<- readr::read_lines("https://biostat2.uni.lu/practicals/data/hepd.fasta") hepd
String manipulation
This little tutorial aims to make you familiar with some of the functions of the stringr
package and a few regular expressions.
It consists of a few basic questions and an application provided by Prof. Emma Schymanski for the Master in Data Science.
stringr
functions
We will be using the words
data that is built into stringr
. The data set is available to you if you load the package.
Select words that contain a y
Retrieve a boolean vector that indicates which words start with y
Retrieve the indices for all words containing ch
within the word.
ch
must not be present at the start or the end of the word.
Extract the y
and the previous character.
Note: Use the function unique()
around the results to avoid printing many empty matches.
Virus Research
Read the genome sequence of the Hepatitis D virus: hepd.fasta. For now, just execute the following:
What is the length of the genome sequence?
What is the sequence composition? How often does each character occur?
Find all motifs in the sequence
Find all matches of the sequence \(ATG\) in the sequence.
Cheminformatics Research
Cheminformatics manipulations in R can be challenging, as there are many string operations, also involving escape characters. These exercises give some brief insights into cheminformatics with strings, using a dataset of agrochemicals (pesticides) from PubChem.
<- read_csv("data/PubChem_Agrochemicals_20231022.csv.gz") agro_data
Rows: 3081 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): cmpdname, mf, inchi, isosmiles, canonicalsmiles, inchikey, iupacna...
dbl (9): cid, mw, exactmass, monoisotopicmass, pclidcnt, gpidcnt, gpfamilyc...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
How many of these agrochemicals are also drugs (pharmaceuticals)?
Look in the annothits
column for entries that have the tag Drug and Medication Information
.
How many agrochemicals have stereochemistry information?
Stereochemistry information is encoded in the isosmiles
column and detected by looking for @
or \
or /
symbols.
What happens if you run this on the canonicalsmiles
column instead?
What does this tell you about Canonical SMILES?
Return the canonicalsmiles
of all agrochemicals containing a triple bond
Triple bonds are encoded by the #
in the SMILES - canonical or isomeric. Use str_view
to look at those that are also salts (encoded by .
).
How many agrochemicals contain fluorine?
Hint: look at the mf
column (molecular formula) and look for the element F
but be careful, there are also agrochemicals containing iron (Fe
) as well. Look at the column contents to see how you can separate these entries.
Generate a list of the names of agrochemicals that have been contributed by Luxembourg contributors.
Use the sidsrcname
column to detect Luxembourg.
Finally, how many agrochemicals were contributed in 2023?
Use the cidcdate
column, which is in YYYYMMDD
format.