library(stringr)
length(words)
[1] 980
Roland Krause/Emma Schymanski
R Workshop
February 10, 2025
This little tutorial aims to make you familiar with some of the functions of the stringr
package and a few regular expressions.
It consists of a introductory questions and an application in cheminformatics provided by Prof. Emma Schymanski for the Master in Data Science.
stringr
functionsWe will be using the words
data that is built into stringr
. The data set is available to you if you load the package.
y
ch
within the word.ch
must not be present at the start or the end of the word.
y
and the previous character.Note: Use the function unique()
around the results to avoid printing many empty matches.
Read the genome sequence of the Hepatitis D virus: hepd.fasta. For now, just execute the following:
Find all matches of the sequence \(ATG\) in the sequence.
Cheminformatics manipulations in R can be challenging, as there are many string operations, also involving escape characters. These exercises give some brief insights into cheminformatics with strings, using a dataset of agrochemicals (pesticides) from PubChem.
Rows: 3081 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): cmpdname, mf, inchi, isosmiles, canonicalsmiles, inchikey, iupacna...
dbl (9): cid, mw, exactmass, monoisotopicmass, pclidcnt, gpidcnt, gpfamilyc...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Look in the annothits
column for entries that have the tag Drug and Medication Information
.
Stereochemistry information is encoded in the isosmiles
column and detected by looking for @
or \
or /
symbols.
canonicalsmiles
of all agrochemicals containing a triple bondTriple bonds are encoded by the #
in the SMILES - canonical or isomeric. Use str_view
to look at those that are also salts (encoded by .
).
# number of triple bond canonical SMILES
t_bonds <- agro_data$canonicalsmiles[str_which(agro_data$canonicalsmiles,"[#]")]
# number of triple bond canonical SMILES
length(str_which(agro_data$canonicalsmiles,"[#]"))
[1] 180
[18] │ C(#N)[S-]<.>[NH4+]
[20] │ [C-]#N<.>[Na+]
[58] │ C(#N)[S-]<.>[Na+]
[59] │ C(#N)[S-]<.>[K+]
[83] │ C(#N)N=C([S-])[S-]<.>[Na+]<.>[Na+]
[88] │ C(#N)[S-]<.>[Cu+]
[89] │ C(#N)[S-]<.>[Cu+]
[101] │ C(#N)[S-]<.>C(#N)[S-]<.>[Cu+2]
[102] │ C(#N)[S-]<.>C(#N)[S-]<.>[Cu+2]
[103] │ C(#N)S<.>[Cu]
# number of triple bonds AND salts
agro_data |>
filter(str_detect(canonicalsmiles, "(#.*\\.)|(\\..*#)")) |>
pull(canonicalsmiles) |>
str_view("[#\\.]")
[1] │ C(<#>N)[S-]<.>[NH4+]
[2] │ [C-]<#>N<.>[Na+]
[3] │ C(<#>N)[S-]<.>[Na+]
[4] │ C(<#>N)[S-]<.>[K+]
[5] │ C(<#>N)N=C([S-])[S-]<.>[Na+]<.>[Na+]
[6] │ C(<#>N)[S-]<.>[Cu+]
[7] │ C(<#>N)[S-]<.>[Cu+]
[8] │ C(<#>N)[S-]<.>C(<#>N)[S-]<.>[Cu+2]
[9] │ C(<#>N)[S-]<.>C(<#>N)[S-]<.>[Cu+2]
[10] │ C(<#>N)S<.>[Cu]
Hint: look at the mf
column (molecular formula) and look for the element F
but be careful, there are also agrochemicals containing iron (Fe
) as well. Look at the column contents to see how you can separate these entries.