library(stringr)
length(words)[1] 980
Roland Krause/Emma Schymanski
R Workshop
February 10, 2025
This little tutorial aims to make you familiar with some of the functions of the stringr package and a few regular expressions.
It consists of a introductory questions and an application in cheminformatics provided by Prof. Emma Schymanski for the Master in Data Science.

stringr functionsWe will be using the words data that is built into stringr. The data set is available to you if you load the package.
ych within the word.ch must not be present at the start or the end of the word.
y and the previous character.Note: Use the function unique() around the results to avoid printing many empty matches.
Read the genome sequence of the Hepatitis D virus: hepd.fasta. For now, just execute the following:
Find all matches of the sequence \(ATG\) in the sequence.
Cheminformatics manipulations in R can be challenging, as there are many string operations, also involving escape characters. These exercises give some brief insights into cheminformatics with strings, using a dataset of agrochemicals (pesticides) from PubChem.
Rows: 3081 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): cmpdname, mf, inchi, isosmiles, canonicalsmiles, inchikey, iupacna...
dbl (9): cid, mw, exactmass, monoisotopicmass, pclidcnt, gpidcnt, gpfamilyc...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Look in the annothits column for entries that have the tag Drug and Medication Information.
Stereochemistry information is encoded in the isosmiles column and detected by looking for @ or \ or / symbols.
canonicalsmiles of all agrochemicals containing a triple bondTriple bonds are encoded by the # in the SMILES - canonical or isomeric. Use str_view to look at those that are also salts (encoded by .).
# number of triple bond canonical SMILES
t_bonds <- agro_data$canonicalsmiles[str_which(agro_data$canonicalsmiles,"[#]")]
# number of triple bond canonical SMILES
length(str_which(agro_data$canonicalsmiles,"[#]"))[1] 180
[18] │ C(#N)[S-]<.>[NH4+]
[20] │ [C-]#N<.>[Na+]
[58] │ C(#N)[S-]<.>[Na+]
[59] │ C(#N)[S-]<.>[K+]
[83] │ C(#N)N=C([S-])[S-]<.>[Na+]<.>[Na+]
[88] │ C(#N)[S-]<.>[Cu+]
[89] │ C(#N)[S-]<.>[Cu+]
[101] │ C(#N)[S-]<.>C(#N)[S-]<.>[Cu+2]
[102] │ C(#N)[S-]<.>C(#N)[S-]<.>[Cu+2]
[103] │ C(#N)S<.>[Cu]
# number of triple bonds AND salts
agro_data |>
filter(str_detect(canonicalsmiles, "(#.*\\.)|(\\..*#)")) |>
pull(canonicalsmiles) |>
str_view("[#\\.]") [1] │ C(<#>N)[S-]<.>[NH4+]
[2] │ [C-]<#>N<.>[Na+]
[3] │ C(<#>N)[S-]<.>[Na+]
[4] │ C(<#>N)[S-]<.>[K+]
[5] │ C(<#>N)N=C([S-])[S-]<.>[Na+]<.>[Na+]
[6] │ C(<#>N)[S-]<.>[Cu+]
[7] │ C(<#>N)[S-]<.>[Cu+]
[8] │ C(<#>N)[S-]<.>C(<#>N)[S-]<.>[Cu+2]
[9] │ C(<#>N)[S-]<.>C(<#>N)[S-]<.>[Cu+2]
[10] │ C(<#>N)S<.>[Cu]
Hint: look at the mf column (molecular formula) and look for the element F but be careful, there are also agrochemicals containing iron (Fe) as well. Look at the column contents to see how you can separate these entries.