library(stringr)
length(words)
[1] 980
Roland Krause/Emma Schymanski
February 6, 2024
This little tutorial aims to make you familiar with some of the functions of the stringr
package and a few regular expressions.
It consists of a few basic questions and an application provided by Prof. Emma Schymanski for the Master in Data Science.
stringr
functionsWe will be using the words
data that is built into stringr
. The data set is available to you if you load the package.
y
[1] "already" "always" "any" "apply" "authority"
[6] "away" "baby" "beauty" "body" "boy"
[11] "busy" "buy" "by" "carry" "city"
[16] "community" "company" "copy" "country" "county"
[21] "day" "dry" "early" "easy" "economy"
[26] "employ" "enjoy" "every" "eye" "family"
[31] "fly" "friday" "germany" "goodbye" "guy"
[36] "happy" "heavy" "history" "holiday" "identify"
[41] "industry" "key" "lady" "lay" "likely"
[46] "many" "marry" "may" "maybe" "monday"
[51] "money" "necessary" "okay" "only" "opportunity"
[56] "party" "pay" "play" "policy" "pretty"
[61] "quality" "ready" "really" "saturday" "say"
[66] "secretary" "society" "sorry" "stay" "story"
[71] "strategy" "study" "sunday" "supply" "system"
[76] "they" "thirty" "thursday" "today" "try"
[81] "tuesday" "twenty" "type" "university" "very"
[86] "way" "wednesday" "why" "worry" "year"
[91] "yes" "yesterday" "yet" "you" "young"
y
ch
within the word.ch
must not be present at the start or the end of the word.
y
and the previous character.Note: Use the function unique()
around the results to avoid printing many empty matches.
Read the genome sequence of the Hepatitis D virus: hepd.fasta. For now, just execute the following:
Find all matches of the sequence \(ATG\) in the sequence.
Cheminformatics manipulations in R can be challenging, as there are many string operations, also involving escape characters. These exercises give some brief insights into cheminformatics with strings, using a dataset of agrochemicals (pesticides) from PubChem.
Rows: 3081 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): cmpdname, mf, inchi, isosmiles, canonicalsmiles, inchikey, iupacna...
dbl (9): cid, mw, exactmass, monoisotopicmass, pclidcnt, gpidcnt, gpfamilyc...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Look in the annothits
column for entries that have the tag Drug and Medication Information
.
Stereochemistry information is encoded in the isosmiles
column and detected by looking for @
or \
or /
symbols.
canonicalsmiles
column instead?What does this tell you about Canonical SMILES?
canonicalsmiles
of all agrochemicals containing a triple bondTriple bonds are encoded by the #
in the SMILES - canonical or isomeric. Use str_view
to look at those that are also salts (encoded by .
).
# number of triple bond canonical SMILES
t_bonds <- agro_data$canonicalsmiles[str_which(agro_data$canonicalsmiles,"[#]")]
# number of triple bond canonical SMILES
length(str_which(agro_data$canonicalsmiles,"[#]"))
[1] 180
[18] │ C(#N)[S-]<.>[NH4+]
[20] │ [C-]#N<.>[Na+]
[58] │ C(#N)[S-]<.>[Na+]
[59] │ C(#N)[S-]<.>[K+]
[83] │ C(#N)N=C([S-])[S-]<.>[Na+]<.>[Na+]
[88] │ C(#N)[S-]<.>[Cu+]
[89] │ C(#N)[S-]<.>[Cu+]
[101] │ C(#N)[S-]<.>C(#N)[S-]<.>[Cu+2]
[102] │ C(#N)[S-]<.>C(#N)[S-]<.>[Cu+2]
[103] │ C(#N)S<.>[Cu]
# number of triple bonds AND salts
agro_data |>
filter(str_detect(canonicalsmiles, "(#.*\\.)|(\\..*#)")) |>
pull(canonicalsmiles) |>
str_view("[#\\.]")
[1] │ C(<#>N)[S-]<.>[NH4+]
[2] │ [C-]<#>N<.>[Na+]
[3] │ C(<#>N)[S-]<.>[Na+]
[4] │ C(<#>N)[S-]<.>[K+]
[5] │ C(<#>N)N=C([S-])[S-]<.>[Na+]<.>[Na+]
[6] │ C(<#>N)[S-]<.>[Cu+]
[7] │ C(<#>N)[S-]<.>[Cu+]
[8] │ C(<#>N)[S-]<.>C(<#>N)[S-]<.>[Cu+2]
[9] │ C(<#>N)[S-]<.>C(<#>N)[S-]<.>[Cu+2]
[10] │ C(<#>N)S<.>[Cu]
Hint: look at the mf
column (molecular formula) and look for the element F
but be careful, there are also agrochemicals containing iron (Fe
) as well. Look at the column contents to see how you can separate these entries.
Use the sidsrcname
column to detect Luxembourg.
# simple vector solution
lux_list <- agro_data$cmpdname[str_which(agro_data$sidsrcname,"Luxembourg")]
lux_list_sep <- agro_data |>
separate_longer_delim(sidsrcname , "|")|> # This will split rows
filter(str_detect(sidsrcname, "Luxembourg")) |>
distinct(cmpdname) |> pull()
length(lux_list) == length(lux_list_sep)
[1] TRUE
Use the cidcdate
column, which is in YYYYMMDD
format.
[1] 4
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
[1] 4