String manipulation

Author

Roland Krause/Emma Schymanski

Published

February 6, 2024

Aims

This little tutorial aims to make you familiar with some of the functions of the stringr package and a few regular expressions.

It consists of a few basic questions and an application provided by Prof. Emma Schymanski for the Master in Data Science.

stringr functions

We will be using the words data that is built into stringr. The data set is available to you if you load the package.

library(stringr)
length(words)
[1] 980

Select words that contain a y

str_subset(words, "y")
 [1] "already"     "always"      "any"         "apply"       "authority"  
 [6] "away"        "baby"        "beauty"      "body"        "boy"        
[11] "busy"        "buy"         "by"          "carry"       "city"       
[16] "community"   "company"     "copy"        "country"     "county"     
[21] "day"         "dry"         "early"       "easy"        "economy"    
[26] "employ"      "enjoy"       "every"       "eye"         "family"     
[31] "fly"         "friday"      "germany"     "goodbye"     "guy"        
[36] "happy"       "heavy"       "history"     "holiday"     "identify"   
[41] "industry"    "key"         "lady"        "lay"         "likely"     
[46] "many"        "marry"       "may"         "maybe"       "monday"     
[51] "money"       "necessary"   "okay"        "only"        "opportunity"
[56] "party"       "pay"         "play"        "policy"      "pretty"     
[61] "quality"     "ready"       "really"      "saturday"    "say"        
[66] "secretary"   "society"     "sorry"       "stay"        "story"      
[71] "strategy"    "study"       "sunday"      "supply"      "system"     
[76] "they"        "thirty"      "thursday"    "today"       "try"        
[81] "tuesday"     "twenty"      "type"        "university"  "very"       
[86] "way"         "wednesday"   "why"         "worry"       "year"       
[91] "yes"         "yesterday"   "yet"         "you"         "young"      

Retrieve a boolean vector that indicates which words start with y

str_detect(words, '^y') |> head()
[1] FALSE FALSE FALSE FALSE FALSE FALSE

Retrieve the indices for all words containing ch within the word.

ch must not be present at the start or the end of the word.

# A correct solution - many exist
str_which(words, "^[^c].*ch.*[^h]$")
[1]   7 449 497 724 725
# The minimal (good) solution
# For the edge cases for words that start with c or end with h 
# this simple pattern will correctly not match ch at start of end
str_which(words, ".ch.")
[1]   7 449 497 724 725

Extract the y and the previous character.

Note: Use the function unique() around the results to avoid printing many empty matches.

str_match(words, ".y") |> unique()
      [,1]
 [1,] NA  
 [2,] "dy"
 [3,] "ay"
 [4,] "ny"
 [5,] "ly"
 [6,] "ty"
 [7,] "by"
 [8,] "oy"
 [9,] "sy"
[10,] "uy"
[11,] "ry"
[12,] "py"
[13,] "my"
[14,] "ey"
[15,] "vy"
[16,] "fy"
[17,] "cy"
[18,] "gy"
[19,] "hy"

Virus Research

Read the genome sequence of the Hepatitis D virus: hepd.fasta. For now, just execute the following:

hepd <- readr::read_lines("https://biostat2.uni.lu/practicals/data/hepd.fasta")

What is the length of the genome sequence?

str_length(hepd)
[1] 1682

What is the sequence composition? How often does each character occur?

str_count(hepd, c("A", "C", "G","T"))
[1] 339 504 485 354
# The data or the pattern can be supplied vectorised

Find all motifs in the sequence

Find all matches of the sequence \(ATG\) in the sequence.

str_locate_all(hepd, "ATG")
[[1]]
      start  end
 [1,]     1    3
 [2,]   130  132
 [3,]   378  380
 [4,]   581  583
 [5,]   586  588
 [6,]   637  639
 [7,]   686  688
 [8,]   695  697
 [9,]   758  760
[10,]   765  767
[11,]   858  860
[12,]   888  890
[13,]   893  895
[14,]  1015 1017
[15,]  1038 1040
[16,]  1089 1091
[17,]  1440 1442
[18,]  1457 1459

Cheminformatics Research

Cheminformatics manipulations in R can be challenging, as there are many string operations, also involving escape characters. These exercises give some brief insights into cheminformatics with strings, using a dataset of agrochemicals (pesticides) from PubChem.

agro_data <- read_csv("data/PubChem_Agrochemicals_20231022.csv.gz")
Rows: 3081 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): cmpdname, mf, inchi, isosmiles, canonicalsmiles, inchikey, iupacna...
dbl  (9): cid, mw, exactmass, monoisotopicmass, pclidcnt, gpidcnt, gpfamilyc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

How many of these agrochemicals are also drugs (pharmaceuticals)?

Look in the annothits column for entries that have the tag Drug and Medication Information.

# using a column of the data frame as vector
str_subset(agro_data$annothits, "Drug and Medication Information") |> 
  length()
[1] 530
# Tibble context
agro_data |> 
  filter(str_detect(annothits, "Drug and Medication Information")) |> 
  nrow()
[1] 530

How many agrochemicals have stereochemistry information?

Stereochemistry information is encoded in the isosmiles column and detected by looking for @ or \ or / symbols.

str_detect(agro_data$isosmiles, "[@\\\\/]") |> 
  sum() # relies on counting TRUE and FALSE as 1 and 0; legit R but ugly
[1] 603
# tibble context
agro_data |> 
  filter(str_detect(isosmiles, "[@\\\\/]")) |> 
  # relocate(isosmiles) |> # just to check
  nrow()
[1] 603

What happens if you run this on the canonicalsmiles column instead?

What does this tell you about Canonical SMILES?

length(str_which(agro_data$canonicalsmiles,"[@/\\\\]"))
[1] 0
# canonicalsmiles contains codes in stereo-agnostic format

Return the canonicalsmiles of all agrochemicals containing a triple bond

Triple bonds are encoded by the # in the SMILES - canonical or isomeric. Use str_view to look at those that are also salts (encoded by .).

# number of triple bond canonical SMILES
t_bonds <- agro_data$canonicalsmiles[str_which(agro_data$canonicalsmiles,"[#]")]
# number of triple bond canonical SMILES
length(str_which(agro_data$canonicalsmiles,"[#]"))
[1] 180
#looking at those with salts
str_view(t_bonds,pattern = "\\.")
 [18] │ C(#N)[S-]<.>[NH4+]
 [20] │ [C-]#N<.>[Na+]
 [58] │ C(#N)[S-]<.>[Na+]
 [59] │ C(#N)[S-]<.>[K+]
 [83] │ C(#N)N=C([S-])[S-]<.>[Na+]<.>[Na+]
 [88] │ C(#N)[S-]<.>[Cu+]
 [89] │ C(#N)[S-]<.>[Cu+]
[101] │ C(#N)[S-]<.>C(#N)[S-]<.>[Cu+2]
[102] │ C(#N)[S-]<.>C(#N)[S-]<.>[Cu+2]
[103] │ C(#N)S<.>[Cu]
# number of triple bonds AND salts 
agro_data |> 
  filter(str_detect(canonicalsmiles, "(#.*\\.)|(\\..*#)")) |> 
  pull(canonicalsmiles) |> 
  str_view("[#\\.]")
 [1] │ C(<#>N)[S-]<.>[NH4+]
 [2] │ [C-]<#>N<.>[Na+]
 [3] │ C(<#>N)[S-]<.>[Na+]
 [4] │ C(<#>N)[S-]<.>[K+]
 [5] │ C(<#>N)N=C([S-])[S-]<.>[Na+]<.>[Na+]
 [6] │ C(<#>N)[S-]<.>[Cu+]
 [7] │ C(<#>N)[S-]<.>[Cu+]
 [8] │ C(<#>N)[S-]<.>C(<#>N)[S-]<.>[Cu+2]
 [9] │ C(<#>N)[S-]<.>C(<#>N)[S-]<.>[Cu+2]
[10] │ C(<#>N)S<.>[Cu]

How many agrochemicals contain fluorine?

Hint: look at the mf column (molecular formula) and look for the element F but be careful, there are also agrochemicals containing iron (Fe) as well. Look at the column contents to see how you can separate these entries.

length(str_which(agro_data$mf,"F[A-Z0-9]$"))
[1] 3
# pipe-free 

# tible solution
agro_data |> 
  filter(str_detect(mf, "F[\\d[A-Z]$]")) |> 
  relocate(mf) |> 
  nrow()
[1] 481

Generate a list of the names of agrochemicals that have been contributed by Luxembourg contributors.

Use the sidsrcname column to detect Luxembourg.

# simple vector solution

lux_list <- agro_data$cmpdname[str_which(agro_data$sidsrcname,"Luxembourg")]


lux_list_sep <- agro_data |> 
  separate_longer_delim(sidsrcname , "|")|>   # This will split rows 
  filter(str_detect(sidsrcname, "Luxembourg")) |> 
  distinct(cmpdname)  |> pull()

length(lux_list) == length(lux_list_sep)
[1] TRUE
# Both would fail on "Jardin du Luxembourg" or "Luxembourg province, Belgium (not present)" 

Finally, how many agrochemicals were contributed in 2023?

Use the cidcdate column, which is in YYYYMMDD format.

# straightforward solution
length(str_which(agro_data$cidcdate,"^2023"))
[1] 4
# lubridate version using year() and ymd()
library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
agro_data |> 
  filter(year(ymd(cidcdate)) == 2023) |> 
  nrow()
[1] 4