String manipulation

Author
Affiliation

Roland Krause/Emma Schymanski

R Workshop

Published

February 10, 2025

Aims

This little tutorial aims to make you familiar with some of the functions of the stringr package and a few regular expressions.

It consists of a introductory questions and an application in cheminformatics provided by Prof. Emma Schymanski for the Master in Data Science.

stringr functions

We will be using the words data that is built into stringr. The data set is available to you if you load the package.

library(stringr)
length(words)
[1] 980

Retrieve a boolean vector that indicates which words start with y

str_detect(words, '^y') |> head()
[1] FALSE FALSE FALSE FALSE FALSE FALSE

Retrieve the indices for all words containing ch within the word.

ch must not be present at the start or the end of the word.

# A correct solution - many exist
str_which(words, "^[^c].*ch.*[^h]$")
[1]   7 449 497 724 725
# The minimal (good) solution
# For the edge cases for words that start with c or end with h 
# this simple pattern will correctly not match ch at start of end
str_which(words, ".ch.")
[1]   7 449 497 724 725

Extract the y and the previous character.

Note: Use the function unique() around the results to avoid printing many empty matches.

str_match(words, ".y") |> unique()
      [,1]
 [1,] NA  
 [2,] "dy"
 [3,] "ay"
 [4,] "ny"
 [5,] "ly"
 [6,] "ty"
 [7,] "by"
 [8,] "oy"
 [9,] "sy"
[10,] "uy"
[11,] "ry"
[12,] "py"
[13,] "my"
[14,] "ey"
[15,] "vy"
[16,] "fy"
[17,] "cy"
[18,] "gy"
[19,] "hy"

Virus Research

Read the genome sequence of the Hepatitis D virus: hepd.fasta. For now, just execute the following:

hepd <- readr::read_lines("https://biostat2.uni.lu/practicals/data/hepd.fasta")

What is the length of the genome sequence?

str_length(hepd)
[1] 1682

What is the sequence composition? How often does each character occur?

str_count(hepd, c("A", "C", "G","T"))
[1] 339 504 485 354
# The data or the pattern can be supplied vectorised

Find all motifs in the sequence

Find all matches of the sequence \(ATG\) in the sequence.

str_locate_all(hepd, "ATG")
[[1]]
      start  end
 [1,]     1    3
 [2,]   130  132
 [3,]   378  380
 [4,]   581  583
 [5,]   586  588
 [6,]   637  639
 [7,]   686  688
 [8,]   695  697
 [9,]   758  760
[10,]   765  767
[11,]   858  860
[12,]   888  890
[13,]   893  895
[14,]  1015 1017
[15,]  1038 1040
[16,]  1089 1091
[17,]  1440 1442
[18,]  1457 1459

Cheminformatics Research

Cheminformatics manipulations in R can be challenging, as there are many string operations, also involving escape characters. These exercises give some brief insights into cheminformatics with strings, using a dataset of agrochemicals (pesticides) from PubChem.

agro_data <- read_csv("data/PubChem_Agrochemicals_20231022.csv.gz")
Rows: 3081 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): cmpdname, mf, inchi, isosmiles, canonicalsmiles, inchikey, iupacna...
dbl  (9): cid, mw, exactmass, monoisotopicmass, pclidcnt, gpidcnt, gpfamilyc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

How many of these agrochemicals are also drugs (pharmaceuticals)?

Look in the annothits column for entries that have the tag Drug and Medication Information.

# using a column of the data frame as vector
str_subset(agro_data$annothits, "Drug and Medication Information") |> 
  length()
[1] 530
# Tibble context
agro_data |> 
  filter(str_detect(annothits, "Drug and Medication Information")) |> 
  nrow()
[1] 530

How many agrochemicals have stereochemistry information?

Stereochemistry information is encoded in the isosmiles column and detected by looking for @ or \ or / symbols.

str_detect(agro_data$isosmiles, "[@\\\\/]") |> 
  sum() # relies on counting TRUE and FALSE as 1 and 0; legit R but ugly
[1] 603
# tibble context
agro_data |> 
  filter(str_detect(isosmiles, "[@\\\\/]")) |> 
  # relocate(isosmiles) |> # just to check
  nrow()
[1] 603

Return the canonicalsmiles of all agrochemicals containing a triple bond

Triple bonds are encoded by the # in the SMILES - canonical or isomeric. Use str_view to look at those that are also salts (encoded by .).

# number of triple bond canonical SMILES
t_bonds <- agro_data$canonicalsmiles[str_which(agro_data$canonicalsmiles,"[#]")]
# number of triple bond canonical SMILES
length(str_which(agro_data$canonicalsmiles,"[#]"))
[1] 180
#looking at those with salts
str_view(t_bonds,pattern = "\\.")
 [18] │ C(#N)[S-]<.>[NH4+]
 [20] │ [C-]#N<.>[Na+]
 [58] │ C(#N)[S-]<.>[Na+]
 [59] │ C(#N)[S-]<.>[K+]
 [83] │ C(#N)N=C([S-])[S-]<.>[Na+]<.>[Na+]
 [88] │ C(#N)[S-]<.>[Cu+]
 [89] │ C(#N)[S-]<.>[Cu+]
[101] │ C(#N)[S-]<.>C(#N)[S-]<.>[Cu+2]
[102] │ C(#N)[S-]<.>C(#N)[S-]<.>[Cu+2]
[103] │ C(#N)S<.>[Cu]
# number of triple bonds AND salts 
agro_data |> 
  filter(str_detect(canonicalsmiles, "(#.*\\.)|(\\..*#)")) |> 
  pull(canonicalsmiles) |> 
  str_view("[#\\.]")
 [1] │ C(<#>N)[S-]<.>[NH4+]
 [2] │ [C-]<#>N<.>[Na+]
 [3] │ C(<#>N)[S-]<.>[Na+]
 [4] │ C(<#>N)[S-]<.>[K+]
 [5] │ C(<#>N)N=C([S-])[S-]<.>[Na+]<.>[Na+]
 [6] │ C(<#>N)[S-]<.>[Cu+]
 [7] │ C(<#>N)[S-]<.>[Cu+]
 [8] │ C(<#>N)[S-]<.>C(<#>N)[S-]<.>[Cu+2]
 [9] │ C(<#>N)[S-]<.>C(<#>N)[S-]<.>[Cu+2]
[10] │ C(<#>N)S<.>[Cu]

How many agrochemicals contain fluorine?

Hint: look at the mf column (molecular formula) and look for the element F but be careful, there are also agrochemicals containing iron (Fe) as well. Look at the column contents to see how you can separate these entries.

length(str_which(agro_data$mf,"F[A-Z0-9]$"))
[1] 3
# pipe-free 

# tible solution
agro_data |> 
  filter(str_detect(mf, "F[\\d[A-Z]$]")) |> 
  relocate(mf) |> 
  nrow()
[1] 481