String manipulation

Author

Roland Krause/Emma Schymanski

Published

February 10, 2025

Aims

This little tutorial aims to make you familiar with some of the functions of the stringr package and a few regular expressions.

It consists of a few basic questions and an application provided by Prof. Emma Schymanski for the Master in Data Science.

stringr functions

We will be using the words data that is built into stringr. The data set is available to you if you load the package.

Select words that contain a y

Retrieve a boolean vector that indicates which words start with y

Retrieve the indices for all words containing ch within the word.

ch must not be present at the start or the end of the word.

Extract the y and the previous character.

Note: Use the function unique() around the results to avoid printing many empty matches.

Virus Research

Read the genome sequence of the Hepatitis D virus: hepd.fasta. For now, just execute the following:

hepd <- readr::read_lines("https://biostat2.uni.lu/practicals/data/hepd.fasta")

What is the length of the genome sequence?

What is the sequence composition? How often does each character occur?

Find all motifs in the sequence

Find all matches of the sequence \(ATG\) in the sequence.

Cheminformatics Research

Cheminformatics manipulations in R can be challenging, as there are many string operations, also involving escape characters. These exercises give some brief insights into cheminformatics with strings, using a dataset of agrochemicals (pesticides) from PubChem.

agro_data <- read_csv("data/PubChem_Agrochemicals_20231022.csv.gz")
Rows: 3081 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): cmpdname, mf, inchi, isosmiles, canonicalsmiles, inchikey, iupacna...
dbl  (9): cid, mw, exactmass, monoisotopicmass, pclidcnt, gpidcnt, gpfamilyc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

How many of these agrochemicals are also drugs (pharmaceuticals)?

Look in the annothits column for entries that have the tag Drug and Medication Information.

How many agrochemicals have stereochemistry information?

Stereochemistry information is encoded in the isosmiles column and detected by looking for @ or \ or / symbols.

What happens if you run this on the canonicalsmiles column instead?

What does this tell you about Canonical SMILES?

Return the canonicalsmiles of all agrochemicals containing a triple bond

Triple bonds are encoded by the # in the SMILES - canonical or isomeric. Use str_view to look at those that are also salts (encoded by .).

How many agrochemicals contain fluorine?

Hint: look at the mf column (molecular formula) and look for the element F but be careful, there are also agrochemicals containing iron (Fe) as well. Look at the column contents to see how you can separate these entries.

Generate a list of the names of agrochemicals that have been contributed by Luxembourg contributors.

Use the sidsrcname column to detect Luxembourg.

Finally, how many agrochemicals were contributed in 2023?

Use the cidcdate column, which is in YYYYMMDD format.