String manipulation

stringr and regular expressions

Roland Krause

Rworkshop

Wednesday, 7 February 2024

Introduction

Summary

stringr package

  • Simplifies and unifies string operations
  • Gentle stringr introduction
  • Different matching engines providing locale-sensitive matches.
  • Current version is 1.5.1.

Session set-up

Learning objectives

  • Perform pattern matching and string manipulation
    • Detection, extraction, counting, subsetting
    • Print text nicely and well-formatted
  • Clean up input data programmatically
  • Quality control and processing

Material: Regular expressions

Strings in R

String examples in Base R

Strings are character objects

# A character object = colloquially called "string"
my_string <- "cat"

my_string
[1] "cat"
my_other_string <- 'catastrophe' # single quotes

not_so_numeric <- as.character(3.1415)

not_so_numeric
[1] "3.1415"
# A character vector

my_string_vec <- c("atg", "ttg", "tga")

Printing complex objects

C style with placeholder

sprintf("Hello %s, how is day %d of this course?", 
        "John Doe", 12)
[1] "Hello John Doe, how is day 12 of this course?"

Comparisons of Base R and stringr

pattern <- "r"
my_words <- c( "cat", "cart","carrot", "catastrophe",
               "dog","rat",  "bet")

Base R

grep(pattern, my_words)
[1] 2 3 4 6
grep(pattern, my_words, value = TRUE)
[1] "cart"        "carrot"      "catastrophe" "rat"        
substr(my_words, 1, 3)
[1] "cat" "car" "car" "cat" "dog" "rat" "bet"
gsub(pattern, "R", my_words)
[1] "cat"         "caRt"        "caRRot"      "catastRophe" "dog"        
[6] "Rat"         "bet"        

With stringr

str_which(my_words, pattern)
[1] 2 3 4 6
str_subset(my_words, pattern)
[1] "cart"        "carrot"      "catastrophe" "rat"        
str_sub(my_words, 1, 3)
[1] "cat" "car" "car" "cat" "dog" "rat" "bet"
str_replace(my_words, pattern, "R")
[1] "cat"         "caRt"        "caRrot"      "catastRophe" "dog"        
[6] "Rat"         "bet"        

Strings with stringr

Why use stringr?

Easy to use

  • Consistency

  • Less typing and looking up things

  • All functions in stringr start with str_

  • All take a vector of strings as the first argument

  • (“data first”)

  • Pipes work as expected

  • All functions properly vectorised

Useful additions

Viewing matches rendered in ASCII colors Matches enclosed in chevrons (< >)

str_view(my_words, pattern)
[2] │ ca<r>t
[3] │ ca<r><r>ot
[4] │ catast<r>ophe
[6] │ <r>at

Well documented

(http://stringr.tidyverse.org)

Cheat sheets

Cheat sheets (cont.)

Matching strings

Detect matching strings

my_words
[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"        
str_detect(my_words, "a")
[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Retrieve indeces

str_which(my_words, "a")
[1] 1 2 3 4 6

Retrieving (only) matching strings

str_subset(my_words, "a")
[1] "cat"         "cart"        "carrot"      "catastrophe" "rat"        

Inverting in all stringr functions

str_subset(my_words, "a", negate = TRUE)
[1] "dog" "bet"

Working with matched positions

Extracting matches

str_extract(my_words, "a")
[1] "a" "a" "a" "a" NA  "a" NA 

Why not str_match()?

Includes capture groups (see regular expressions) and returns matrix object. Usually more complicated than what you wanted.

str_match(my_words, "a")
     [,1]
[1,] "a" 
[2,] "a" 
[3,] "a" 
[4,] "a" 
[5,] NA  
[6,] "a" 
[7,] NA  

Position of a match

str_locate(my_words, "a")
     start end
[1,]     2   2
[2,]     2   2
[3,]     2   2
[4,]     2   2
[5,]    NA  NA
[6,]     2   2
[7,]    NA  NA

Locate all matches

str_locate_all(my_words, "a")
[[1]]
     start end
[1,]     2   2

[[2]]
     start end
[1,]     2   2

[[3]]
     start end
[1,]     2   2

[[4]]
     start end
[1,]     2   2
[2,]     4   4

[[5]]
     start end

[[6]]
     start end
[1,]     2   2

[[7]]
     start end

Warning

Returns list objects. Usually a lot more complicated than what you wanted.

How long is my string?

Length of items a in character vector

str_length(my_words)
[1]  3  4  6 11  3  3  3

Vector length

length(my_words)
[1] 7

Elements of strings

Substrings

my_words
[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"        
str_sub(my_words, 1, 4)
[1] "cat"  "cart" "carr" "cata" "dog"  "rat"  "bet" 

Replace

str_replace(my_words, "a", "#")
[1] "c#t"         "c#rt"        "c#rrot"      "c#tastrophe" "dog"        
[6] "r#t"         "bet"        

Check c#tastrophe!

str_replace_all()

str_replace_all(my_words, 'a', '#')
[1] "c#t"         "c#rt"        "c#rrot"      "c#t#strophe" "dog"        
[6] "r#t"         "bet"        

Splitting strings

Basic splitting leads to complex output

str_split(my_words, "a")
[[1]]
[1] "c" "t"

[[2]]
[1] "c"  "rt"

[[3]]
[1] "c"    "rrot"

[[4]]
[1] "c"       "t"       "strophe"

[[5]]
[1] "dog"

[[6]]
[1] "r" "t"

[[7]]
[1] "bet"

str_split() creates lists with vectors of different lengths. Harder to work on programmatically.

Simplification

str_split(my_words, "a", simplify = TRUE)
     [,1]  [,2]   [,3]     
[1,] "c"   "t"    ""       
[2,] "c"   "rt"   ""       
[3,] "c"   "rrot" ""       
[4,] "c"   "t"    "strophe"
[5,] "dog" ""     ""       
[6,] "r"   "t"    ""       
[7,] "bet" ""     ""       

Creates a matrix, all rows have the same length.

Joining strings

Concatenation

str_c(my_words, collapse = "|")
[1] "cat|cart|carrot|catastrophe|dog|rat|bet"

Vectorization of concatenation

str_c(my_words, my_words, sep = ": ")
[1] "cat: cat"                 "cart: cart"              
[3] "carrot: carrot"           "catastrophe: catastrophe"
[5] "dog: dog"                 "rat: rat"                
[7] "bet: bet"                

Padding

str_pad(my_words, 6)
[1] "   cat"      "  cart"      "carrot"      "catastrophe" "   dog"     
[6] "   rat"      "   bet"     

Trimming

str_trunc(c("anachronism", 
            "antebellum", 
            "antithesis"), 6)
[1] "ana..." "ant..." "ant..."

Better treatment of conversion?

my_col <- c("F", "M", "female", "male", "male", "female", "female", "männlich")


convert_gender <- function(x){
  case_when(
    str_detect(x, "m") ~ "Male",
    str_detect(x, "M") ~ "Male", 
    str_detect(x, "F") ~ "Female", 
    str_detect(x, "f") ~ "Female", 

        TRUE ~ x
  )
}

convert_gender(my_col)
[1] "Female" "Male"   "Male"   "Male"   "Male"   "Male"   "Male"   "Male"  

Need a better way to express what we want to match!

Regular expressions!

Getting started with regular expressions

Higher aims

  • Extract particular characters, e.g. numbers only
  • Express a variety of character following or preceding patterns
  • Matching any character
  • Not matching a particular character

Prerequisites complex matching

  • Regular expressions look like strings but are converted to a particular expression object.
  • Can be done explicitly by regex() – rarely necessary
  • print() is giving the quoted strings and therefore misleading
  • Use cat() or writeLines() to see strings properly escaped.
  • writeLines() preferred for writing

Flexible matching though metacharacters

Metacharacters

Symbols matching groups of characters.

. (dot) represents any character except for the newline character (\n)

str_view(my_words, ".at")
[1] │ <cat>
[4] │ <cat>astrophe
[6] │ <rat>

. on it’s own matches exactly one occurrence

 str_subset(my_words, "c..t")
[1] "cart"

+ (plus) represents one or more occurrences

str_subset(my_words, 'c.r+')
[1] "cart"   "carrot"

* (star) represents zero or more occurrences

str_subset(my_words, 'c.r*')
[1] "cat"         "cart"        "carrot"      "catastrophe"

Grouping

Group terms with parentheses ( and )

str_extract(my_words, '(at)|(og)+')
[1] "at" NA   NA   "at" "og" "at" NA  

Alternation operator | ( logical OR ) for groups

str_match(my_words, '(c.t)|(c.rt)')
     [,1]   [,2]  [,3]  
[1,] "cat"  "cat" NA    
[2,] "cart" NA    "cart"
[3,] NA     NA    NA    
[4,] "cat"  "cat" NA    
[5,] NA     NA    NA    
[6,] NA     NA    NA    
[7,] NA     NA    NA    

Capture groups with str_match()

str_match(my_words, 'c(a.)+t')
     [,1]     [,2]
[1,] NA       NA  
[2,] "cart"   "ar"
[3,] NA       NA  
[4,] "catast" "as"
[5,] NA       NA  
[6,] NA       NA  
[7,] NA       NA  

Output complex when multiple capture groups are around.

Quantifying a number of matches

Quantifiers

The preceding item will be matched …

? at most once.
* matched zero or more times.
+ one or more times.
{n} exactly n times.
{n,} n or more times.
{n,m} at least n times, but not more than m times.

Examples

?

dna <- "ATGGTAACCGGTAGGTAGTAAAGGTCCC"
str_view(dna, "AA?")
[1] │ <A>TGGT<AA>CCGGT<A>GGT<A>GT<AA><A>GGTCCC

+

str_view(dna, "AA+")
[1] │ ATGGT<AA>CCGGTAGGTAGT<AAA>GGTCCC

{3,}

str_view(dna, "A{3,}")
[1] │ ATGGTAACCGGTAGGTAGT<AAA>GGTCCC

Greedy and lazy matching

Matches are greedy by default

Match the longest possible subsequence.

dna <- "ATGGTAACCGGTAGGTAGTAAAGGTCCC"
str_extract(dna, "AAG.{2,5}")
[1] "AAGGTCCC"
str_extract(dna, "ATG.+C")
[1] "ATGGTAACCGGTAGGTAGTAAAGGTCCC"

Lazy matching

Adding ? to a regular expression makes it lazy, and returns the shortest possible match.

str_extract(dna, "AAG.{2,5}?")
[1] "AAGGT"
str_extract(dna, "ATG.+?C")
[1] "ATGGTAAC"

Anchors

^ Start of string

my_words
[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"        
str_subset(my_words , '^c')
[1] "cat"         "cart"        "carrot"      "catastrophe"

End of string $

str_subset(my_words, 'r.$')
[1] "cart"

Character classes

Special characters

Pattern Matches Complement Matches
\d Digit \D No digit
\s Whitespace \S No whitespace
\w Word chars \W No work char
\b Boundaries \B Within words

Examples

uniprot <- c("Q6QU88_CALBL", "CO1A2_HUMAN", "SAMH1_HUMAN", 
             "NPRL2_DROME", "GLUC_HUMAN")
str_extract(uniprot, "\\d+")
[1] "6" "1" "1" "2" NA 
str_count(uniprot, "\\w+")
[1] 1 1 1 1 1

Note: single matches counted as one

Negation can be a good

str_extract(uniprot, "\\S+_")
[1] "Q6QU88_" "CO1A2_"  "SAMH1_"  "NPRL2_"  "GLUC_"  

Extended list of regular expressions

Readable short cuts

Expression Description
[:upper:] Upper-case letters.
[:lower:] Lower-case letters.
[:alpha:] Alphabetic characters: ‘[:lower:]’ and ‘[:upper:]’.
[:digit:] Digits: ‘0 1 2 3 4 5 6 7 8 9’.
[:punct:] Punctuation characters: ‘! ” # $ % &’ ( ) * + , - . / : ; < = > ? @ ’ and others.
[:space:] Space characters: tab, newline, vertical tab, form feed, carriage return, and space.
[:blank:] Blank characters: space and tab.
[:alnum:] Alphanumeric characters: ‘[:alpha:]’ and ‘[:digit:]’.
[:graph:] Graphical characters: [:alnum:] and ‘[:punct:]’.

Built-in (stringi - stringr)

Examples

uniprot <- c("Q6QU88_CALBL", "CO1A2_HUMAN", "SAMH1_HUMAN", 
             "NPRL2_DROME", "GLUC_HUMAN")
str_subset(uniprot, "[:digit:]")
[1] "Q6QU88_CALBL" "CO1A2_HUMAN"  "SAMH1_HUMAN"  "NPRL2_DROME" 
str_extract(uniprot, "[:digit:]")
[1] "6" "1" "1" "2" NA 

Roll your own character class

Define groups

Group Matches
[a-z] lowercase letters
[a-zA-Z] any (ascii) letter
[0-9] any number
[aeiou] any vowel
[0-7ivx] any of 0 to 7, i, v, and x

Example

str_subset(c("Rpl12", "Rpn12", 
             "Rps1",
             "Pre1"), 'Rp[ls]')
[1] "Rpl12" "Rps1" 
str_subset(c("Rpl12", "Rpn12", 
             "Rps1",
             "Pre1"), 'Rp[^ls]')
[1] "Rpn12"

Rely on built-in groups where possible

Note that the set of alphabetic characters includes accents such as ß, ç or ö which are very common in some languages. Use is more general than [A-Za-z] which ascii characters only.

Matching metacharacters

Strings containing only a full stop

We saw special characters such as ., +, * or $ having special meaning in regular expressions.

vec2 <- c("YKL045W-A", "12+45=57", "$1200.00", "ID2.2")

str_subset(vec2 , '.')
[1] "YKL045W-A" "12+45=57"  "$1200.00"  "ID2.2"    

Not what we wanted!

Use an escape character

str_subset(vec2, '\.')
Error: '\.' is an unrecognized escape in character string (<text>:1:20)

Errors we don’t want either!

Need to escape with \ twice!

str_subset(vec2, '\\.')
[1] "$1200.00" "ID2.2"   

Implicit conversion

R wraps regular expressions as strings without explicit interference of the user. When converting from string to regular expression internally, single backslashes (\) are already converted.

No escape

To match a \, our pattern must represent \\.

How to match c("a\\backslash", "nobackslash", "slash","\n)?

Note the difference when printing meta-characters.

slash_vec <- c("a\\backslash", "nobackslash", "slash","\n")
print(slash_vec)
[1] "a\\backslash" "nobackslash"  "slash"        "\n"          
cat(slash_vec)
a\backslash nobackslash slash 
str_subset(slash_vec, '\\') 
Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE, context=`\`)

Use more backslashes!!!

Our string must contain 4 backslashes!

str_subset(slash_vec, '\\\\')
[1] "a\\backslash"

Or raw strings

Raw strings are written as r”( )“ and do not require additional escape sequences.

str_subset(slash_vec, r"(\\)")
[1] "a\\backslash"

See R for Data Science, 14.2.2

Disambiguation

^ is …

  • anchor at start of sequence and
  • negator in groups.
(uniprot)
[1] "Q6QU88_CALBL" "CO1A2_HUMAN"  "SAMH1_HUMAN"  "NPRL2_DROME"  "GLUC_HUMAN"  
str_view(uniprot, "^[^S]", html = TRUE)

? is …

  • quantifier and
  • lazy switch.
str_extract(uniprot, "\\w?") # quantifier (0, 1, greedy!)
[1] "Q" "C" "S" "N" "G"
str_extract(uniprot, "\\w??") # quantifier and lazy switch
[1] "" "" "" "" ""

Grouping

( ) Grouping with round brackets

str_view(uniprot, "(SAM)", html = TRUE)

[ ] Grouping with square brackets

str_view(uniprot, "[SAM]", html = TRUE)

Backreferences

Group matches

\1, \2 and so forth refer to groups matched with ().

Constructing new strings from regular expression matches

str_replace(uniprot, '(\\S+)_(\\S+)', "\\2: \\1")
[1] "CALBL: Q6QU88" "HUMAN: CO1A2"  "HUMAN: SAMH1"  "DROME: NPRL2" 
[5] "HUMAN: GLUC"  

Helpers

regexplain

Simple addin for RStudio by Garrick Aden-Buie

  • Test regular expressions on the fly
  • Reference library
  • Cheatsheet
  • test it live
devtools::install_github("gadenbuie/regexplain")

regexplain::regexplain_gadget()

Application of string matching with tidyr

Separate wider functions

Key-value pairs

# A tibble: 6 × 2
  subject_id gender_age
       <int> <chr>     
1       1001 m-46      
2       1002 m-56      
3       1003 m-44      
4       1004 f-47      
5       1005 f-47      
6       1006 f-36      
patient |> 
  separate_wider_delim(gender_age, 
           names = c("sex", "age"), 
           delim = "-")
# A tibble: 6 × 3
  subject_id sex   age  
       <int> <chr> <chr>
1       1001 m     46   
2       1002 m     56   
3       1003 m     44   
4       1004 f     47   
5       1005 f     47   
6       1006 f     36   

No separator - position

# A tibble: 6 × 2
  subject_id gender_age
       <int> <glue>    
1       1001 f37       
2       1002 f34       
3       1003 m54       
4       1004 m57       
5       1005 f62       
6       1006 m41       
patient |> 
  separate_wider_position(gender_age, 
                          c(sex = 1, age = 2))
# A tibble: 6 × 3
  subject_id sex   age  
       <int> <chr> <chr>
1       1001 f     37   
2       1002 f     34   
3       1003 m     54   
4       1004 m     57   
5       1005 f     62   
6       1006 m     41   

Separate longer exists

Check separate_longer_position() or separate_longer_delim() if you split repeating values.

Pasting with unite()

Input tibble

data_value <-
  tibble(
    year = c(2015, 2014, 2014),
    month = c(11, 2, 4),
    day = c(23, 1, 30),
    value = c("high", "low", "low"))

data_value
# A tibble: 3 × 4
   year month   day value
  <dbl> <dbl> <dbl> <chr>
1  2015    11    23 high 
2  2014     2     1 low  
3  2014     4    30 low  

Demonstration only.

Use the package lubridate for actually working with dates!

unite()

date_unite <-  unite(data_value, 
                    date, year, month, day, 
                    sep = "-") 

date_unite
# A tibble: 3 × 2
  date       value
  <chr>      <chr>
1 2015-11-23 high 
2 2014-2-1   low  
3 2014-4-30  low  

No need to clean up old columns.

Parsing dates with lubridate functions

A gift from your collaborators

visit_times <- tribble(
  ~subject, ~visit_date,
  1, "01/07/2001",
  2, "01.MAY.2012",
  3, "12-07-2015",
  4, "4/5/14",
  5, "12. Jun 1999"
)

Lubridate to the rescue!

visit_times |> 
  mutate(good_date = 
           lubridate::dmy(visit_date))
# A tibble: 5 × 3
  subject visit_date   good_date 
    <dbl> <chr>        <date>    
1       1 01/07/2001   2001-07-01
2       2 01.MAY.2012  2012-05-01
3       3 12-07-2015   2015-07-12
4       4 4/5/14       2014-05-04
5       5 12. Jun 1999 1999-06-12

lubridate has a range of functions for parsing ill-formatted dates and times.

Separate rows with multiple entries

Multiple values per cell

patient_df <- tibble(
    subject_id = 1001:1003, 
    visit_id = c("1,2, 3", "1|2", "1"),
    measured = c("9,0, 11", "11, 3", "12"))
patient_df
# A tibble: 3 × 3
  subject_id visit_id measured
       <int> <chr>    <chr>   
1       1001 1,2, 3   9,0, 11 
2       1002 1|2      11, 3   
3       1003 1        12      

Note the incoherent white space and separators.

Combinations of variables

patient_df |> 
  separate_rows(visit_id, measured,
                convert = TRUE) -> patient_separate
patient_separate
# A tibble: 6 × 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1003        1       12

Fill all combinations with complete()

Combinations of variables

patient_separate |> 
  complete(subject_id, 
           nesting(visit_id))
# A tibble: 9 × 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1002        3       NA
7       1003        1       12
8       1003        2       NA
9       1003        3       NA

Deteminer filling element as list

patient_separate |> 
  complete(subject_id, 
           nesting(visit_id), fill = list(measured = 0))
# A tibble: 9 × 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1002        3        0
7       1003        1       12
8       1003        2        0
9       1003        3        0

Use <NA>

Don’t use 0 for missing data in real life application.

Before we stop

Resources

  • stringi – General implementation of regular expressions
  • stringr – Wrapper for vectorisation and convenience functions
  • glue – formatting complex strings - for reference

Acknowledgments

  • Charlotte Wickham
  • Hadley Wickham
  • Marek Gagolewski (Author of stringi implementation)

Thank you for your attention!