String manipulation

stringr and regular expressions

Roland Krause

R Workshop

Monday, 10 February 2025

Introduction

Summary

`stringr` package

Simplifies and unifies string operations
Gentle stringr introduction
Different matching engines providing locale-sensitive matches.
Current version is 1.5.1.

Session set-up

Learning objectives

Perform pattern matching and string manipulation
- Detection, extraction, counting, subsetting
- Print text nicely and well-formatted
Clean up input data programmatically
Quality control and processing

Material: Regular expressions

Matching and substituting of strings with meta-characters
"^lecture([0-9]{1,2}).*[^_].qmd$/\1.qmd/g"
R for Data Science
- Strings
- Regular expressions

Strings in R

String examples in Base R

Strings are character objects

# A character object = colloquially called "string"
my_string <- "cat"

my_string

[1] "cat"

my_other_string <- 'catastrophe' # single quotes

not_so_numeric <- as.character(3.1415)

not_so_numeric

[1] "3.1415"

# A character vector

my_string_vec <- c("atg", "ttg", "tga")

Printing complex objects

C style with placeholder

sprintf("Hello %s, how is day %d of this course?", 
        "John Doe", 12)

[1] "Hello John Doe, how is day 12 of this course?"

Comparisons of Base R and `stringr`

pattern <- "r"
my_words <- c( "cat", "cart","carrot", "catastrophe",
               "dog","rat",  "bet")

Base R

grep(pattern, my_words)

[1] 2 3 4 6

grep(pattern, my_words, value = TRUE)

[1] "cart"        "carrot"      "catastrophe" "rat"

substr(my_words, 1, 3)

[1] "cat" "car" "car" "cat" "dog" "rat" "bet"

gsub(pattern, "R", my_words)

[1] "cat"         "caRt"        "caRRot"      "catastRophe" "dog"        
[6] "Rat"         "bet"

With `stringr`

str_which(my_words, pattern)

[1] 2 3 4 6

str_subset(my_words, pattern)

[1] "cart"        "carrot"      "catastrophe" "rat"

str_sub(my_words, 1, 3)

[1] "cat" "car" "car" "cat" "dog" "rat" "bet"

str_replace(my_words, pattern, "R")

[1] "cat"         "caRt"        "caRrot"      "catastRophe" "dog"        
[6] "Rat"         "bet"

Strings with `stringr`

Why use `stringr`?

Easy to use

Consistency
Less typing and looking up things
All functions in stringr start with str_
All take a vector of strings as the first argument
(“data first”)
Pipes work as expected
All functions properly vectorised

Useful additions

Viewing matches rendered in ASCII colors Matches enclosed in chevrons (< >)

str_view(my_words, pattern)

[2] │ ca<r>t
[3] │ ca<r><r>ot
[4] │ catast<r>ophe
[6] │ <r>at

Well documented

(http://stringr.tidyverse.org)

Cheat sheets

Cheat sheets (cont.)

Matching strings

Detect matching strings

my_words

[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"

str_detect(my_words, "a")

[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Retrieve indeces

str_which(my_words, "a")

[1] 1 2 3 4 6

Retrieving (only) matching strings

str_subset(my_words, "a")

[1] "cat"         "cart"        "carrot"      "catastrophe" "rat"

Inverting in all `stringr` functions

str_subset(my_words, "a", negate = TRUE)

[1] "dog" "bet"

Working with matched positions

Extracting matches

str_extract(my_words, "a")

[1] "a" "a" "a" "a" NA  "a" NA

Why not str_match()?

Includes capture groups (see regular expressions) and returns matrix object. Usually more complicated than what you wanted.

str_match(my_words, "a")

     [,1]
[1,] "a" 
[2,] "a" 
[3,] "a" 
[4,] "a" 
[5,] NA  
[6,] "a" 
[7,] NA

Position of a match

str_locate(my_words, "a")

     start end
[1,]     2   2
[2,]     2   2
[3,]     2   2
[4,]     2   2
[5,]    NA  NA
[6,]     2   2
[7,]    NA  NA

Locate all matches

str_locate_all(my_words, "a")

[[1]]
     start end
[1,]     2   2

[[2]]
     start end
[1,]     2   2

[[3]]
     start end
[1,]     2   2

[[4]]
     start end
[1,]     2   2
[2,]     4   4

[[5]]
     start end

[[6]]
     start end
[1,]     2   2

[[7]]
     start end

Warning

Returns list objects. Usually a lot more complicated than what you wanted.

How long is my string?

Length of items a in character vector

str_length(my_words)

[1]  3  4  6 11  3  3  3

Vector length

length(my_words)

[1] 7

Elements of strings

Substrings

my_words

[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"

str_sub(my_words, 1, 4)

[1] "cat"  "cart" "carr" "cata" "dog"  "rat"  "bet"

Replace

str_replace(my_words, "a", "#")

[1] "c#t"         "c#rt"        "c#rrot"      "c#tastrophe" "dog"        
[6] "r#t"         "bet"

Check c#tastrophe!

`str_replace_all()`

str_replace_all(my_words, 'a', '#')

[1] "c#t"         "c#rt"        "c#rrot"      "c#t#strophe" "dog"        
[6] "r#t"         "bet"

Splitting strings

Basic splitting leads to complex output

str_split(my_words, "a")

[[1]]
[1] "c" "t"

[[2]]
[1] "c"  "rt"

[[3]]
[1] "c"    "rrot"

[[4]]
[1] "c"       "t"       "strophe"

[[5]]
[1] "dog"

[[6]]
[1] "r" "t"

[[7]]
[1] "bet"

str_split() creates lists with vectors of different lengths. Harder to work on programmatically.

Simplification

str_split(my_words, "a", simplify = TRUE)

     [,1]  [,2]   [,3]     
[1,] "c"   "t"    ""       
[2,] "c"   "rt"   ""       
[3,] "c"   "rrot" ""       
[4,] "c"   "t"    "strophe"
[5,] "dog" ""     ""       
[6,] "r"   "t"    ""       
[7,] "bet" ""     ""

Creates a matrix, all rows have the same length.

Joining strings

Concatenation

str_c(my_words, collapse = "|")

[1] "cat|cart|carrot|catastrophe|dog|rat|bet"

Vectorization of concatenation

str_c(my_words, my_words, sep = ": ")

[1] "cat: cat"                 "cart: cart"              
[3] "carrot: carrot"           "catastrophe: catastrophe"
[5] "dog: dog"                 "rat: rat"                
[7] "bet: bet"

Padding

str_pad(my_words, 6)

[1] "   cat"      "  cart"      "carrot"      "catastrophe" "   dog"     
[6] "   rat"      "   bet"

Trimming

str_trunc(c("anachronism", 
            "antebellum", 
            "antithesis"), 6)

[1] "ana..." "ant..." "ant..."

Better treatment of conversion?

my_col <- c("F", "M", "female", "male", "male", "female", "female", "männlich")


convert_gender <- function(x){
  case_when(
    str_detect(x, "m") ~ "Male",
    str_detect(x, "M") ~ "Male", 
    str_detect(x, "F") ~ "Female", 
    str_detect(x, "f") ~ "Female", 

        TRUE ~ x
  )
}

convert_gender(my_col)

[1] "Female" "Male"   "Male"   "Male"   "Male"   "Male"   "Male"   "Male"

Need a better way to express what we want to match!

Regular expressions!

Getting started with regular expressions

Higher aims

Extract particular characters, e.g. numbers only
Express a variety of character following or preceding patterns
Matching any character
Not matching a particular character

Prerequisites complex matching

Regular expressions look like strings but are converted to a particular expression object.
Can be done explicitly by regex() – rarely necessary
print() is giving the quoted strings and therefore misleading
Use cat() or writeLines() to see strings properly escaped.
writeLines() preferred for writing

Flexible matching though metacharacters

Metacharacters

Symbols matching groups of characters.

. (dot) represents any character except for the newline character (\n)

str_view(my_words, ".at")

[1] │ <cat>
[4] │ <cat>astrophe
[6] │ <rat>

. on it’s own matches exactly one occurrence

 str_subset(my_words, "c..t")

[1] "cart"

+ (plus) represents one or more occurrences

str_subset(my_words, 'c.r+')

[1] "cart"   "carrot"

* (star) represents zero or more occurrences

str_subset(my_words, 'c.r*')

[1] "cat"         "cart"        "carrot"      "catastrophe"

Quantifying a number of matches

Quantifiers

The preceding item will be matched …

? at most once.
* matched zero or more times.
+ one or more times.
{n} exactly n times.
{n,} n or more times.
{n,m} at least n times, but not more than m times.

Examples

?

dna <- "ATGGTAACCGGTAGGTAGTAAAGGTCCC"
str_view(dna, "AA?")

[1] │ <A>TGGT<AA>CCGGT<A>GGT<A>GT<AA><A>GGTCCC

+

str_view(dna, "AA+")

[1] │ ATGGT<AA>CCGGTAGGTAGT<AAA>GGTCCC

{3,}

str_view(dna, "A{3,}")

[1] │ ATGGTAACCGGTAGGTAGT<AAA>GGTCCC

Greedy and lazy matching

Matches are greedy by default

Match the longest possible subsequence.

dna <- "ATGGTAACCGGTAGGTAGTAAAGGTCCC"
str_extract(dna, "AAG.{2,5}")

[1] "AAGGTCCC"

str_extract(dna, "ATG.+C")

[1] "ATGGTAACCGGTAGGTAGTAAAGGTCCC"

Lazy matching

Adding ? to a regular expression makes it lazy, and returns the shortest possible match.

str_extract(dna, "AAG.{2,5}?")

[1] "AAGGT"

str_extract(dna, "ATG.+?C")

[1] "ATGGTAAC"

Anchors

`^` Start of string

my_words

[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"

str_subset(my_words , '^c')

[1] "cat"         "cart"        "carrot"      "catastrophe"

End of string $

str_subset(my_words, 'r.$')

[1] "cart"

[ ] Character classes

Define character classes

Group	Matches
[a-z]	Lowercase letters
[a-zA-Z]	Any (ascii) letter
[0-9]	Numbers
[aeiou]	Vowel
[0-7ivx]	Any of 0 to 7, i, v, and x
[^a-z]	No lowercase letters (negation)

Example

str_subset(c("Rpl12", "Rpn12", 
             "Rps1",
             "Pre1"), 'Rp[ls]')

[1] "Rpl12" "Rps1"

str_subset(c("Rpl12", "Rpn12", 
             "Rps1",
             "Pre1"), 'Rp[^ls]')

[1] "Rpn12"

Character classes

Shorthand classes

Pattern	Matches	Complement	Matches
`\\d`	Digit	`\\D`	No digit
`\\s`	Whitespace	`\\S`	No whitespace
`\\w`	Word chars	`\\W`	No work char
`\\b`	Boundaries	`\\B`	Within words

Learn these!

These classes cover most applications.

Examples

uniprot <- c("Q6QU88_CALBL", "CO1A2_HUMAN", "SAMH1_HUMAN", 
             "NPRL2_DROME", "GLUC_HUMAN")

str_extract(uniprot, "\\d+")

[1] "6" "1" "1" "2" NA

str_count(uniprot, "\\w+")

[1] 1 1 1 1 1

Tip

Single matches count as 1.

Negation can be a good

str_extract(uniprot, "\\S+_")

[1] "Q6QU88_" "CO1A2_"  "SAMH1_"  "NPRL2_"  "GLUC_"

Builtin character classes

Readable short cuts

Expression	Description
[:upper:]	Upper-case letters.
[:lower:]	Lower-case letters.
[:alpha:]	Alphabetic characters: ‘[:lower:]’ and ‘[:upper:]’.
[:digit:]	Digits: ‘0 1 2 3 4 5 6 7 8 9’.
[:punct:]	Punctuation characters: ‘! ” # $ % &’ ( ) * + , - . / : ; < = > ? @ ’ and others.
[:space:]	Space characters: tab, newline, vertical tab, form feed, carriage return, and space.
[:blank:]	Blank characters: space and tab.
[:alnum:]	Alphanumeric characters: ‘[:alpha:]’ and ‘[:digit:]’.
[:graph:]	Graphical characters: `[:alnum:]` and ‘[:punct:]’.

Examples

uniprot <- c("Q6QU88_CALBL", "CO1A2_HUMAN", "SAMH1_HUMAN", 
             "NPRL2_DROME", "GLUC_HUMAN")

str_extract(uniprot, "[:digit:]+")

[1] "6" "1" "1" "2" NA

Watch for locale settings!

Note that the set of alphabetic characters includes accents such as ß, ç or ö which are very common in some languages. Use is more general than [A-Za-z] which matches ascii characters only.

Matching metacharacters

Strings containing only a full stop

We saw special characters such as ., +, * or $ having special meaning in regular expressions.

vec2 <- c("YKL045W-A", "12+45=57", "$1200.00", "ID2.2")

str_subset(vec2 , '.')

[1] "YKL045W-A" "12+45=57"  "$1200.00"  "ID2.2"

Not what we wanted!

Use an escape character

str_subset(vec2, '\.')

Error: '\.' is an unrecognized escape in character string (<input>:1:20)

Errors we don’t want either!

Need to escape with `\` twice!

str_subset(vec2, '\\.')

[1] "$1200.00" "ID2.2"

Implicit conversion

R wraps regular expressions as strings without explicit interference of the user. When converting from string to regular expression internally, single backslashes (\) are already converted.

No escape

To match a \, our pattern must represent \\.

How to match c("a\\backslash", "nobackslash", "slash","\n)?

Note the difference when printing meta-characters.

slash_vec <- c("a\\backslash", "nobackslash", "slash","\n")
print(slash_vec)

[1] "a\\backslash" "nobackslash"  "slash"        "\n"

cat(slash_vec)

a\backslash nobackslash slash

str_subset(slash_vec, '\\')

Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE, context=`\`)

Use more backslashes!!!

Our string must contain 4 backslashes!

str_subset(slash_vec, '\\\\')

[1] "a\\backslash"

Or raw strings

Raw strings are written as r”( )“ and do not require additional escape sequences.

str_subset(slash_vec, r"(\\)")

[1] "a\\backslash"

See R for Data Science, 14.2.2

Disambiguation

`^` is …

Anchor at start of sequence and
Negator in groups.

(uniprot)

[1] "Q6QU88_CALBL" "CO1A2_HUMAN"  "SAMH1_HUMAN"  "NPRL2_DROME"  "GLUC_HUMAN"

str_view(uniprot, "^[^S]", html = TRUE)

`?` is …

quantifier and
lazy switch.

str_extract(uniprot, "\\w?") # quantifier (0, 1, greedy!)

[1] "Q" "C" "S" "N" "G"

str_extract(uniprot, "\\w??") # quantifier and lazy switch

[1] "" "" "" "" ""

Grouping

( ) Grouping with parentheses

str_view(uniprot, "(SAM)", html = TRUE)

[ ] Character classes with square brackets

str_view(uniprot, "[SAM]", html = TRUE)

Combining groupings

Alternation operator `|` ( logical OR )

str_extract(my_words, '[cd](a.|o.)+')

[1] "cat"   "car"   "car"   "catas" "dog"   NA      NA

Group terms within parentheses `(` and `)`

str_match(my_words, '(cat)|(dog)')

     [,1]  [,2]  [,3] 
[1,] "cat" "cat" NA   
[2,] NA    NA    NA   
[3,] NA    NA    NA   
[4,] "cat" "cat" NA   
[5,] "dog" NA    "dog"
[6,] NA    NA    NA   
[7,] NA    NA    NA

Capture groups with `str_match()`

str_match(my_words, 'c(a.)+t')

     [,1]     [,2]
[1,] NA       NA  
[2,] "cart"   "ar"
[3,] NA       NA  
[4,] "catast" "as"
[5,] NA       NA  
[6,] NA       NA  
[7,] NA       NA

Output complex when multiple capture groups are around.

Backreferences

Group matches

\1, \2 and so forth refer to groups matched with ().

Constructing new strings from regular expression matches

str_replace(uniprot, '(\\S+)_(\\S+)', "\\2: \\1")

[1] "CALBL: Q6QU88" "HUMAN: CO1A2"  "HUMAN: SAMH1"  "DROME: NPRL2" 
[5] "HUMAN: GLUC"

Helpers

`regexplain`

Simple addin for RStudio by Garrick Aden-Buie

Test regular expressions on the fly
Reference library
Cheatsheet
test it live

devtools::install_github("gadenbuie/regexplain")

regexplain::regexplain_gadget()

Application of string matching with tidyr

Separate wider functions

Key-value pairs

# A tibble: 6 × 2
  subject_id gender_age
       <int> <chr>     
1       1001 m-23      
2       1002 m-31      
3       1003 m-63      
4       1004 f-53      
5       1005 m-22      
6       1006 m-29

patient |> 
  separate_wider_delim(gender_age, 
           names = c("sex", "age"), 
           delim = "-")

# A tibble: 6 × 3
  subject_id sex   age  
       <int> <chr> <chr>
1       1001 m     23   
2       1002 m     31   
3       1003 m     63   
4       1004 f     53   
5       1005 m     22   
6       1006 m     29

No separator - position

# A tibble: 6 × 2
  subject_id gender_age
       <int> <glue>    
1       1001 f41       
2       1002 m34       
3       1003 m30       
4       1004 f34       
5       1005 f56       
6       1006 m61

patient |> 
  separate_wider_position(gender_age, 
                          c(sex = 1, age = 2))

# A tibble: 6 × 3
  subject_id sex   age  
       <int> <chr> <chr>
1       1001 f     41   
2       1002 m     34   
3       1003 m     30   
4       1004 f     34   
5       1005 f     56   
6       1006 m     61

Separate longer exists

Check separate_longer_position() or separate_longer_delim() if you split repeating values.

Pasting with `unite()`

Input tibble

data_value <-
  tibble(
    year = c(2015, 2014, 2014),
    month = c(11, 2, 4),
    day = c(23, 1, 30),
    value = c("high", "low", "low"))

data_value

# A tibble: 3 × 4
   year month   day value
  <dbl> <dbl> <dbl> <chr>
1  2015    11    23 high 
2  2014     2     1 low  
3  2014     4    30 low

Demonstration only.

Use the package lubridate for actually working with dates!

`unite()`

date_unite <-  unite(data_value, 
                    date, year, month, day, 
                    sep = "-") 

date_unite

# A tibble: 3 × 2
  date       value
  <chr>      <chr>
1 2015-11-23 high 
2 2014-2-1   low  
3 2014-4-30  low

No need to clean up old columns.

Parsing dates with `lubridate` functions

A gift from your collaborators

visit_times <- tribble(
  ~subject, ~visit_date,
  1, "01/07/2001",
  2, "01.MAY.2012",
  3, "12-07-2015",
  4, "4/5/14",
  5, "12. Jun 1999"
)

Lubridate to the rescue!

visit_times |> 
  mutate(good_date = 
           lubridate::dmy(visit_date))

# A tibble: 5 × 3
  subject visit_date   good_date 
    <dbl> <chr>        <date>    
1       1 01/07/2001   2001-07-01
2       2 01.MAY.2012  2012-05-01
3       3 12-07-2015   2015-07-12
4       4 4/5/14       2014-05-04
5       5 12. Jun 1999 1999-06-12

lubridate has a range of functions for parsing ill-formatted dates and times.

Separate rows with multiple entries

Multiple values per cell

patient_df <- tibble(
    subject_id = 1001:1003, 
    visit_id = c("1,2, 3", "1|2", "1"),
    measured = c("9,0, 11", "11, 3", "12"))
patient_df

# A tibble: 3 × 3
  subject_id visit_id measured
       <int> <chr>    <chr>   
1       1001 1,2, 3   9,0, 11 
2       1002 1|2      11, 3   
3       1003 1        12

Note the incoherent white space and separators.

Combinations of variables

patient_df |> 
  separate_rows(visit_id, measured,
                convert = TRUE) -> patient_separate
patient_separate

# A tibble: 6 × 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1003        1       12

Fill all combinations with `complete()`

Combinations of variables

patient_separate |> 
  complete(subject_id, 
           nesting(visit_id))

# A tibble: 9 × 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1002        3       NA
7       1003        1       12
8       1003        2       NA
9       1003        3       NA

Deteminer filling element as list

patient_separate |> 
  complete(subject_id, 
           nesting(visit_id), fill = list(measured = 0))

# A tibble: 9 × 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1002        3        0
7       1003        1       12
8       1003        2        0
9       1003        3        0

Use <NA>

Don’t use 0 for missing data in real life application.

Before we stop

Resources

stringi – General implementation of regular expressions
stringr – Wrapper for vectorisation and convenience functions
glue – formatting complex strings - for reference

Acknowledgments

Charlotte Wickham
Hadley Wickham
Marek Gagolewski (Author of stringi implementation)

Further reading

R for Data Science
- Strings
- Regular expressions

String manipulation

Introduction

Summary

stringr package

Session set-up

Strings in R

String examples in Base R

Strings are character objects

Printing complex objects

C style with placeholder

Comparisons of Base R and stringr

Base R

With stringr

Strings with stringr

Why use stringr?

Easy to use

Useful additions

Well documented

Cheat sheets

Cheat sheets (cont.)

Matching strings

Detect matching strings

Retrieve indeces

Retrieving (only) matching strings

Inverting in all stringr functions

Working with matched positions

Extracting matches

Position of a match

Locate all matches

How long is my string?

Length of items a in character vector

Elements of strings

Substrings

Replace

str_replace_all()

Splitting strings

Basic splitting leads to complex output

Simplification

Joining strings

Concatenation

Vectorization of concatenation

Padding

Trimming

Better treatment of conversion?

Regular expressions!

Getting started with regular expressions

Higher aims

Prerequisites complex matching

Flexible matching though metacharacters

Quantifying a number of matches

Quantifiers

The preceding item will be matched …

Examples

?

+

{3,}

Greedy and lazy matching

Matches are greedy by default

Lazy matching

Anchors

^ Start of string

End of string $

[ ] Character classes

Define character classes

Example

Character classes

Shorthand classes

Examples

Negation can be a good

Builtin character classes

Readable short cuts

Examples

Matching metacharacters

Strings containing only a full stop

Use an escape character

Need to escape with \ twice!

No escape

Use more backslashes!!!

Or raw strings

Disambiguation

`stringr` package

Comparisons of Base R and `stringr`

With `stringr`

Strings with `stringr`

Why use `stringr`?

Inverting in all `stringr` functions

`str_replace_all()`

`^` Start of string

Need to escape with `\` twice!

`^` is …

`?` is …

Alternation operator `|` ( logical OR )

Group terms within parentheses `(` and `)`

Capture groups with `str_match()`

`regexplain`

Pasting with `unite()`

`unite()`

Parsing dates with `lubridate` functions

Fill all combinations with `complete()`