class: title-slide,center
# String manipulation ## Reproducible data munging <img src="https://raw.githubusercontent.com/tidyverse/stringr/master/man/figures/logo.png" width="100px"/>] ### Roland Krause | rworkshop | 2021-09-08 --- # Session set-up .pull-left[ .w-100.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Learning objectives]] .float-img[
] * Perform pattern matching and string manipulation * Print text nicely and well-formatted ] .w-100.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.ybox[`stringr` package]]
* Simplifies and unifies string operations in base R * Detection, extraction, counting, subsetting * Gentle [`stringr` introduction](http://stringr.tidyverse.org) * Different matching engines, e.g. locale-sensitive ]] -- .pull-right[ .w-100.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Regular expressions]] .float-img[
] * Matching and substituting of strings * "^lecture([0-9]{1,2}).*[^_].Rmd$/\1.Rmd/g" * See [R for data science](http://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions) ] .w-100.bg-washed-blue.b--blue.ba.bw2.br3.pb2.shadow-5.ph3.mt3.mr1[ .ybox[`glue` package]
* Join and output complex strings * Concise [`glue` introduction](http://glue.tidyverse.org) ] ] --- # String examples in Base R -- .pull-left[ .bbox[Strings are character objects] ```r # A character object = colloquially called "string" my_string <- "cat" my_string ``` ``` [1] "cat" ``` ```r my_other_string <- 'catastrophe' # single quotes not_so_numeric <- as.character(3.1415) not_so_numeric ``` ``` [1] "3.1415" ``` ```r # A character vector my_string_vec <- c("atg", "ttg", "tga") ``` ] .pull-right[ .bbox[Printing complex objects] #### C style with placeholder ```r sprintf("Hello %s, how is day %d of this course?", "John Doe", 12) ``` ``` [1] "Hello John Doe, how is day 12 of this course?" ``` ] --- # Comparisons of Base R and `stringr` ```r pattern <- "r" my_words <- c( "cat", "cart","carrot", "catastrophe", "dog","rat", "bet") ``` .pull-left[ ### Base R ```r grep(pattern, my_words) ``` ``` [1] 2 3 4 6 ``` ```r grep(pattern, my_words, value = TRUE) ``` ``` [1] "cart" "carrot" "catastrophe" "rat" ``` ```r substr(my_words, 1, 3) ``` ``` [1] "cat" "car" "car" "cat" "dog" "rat" "bet" ``` ```r gsub(pattern, "R", my_words) ``` ``` [1] "cat" "caRt" "caRRot" "catastRophe" "dog" [6] "Rat" "bet" ``` ] -- .pull-right[ ### `stringr` ```r library(stringr) str_which(my_words, pattern) ``` ``` [1] 2 3 4 6 ``` ```r str_subset(my_words, pattern) ``` ``` [1] "cart" "carrot" "catastrophe" "rat" ``` ```r str_sub(my_words, 1, 2) ``` ``` [1] "ca" "ca" "ca" "ca" "do" "ra" "be" ``` ```r str_replace(my_words, pattern, "R") ``` ``` [1] "cat" "caRt" "caRrot" "catastRophe" "dog" [6] "Rat" "bet" ``` ] --- # Why use `stringr`? .pull-left[ .bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Motivation]]
* Consistency * Less typing and looking up things .bbox[Usage] * All functions in stringr start with `str_` * All take a vector of strings as the first argument + ("data first") + `>%>` now works * All functions properly vectorised ]] .pull-right[ .flex[ .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Useful additions] Viewing matches rendered in HTML <style type="text/css"> .box .html-widget { width: 100% !important; padding: 10px; margin: 10px; background-color: white; color: black; } .str_view .match { border: 2px solid orange; background-color: yellow; } </style> ```r str_view_all(my_words, pattern) ```
]]] --- class: nvs3 # `stringr` overview
--- class: nvs3, inverse
--- class: nvs3, inverse
--- # Length .pull-left[.w-100.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph4[ .large[.bbox[Length of items in character vector]] ```r str_length(my_words) ``` ``` [1] 3 4 6 11 3 3 3 ``` ]] -- .pull-right[ .w-100.bg-washed-red.b--red.ba.bw2.br3.shadow-5.ph3.mr1[ .large[.rbox[Warning]] ```r length(my_words) ``` ``` [1] 7 ``` ]] --- # Elements of strings .pull-left[ ### Substrings ```r my_words ``` ``` [1] "cat" "cart" "carrot" "catastrophe" "dog" [6] "rat" "bet" ``` ```r str_sub(my_words, 1, 4) ``` ``` [1] "cat" "cart" "carr" "cata" "dog" "rat" "bet" ``` ### Replace
```r str_replace(my_words, "a", "A") ``` ``` [1] "cAt" "cArt" "cArrot" "cAtastrophe" "dog" [6] "rAt" "bet" ``` ] -- .pull-right[ ### Spliting strings ```r str_split(my_words, "a") ``` ``` [[1]] [1] "c" "t" [[2]] [1] "c" "rt" [[3]] [1] "c" "rrot" [[4]] [1] "c" "t" "strophe" [[5]] [1] "dog" [[6]] [1] "r" "t" [[7]] [1] "bet" ``` ] --- # Matching strings .pull-left[ ### Detect matching strings
```r str_detect(my_words, "o") ``` ``` [1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE ``` ### Retrieving (only) matching strings ```r str_subset(my_words, "r") ``` ``` [1] "cart" "carrot" "catastrophe" "rat" ``` .w-100.bg-washed-red.b--red.ba.bw2.br3.shadow-5.ph3.mr1[ Useful interactively primarily. Dangerous in programming! ] ] -- .pull-right[ ### Retrieving matching strings ```r str_match(my_words, "a") ``` ``` [,1] [1,] "a" [2,] "a" [3,] "a" [4,] "a" [5,] NA [6,] "a" [7,] NA ``` Includes capture groups (see regular expressions) ### Extracting matches ```r str_extract(my_words, "a") ``` ``` [1] "a" "a" "a" "a" NA "a" NA ``` ] --- # Better treatment of conversion ```r my_col <- c("F", "M", "female", "male", "male", "female", "female", "männlich") convert_gender <- function(x){ case_when( str_detect(x, "^[Ff]") ~ "Female", str_detect(x, "^[Mm]") ~ "Male", TRUE ~ x ) } convert_gender(my_col) ``` ``` [1] "Female" "Male" "Female" "Male" "Male" "Female" "Female" "Male" ``` --- class: middle, center, inverse # Regular expressions --- # Getting started with regular expressions .pull-left[ .bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Higher aims]] * Extract particular characters, e.g. numbers only * Express a variety of character following or preceding patterns * Matching any character * Not matching a particular character ] ] -- .pull-right[ .bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[Prerequisites complex matching]] * Regular expressions look like strings but are converted to a particular expression object. * Can be done explicitly by `regex()` -- rarely necessary * `print()` is giving the quoted strings and therefore misleading * Use `cat()` or `writeLines()` to see strings properly escaped. + `writeLines()` preferred for writing ]] --- # Flexible matching though *metacharacters* .pull-left[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mr1.large[ .bbox[Metacharacters ] **Symbols matching a variety of characters as opposed to literal matches.** ] .bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ * *.* (dot) represents any character * Exception is the newline character (`\n`) ```r str_view(my_words, ".at") ```
] ] -- .pull-right[ .bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ * `.` matches exactly one occurrence ```r str_subset(my_words, "c.t") ``` ``` [1] "cat" "catastrophe" ``` * `+` (plus) represents one or more occurrences ```r str_subset(my_words, 'c.r+') ``` ``` [1] "cart" "carrot" ``` * `*` (star) represents zero or more occurrences ```r str_subset(my_words, 'c.r*') ``` ``` [1] "cat" "cart" "carrot" "catastrophe" ``` ]] --- # Grouping .pull-left[ .bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ .box[#### Group terms with parentheses `(` and `)`] ```r str_view(my_words, 'c(.r)+t') ```
]] -- .pull-right[ .bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ .box[#### Capture groups with `str_match()`] ```{r"} str_match(my_words, 'c(.r)*t') ``` ]] --- ### Alternation operator `|` ( logical OR ) ```r str_subset(my_words, '(c.t)|(c.rt)') ``` ``` [1] "cat" "cart" "catastrophe" ``` --- # Quantifying a number of matches .pull-left[ .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[The preceding item will be matched ...] * `?` at most once. * `*` matched zero or more times. * `+` one or more times. * `{n}` exactly 'n' times. * `{n,}` 'n' or more times. * `{n,m}` at least 'n' times, but not more than 'm' times. ]] .pull-right[ ```r dna <- "ATGGTAACCGGTAGGTAGTAAAGGTCCC" str_view_all(dna, "AA?") ```
```r str_view_all(dna, "AA+") ```
```r str_view_all(dna, "A{3,}") ```
] --- # *Greedy* and *lazy* matching .pull-left[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3[ .bbox[Matches are *greedy* by default ] Match the longest possible subsequence. ```r dna <- "ATGGTAACCGGTAGGTAGTAAAGGTCCC" str_extract(dna, "AAG.{2,5}") ``` ``` [1] "AAGGTCCC" ``` ```r str_extract(dna, "ATG.+C") ``` ``` [1] "ATGGTAACCGGTAGGTAGTAAAGGTCCC" ``` ]] -- .pull-right[ .bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3[ .bbox[*Lazy* matching] Adding `?` to a regular expression makes it *lazy*, and returns the shortest possible match. ```r str_extract(dna, "AAG.{2,5}?") ``` ``` [1] "AAGGT" ``` ```r str_extract(dna, "ATG.+?C") ``` ``` [1] "ATGGTAAC" ``` ]] --- # Anchors .pull-left[ # `^` Start of string ```r my_words ``` ``` [1] "cat" "cart" "carrot" "catastrophe" "dog" [6] "rat" "bet" ``` ```r str_subset(my_words , '^c') ``` ``` [1] "cat" "cart" "carrot" "catastrophe" ``` ] -- .pull-right[ # `$` End of string ```r str_subset(my_words, 'r.$') ``` ``` [1] "cart" ``` ] --- # Character classes .pull-left[ ### Special characters .fl[ |Pattern | Matches | Complement | Matches | |:------:|------------|:----------:|---------------| |`\d` | Digit | `\D` | No digit | |`\s` | Whitespace | `\S` | No whitespace| |`\w` | Word chars | `\W` | No work char| |`\b` | Boundaries | `\B` | Within words| |`\p{}` | Property | `\P{}` | Not that property| ]] .pull-right[ ### Example for Unicode properties .left.fl[ | Code | Description | |-----:|-------------| |Ll | Lowercase letter | |Lu | Uppercase letter | |Sc | A currency sign | |Sm | Symbol of mathematical use| | ... | See [documentation](http://www.unicode.org/reports/tr44/#Property_Index)| Powerful but complex to use. ]] --- # Examples .pull-left[ .gbox[Matching digits] ```r str_extract(my_words, "\\d+") ``` ``` [1] NA NA NA NA NA NA NA ``` .bbox[Counting word elements] ```r str_count(my_words, "\\w+") ``` ``` [1] 1 1 1 1 1 1 1 ``` ```r str_length(my_words) ``` ``` [1] 3 4 6 11 3 3 3 ``` ```r str_count(my_words, "\\w") ``` ``` [1] 3 4 6 11 3 3 3 ``` Counting all matches to words. ] -- .pull-right[ .gbox[ Matching "everything that you want" ] ```r str_extract(my_words, "\\S+") ``` ``` [1] "cat" "cart" "carrot" "catastrophe" "dog" [6] "rat" "bet" ``` .gbox[Unicode ] ```r str_extract(my_words, "\\p{Lu}+") ``` ``` [1] NA NA NA NA NA NA NA ``` ] --- # Extended list of regular expressions .pull-left[ ### Readable short cuts .w-70.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3[ Built-in (`stringi` - `stringr`) Requires `perl = TRUE` flag in base R. Works out of the box in `stringr`. ] - **[:upper:]** Upper-case letters. - **[:lower:]** Lower-case letters. - **[:alpha:]** Alphabetic characters: '[:lower:]' and '[:upper:]'. - **[:digit:]** Digits: '0 1 2 3 4 5 6 7 8 9'. - **[:punct:]** Punctuation characters: '! " # $ % & ' ( ) * + , - . / : ; < = > ? @ ' and others. - **[:space:]** Space characters: tab, newline, vertical tab, form feed, carriage return, and space. - **[:blank:]** Blank characters: space and tab. - **[:alnum:]** Alphanumeric characters: '[:alpha:]' and '[:digit:]'. - **[:graph:]** Graphical characters: '[:alnum:]' and '[:punct:]'. ] -- .pull-right[ ```r my_words ``` ``` [1] "cat" "cart" "carrot" "catastrophe" "dog" [6] "rat" "bet" ``` ```r str_subset(my_words, "[:punct:]") ``` ``` character(0) ``` ```r str_extract(my_words, "[:punct:]") ``` ``` [1] NA NA NA NA NA NA NA ``` ] --- # Roll your own character class .flex[ .w-30.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph4.mt1[ .large[.gbox[Define groups]] * **[a-z]** lowercase letters * **[a-zA-Z]** any (ascii) letter * **[0-9]** any number * **[aeiou]** any vowel * **[0-7ivx]** any of 0 to 7, i, v, and x ] .w-40.bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph4.mt1.mr1.ml1[ .bbox[Example] ```r str_subset(c("Rpl12", "Rpn12", "Rps1", "Pre1"), 'Rp[ls]') ``` ``` [1] "Rpl12" "Rps1" ``` ] .w-30.bg-washed-red.b--red.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ .rbox[Rely on built-in groups where possible] Note that the set of alphabetic characters includes accents such as ß, ç or ö which are very common in some languages. Use is more general than **[A-Za-z]** which ascii characters only. .float-img[
] ] ] --- # Matching metacharacters .pull-left[ .bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ We saw special characters such as - `.` - `+` - `*` - or `$` What if we want to match them? ]] -- .pull-right[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Strings containing only a full stop]] ```r vec2 <- c("YKL045W-A", "12+45=57", "$1200.00", "ID2.2") str_subset(vec2 , '.') ``` ``` [1] "YKL045W-A" "12+45=57" "$1200.00" "ID2.2" ``` Not what we wanted. ] ] --- # Excursion .pull-left[ .bg-washed-red.b--red.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .rbox[Implicit conversion] R wraps regular expressions as strings without explicit interference of the user. When converting from *string* to *regular expression* internally, single backslashes (`\`) are already converted. .right[
] ] .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Solution] .mt1[
Need to escape `\` with an additional one `-> \\`. ] ] ] -- .pull-right[ ### Double escape ```r str_subset(vec2, '\.') ``` ``` Error: '\.' is an unrecognized escape in character string starting "'\." ``` ```r str_subset(vec2, '\\.') ``` ``` [1] "$1200.00" "ID2.2" ``` ] --- # No escape .pull-left[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ To match a `\`, our pattern must represent `\\`. How to match `c("a\\backslash", "nobackslash", "slash","\n")`? Note the difference when printing meta-characters. ] ```r slash_vec <- c("a\\backslash", "nobackslash", "slash","\n") print(slash_vec) ``` ``` [1] "a\\backslash" "nobackslash" "slash" "\n" ``` ```r cat(slash_vec) ``` ``` a\backslash nobackslash slash ``` ```r str_subset(slash_vec, '\\') ``` ``` Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE, context=`\`) ``` ] .pull-right[ ### Use more backslashes!!! Our string must contain 4 backslashes! ```r str_subset(slash_vec, '\\\\') ``` ``` [1] "a\\backslash" ``` ] --- # Search and replace .pull-left[ ### `str_replace()` ```r mirna <- c("dme-bantam", "dme-let-7", "dme-mir-1", "dme-mir-2a-1", "mmu-let-7f-2") str_replace(mirna, '-', '|') ``` ``` [1] "dme|bantam" "dme|let-7" "dme|mir-1" "dme|mir-2a-1" "mmu|let-7f-2" ``` Only the first match is replaced! ] -- .pull-right[ ### `str_replace_all()` ```r str_replace_all(mirna, '-', '|') ``` ``` [1] "dme|bantam" "dme|let|7" "dme|mir|1" "dme|mir|2a|1" "mmu|let|7f|2" ``` ] --- # Backreferences ### Group matches `\1`, `\2` and so forth refer to groups matched with `()`. # Constructing new strings from regular expression matches ```r uniprot <- c("Q6QU88_CALBL", "CO1A2_HUMAN", "SAMH1_HUMAN", "NPRL2_DROME") str_replace(uniprot, '(\\S+)_(\\S+)', "\\2: \\1") ``` ``` [1] "CALBL: Q6QU88" "HUMAN: CO1A2" "HUMAN: SAMH1" "DROME: NPRL2" ``` --- # Helpers .pull-left[ ### `regexplain` Simple [addin for RStudio](https://github.com/gadenbuie/regexplain) by [Garrick Aden-Buie](https://github.com/gadenbuie/) * Test regular expressions on the fly * Reference library * Cheatsheet * test it [live](https://www.garrickadenbuie.com/project/regexplain/) ```r devtools::install_github("gadenbuie/regexplain") regexplain::regexplain_gadget() ``` ] .pull-right[  ] --- # Constructing strings with `glue` and `stringr` .w-70.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.ml5[ .large[.ybox[`glue` is not load with `library(tidyverse)`]] ```r library(glue) ``` ``` Attaching package: 'glue' ``` ``` The following object is masked from 'package:dplyr': collapse ``` ] --- # Joining strings .w-100.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .ybox[Base R and `glue` ] Several ways to join strings even in the tidyverse. This is the `stringr` way, options are used in other packages too, e.g. `paste()`. ] .pull-left[ ### Concatenation ```r str_c(my_words, collapse = "|") ``` ``` [1] "cat|cart|carrot|catastrophe|dog|rat|bet" ``` ### Vectorization of concatenation ```r str_c(my_words, my_words, sep = ": ") ``` ``` [1] "cat: cat" "cart: cart" [3] "carrot: carrot" "catastrophe: catastrophe" [5] "dog: dog" "rat: rat" [7] "bet: bet" ``` ] -- .pull-right[ ### Padding ```r str_pad(my_words, 12) ``` ``` [1] " cat" " cart" " carrot" " catastrophe" " dog" [6] " rat" " bet" ``` ### Trimming ```r str_trunc(c("anachronism", "antebellum", "antithesis"), 6) ``` ``` [1] "ana..." "ant..." "ant..." ``` ] --- # Examples ### Functions of `glue` * Format complex character objects will an easy syntax * Concatenate strings useful white space treatment * Made to work with `%>%` ```r year_pub <- 1881 book <- "The Formation of Vegetable Mould through the Action of Worms" author <- "Charles Darwin" glue("The author {author}", ' also wrote "{book}"', # note the use of single quote to escape double quotes " in {year_pub}.") ``` ``` The author Charles Darwin also wrote "The Formation of Vegetable Mould through the Action of Worms" in 1881. ``` --- class: hide_logo # Collapsing ```r glue::glue_collapse(letters, '&') ``` ``` a&b&c&d&e&f&g&h&i&j&k&l&m&n&o&p&q&r&s&t&u&v&w&x&y&z ``` # Vectorised operation utilizing pipes, `glue_data()` ```r head(patient) ``` ``` # A tibble: 6 × 2 subject_id gender_age <int> <chr> 1 1001 m-34 2 1002 f-24 3 1003 m-53 4 1004 f-44 5 1005 m-24 6 1006 f-30 ``` ```r # with pipes and non-standard evaluation sample_n( patient, 5) %>% separate(gender_age, into=c("gender", "age")) %>% glue_data("This case is {age} years old and reports gender as '{gender}'. ") ``` ``` This case is 30 years old and reports gender as 'f'. This case is 53 years old and reports gender as 'm'. This case is 34 years old and reports gender as 'm'. This case is 44 years old and reports gender as 'f'. This case is 24 years old and reports gender as 'f'. ``` --- ## Before we stop .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml1[ .large[.gbox[Resources] * `stringi` -- General implementation of regular expressions * `stringr` -- Wrapper for vectorisation and convenience functions * `glue` -- formating complex strings ]] .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml2[ .large[.bbox[Acknowledgments 🙏 👏]] - Charlotte Wickham - Hadley Wickham - Marek Gagolewski (Author of `stringi` implementation) ]] .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml1[ .large[.ybox[Further reading
]] * [Strings in R for Data Science](http://r4ds.had.co.nz/strings.html) ] .w-50.pv2.ph3.mt2.ml1[ .huge[.bbox[Thank you for your attention!]] ] ] --- # Try it all yourself! - Select the practical "String manipulation" for this lecture.