class: title-slide # Functional programming ## with `purrr` .center[<img src="img/00/logo_purrr.png" width="100px"/>] ### A. Ginolhac | rworkshop | 2021-09-10
--- # Learning objectives .flex[ .w-60.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.ml6[ .large[.gbox[You will learn]] .float-img[] .Large[ - Functional programming approach to focus on `actions` - Iteration machinery done by someone else - Pass `functions` as `arguments` to higher order functions - Use `map()` to replace `for` loops ]]] --- class: middle, inverse, center # Reminders --- # Vectors .left-column[ **Atomic** means only one type of data - The type of each atom is the .bold.Large[same] - The size of each atom is .bold.Large[1] (single element) - is the .bold[conversion] between types - can be + **explicit** (using `as.*()` functions) + **implicit** ] .right-column[ ```r # Logical c(TRUE, FALSE, TRUE) ``` ``` [1] TRUE FALSE TRUE ``` ```r # double c(1, 5, 7) ``` ``` [1] 1 5 7 ``` ```r # automatic coercion str(c(1, 5, 7, "seven")) ``` ``` chr [1:4] "1" "5" "7" "seven" ``` ```r # volontary coercion as.character(c(1, 5, 7, "seven")) ``` ``` [1] "1" "5" "7" "seven" ``` ] .footnote[Adapted from the [tutorial](https://jennybc.github.io/purrr-tutorial/ls03_map-function-syntax.html) of .bold[Jennifer Bryan]] --- # Lists as trains .pull-left[ #### The steam machine is the `list()` structure   ] -- .pull-right[ #### Sub-selection  ] .footnote[from [Advanced R, second Ed.](https://adv-r.hadley.nz/) by .bold[Hadley Wickham]] --- # Actions: functions .pull-left[ ### Declared function ```r my_function <- function(my_argument) { my_argument + 1 } ``` - Is defined in the global environment ```r ls() ``` ``` [1] "my_function" ``` ```r ls.str() ``` ``` my_function : function (my_argument) ``` - Is reusable ```r my_function(2) ``` ``` [1] 3 ``` ] -- .pull-right[ ### Anonymous functions - Are not stored in an object and are used "on the fly" ```r (function(x) { x + 2 })(2) # (\(x) x + 2)(2) ``` ``` [1] 4 ``` - Does not alter the global environment ```r ls() ``` ``` [1] "my_function" ``` ```r # remove the previous my_function to convince you rm(my_function) (function(x) { x + 1 })(2) ``` ``` [1] 3 ``` ```r ls() ``` ``` character(0) ``` ] --- class: middle, center, inverse  .flex[ .w-70.ml6[ ### .large[purrr] enhances R with consistent tools for working with functions and vectors .large[ > .Large[Functional programming] [...] treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data .tr[ — _Wikipedia_] ]]] --- # Function, the LEGO figures example #### Consider a hypothetic `put_on` function .flex.justify-center[ .w-25.mh5.mt3.ml3[  ] .w-25.mh2.ph3.mt5[ .huge.bold[+] ] .w-25.mh5.mt3[  ] .w-25.mh2.mt5[ .huge.bold[=] ] .w-25.mh5.mt3.mr3[  ] ] -- .center.Large[`put_on(figures, antenna)` returns a LEGO figure with antenna] .footnote[Figures LEGO pictures courtesy by .bold[Jennifer Bryan]] --- # Iteration, the LEGO figures example #### Illustration for .bold[several] legos, how to apply `put_on()` to more than 1 input? .flex.justify-center[ .w-40.mh5.mt3.ml3.mt4[  ] .w-25.mh2.ph3.mt5[ .huge.bold[+] ] .w-25.mh5.mt4[  ]] -- .flex.justify-center[ .w-40.mh2.mt3.ml5[ .Large[`put_on(figures, antenna)`] ] .w-15.mh2.mt3[ .huge.bold[=] ] .w-30.mh5.mr5[  ]] .footnote[[LEGO figure pictures from Jennifer Bryan](https://github.com/Jenniferbc/lego-rstats)] --- # Back to R programming .pull-left[ ### For loop approach ```r out <- vector("list", length(legos)) for (i in seq_along(legos)) { * out[[i]] <- put_on(legos[[i]], antenna) } out ``` ] .pull-right[ ### Functional programming approach - named function ```r antennate <- function(x) put_on(x, antenna) map(figures, antennate) ``` - anonymous function ```r map(figures, function(x) put_on(x, antenna)) ``` ] -- .flex[ .w40.ml4.mr6[ .huge[ >Of course, someone has to write loops. It doesn't have to be you. .tr[ — _Jenny Bryan_]] ]] --- class: slide-practical # Your turn
04
:
00
.flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.ml2[ .large[.gbox[Questions
]] .large[ Calculate the `mean` of each column of the `swiss` dataset, which is packaged with base `R`. ] ] .w-50.bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.ml2[ .large[.ybox[Tips]] - `purrr::map()` expects 2 arguments: 1. a `list` 1. a `function` - a data frame is a list - Each column represents an element of the list _i.e._ a **data frame is a list of columns** ]] --- # Answer .pull-left[ ### Functional programming focuses on actions ```r map(swiss, mean) %>% str() ``` ``` List of 6 $ Fertility : num 70.1 $ Agriculture : num 50.7 $ Examination : num 16.5 $ Education : num 11 $ Catholic : num 41.1 $ Infant.Mortality: num 19.9 ``` ] -- .pull-right[ ### The for loop machinery ```r means <- vector("list", ncol(swiss)) for (i in seq_along(swiss)) { * means[i] <- mean(swiss[[i]]) } # need to manually add names names(means) <- names(swiss) means %>% str() ``` ``` List of 6 $ Fertility : num 70.1 $ Agriculture : num 50.7 $ Examination : num 16.5 $ Education : num 11 $ Catholic : num 41.1 $ Infant.Mortality: num 19.9 ``` ] --- class: nvs1 # For loops are fine .flex[ .w-50.bg-washed-red.b--red.ba.bw1.br3.shadow-5.ph3.mt1.mr1[ .large[.rbox[Growing vector]] ```r for_loop <- function(x) { res <- c() # initialize an empty vector for (i in seq_len(x)) { res[i] <- i } } ``` ] .w-50.bg-washed-green.b--green.ba.bw1.br3.shadow-5.ph3.mt1.mr1[ .large[.gbox[Correct memory allocation]] ```r for_loop <- function(x) { res <- vector(mode = "integer", length = x) for (i in seq_len(x)) { res[i] <- i } } ``` ]] -- .flex[ .w-50.bg-washed-yellow.b--yellow.ba.bw1.br3.shadow-5.ph3.mt1.mr1[ .large[.gbox[Using `Rcpp`]] ```r library(Rcpp) # binds cpp compilation to R cppFunction("NumericVector rcpp(int x) { NumericVector res(x); for (int i=0; i < x; i++) { res[i] = i; } }") ``` ] .w-50.ph3.mt1.mr1[  ]] --- # The `purrr::map()` family of functions .pull-left[ - Are designed to be consistent - `map()` is the general function and close to `base::lapply()` - `map()` introduces shortcuts (absent in `lapply()`) - Variants to specify the type of vectorized output: + `map_lgl()` + `map_int()` + `map_dbl()` + `map_chr()` + `map_dfr()` data.frame rows + `map_dfc()` data.frame cols - .red.bold[Fail] if coercion is impossible ] -- .pull-right[ ```r map_dbl(swiss, mean) ``` ``` Fertility Agriculture Examination Education 70.14255 50.65957 16.48936 10.97872 Catholic Infant.Mortality 41.14383 19.94255 ``` ```r map_chr(swiss, mean) ``` ``` Fertility Agriculture Examination Education "70.142553" "50.659574" "16.489362" "10.978723" Catholic Infant.Mortality "41.143830" "19.942553" ``` ```r map_int(swiss, mean) ``` ``` Error: Can't coerce element 1 from a double to a integer ``` ] --- class: hide_logo # Linear modelling example ### Palmer penguins from previous lecture .pull-left[ ```r library(palmerpenguins) ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, colour = species)) + geom_point() + geom_smooth(method = "lm", formula = 'y ~ x', # no standard error ribbon se = FALSE) + facet_grid(island ~ .) ``` <img src="lecture07_purrr_files/figure-html/unnamed-chunk-18-1.png" width="432" /> ] -- .pull-right[ .huge[How to perform those 5 linear model?] .tiny[From [R for Data Science](http://r4ds.had.co.nz/iteration.html#the-map-functions)] .footnote[`palmerpenguins` from _Horst AM, Hill AP, Gorman KB (2020)_] ] --- # Reminder: fit a linear model - Using the `pinguins` dataset we can fit a linear model to explain the `bill depth` by the `bill length` using: .pull-left[ ```r lm(bill_depth_mm ~ bill_length_mm, data = penguins) ``` ``` Call: lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins) Coefficients: (Intercept) bill_length_mm 20.88547 -0.08502 ``` ] .pull-right[  ] --- # Reminder: `lm` outputs complex objects .left-column[ #### Summarise a linear model with `base::summary()` - `\(r^2\)` is low (`0.05525`) because we mix .bold[all] individuals. - We need .bold[one] tibble .bold.large[per] group ] .right-column[ ```r summary(lm(bill_depth_mm ~ bill_length_mm, data = penguins)) ``` ``` Call: lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins) Residuals: Min 1Q Median 3Q Max -4.1381 -1.4263 0.0164 1.3841 4.5255 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.88547 0.84388 24.749 < 2e-16 *** bill_length_mm -0.08502 0.01907 -4.459 1.12e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.922 on 340 degrees of freedom (2 observations deleted due to missingness) Multiple R-squared: 0.05525, Adjusted R-squared: 0.05247 F-statistic: 19.88 on 1 and 340 DF, p-value: 1.12e-05 ``` ] --- # Split the penguin data .pull-left[ ```r # split a data.frame by group, returns a list split_peng <- group_split(palmerpenguins::penguins, island, species) # dimensions of each tibble # returns a list map(split_peng, dim) # How many observations per group? # coerce to a double map_dbl(split_peng, nrow) ``` ] .pull-right[ ``` [[1]] [1] 44 8 [[2]] [1] 124 8 [[3]] [1] 56 8 [[4]] [1] 68 8 [[5]] [1] 52 8 ``` ``` [1] 44 124 56 68 52 ``` ] -- .center.Large[Coercion from a list to a numeric vector is required by the user.] --- # Map the linear model - `map(YOUR_LIST, YOUR_FUNCTION)` - `YOUR_LIST` = `spl_mtcars` - `YOUR_FUNCTION` can be an anonymous function (declared on the fly) .pull-left[ ```r map(split_peng, function(x) { lm(bill_depth_mm ~ bill_length_mm, data = x) }) ``` .large[ Curly braces can be used for: - Several code lines - Wrap lines to improve readability ] ] -- .pull-right[ ``` [[1]] Call: lm(formula = bill_depth_mm ~ bill_length_mm, data = x) Coefficients: (Intercept) bill_length_mm 9.6263 0.2244 [[2]] Call: lm(formula = bill_depth_mm ~ bill_length_mm, data = x) Coefficients: (Intercept) bill_length_mm 5.2510 0.2048 [[3]] Call: lm(formula = bill_depth_mm ~ bill_length_mm, data = x) Coefficients: (Intercept) bill_length_mm 9.2607 0.2335 [[4]] Call: lm(formula = bill_depth_mm ~ bill_length_mm, data = x) Coefficients: (Intercept) bill_length_mm 7.5691 0.2222 [[5]] Call: lm(formula = bill_depth_mm ~ bill_length_mm, data = x) Coefficients: (Intercept) bill_length_mm 14.1359 0.1102 ``` ] --- # To extract `\(r^2\)` .pull-left[ `base::summary()` generates a list ```r lm_all <- summary(lm(bill_depth_mm ~ bill_length_mm, data = penguins)) str(lm_all, max.level = 1, give.attr = FALSE) ``` ``` List of 12 $ call : language lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins) $ terms :Classes 'terms', 'formula' language bill_depth_mm ~ bill_length_mm $ residuals : Named num [1:342] 1.139 -0.127 0.541 1.535 3.056 ... $ coefficients : num [1:2, 1:4] 20.8855 -0.085 0.8439 0.0191 24.7492 ... $ aliased : Named logi [1:2] FALSE FALSE $ sigma : num 1.92 $ df : int [1:3] 2 340 2 $ r.squared : num 0.0552 $ adj.r.squared: num 0.0525 $ fstatistic : Named num [1:3] 19.9 1 340 $ cov.unscaled : num [1:2, 1:2] 1.93e-01 -4.32e-03 -4.32e-03 9.84e-05 $ na.action : 'omit' Named int [1:2] 4 272 ``` ] .pull-right[ ### base R or `purrr` list elements extraction ```r lm_all$r.squared ``` ``` [1] 0.05524985 ``` ```r lm_all[["r.squared"]] ``` ``` [1] 0.05524985 ``` ```r pluck(lm_all, "r.squared") ``` ``` [1] 0.05524985 ``` ] --- # Extract `\(r^2\)` for all groups .pull-left[ ```r split_peng %>% map(function(x) lm(bill_depth_mm ~ bill_length_mm, data = x)) %>% map(summary) %>% map(function(x) pluck(x, "r.squared")) ``` ] -- .pull-left[ ``` [[1]] [1] 0.2192052 [[2]] [1] 0.4139429 [[3]] [1] 0.2579242 [[4]] [1] 0.4271096 [[5]] [1] 0.06198376 ``` ] --- # `purrr::map()` shortcuts I ## Anonymous functions - One sided formula create anonymous functions: + Define the function **using `~`** (*) + Use the **placeholder `.x`** to refer to the current list element (`.x` represents the argument of the anonymous function) .pull-left[ ### Initial code ```r split_peng %>% map(function(x) lm(bill_depth_mm ~ bill_length_mm, data = x)) %>% map(summary) %>% map(function(x) pluck(x, "r.squared")) ``` ] -- .pull-right[ ### With anonymous function shortcuts (.large[`~`]) ```r split_peng %>% * map(~ lm(bill_depth_mm ~ bill_length_mm, data = .x)) %>% map(summary) %>% * map(~ pluck(.x, "r.squared")) ``` ] .footnote[*: R base next version `4.1` is likely to get a shortcut too for anonymous functions ([Luke Tierney, useR!2020](https://youtu.be/X_eDHNVceCU?t=4048))] --- # `purrr::map()` shortcuts II .pull-left.large[ ### With shortcuts I ```r split_peng %>% map(~ lm(bill_depth_mm ~ bill_length_mm, data = .x)) %>% map(summary) %>% map(~ pluck(.x, "r.squared")) ``` ] -- .pull-right.large[ ### With shortcuts II ```r split_peng %>% map(~ lm(bill_depth_mm ~ bill_length_mm, data = .x)) %>% map(summary) %>% * map("r.squared") ``` ] --- # Obtain a vector .pull-left[ ### With shortcuts II ```r split_peng %>% map(~ lm(bill_depth_mm ~ bill_length_mm, data = .x)) %>% map(summary) %>% map("r.squared") ``` ] -- .pull-right[ ### With double coercion ```r split_peng %>% map(~ lm(bill_depth_mm ~ bill_length_mm, data = .x)) %>% map(summary) %>% * map_dbl("r.squared") ``` ``` [1] 0.21920517 0.41394290 0.25792423 0.42710958 0.06198376 ``` ] --- # Lists as a column in a tibble .left-column[ ### Example  .footnote[Picture by .bold[Jennifer Bryan]] ] .right-column[ ```r tibble(numbers = 1:8, my_list = list(a = c("a", "b"), b = 2.56, c = c("a", "b", "c"), d = rep(TRUE, 4), d = 2:3, e = 4:6, f = FALSE, g = c(1, 4, 5, 6))) ``` ``` # A tibble: 8 × 2 numbers my_list <int> <named list> 1 1 <chr [2]> 2 2 <dbl [1]> 3 3 <chr [3]> 4 4 <lgl [4]> 5 5 <int [2]> 6 6 <int [3]> 7 7 <lgl [1]> 8 8 <dbl [4]> ``` ] --- # Rewriting our previous example ### Nesting the `tibble` by `island` and `species` .pull-left[ ```r penguins %>% group_by(island, species) %>% tidyr::nest() ``` ] .pull-right[ ``` # A tibble: 5 × 3 # Groups: species, island [5] species island data <fct> <fct> <list> 1 Adelie Torgersen <tibble [52 × 6]> 2 Adelie Biscoe <tibble [44 × 6]> 3 Adelie Dream <tibble [56 × 6]> 4 Gentoo Biscoe <tibble [124 × 6]> 5 Chinstrap Dream <tibble [68 × 6]> ``` ] --- # Rewriting our previous example ### With modelling using `mutate` and `map` .pull-left[ ```r penguins %>% group_by(island, species) %>% tidyr::nest() %>% mutate(model = map(data, ~ lm(bill_depth_mm ~ bill_length_mm, data = .x)), summary = map(model, summary), r_squared = map_dbl(summary, "r.squared")) ``` .large[ - Very powerful - Data rectangle - Next lecture will show you how `dplyr`, `tidyr`, `tibble`, `purrr` and `broom` nicely work together ] ] .pull-right[ ``` # A tibble: 5 × 6 # Groups: species, island [5] species island data model summary r_squared <fct> <fct> <list> <list> <list> <dbl> 1 Adelie Torgersen <tibble [52 × 6]> <lm> <smmry.lm> 0.0620 2 Adelie Biscoe <tibble [44 × 6]> <lm> <smmry.lm> 0.219 3 Adelie Dream <tibble [56 × 6]> <lm> <smmry.lm> 0.258 4 Gentoo Biscoe <tibble [124 × 6]> <lm> <smmry.lm> 0.414 5 Chinstrap Dream <tibble [68 × 6]> <lm> <smmry.lm> 0.427 ``` ] --- # Without `map` and `dplyr` 1.0 ### With modelling using `mutate` and `list` .pull-left[ ```r penguins %>% nest_by(island, species) %>% mutate(model = list(lm(bill_depth_mm ~ bill_length_mm, data = data)), summary = list(summary(model)), r_squared = pluck(summary, "r.squared")) ``` - Might be easier for some people - `rowwise` integrated so no `map` - `summarise` .bold[v1.0] handles more than one ouput per group ] .pull-right[ ``` # A tibble: 5 × 6 # Rowwise: island, species island species data model summary r_squared <fct> <fct> <list<tibble[,6]>> <list> <list> <dbl> 1 Biscoe Adelie [44 × 6] <lm> <smmry.lm> 0.219 2 Biscoe Gentoo [124 × 6] <lm> <smmry.lm> 0.414 3 Dream Adelie [56 × 6] <lm> <smmry.lm> 0.258 4 Dream Chinstrap [68 × 6] <lm> <smmry.lm> 0.427 5 Torgersen Adelie [52 × 6] <lm> <smmry.lm> 0.0620 ``` ] --- # Wrap up .flex[ .w-40.ml1.mr1[ .large[.ybox[`map()` or `walk()`]]  ] .w-40.ml2[ .large[.ybox[`map2()` or `walk2()`]]  ]] .flex[ .w-40.ml1.mr1[ .large[.ybox[`pmap()`]]  ] .w-40[ .footnote[Pictures from [Lise Vaudor's blog](http://perso.ens-lyon.fr/lise.vaudor/iterer-des-fonctions-avec-purrr/)] ]] --- # Don't forget vectorisation ## .red[Warning] .float-img[] Don't overmap functions! Use `map` only if required (non vectorised function) .pull-left[ ```r nums <- sample(1:10, size = 1000, replace = TRUE) log_vec <- log(nums) log_map <- map_dbl(nums, log) identical(log_vec, log_map) bench::mark( vectorised = log(nums), mapped = map_dbl(nums, log)) %>% autoplot() ``` ] .pull-right[ ``` [1] TRUE ``` <img src="lecture07_purrr_files/figure-html/benchmark-1.png" width="288" /> ] --- # Before we stop .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml1[ .large[.gbox[You learned to:] - Functional programming: focus on `actions` - `for` loops are fine, but don't write them - Pass `functions` as `arguments` - Apprehend nested tibbles with `list-columns` ]] .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml2[ .large[.bbox[Acknowledgments 🙏 👏]] * Eric Koncina for writing the initial content * Jennifer Bryan ([LEGO pictures](https://github.com/Jenniferbc/lego-rstats), courtesy CC licence) * Hadley Wickham * Lise Vaudor * Ian Lyttle * Jim Hester ]] .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt1.ml1[ .large[.ybox[Further reading
]] - Jennifer Bryan - [lessons & tutorial](https://Jenniferbc.github.io/purrr-tutorial/) - Hadley Wickham - [R for data science](http://r4ds.had.co.nz) ([iteration](http://r4ds.had.co.nz/iteration.html), [many models](http://r4ds.had.co.nz/many-models.html)) - Ian Lyttle - [purrr applied for engineering](http://ijlyttle.github.io/isugg_purrr/presentation.html#%281%29) - Robert Rudis - [purrr, comparison with base](https://rud.is/b/2016/07/26/use-quick-formula-functions-in-purrrmap-base-vs-tidtyverse-idiom-comparisonsexamples/) - Rstudio's blog - [purrr 0.2 release](https://blog.rstudio.org/2016/01/06/purrr-0-2-0/) [purrr 0.3 release](https://www.tidyverse.org/articles/2019/02/purrr-0-3-0/) - Kris Jenkins - What is Functional Programming? ([Blog version](http://blog.jenkster.com/2015/12/what-is-functional-programming.html) and [Talk video](https://www.youtube.com/watch?v=tQRtTSIpye4)) - Lise Vaudor - [R-atique](http://perso.ens-lyon.fr/lise.vaudor/) (_in french_) ] .w-50.pv2.ph3.mt3.ml1[ .huge[.bbox[Thank you for your attention!]] ] ]