Functional Programming

with purrr

Roland Krause

Rworkshop

Thursday, 8 February 2024

Learning objectives

Learning objectives

  • Functional programming approach to focus on actions
  • Iteration machinery written by someone else
  • Pass functions as arguments to higher order functions
  • Use map() to replace remaining for loops
  • Working with lists
  • Nesting

Motivation

  • Building multiple models from tidy data (purrr)

  • Working with multiple models (broom, next lecture)

Reminders for R

Vectors

Atomic vectors

Atomic means only one type of data

The type of each atom is the same. The size of each atom is one single element.

The conversion between types can be + Explicit (using as.*() functions) + or Implicit

# Logical
c(TRUE, FALSE, TRUE)
[1]  TRUE FALSE  TRUE
# double
c(1, 5, 7)
[1] 1 5 7
# automatic coercion
str(c(1, 5, 7, "seven"))
 chr [1:4] "1" "5" "7" "seven"
# voluntary coercion
as.character(c(1, 5, 7, "seven"))
[1] "1"     "5"     "7"     "seven"

The tidyverse vctrs prevent any automatic coercion

Lists

Lists as trains

The steam machine is the list() structure

Sub-selection

Lists are cool

Can contain anything, named or not

list(head(women, 3), 
     sw = head(swiss, 3), c("ab", "c"))
[[1]]
  height weight
1     58    115
2     59    117
3     60    120

$sw
             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2

[[3]]
[1] "ab" "c" 

Even empty slot

list(head(women, 3), NULL, 
     sw = head(swiss, 3), c("ab", "c"))
[[1]]
  height weight
1     58    115
2     59    117
3     60    120

[[2]]
NULL

$sw
             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2

[[4]]
[1] "ab" "c" 

Lists inside rectangle data

Rectangle, tibble/data.frame fits our mindset

2D representation of data we see everywhere

cms <- sample_n(tidyr::cms_patient_care, size = 4)
cms
# A tibble: 4 × 5
  ccn    facility_name                          measure_abbr      score type    
  <chr>  <chr>                                  <chr>             <dbl> <chr>   
1 011500 BAPTIST HOSPICE                        composite_process   202 denomin…
2 011514 ST VINCENT'S HOSPICE                   pain_screening      210 denomin…
3 011502 COMFORT CARE COASTAL HOSPICE - BALDWIN pain_screening      295 denomin…
4 011517 HOSPICE OF WEST ALABAMA                dyspnea_screening   400 denomin…
  • Are columns with more complex structures useful?
  • Can we not model everything flat?
  • How to store some JSON?

Use list-columns

cms |> 
  mutate(metadata = list(
    women,
    vec = c("ab", "c"),
    lm(height ~ weight, data = women),
    NULL
  ), .after = ccn)
# A tibble: 4 × 6
  ccn    metadata      facility_name                    measure_abbr score type 
  <chr>  <named list>  <chr>                            <chr>        <dbl> <chr>
1 011500 <df [15 × 2]> BAPTIST HOSPICE                  composite_p…   202 deno…
2 011514 <chr [2]>     ST VINCENT'S HOSPICE             pain_screen…   210 deno…
3 011502 <lm>          COMFORT CARE COASTAL HOSPICE - … pain_screen…   295 deno…
4 011517 <NULL>        HOSPICE OF WEST ALABAMA          dyspnea_scr…   400 deno…
  • However, how can we iterate along list-columns?

Functions as actions

Functions declared …

my_function <- function(my_argument) {
  my_argument + 1
}

in the global environment

    ls()
[1] "my_function"
    ls.str()
my_function : function (my_argument)  

are reusable.

    my_function(2)
[1] 3

Anonymous functions …

are not stored in an object and are used “on the fly” and

(function(x) { x + 2 })(2) # 
[1] 4

do not alter the global environment.

 ls()
[1] "my_function"
# remove the previous my_function to convince you
rm(my_function)
(\(x) x + 2)(2)
[1] 4
ls()
character(0)

purrr

purrr enhances R with consistent tools for working with functions, lists and vectors.

Functional programming […] treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data

Wikipedia

Function, the LEGO figures example

Consider a hypothetic put_on function

legos

antenna

put_on(figures, antenna) returns a LEGO figure with antenna

lego with antenna

Iteration with LEGO figures

Illustration for several lego

How to apply put_on() to more than 1 input?

legos

antenna

put_on(figures, antenna)

figures with antennas

Functional programming

Back to R programming

For loop approach

out <- vector("list", length(legos))
for (i in seq_along(legos)) {
  out[[i]] <- put_on(legos[[i]], antenna) #<<
}
out

Functional programming approach

  • Named function
antennate <- function(x) put_on(x, antenna)
map(figures, antennate)
  • Anonymous function
map(figures, \(x) put_on(x, antenna))

[Of course, someone has to write loops. It doesn’t have to be you.

Jenny Bryan]{.center}

Your turn

04:00

Questions

Calculate the mean of each column of the swiss dataset, which is packaged with base R.

Tips

  • purrr::map() expects 2 arguments:
    1. a list
    2. a function
  • A data.frame is a list
  • Each column represents an element of the list i.e. a data frame is a list of vectors

Answer

Functional programming focuses on actions

map(swiss, mean) |>
  str()
List of 6
 $ Fertility       : num 70.1
 $ Agriculture     : num 50.7
 $ Examination     : num 16.5
 $ Education       : num 11
 $ Catholic        : num 41.1
 $ Infant.Mortality: num 19.9

The for loop machinery

means <- vector("list", ncol(swiss))
for (i in seq_along(swiss)) {
  means[i] <- mean(swiss[[i]]) #<<
}
# Need to manually add names
names(means) <- names(swiss)
means |>
  str()
List of 6
 $ Fertility       : num 70.1
 $ Agriculture     : num 50.7
 $ Examination     : num 16.5
 $ Education       : num 11
 $ Catholic        : num 41.1
 $ Infant.Mortality: num 19.9

The purrr::map() family of functions

  • Are designed to be consistent
  • map() is the general function and close to base::lapply()
  • map() introduces shortcuts (absent in lapply())
  • Variants to specify the type of vectorized output:
    • map_lgl()
    • map_int()
    • map_dbl()
    • map_chr()
    • map_dfr() data.frame rows
    • map_dfc() data.frame cols
  • Fail if coercion is impossible
map_dbl(swiss, mean)
       Fertility      Agriculture      Examination        Education 
        70.14255         50.65957         16.48936         10.97872 
        Catholic Infant.Mortality 
        41.14383         19.94255 
map_chr(swiss, mean)
       Fertility      Agriculture      Examination        Education 
     "70.142553"      "50.659574"      "16.489362"      "10.978723" 
        Catholic Infant.Mortality 
     "41.143830"      "19.942553" 
map_int(swiss, mean)
Error in `map_int()`:
ℹ In index: 1.
ℹ With name: Fertility.
Caused by error:
! Can't coerce from a number to an integer.

Iterating on 2 lists in parallel

legos

enhair <- function(x, y) put_on(x, y)
map2(legos, hairs, enhair)

map2 example and pmap

marks <- list(report = c(14, 13),
              practical = c(10, 12),
              theoritical = c(17, 8))
marks
$report
[1] 14 13

$practical
[1] 10 12

$theoritical
[1] 17  8
coef_basv <- c(2, 0.5, 1.5)

map2(marks, coef_basv,  \(x, y) x * y) #<<
$report
[1] 28 26

$practical
[1] 5 6

$theoritical
[1] 25.5 12.0

Iterate on list, but vectorized on atomic vectors

Generate grid of values for named arguments of a desired function

expand_grid(n = c(3, 6),
            min = 1,
            max = c(4, 8)) -> unif_grid
unif_grid
# A tibble: 4 × 3
      n   min   max
  <dbl> <dbl> <dbl>
1     3     1     4
2     3     1     8
3     6     1     4
4     6     1     8
pmap(unif_grid, runif)
[[1]]
[1] 2.610502 2.574514 2.522395

[[2]]
[1] 7.210280 5.271757 5.246017

[[3]]
[1] 1.549327 1.237971 3.628081 2.490355 1.289043 1.467250

[[4]]
[1] 3.533091 6.507228 2.663833 5.503842 1.488231 7.394657

Linear modelling example

Palmer penguins from previous lecture

library(palmerpenguins)
ggplot(penguins,
       aes(x = bill_length_mm,
           y = bill_depth_mm,
           colour = species)) +
  geom_point() +
  geom_smooth(method = "lm", formula = 'y ~ x',
      # no standard error ribbon
              se = FALSE) +
  facet_grid(island ~ .) 

How to perform those 5 linear model?

From R for Data Science

Reminder: Fitting a linear model

Using the pinguins dataset we can fit a linear model to explain the bill depth by the bill_length_mm.

lm(bill_depth_mm ~ bill_length_mm, data = penguins)

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)

Coefficients:
   (Intercept)  bill_length_mm  
      20.88547        -0.08502  

Reminder: lm outputs complex objects

Summarise a linear model with base::summary()

  • \(R^2\) is low (0.05525) because we mix all individuals.
  • We need one tibble per group
summary(lm(bill_depth_mm ~ bill_length_mm, data = penguins))

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1381 -1.4263  0.0164  1.3841  4.5255 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    20.88547    0.84388  24.749  < 2e-16 ***
bill_length_mm -0.08502    0.01907  -4.459 1.12e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.922 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.05525,   Adjusted R-squared:  0.05247 
F-statistic: 19.88 on 1 and 340 DF,  p-value: 1.12e-05

Split the penguin data

Split a data.frame by group into a list of tibble

split_peng <- group_split(palmerpenguins::penguins, 
                          island, 
                          species)

Get the dimensions of each tibble in the list

map(split_peng, dim)
[[1]]
[1] 44  8

[[2]]
[1] 124   8

[[3]]
[1] 56  8

[[4]]
[1] 68  8

[[5]]
[1] 52  8

Coercion

  • How many observations per group?
  • Coerce to a double vector
map_dbl(split_peng, nrow)
[1]  44 124  56  68  52

Map the linear model

General syntax: map(list, function)

  • list split_peng
  • function can be an anonymous function

Curly braces are optional but helpful

  • Several code lines
  • Wrap lines to improve readability
map(split_peng, \(x) {
  lm(bill_depth_mm ~ bill_length_mm, 
     data = x)
})
[[1]]

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)

Coefficients:
   (Intercept)  bill_length_mm  
        9.6263          0.2244  


[[2]]

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)

Coefficients:
   (Intercept)  bill_length_mm  
        5.2510          0.2048  


[[3]]

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)

Coefficients:
   (Intercept)  bill_length_mm  
        9.2607          0.2335  


[[4]]

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)

Coefficients:
   (Intercept)  bill_length_mm  
        7.5691          0.2222  


[[5]]

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)

Coefficients:
   (Intercept)  bill_length_mm  
       14.1359          0.1102  

To extract \(R^2\)

base::summary() generates a list

lm_all <- summary(lm(bill_depth_mm ~ bill_length_mm,
                     data = penguins))
str(lm_all, max.level = 1, give.attr = FALSE)
List of 12
 $ call         : language lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)
 $ terms        :Classes 'terms', 'formula'  language bill_depth_mm ~ bill_length_mm
 $ residuals    : Named num [1:342] 1.139 -0.127 0.541 1.535 3.056 ...
 $ coefficients : num [1:2, 1:4] 20.8855 -0.085 0.8439 0.0191 24.7492 ...
 $ aliased      : Named logi [1:2] FALSE FALSE
 $ sigma        : num 1.92
 $ df           : int [1:3] 2 340 2
 $ r.squared    : num 0.0552
 $ adj.r.squared: num 0.0525
 $ fstatistic   : Named num [1:3] 19.9 1 340
 $ cov.unscaled : num [1:2, 1:2] 1.93e-01 -4.32e-03 -4.32e-03 9.84e-05
 $ na.action    : 'omit' Named int [1:2] 4 272

Extract elements from lists

lm_all$r.squared            # Base R $ notation
[1] 0.05524985
lm_all[["r.squared"]]       # Base R double brackets notation
[1] 0.05524985
pluck(lm_all, "r.squared")  # purrr::pluck()
[1] 0.05524985

Extract \(R^2\) for all groups

Using anonymous functions (i. e lambdas)

split_peng |>
  map(\(x) lm(bill_depth_mm ~ bill_length_mm,
              data = x)) |> 
  map(summary) |>
  map(\(x) pluck(x, "r.squared"))
[[1]]
[1] 0.2192052

[[2]]
[1] 0.4139429

[[3]]
[1] 0.2579242

[[4]]
[1] 0.4271096

[[5]]
[1] 0.06198376

Further simplifications

With shortcuts from purrr

split_peng |>
  map(\(grp) lm(bill_depth_mm ~ bill_length_mm, 
                data = grp)) |> 
  map(summary) |>
  map("r.squared") #<<
[[1]]
[1] 0.2192052

[[2]]
[1] 0.4139429

[[3]]
[1] 0.2579242

[[4]]
[1] 0.4271096

[[5]]
[1] 0.06198376

With coercion to doubles vector map_dbl()

split_peng |>
  map(\(x) lm(bill_depth_mm ~ bill_length_mm, 
              data = x)) |> 
  map(summary) |>
  map_dbl("r.squared") #<<
[1] 0.21920517 0.41394290 0.25792423 0.42710958 0.06198376

purrr anonmyous functions

  • The package contains its own lambda, which predates the \() syntax in base R.
  • It uses ~ in place of \() and has default placeholders starting with ..
  • We advise against its use.

Lists as a column in a tibble

Example

tibble(numbers = 1:8,
       my_list = list(a = c("a", "b"), b = 2.56, 
                      c = c("a", "b", "c"), d = rep(TRUE, 4),
                      d = 2:3, e = 4:6, f = FALSE, g = c(1, 4, 5, 6)))
# A tibble: 8 × 2
  numbers my_list     
    <int> <named list>
1       1 <chr [2]>   
2       2 <dbl [1]>   
3       3 <chr [3]>   
4       4 <lgl [4]>   
5       5 <int [2]>   
6       6 <int [3]>   
7       7 <lgl [1]>   
8       8 <dbl [4]>   

Nesting

Rewriting our previous example

Nesting the tibble by island and species

penguins |>
  nest(.by = c(island, species))
# A tibble: 5 × 3
  island    species   data              
  <fct>     <fct>     <list>            
1 Torgersen Adelie    <tibble [52 × 6]> 
2 Biscoe    Adelie    <tibble [44 × 6]> 
3 Dream     Adelie    <tibble [56 × 6]> 
4 Biscoe    Gentoo    <tibble [124 × 6]>
5 Dream     Chinstrap <tibble [68 × 6]> 

Rewriting our previous example

With modelling using mutate and map

penguins |>
  nest(.by = c(island, species)) |>
  mutate(model = map(data, 
                     \(x) lm(bill_depth_mm ~ bill_length_mm,
                          data = x)),
         summary = map(model, summary),
         r_squared = map_dbl(summary, "r.squared"))
  • Very powerful
  • Data rectangle
  • Next lecture will show you how dplyr, tidyr, tibble, purrr and broom nicely work together
# A tibble: 5 × 6
  island    species   data               model  summary    r_squared
  <fct>     <fct>     <list>             <list> <list>         <dbl>
1 Torgersen Adelie    <tibble [52 × 6]>  <lm>   <smmry.lm>    0.0620
2 Biscoe    Adelie    <tibble [44 × 6]>  <lm>   <smmry.lm>    0.219 
3 Dream     Adelie    <tibble [56 × 6]>  <lm>   <smmry.lm>    0.258 
4 Biscoe    Gentoo    <tibble [124 × 6]> <lm>   <smmry.lm>    0.414 
5 Dream     Chinstrap <tibble [68 × 6]>  <lm>   <smmry.lm>    0.427 

Don’t forget vectorisation!

Don’t “overmap” functions!

  • Use map() only if required for functions that are not vectorised.
nums <- sample(1:10,
               size = 1000,
               replace = TRUE) 

log_vec <- log(nums)
log_map <- map_dbl(nums, log)

identical(log_vec, log_map)
[1] TRUE

Benchmarking

bench::mark(
  vectorised = log(nums),
  mapped = map_dbl(nums, log)) |>
  autoplot() +
  theme_bw(18)

Calling a function for its side-effects

Side effects are not pure

  • Output on screen
  • Save files to disk
  • The return values are not used for further list-wise computation.

The map family should not be used.

Use the walk family of function

  • walk(), walk2()
  • Returns the input list (invisibly)

Saving 5 plots using indices for filenames

iwalk(split_peng,  \(x, y) {
  ggplot(x, aes(x = island)) +
    geom_bar() +
    # extract vector then unique species string 
    labs(title = pull(x, species) |>
                   str_unique())
    ggsave(glue::glue("{y}_peng.pdf"))
})

fs::dir_ls(glob = "*_peng.pdf")
1_peng.pdf 2_peng.pdf 3_peng.pdf 4_peng.pdf 5_peng.pdf 

Wrap up

map() or walk()

map2() or walk2()

pmap()

Before we stop

You learned to

  • Functional programming: focus on actions
  • for loops are fine, but don’t write them
  • Pass functions as arguments
  • Apprehend nested tibbles with list-columns

Acknowledgments

  • Eric Koncina for writing the initial content
  • Jennifer Bryan (LEGO pictures, courtesy CC licence)
  • Hadley Wickham
  • Lise Vaudor
  • Ian Lyttle
  • Jim Hester

Further reading

Thank you for your attention!