Advanced R

for Functional Programming

Roland Krause

R Workshop

Wednesday, 12 February 2025

Introduction

Material

purrr Tutorial by Jenny Bryan

R for Data Science

Iteration

Advanced R

Functions
Functional programming
- Several chapters

Learning objectives

Functional programming approach to focus on actions
Iteration machinery written by someone else
Pass functions as arguments to higher order functions
Use map() to replace remaining for loops
Working with lists
Nesting of data

Outlook

Building multiple models from tidy data (purrr)
Working with multiple models (broom, next lecture)

Types and classes

Retrieve the data type with `typeof()`

typeof("RIntro!")

[1] "character"

typeof(TRUE)

[1] "logical"

typeof(2+0i)

[1] "complex"

Class of data structure

class(2)

[1] "numeric"

class(3.0)

[1] "numeric"

Storage mode

mode(2)

[1] "numeric"

# character vector converted as date
mode(as.Date("2024-09-23"))

[1] "numeric"

Vectors

Atomic vectors

Atomic means only one type of data

The type of each atom is the same. The size of each atom is one single element.

The conversion between types can be

Explicit (using as.*() functions)
or Implicit

# double
c(1, 5, 7)

[1] 1 5 7

# automatic coercion
str(c(1, 5, 7, "seven"))

 chr [1:4] "1" "5" "7" "seven"

# voluntary coercion
as.character(c(1, 5, 7, "seven"))

[1] "1"     "5"     "7"     "seven"

Names in vectors

Elements in a vector can receive names.

c(a = "apple", b = "banana")

       a        b 
 "apple" "banana"

names(c(a = "apple", b = "banana"))

[1] "a" "b"

Elements can be retrieved by name

c(a = "apple", b = "banana")["a"]

      a 
"apple"

c(a = "apple", b = "banana")[2]

       b 
"banana"

Factors

Categorical values

Factors are useful - to restrict possible values - to order character values

Using strings

(rivers_str <- c("Alzette","Moselle", "Rhine", "Sauer"))

[1] "Alzette" "Moselle" "Rhine"   "Sauer"

Rivers flow into each other but sort() would not know.

Unordered by default

factor(c("Alzette", "Sauer", "Moselle", "Rhine"))

[1] Alzette Sauer   Moselle Rhine  
Levels: Alzette Moselle Rhine Sauer

Creating an ordered list

(rivers_fct <- factor(rivers_str, 
                      levels = c("Alzette", "Sauer", "Moselle", "Rhine"), 
                        ordered = TRUE))

[1] Alzette Moselle Rhine   Sauer  
Levels: Alzette < Sauer < Moselle < Rhine

sort(rivers_fct)

[1] Alzette Sauer   Moselle Rhine  
Levels: Alzette < Sauer < Moselle < Rhine

forcats

Factors are seen in statistical applications and plotting order. Manipulation of factors is greatly simplified by the forcats package.

We will exercise this in the context of ggplot.

library(forcats)
fct_rev(rivers_fct)

[1] Alzette Moselle Rhine   Sauer  
Levels: Rhine < Moselle < Sauer < Alzette

Date and Time are inbuilt classes

Date

today()

[1] "2025-03-31"

class(today())

[1] "Date"

mode(today())

[1] "numeric"

Datetime

now()

[1] "2025-03-31 11:37:43 UTC"

class(now())

[1] "POSIXct" "POSIXt"

Difftime

today() -
  as.Date("1972-02-26")

Time difference of 19392 days

class(
today() -
  as.Date("1972-02-26")
)

[1] "difftime"

Converting to dates

This year as Date

as.Date(2024)

[1] "1975-07-18"

as.Date("2024")

Error in charToDate(x): character string is not in a standard unambiguous format

as.Date("2024", "%Y")

[1] "2024-03-31"

Import

readr::read_csv("year\n2024", 
                col_types = cols(year = col_date("%Y")))

# A tibble: 1 × 1
  year      
  <date>    
1 2024-01-01

`lubridate` to the rescue

lubridate::ymd("2024", truncated = 2L)

[1] "2024-01-01"

lubridate cheatsheet

You can do a lot of things in base R with dates without lubridate but it provides an excellent cheatsheet lubridate has a range of functions for parsing ill-formatted dates and times.

Parsing dates with `lubridate` functions (reprise)

A gift from your collaborators

visit_times <- tribble(
  ~subject, ~visit_date,
  1, "01/07/2001",
  2, "01.MAY.2012",
  3, "12-07-2015",
  4, "4/5/14",
  5, "12. Jun 1999"
)

Lubridate to the rescue!

visit_times |> 
  mutate(good_date = 
           lubridate::dmy(visit_date))

# A tibble: 5 × 3
  subject visit_date   good_date 
    <dbl> <chr>        <date>    
1       1 01/07/2001   2001-07-01
2       2 01.MAY.2012  2012-05-01
3       3 12-07-2015   2015-07-12
4       4 4/5/14       2014-05-04
5       5 12. Jun 1999 1999-06-12

All dates must use date types

Don’t be cheap - it’s easy to convert dates to date; all major programming environments support date and time classes.

Lists

Introduction to lists

Lists are vectors

list("a", 1:4, TRUE)

[[1]]
[1] "a"

[[2]]
[1] 1 2 3 4

[[3]]
[1] TRUE

Lists are not flat

Contain other, complex, data structures, unlike regular vectors.

c("a", 1:4, TRUE)

[1] "a"    "1"    "2"    "3"    "4"    "TRUE"

source: H. Wickham - R for data science, licence CC

Lists are cool

Can contain anything, named or not

list(head(women, 3), 
     sw = head(swiss, 3), c("ab", "c"))

[[1]]
  height weight
1     58    115
2     59    117
3     60    120

$sw
             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2

[[3]]
[1] "ab" "c"

Even empty slot

list(head(women, 3), NULL, 
     sw = head(swiss, 3), c("ab", "c"))

[[1]]
  height weight
1     58    115
2     59    117
3     60    120

[[2]]
NULL

$sw
             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2

[[4]]
[1] "ab" "c"

Working with lists

items <- list(head(women, 3), 
              NULL, 
              sw = head(swiss, 3), 
              char = c("ab", "c"))

items[[4]]

[1] "ab" "c"

items[4]

$char
[1] "ab" "c"

What’s the difference?

items$char

[1] "ab" "c"

items[["char"]]

[1] "ab" "c"

items["char"]

$char
[1] "ab" "c"

Accessing elements of a list

The steam machine is the list() structure

[["element]] == $element

The $ notation is a shorthand for the double brackets that does not require quoting the name of the item.

Sub-selection

Lists inside rectangle data

Rectangle, tibble/data.frame fits our mindset

2D representation of data we see everywhere

cms <- sample_n(tidyr::cms_patient_care, size = 4)
cms

# A tibble: 4 × 5
  ccn    facility_name    measure_abbr score type 
  <chr>  <chr>            <chr>        <dbl> <chr>
1 011502 COMFORT CARE CO… visits_immi…   390 deno…
2 011517 HOSPICE OF WEST… pain_assess…   100 obse…
3 011508 SOUTHERNCARE NE… opioid_bowel   100 obse…
4 011501 SOUTHERNCARE NE… pain_assess…   325 deno…

Are columns with more complex structures useful?
Can we not model everything flat?
How to store some JSON?

Use list-columns

cms |> 
  mutate(metadata = list(
    women,
    vec = c("ab", "c"),
    lm(height ~ weight, data = women),
    NULL
  ), .after = ccn)

# A tibble: 4 × 6
  ccn    metadata facility_name measure_abbr score
  <chr>  <named > <chr>         <chr>        <dbl>
1 011502 <df>     COMFORT CARE… visits_immi…   390
2 011517 <chr>    HOSPICE OF W… pain_assess…   100
3 011508 <lm>     SOUTHERNCARE… opioid_bowel   100
4 011501 <NULL>   SOUTHERNCARE… pain_assess…   325
# ℹ 1 more variable: type <chr>

However, how can we iterate along list-columns?

Functions

Functions in R

Data science workflows can be well abstracted with functional programming.

No tracking of states.
No complex objects that require methods or inheritance
No user interfaces

R can be used in functional programming context.

Functions are the primary organizing programming element
Functions have no side effects
Functions pass input to each other

Functions as actions

Functions declared …

my_function <- function(my_argument) {
  my_argument + 1
}

in the global environment …

ls()

[1] "my_function"

ls.str()

my_function : function (my_argument)

are reusable.

my_function(2)

[1] 3

Anonymous functions …

are not stored in an object and are used on the fly and

(function(x) { x + 2 })(2) #

[1] 4

do not alter the global environment.

 ls()

[1] "my_function"

# remove the previous my_function to convince you
rm(my_function)
(\(x) x + 2)(2)

[1] 4

ls()

character(0)

purrr

purrr is functional programming toolkit which enhances R with consistent tools for working with functions, lists and vectors.

Functional programming […] treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data

— Wikipedia

An example function

Consider a hypothetic `put_on()` function

put_on(figures, antenna) returns a minifig with antenna

Iteration with LEGO figures

Illustration for several lego

How to apply put_on() to more than 1 input?

`put_on(figures, antenna)`

Functional programming

Back to R programming

For loop approach

out <- vector("list", length(legos))
for (i in seq_along(legos)) {
  out[[i]] <- put_on(legos[[i]], antenna) 
}
out

Functional programming approach

Named function

antennate <- function(x) put_on(x, antenna)
map(figures, antennate)

Anonymous function

map(figures, \(x) put_on(x, antenna))

Answer

The for loop machinery

means <- vector("list", ncol(swiss))
for (i in seq_along(swiss)) {
  means[i] <- mean(swiss[[i]])
}
# Need to manually add names
names(means) <- names(swiss)
means |>
  str()

List of 6
 $ Fertility       : num 70.1
 $ Agriculture     : num 50.7
 $ Examination     : num 16.5
 $ Education       : num 11
 $ Catholic        : num 41.1
 $ Infant.Mortality: num 19.9

Better answer

Functional programming focuses on actions

map(swiss, mean) |>
  str()

List of 6
 $ Fertility       : num 70.1
 $ Agriculture     : num 50.7
 $ Examination     : num 16.5
 $ Education       : num 11
 $ Catholic        : num 41.1
 $ Infant.Mortality: num 19.9

The `purrr::map()` family of functions

Are designed to be consistent
- map(), map2() and pmap() return a lists
Typed variants with vectorized output:
- map_lgl()
- map_int()
- map_dbl()
- map_chr()
- map_dfr() data.frame rows
- map_dfc() data.frame cols
Fail if coercion is impossible

map_dbl(swiss, mean)

       Fertility      Agriculture      Examination        Education 
        70.14255         50.65957         16.48936         10.97872 
        Catholic Infant.Mortality 
        41.14383         19.94255

map_chr(swiss, mean)

       Fertility      Agriculture      Examination        Education 
     "70.142553"      "50.659574"      "16.489362"      "10.978723" 
        Catholic Infant.Mortality 
     "41.143830"      "19.942553"

map_int(swiss, mean)

Error in `map_int()`:
ℹ In index: 1.
ℹ With name: Fertility.
Caused by error:
! Can't coerce from a number to an integer.

purrr and base R

`apply()` family of functions

map() is the general function and close to base::lapply()
- Both return lists
map_dbl() generating vectors are similar to sapply()
map() introduces shortcuts
- Generally easier to use and more consistent
- Similar in idea to map() in Python
apply(), tapply() and vapply() have varying return types and are best avoided even when using pure base R.

Examples

map_dbl(swiss, mean)

       Fertility      Agriculture      Examination        Education 
        70.14255         50.65957         16.48936         10.97872 
        Catholic Infant.Mortality 
        41.14383         19.94255

lapply(swiss, mean)

$Fertility
[1] 70.14255

$Agriculture
[1] 50.65957

$Examination
[1] 16.48936

$Education
[1] 10.97872

$Catholic
[1] 41.14383

$Infant.Mortality
[1] 19.94255

sapply(swiss, mean)

       Fertility      Agriculture      Examination        Education 
        70.14255         50.65957         16.48936         10.97872 
        Catholic Infant.Mortality 
        41.14383         19.94255

Iterating on 2 lists in parallel

enhair <- function(x, y) put_on(x, y)
map2(legos, hairs, enhair)

Practical examples

`map2()`

marks <- list(report = c(14, 13),
              practical = c(10, 12),
              theoretical = c(17, 8))

weights <- c(2, 0.5, 1.5)

map2(marks, weights,  \(x, y) x * y)

$report
[1] 28 26

$practical
[1] 5 6

$theoretical
[1] 25.5 12.0

Iterate on list (marks), but vectorized on atomic vectors (weights).

`pmap()`

unif_grid <- expand_grid(n = c(3, 6),
                         min = 1,
                         max = c(4, 8))

Generate grid of values for named arguments of a function

pmap(unif_grid, runif)

[[1]]
[1] 2.734999 2.792766 2.933394

[[2]]
[1] 4.587340 7.047169 7.901624

[[3]]
[1] 2.366165 3.732995 3.982275 1.814529 1.757976 2.534823

[[4]]
[1] 2.827777 2.979828 7.966154 4.722096 3.719253 2.846717

Linear modelling

Palmer penguins

library(palmerpenguins)
ggplot(penguins,
       aes(x = bill_length_mm, y = bill_depth_mm, colour = species)) +
  geom_point() +
  geom_smooth(method = "lm",
              formula = 'y ~ x',
              se = FALSE) +
  facet_wrap(vars(island)) +
  labs(title = "Species by Islands")

Fitting a linear model

Using the pinguins dataset we can fit a linear model to explain the bill depth by the bill_length_mm.

lm(bill_depth_mm ~ bill_length_mm, data = penguins)


Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)

Coefficients:
   (Intercept)  bill_length_mm  
      20.88547        -0.08502

Printed output of a lm()

`lm` outputs complex lists

summary(lm(bill_depth_mm ~ bill_length_mm, 
           data = penguins))


Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1381 -1.4263  0.0164  1.3841  4.5255 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    20.88547    0.84388  24.749  < 2e-16 ***
bill_length_mm -0.08502    0.01907  -4.459 1.12e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.922 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.05525,   Adjusted R-squared:  0.05247 
F-statistic: 19.88 on 1 and 340 DF,  p-value: 1.12e-05

Summarise a linear model with

base::summary()

$R^2$ is low (0.05525) because we mix all individuals.
We need one tibble per group

Split the penguin data

Split a data.frame by group into a list of tibble

split_peng <- group_split(palmerpenguins::penguins, 
                          island, 
                          species)

Get the dimensions of each tibble in the list

map(split_peng, dim)

[[1]]
[1] 44  8

[[2]]
[1] 124   8

[[3]]
[1] 56  8

[[4]]
[1] 68  8

[[5]]
[1] 52  8

Coercion

How many observations per group?
Coerce to a double vector

map_dbl(split_peng, nrow)

[1]  44 124  56  68  52

Map the linear model

General syntax: map(list, function)

list split_peng
function can be an anonymous function

Curly braces are optional but helpful

Several code lines
Wrap lines to improve readability

map(split_peng, \(x) {
  lm(bill_depth_mm ~ bill_length_mm, 
     data = x)
})

[[1]]

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)

Coefficients:
   (Intercept)  bill_length_mm  
        9.6263          0.2244  


[[2]]

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)

Coefficients:
   (Intercept)  bill_length_mm  
        5.2510          0.2048  


[[3]]

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)

Coefficients:
   (Intercept)  bill_length_mm  
        9.2607          0.2335  


[[4]]

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)

Coefficients:
   (Intercept)  bill_length_mm  
        7.5691          0.2222  


[[5]]

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)

Coefficients:
   (Intercept)  bill_length_mm  
       14.1359          0.1102

To extract $R^2$

`base::summary()` generates a list

lm_all <- summary(lm(bill_depth_mm ~ bill_length_mm,
                     data = penguins))
str(lm_all, max.level = 1, give.attr = FALSE)

List of 12
 $ call         : language lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)
 $ terms        :Classes 'terms', 'formula'  language bill_depth_mm ~ bill_length_mm
 $ residuals    : Named num [1:342] 1.139 -0.127 0.541 1.535 3.056 ...
 $ coefficients : num [1:2, 1:4] 20.8855 -0.085 0.8439 0.0191 24.7492 ...
 $ aliased      : Named logi [1:2] FALSE FALSE
 $ sigma        : num 1.92
 $ df           : int [1:3] 2 340 2
 $ r.squared    : num 0.0552
 $ adj.r.squared: num 0.0525
 $ fstatistic   : Named num [1:3] 19.9 1 340
 $ cov.unscaled : num [1:2, 1:2] 1.93e-01 -4.32e-03 -4.32e-03 9.84e-05
 $ na.action    : 'omit' Named int [1:2] 4 272

Extract elements from lists

lm_all$r.squared            # Base R $ notation

[1] 0.05524985

lm_all[["r.squared"]]       # Base R double brackets notation

[1] 0.05524985

pluck(lm_all, "r.squared")  # purrr::pluck()

[1] 0.05524985

Extract $R^2$ for all groups

Using anonymous functions

split_peng |>
  map(\(x) lm(bill_depth_mm ~ bill_length_mm,
              data = x)) |> 
  map(summary) |>
  map(\(x) pluck(x, "r.squared"))

[[1]]
[1] 0.2192052

[[2]]
[1] 0.4139429

[[3]]
[1] 0.2579242

[[4]]
[1] 0.4271096

[[5]]
[1] 0.06198376

Further simplifications

With shortcuts from `purrr`

split_peng |>
  map(\(grp) lm(bill_depth_mm ~ bill_length_mm, 
                data = grp)) |> 
  map(summary) |>
  map("r.squared")

[[1]]
[1] 0.2192052

[[2]]
[1] 0.4139429

[[3]]
[1] 0.2579242

[[4]]
[1] 0.4271096

[[5]]
[1] 0.06198376

With coercion to doubles vector `map_dbl()`

split_peng |>
  map(\(x) lm(bill_depth_mm ~ bill_length_mm, 
              data = x)) |> 
  map(summary) |>
  map_dbl("r.squared")

[1] 0.21920517 0.41394290 0.25792423 0.42710958 0.06198376

purrr anonmyous functions

The package contains its own lambda, which predates the \() syntax in base R.
It uses ~ in place of \() and has default placeholders starting with ..
We’re not going to use it.

Lists as a column in a tibble

Example

tibble(numbers = 1:8,
       my_list = list(a = c("a", "b"), b = 2.56, 
                      c = c("a", "b", "c"), d = rep(TRUE, 4),
                      d = 2:3, e = 4:6, f = FALSE, g = c(1, 4, 5, 6)))

# A tibble: 8 × 2
  numbers my_list     
    <int> <named list>
1       1 <chr [2]>   
2       2 <dbl [1]>   
3       3 <chr [3]>   
4       4 <lgl [4]>   
5       5 <int [2]>   
6       6 <int [3]>   
7       7 <lgl [1]>   
8       8 <dbl [4]>

Nesting

Rewriting our previous example

Nesting the `tibble` by `island` and `species`

penguins |>
  nest(.by = c(island, species))

# A tibble: 5 × 3
  island    species   data              
  <fct>     <fct>     <list>            
1 Torgersen Adelie    <tibble [52 × 6]> 
2 Biscoe    Adelie    <tibble [44 × 6]> 
3 Dream     Adelie    <tibble [56 × 6]> 
4 Biscoe    Gentoo    <tibble [124 × 6]>
5 Dream     Chinstrap <tibble [68 × 6]>

Rewriting our previous example

With modelling using `mutate` and `map`

penguins |>
  nest(.by = c(island, species)) |>
  mutate(
    model = map(data, \(x) lm(bill_depth_mm ~ 
                                bill_length_mm, 
                              data = x)),
    summary = map(model, summary),
    r_squared = map_dbl(summary, "r.squared")
  )

Very powerful
Data rectangle
Next lecture will show you how dplyr, tidyr, tibble, purrr and broom nicely work together

# A tibble: 5 × 6
  island    species   data     model  summary   
  <fct>     <fct>     <list>   <list> <list>    
1 Torgersen Adelie    <tibble> <lm>   <smmry.lm>
2 Biscoe    Adelie    <tibble> <lm>   <smmry.lm>
3 Dream     Adelie    <tibble> <lm>   <smmry.lm>
4 Biscoe    Gentoo    <tibble> <lm>   <smmry.lm>
5 Dream     Chinstrap <tibble> <lm>   <smmry.lm>
# ℹ 1 more variable: r_squared <dbl>

Don’t forget vectorisation!

Don’t “overmap” functions!

Use map() only if required for functions that are not vectorised.

nums <- sample(1:10,
               size = 1000,
               replace = TRUE) 

log_vec <- log(nums)
log_map <- map_dbl(nums, log)

identical(log_vec, log_map)

[1] TRUE

Benchmarking

bench::mark(
  vectorised = log(nums),
  mapped = map_dbl(nums, log),
  memory = FALSE) |>
  autoplot() +
  theme_bw(18)

Calling a function for its side-effects

Side effects are not pure

Output on screen
Save files to disk
The return values are not used for further list-wise computation.

The map family should not be used.

Use the `walk` family of function

walk(), walk2()
Returns the input list (invisibly)

Saving 5 plots using indices for filenames

iwalk(split_peng,  \(x, y) {
  ggplot(x, aes(x = island)) +
    geom_bar() +
    # extract vector then unique species string 
    labs(title = pull(x, species) |>
                   str_unique())
    ggsave(glue::glue("{y}_peng.pdf"))
})

fs::dir_ls(glob = "*_peng.pdf")

1_peng.pdf 2_peng.pdf 3_peng.pdf 4_peng.pdf 5_peng.pdf

Wrap up

map() or walk()

map2() or walk2()

pmap()

Assignment and environments

Other assignment operators

<- Left assignment operator

Most commonly used.

Rstudio has the built-in shortcut Alt+- for <-.

-> Right assignment operator

3 -> t

Right assignment in long pipe statements

swiss |> 
  as_tibble(rownames = "Province") |> 
  mutate(First = str_extract(Province, "^.")) |> 
  group_by(First) |> 
  summarize(across(where(is.numeric), 
                   mean)) -> swiss_province_group

swiss_province_group

# A tibble: 17 × 7
   First Fertility Agriculture Examination
   <chr>     <dbl>       <dbl>       <dbl>
 1 A          66.6        63.4       18   
 2 B          77.1        54.3       21   
 3 C          72.5        57.4       13.3 
 4 D          83.1        45.1        6   
 5 E          68.8        78.8       12.5 
 6 F          92.5        39.7        5   
 7 G          82.2        51.7       14.3 
 8 H          77.3        89.7        5   
 9 L          62.7        26.4       25.4 
10 M          73.2        58.9       13.4 
11 N          66.0        37.3       24.7 
12 O          65.0        62.6       16   
13 P          74.1        52.3        9.67
14 R          49.3        45.0       18   
15 S          79.8        67.2       10.2 
16 V          65.1        29.8       23.2 
17 Y          65.4        49.5       15   
# ℹ 3 more variables: Education <dbl>,
#   Catholic <dbl>, Infant.Mortality <dbl>

The case of the equal sign

= Named parameter operator

swiss |> 
  as_tibble(rownames = "Province") |> 
  mutate(First = str_extract(Province, "^.")) |> 
  group_by(First) |> 
  summarize(across(where(is.numeric), mean)) -> 
    swiss_province_group

Double meaning of =

Alias of left assignment <-.
Parameter - argument operator

mean(1:4, na.rm <- TRUE)

Error in mean.default(1:4, na.rm <- TRUE): 'trim' must be numeric of length one

mean(1:4, na.rm = TRUE)

[1] 2.5

Take home message

Do not use = as regular left assignment operator! Learn the keyboard shortcut.

Parental assignment

Parental assignment operators

Left <<- and

Right ->>

Work as -> and <- but assign in Global Environment.

assign() function

assign(name, data, envir)

Breaking values out of pipes

  n_swiss <- 0
swiss_tbl |>
  (\(x) { n_swiss <<- nrow(x); x })() |>
  summarise(mean_ag = mean(Agriculture))

# A tibble: 1 × 1
  mean_ag
    <dbl>
1    50.7

n_swiss

[1] 47

With local assignment

n_swiss <- 0
swiss_tbl |>
  (\(x) { n_swiss <- nrow(x); x })() |>
  select(Agriculture)

# A tibble: 47 × 1
   Agriculture
         <dbl>
 1        17  
 2        45.1
 3        39.7
 4        36.5
 5        43.5
 6        35.3
 7        70.2
 8        67.8
 9        53.3
10        45.2
# ℹ 37 more rows

n_swiss

[1] 0

::::

Use of parental assignment operators for side effects

Multiple filtering steps

swiss_tbl |> 
  filter(Agriculture > 40) |> 
  # how many rows?
  filter(Education   < 50) |> 
  # how many rows?
  filter(Examination < 10)

# A tibble: 8 × 7
  Province     Fertility Agriculture Examination
  <chr>            <dbl>       <dbl>       <int>
1 Delemont          83.1        45.1           6
2 Paysd'enhaut      72          63.5           6
3 Conthey           75.5        85.9           3
4 Entremont         69.3        84.9           7
5 Herens            77.3        89.7           5
6 Monthey           79.4        64.9           7
7 St Maurice        65          75.9           9
8 Sierre            92.2        84.6           3
# ℹ 3 more variables: Education <int>,
#   Catholic <dbl>, Infant.Mortality <dbl>

Custom counting function

counter <- function(df){
  # Parental assignment
  n_df <<- nrow(df)
  # return 
  df 
}
counter(swiss_tbl) -> dont_print 
n_df

[1] 47

Better `counter` function

counter_name <- function(df, n_name){
  assign(n_name, nrow(df))
  df
}
counter_name(swiss_tbl, "n_filter_1" )

# A tibble: 47 × 7
   Province     Fertility Agriculture Examination
   <chr>            <dbl>       <dbl>       <int>
 1 Courtelary        80.2        17            15
 2 Delemont          83.1        45.1           6
 3 Franches-Mnt      92.5        39.7           5
 4 Moutier           85.8        36.5          12
 5 Neuveville        76.9        43.5          17
 6 Porrentruy        76.1        35.3           9
 7 Broye             83.8        70.2          16
 8 Glane             92.4        67.8          14
 9 Gruyere           82.4        53.3          12
10 Sarine            82.9        45.2          16
# ℹ 37 more rows
# ℹ 3 more variables: Education <int>,
#   Catholic <dbl>, Infant.Mortality <dbl>

n_filter_1

Error: object 'n_filter_1' not found

Need to set an environment

counter_global <- function(df, n_name){
  assign(n_name,
         nrow(df),  
         envir = .GlobalEnv)
  df
}
counter_global(swiss_tbl, "n_filter_2") -> dont_print
n_filter_2

[1] 47

Final options

Global name space pollution

swiss_tbl |> 
  filter(Agriculture > 40) |> 
  counter_global("n_agr") |> 
  filter(Education   > 5) |> 
  counter_global("n_edu") |> 
  filter(Examination < 10) -> dont_print

Low readability, custom names

swiss_tbl |>
  filter(Agriculture > 40) |> 
  (\(df) { n_agr <<- nrow(df); df })() |> 
  filter(Education   > 5) |> 
  (\(df) { n_edu <<- nrow(df); df })() |> 
  filter(Examination < 10) -> dont_print

n_agr

[1] 31

n_edu

[1] 22

Before we stop

You learned to

Functional programming: focus on actions
for loops are fine, but don’t write them
Pass functions as arguments
Apprehend nested tibbles with list-columns

Acknowledgments

Eric Koncina for writing the initial content
Jennifer Bryan (LEGO pictures, courtesy CC licence)
Hadley Wickham
Lise Vaudor
Ian Lyttle
Jim Hester

Further reading

Jennifer Bryan - lessons & tutorial
Hadley Wickham - R for data science (iteration, many models)
Ian Lyttle - purrr applied for engineering
Robert Rudis - purrr, comparison with base
Rstudio’s blog - purrr 0.2 release purrr 0.3 release
Kris Jenkins - What is Functional Programming? (Blog version and Talk video)
Lise Vaudor - R-atique (in french)

Advanced R

Introduction

Material

R for Data Science

Advanced R

Learning objectives

Learning objectives

Outlook

Types and classes

Types and classes

Retrieve the data type with typeof()

Class of data structure

Storage mode

Vectors

Factors

Categorical values

Using strings

Unordered by default

Creating an ordered list

Date and Time are inbuilt classes

Date

Datetime

Difftime

Converting to dates

This year as Date

Import

lubridate to the rescue

Parsing dates with lubridate functions (reprise)

A gift from your collaborators

Lubridate to the rescue!

Your turn!

Lists

Introduction to lists

Lists are vectors

Lists are cool

Can contain anything, named or not

Even empty slot

Working with lists

Accessing elements of a list

Lists inside rectangle data

Rectangle, tibble/data.frame fits our mindset

Use list-columns

Functions

Your turn

Functions in R

Functions as actions

Functions declared …

Anonymous functions …

purrr

An example function

Consider a hypothetic put_on() function

Iteration with LEGO figures

Illustration for several lego

put_on(figures, antenna)

Functional programming

Back to R programming

For loop approach

Functional programming approach

Your turn

Answer

The for loop machinery

Better answer

Functional programming focuses on actions

The purrr::map() family of functions

purrr and base R

apply() family of functions

Iterating on 2 lists in parallel

Practical examples

map2()

pmap()

Linear modelling

Palmer penguins

Fitting a linear model

lm outputs complex lists

Split the penguin data

Split a data.frame by group into a list of tibble

Coercion

Map the linear model

To extract \(R^2\)

base::summary() generates a list

Retrieve the data type with `typeof()`

`lubridate` to the rescue

Parsing dates with `lubridate` functions (reprise)

Consider a hypothetic `put_on()` function

`put_on(figures, antenna)`

The `purrr::map()` family of functions

`apply()` family of functions

`map2()`

`pmap()`

`lm` outputs complex lists

`base::summary()` generates a list

With shortcuts from `purrr`

With coercion to doubles vector `map_dbl()`

Nesting the `tibble` by `island` and `species`

With modelling using `mutate` and `map`

Use the `walk` family of function

Better `counter` function