with purrr
Rworkshop
Thursday, 8 February 2024
Learning objectives
actions
functions
as arguments
to higher order functionsmap()
to replace remaining for
loopsMotivation
Building multiple models from tidy data (purrr
)
Working with multiple models (broom
, next lecture)
Atomic vectors
Atomic means only one type of data
The type of each atom is the same. The size of each atom is one single element.
The conversion between types can be + Explicit (using as.*()
functions) + or Implicit
list()
structure[[1]]
height weight
1 58 115
2 59 117
3 60 120
$sw
Fertility Agriculture Examination Education Catholic
Courtelary 80.2 17.0 15 12 9.96
Delemont 83.1 45.1 6 9 84.84
Franches-Mnt 92.5 39.7 5 5 93.40
Infant.Mortality
Courtelary 22.2
Delemont 22.2
Franches-Mnt 20.2
[[3]]
[1] "ab" "c"
[[1]]
height weight
1 58 115
2 59 117
3 60 120
[[2]]
NULL
$sw
Fertility Agriculture Examination Education Catholic
Courtelary 80.2 17.0 15 12 9.96
Delemont 83.1 45.1 6 9 84.84
Franches-Mnt 92.5 39.7 5 5 93.40
Infant.Mortality
Courtelary 22.2
Delemont 22.2
Franches-Mnt 20.2
[[4]]
[1] "ab" "c"
2D representation of data we see everywhere
# A tibble: 4 × 5
ccn facility_name measure_abbr score type
<chr> <chr> <chr> <dbl> <chr>
1 011500 BAPTIST HOSPICE composite_process 202 denomin…
2 011514 ST VINCENT'S HOSPICE pain_screening 210 denomin…
3 011502 COMFORT CARE COASTAL HOSPICE - BALDWIN pain_screening 295 denomin…
4 011517 HOSPICE OF WEST ALABAMA dyspnea_screening 400 denomin…
cms |>
mutate(metadata = list(
women,
vec = c("ab", "c"),
lm(height ~ weight, data = women),
NULL
), .after = ccn)
# A tibble: 4 × 6
ccn metadata facility_name measure_abbr score type
<chr> <named list> <chr> <chr> <dbl> <chr>
1 011500 <df [15 × 2]> BAPTIST HOSPICE composite_p… 202 deno…
2 011514 <chr [2]> ST VINCENT'S HOSPICE pain_screen… 210 deno…
3 011502 <lm> COMFORT CARE COASTAL HOSPICE - … pain_screen… 295 deno…
4 011517 <NULL> HOSPICE OF WEST ALABAMA dyspnea_scr… 400 deno…
in the global environment …
are reusable.
purrr enhances R with consistent tools for working with functions, lists and vectors.
Functional programming […] treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data
— Wikipedia
put_on
functionput_on(figures, antenna)
returns a LEGO figure with antennaHow to apply put_on()
to more than 1 input?
put_on(figures, antenna)
[Of course, someone has to write loops. It doesn’t have to be you.
— Jenny Bryan]{.center}
04:00
Questions
Calculate the mean
of each column of the swiss
dataset, which is packaged with base R
.
Tips
purrr::map()
expects 2 arguments:
list
function
data.frame
is a listmeans <- vector("list", ncol(swiss))
for (i in seq_along(swiss)) {
means[i] <- mean(swiss[[i]]) #<<
}
# Need to manually add names
names(means) <- names(swiss)
means |>
str()
List of 6
$ Fertility : num 70.1
$ Agriculture : num 50.7
$ Examination : num 16.5
$ Education : num 11
$ Catholic : num 41.1
$ Infant.Mortality: num 19.9
purrr::map()
family of functionsmap()
is the general function and close to base::lapply()
map()
introduces shortcuts (absent in lapply()
)map_lgl()
map_int()
map_dbl()
map_chr()
map_dfr()
data.frame rowsmap_dfc()
data.frame cols Fertility Agriculture Examination Education
70.14255 50.65957 16.48936 10.97872
Catholic Infant.Mortality
41.14383 19.94255
Fertility Agriculture Examination Education
"70.142553" "50.659574" "16.489362" "10.978723"
Catholic Infant.Mortality
"41.143830" "19.942553"
Error in `map_int()`:
ℹ In index: 1.
ℹ With name: Fertility.
Caused by error:
! Can't coerce from a number to an integer.
$report
[1] 14 13
$practical
[1] 10 12
$theoritical
[1] 17 8
$report
[1] 28 26
$practical
[1] 5 6
$theoritical
[1] 25.5 12.0
Iterate on list, but vectorized on atomic vectors
# A tibble: 4 × 3
n min max
<dbl> <dbl> <dbl>
1 3 1 4
2 3 1 8
3 6 1 4
4 6 1 8
[[1]]
[1] 2.610502 2.574514 2.522395
[[2]]
[1] 7.210280 5.271757 5.246017
[[3]]
[1] 1.549327 1.237971 3.628081 2.490355 1.289043 1.467250
[[4]]
[1] 3.533091 6.507228 2.663833 5.503842 1.488231 7.394657
From R for Data Science
lm
outputs complex objectsbase::summary()
0.05525
) because we mix all individuals.
Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-4.1381 -1.4263 0.0164 1.3841 4.5255
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.88547 0.84388 24.749 < 2e-16 ***
bill_length_mm -0.08502 0.01907 -4.459 1.12e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.922 on 340 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.05525, Adjusted R-squared: 0.05247
F-statistic: 19.88 on 1 and 340 DF, p-value: 1.12e-05
Get the dimensions of each tibble in the list
General syntax: map(list, function)
list
split_peng
function
can be an anonymous functionCurly braces are optional but helpful
[[1]]
Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)
Coefficients:
(Intercept) bill_length_mm
9.6263 0.2244
[[2]]
Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)
Coefficients:
(Intercept) bill_length_mm
5.2510 0.2048
[[3]]
Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)
Coefficients:
(Intercept) bill_length_mm
9.2607 0.2335
[[4]]
Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)
Coefficients:
(Intercept) bill_length_mm
7.5691 0.2222
[[5]]
Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = x)
Coefficients:
(Intercept) bill_length_mm
14.1359 0.1102
base::summary()
generates a listlm_all <- summary(lm(bill_depth_mm ~ bill_length_mm,
data = penguins))
str(lm_all, max.level = 1, give.attr = FALSE)
List of 12
$ call : language lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)
$ terms :Classes 'terms', 'formula' language bill_depth_mm ~ bill_length_mm
$ residuals : Named num [1:342] 1.139 -0.127 0.541 1.535 3.056 ...
$ coefficients : num [1:2, 1:4] 20.8855 -0.085 0.8439 0.0191 24.7492 ...
$ aliased : Named logi [1:2] FALSE FALSE
$ sigma : num 1.92
$ df : int [1:3] 2 340 2
$ r.squared : num 0.0552
$ adj.r.squared: num 0.0525
$ fstatistic : Named num [1:3] 19.9 1 340
$ cov.unscaled : num [1:2, 1:2] 1.93e-01 -4.32e-03 -4.32e-03 9.84e-05
$ na.action : 'omit' Named int [1:2] 4 272
purrr
map_dbl()
split_peng |>
map(\(x) lm(bill_depth_mm ~ bill_length_mm,
data = x)) |>
map(summary) |>
map_dbl("r.squared") #<<
[1] 0.21920517 0.41394290 0.25792423 0.42710958 0.06198376
purrr
anonmyous functions
\()
syntax in base R.~
in place of \()
and has default placeholders starting with .
.tibble(numbers = 1:8,
my_list = list(a = c("a", "b"), b = 2.56,
c = c("a", "b", "c"), d = rep(TRUE, 4),
d = 2:3, e = 4:6, f = FALSE, g = c(1, 4, 5, 6)))
# A tibble: 8 × 2
numbers my_list
<int> <named list>
1 1 <chr [2]>
2 2 <dbl [1]>
3 3 <chr [3]>
4 4 <lgl [4]>
5 5 <int [2]>
6 6 <int [3]>
7 7 <lgl [1]>
8 8 <dbl [4]>
tibble
by island
and species
mutate
and map
dplyr
, tidyr
, tibble
, purrr
and broom
nicely work together# A tibble: 5 × 6
island species data model summary r_squared
<fct> <fct> <list> <list> <list> <dbl>
1 Torgersen Adelie <tibble [52 × 6]> <lm> <smmry.lm> 0.0620
2 Biscoe Adelie <tibble [44 × 6]> <lm> <smmry.lm> 0.219
3 Dream Adelie <tibble [56 × 6]> <lm> <smmry.lm> 0.258
4 Biscoe Gentoo <tibble [124 × 6]> <lm> <smmry.lm> 0.414
5 Dream Chinstrap <tibble [68 × 6]> <lm> <smmry.lm> 0.427
Don’t “overmap” functions!
map()
only if required for functions that are not vectorised.The map
family should not be used.
walk
family of functionwalk()
, walk2()
map()
or walk()
map2()
or walk2()
pmap()
You learned to
actions
for
loops are fine, but don’t write themfunctions
as arguments
list-columns
Acknowledgments
Further reading