Case study - gapminder

Guided tutorial

Roland Krause

R Workshop

Thursday, 13 February 2025

Learning objectives

You will learn

Use dplyr / purrr for efficient data manipulation
Tidying multiple linear models using broom
Managing related things together in one tibble
Summarise findings in one ggplot using relevant aesthetics

Guided practical

Interactive session

Use the provided template practical06_gapminder.qmd in your rworkshop-practicals-username repo.

Managing multiple models

Tutorial based on the great conference by Hadley Wickham

List-column cheatsheet, reminder

`list` in a `data.frame/tibble`

`list of tibbles` in a `data.frame/tibble`

Keep all analyses together

Workflow from last time

palmerpenguins::penguins |>
  # Nest only the relevant cols (bill measurements)
  # and ID penguins with biology criteria
  nest(data = starts_with("bill"), 
       .by = c(island, species)) |>
  mutate(model = map(data, \(x){ lm(bill_depth_mm ~ bill_length_mm,
                                   data = x) } ),
         summary = map(model, summary),
         r_squared = map_dbl(summary, "r.squared"))

# A tibble: 5 × 6
  island    species   data               model  summary    r_squared
  <fct>     <fct>     <list>             <list> <list>         <dbl>
1 Torgersen Adelie    <tibble [52 × 2]>  <lm>   <smmry.lm>    0.0620
2 Biscoe    Adelie    <tibble [44 × 2]>  <lm>   <smmry.lm>    0.219 
3 Dream     Adelie    <tibble [56 × 2]>  <lm>   <smmry.lm>    0.258 
4 Biscoe    Gentoo    <tibble [124 × 2]> <lm>   <smmry.lm>    0.414 
5 Dream     Chinstrap <tibble [68 × 2]>  <lm>   <smmry.lm>    0.427

nest(.by = col) is roughly equivalent to:

penguins |>
  group_by(island, species) |>
  summarise(data = list(cur_data())) |>
  rowwise()

rowwise() ensures that operations are performed on each row

Gapminder

Gapminder is a fact tank

Dataset

From an independent Swedish foundation
R package by Jenny Bryan!
Install from CRAN: gapminder

Hans Rosling

Fundamentally optimistic
Great talk

Gapminder colors

Let’s take a look at the Gapminder vignette

Your turn: Explore Gapminder!

Install the gapminder package
Load gapminder and tidyverse packages
Use the pipe |> to pass gapminder to ggplot()
Plot the Life expectancy (lifeExp) on y vs. the year (on x).
Use geom_line()

Caution

Mind the grouping!

10:00

Gapminder countries

# install.packages("gapminder")
library(gapminder)
gapminder |>
  ggplot(aes(x = year, y = lifeExp,
             group = country)) + 
  geom_line()

Gapminder with country colors

gapminder |>
  ggplot(aes(x = year, y = lifeExp, colour = country,
             group = country)) + 
  geom_line() +
  scale_color_manual(values = country_colors) +
  theme(legend.position = "none")

Keep related things together

Nest per country

by_country <- gapminder |>
  mutate(year1950 = year - 1950) |>

  nest(.by = c(continent, country))
by_country

# A tibble: 142 × 3
   continent country     data             
   <fct>     <fct>       <list>           
 1 Asia      Afghanistan <tibble [12 × 5]>
 2 Europe    Albania     <tibble [12 × 5]>
 3 Africa    Algeria     <tibble [12 × 5]>
 4 Africa    Angola      <tibble [12 × 5]>
 5 Americas  Argentina   <tibble [12 × 5]>
 6 Oceania   Australia   <tibble [12 × 5]>
 7 Europe    Austria     <tibble [12 × 5]>
 8 Asia      Bahrain     <tibble [12 × 5]>
 9 Asia      Bangladesh  <tibble [12 × 5]>
10 Europe    Belgium     <tibble [12 × 5]>
# ℹ 132 more rows

Helpers

By default the list column is named data
year1950 will help to get meaningful intercepts
Add continent to keep it along with country

One country example: Germany

From original tibble

gapminder |>
  filter(country == "Germany") |>
  select(!c(country, continent))

# A tibble: 12 × 4
    year lifeExp      pop gdpPercap
   <int>   <dbl>    <int>     <dbl>
 1  1952    67.5 69145952     7144.
 2  1957    69.1 71019069    10188.
 3  1962    70.3 73739117    12902.
 4  1967    70.8 76368453    14746.
 5  1972    71   78717088    18016.
 6  1977    72.5 78160773    20513.
 7  1982    73.8 78335266    22032.
 8  1987    74.8 77718298    24639.
 9  1992    76.1 80597764    26505.
10  1997    77.3 82011073    27789.
11  2002    78.7 82350671    30036.
12  2007    79.4 82400996    32170.

Nested tibble

by_country |>
  filter(country == "Germany")

# A tibble: 1 × 3
  continent country data             
  <fct>     <fct>   <list>           
1 Europe    Germany <tibble [12 × 5]>

by_country |>
  filter(country == "Germany") |>
  unnest(data)

# A tibble: 12 × 7
   continent country  year lifeExp      pop
   <fct>     <fct>   <int>   <dbl>    <int>
 1 Europe    Germany  1952    67.5 69145952
 2 Europe    Germany  1957    69.1 71019069
 3 Europe    Germany  1962    70.3 73739117
 4 Europe    Germany  1967    70.8 76368453
 5 Europe    Germany  1972    71   78717088
 6 Europe    Germany  1977    72.5 78160773
 7 Europe    Germany  1982    73.8 78335266
 8 Europe    Germany  1987    74.8 77718298
 9 Europe    Germany  1992    76.1 80597764
10 Europe    Germany  1997    77.3 82011073
11 Europe    Germany  2002    78.7 82350671
12 Europe    Germany  2007    79.4 82400996
# ℹ 2 more variables: gdpPercap <dbl>,
#   year1950 <dbl>

Your turn: Add linear models

Using by_country
Add a new column model with linear regressions of lifeExp on year1950
Save as by_country_lm

10:00

Linear models

Linear model per country

by_country_lm <- by_country |>
  mutate(model = map(data, \(x) lm(lifeExp ~ year1950, 
                                   data = x)))
by_country_lm

# A tibble: 142 × 4
   continent country     data              model 
   <fct>     <fct>       <list>            <list>
 1 Asia      Afghanistan <tibble [12 × 5]> <lm>  
 2 Europe    Albania     <tibble [12 × 5]> <lm>  
 3 Africa    Algeria     <tibble [12 × 5]> <lm>  
 4 Africa    Angola      <tibble [12 × 5]> <lm>  
 5 Americas  Argentina   <tibble [12 × 5]> <lm>  
 6 Oceania   Australia   <tibble [12 × 5]> <lm>  
 7 Europe    Austria     <tibble [12 × 5]> <lm>  
 8 Asia      Bahrain     <tibble [12 × 5]> <lm>  
 9 Asia      Bangladesh  <tibble [12 × 5]> <lm>  
10 Europe    Belgium     <tibble [12 × 5]> <lm>  
# ℹ 132 more rows

# A tibble: 142 × 4
   continent country     data              model 
   <fct>     <fct>       <list>            <list>
 1 Asia      Afghanistan <tibble [12 × 5]> <lm>  
 2 Europe    Albania     <tibble [12 × 5]> <lm>  
 3 Africa    Algeria     <tibble [12 × 5]> <lm>  
 4 Africa    Angola      <tibble [12 × 5]> <lm>  
 5 Americas  Argentina   <tibble [12 × 5]> <lm>  
 6 Oceania   Australia   <tibble [12 × 5]> <lm>  
 7 Europe    Austria     <tibble [12 × 5]> <lm>  
 8 Asia      Bahrain     <tibble [12 × 5]> <lm>  
 9 Asia      Bangladesh  <tibble [12 × 5]> <lm>  
10 Europe    Belgium     <tibble [12 × 5]> <lm>  
# ℹ 132 more rows

Explore a list column, counting

10:00

Count the number of rows per country in the data column.
Does any country have less data than others?
Try writing a single statement that solves this questions!

Note

A list column is a list, you need to iterate through elements
But remember the tibble is still rowwise grouped (good).
distinct() will help to find unique values
But remember the tibble is still rowwise grouped (bad).

Explore a list column

by_country |>
  # so n is displayed next to grouped columns
  mutate(n = map_int(data, nrow), .after = continent)

# A tibble: 142 × 4
   continent     n country     data             
   <fct>     <int> <fct>       <list>           
 1 Asia         12 Afghanistan <tibble [12 × 5]>
 2 Europe       12 Albania     <tibble [12 × 5]>
 3 Africa       12 Algeria     <tibble [12 × 5]>
 4 Africa       12 Angola      <tibble [12 × 5]>
 5 Americas     12 Argentina   <tibble [12 × 5]>
 6 Oceania      12 Australia   <tibble [12 × 5]>
 7 Europe       12 Austria     <tibble [12 × 5]>
 8 Asia         12 Bahrain     <tibble [12 × 5]>
 9 Asia         12 Bangladesh  <tibble [12 × 5]>
10 Europe       12 Belgium     <tibble [12 × 5]>
# ℹ 132 more rows

by_country |>
  mutate(n = map_int(data, nrow), .after = continent) |>
  distinct(n)

# A tibble: 1 × 1
      n
  <int>
1    12

Explore a list column, plotting

05:00

Plot lifeExp ~ year1950 for Bulgaria by unnesting data

Note

filter() for the desired country
unnest() raw data
Pipe to ggplot()

Explore a list column, plotting

by_country |>
  filter(country == "Bulgaria") |>
  unnest(data) |>
  ggplot(aes(x = year1950, y = lifeExp)) +
  geom_line()

Explore nested tibble with `lm`

10:00

Display the \(R^2\) for the linear model of Rwanda
How do you interpret the \(R^2\) for this particular model?

Warning

filter() for the desired country
Use map(model, summary) to run summary() on the linear model
To extract the named "r.squared", use the map_dbl(summary, "r.squared") (purrr syntax)

Linear model for Rwanda

by_country_lm |>
  filter(country == "Rwanda") |>
  mutate(summary = map(model, summary),
         r2 = map_dbl(summary, "r.squared")) |> 
  select(country, r2)

# A tibble: 1 × 2
  country     r2
  <fct>    <dbl>
1 Rwanda  0.0172

\(R^2\) is close to 0, linearity sounds broken
broom will cleanup linear model elements into tibbles

Cleanup dirty objects!

Extract nice tibbles from complex objects

Tidying models

15:00

Install broom from CRAN
Using by_country_lm, add 4 new columns:
- glance, using the broom function on the model column
- tidy, using the broom function on the model column
- augment, using the broom function on the model column
- rsq from the glance column
Save as models
Why extracting the \(R^2\) in the main tibble is useful?

Remember

Use map() when dealing with a list column like glance = map(model, glance)

Tidying models

Useful info

Coefficients estimates:
- slope
- intercept
\(R^2\)
Residuals

library(broom)
models <- by_country_lm |>
  mutate(glance  = map(model, glance),
         tidy    = map(model, tidy),
         augment = map(model, augment),
         rsq     = map_dbl(glance, "r.squared"))

Extracting \(R^2\) in main tibble

Why? no need to unnest for sorting / filtering

Plotting \(R^2\) for countries

10:00

Plot country ~ rsq
Color points per continent
Reorder country levels by \(R^2\) (rsq): snake plot
Which continent shows most of the low \(R^2\) values?

Note

To reorder the discrete values of country:

Use the forcats package
fct_reorder(country, rsq) to reorder based on the rsq continuous variable

Do linear models fit all countries?

Snake plot

library(forcats)
models |>
  ggplot(aes(x = rsq, 
             y = fct_reorder(country,
                             rsq))) +
  geom_point(aes(colour = continent), 
             alpha = 0.8, size = 3) +
  theme_classic(18) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = c(0.25, 0.75)) +
   guides(color = guide_legend(
     override.aes = list(alpha = 1))) +
  labs(x = expression(R^2),
       y = "Country",
       color = "Contient") +
  scale_color_manual(values = continent_colors)

Display the raw life expectancy for countries with a low \(R^2\)

10:00

Focus on non-linear trends from models
Filter the 20 countries with the lowest \(R^2\)
unnest column data
Plot lifeExp ~ year with lines
Colour per continent
Facet per country
Same questions for the top 20 \(R^2\)

Note

You must ungroup() as we work currently by row
slice_min(col, n = 5) returns the 5 minimal values of col

Focus on non-linear trends

models |>
  ungroup() |>
  slice_min(rsq, n = 20) |> 
  unnest(data) |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(colour = country)) +
  facet_wrap(vars(country), nrow = 4) +
  theme(legend.position = "none") +
  labs(x = NULL,
       y = NULL,
       title  = "Life Expectancy development by country") +
  guides(x = guide_axis(n.dodge = 2)) +
  scale_color_manual(values = country_colors)

Focus on best linear trends

models |>
  ungroup() |>
  slice_max(rsq, n = 20) |>
  unnest(data) |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(colour = country)) +
  facet_wrap(vars(country), nrow = 4) +
  theme(legend.position = "none") +
  labs(x = NULL,
       y = NULL,
       title  = "Life Expectancy development by country") +
  guides(x = guide_axis(n.dodge = 2)) +
  scale_color_manual(values = country_colors)

Interpreting the linear model

Regression

What represents the intercept?
- Using year1950?
- Using year?
- Justify Hadley choice
What represents the slope?

Coefficients with predictor year - 1950

filter(models, country == "Germany") |>
  unnest(tidy) |>
  select(continent, country, estimate)

# A tibble: 2 × 3
  continent country estimate
  <fct>     <fct>      <dbl>
1 Europe    Germany   67.1  
2 Europe    Germany    0.214

Compare with model and original years

gapminder |>
  filter(country == "Germany") |>
  # Native placeholder: '_'
  lm(lifeExp ~ year, data = _) |> 
  tidy() |>
  select(term, estimate)

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept) -350.   
2 year           0.214

Summarise on one plot

Unnest coefficients (tidy column)
- Mind to keep the continent, country and rsq columns
Put intercept and slope in their own columns
- In wide format, only one value can be used.
- Discard unused columns.
Plot slope ~ intercept (watch out the (Intercept) name which needs to be called between backticks ‘`’)
Colour per continent
Size per \(R^2\) (use for scale_size_area() for readability)
Add tendency with geom_smooth(method = "loess")

20:00

The final plot in Gapminder colors

gapminder |>
  mutate(year1950 = year - 1950) |>
  nest_by(continent, country) |>
  mutate(model = list(lm(lifeExp ~ year1950,
                         data = data)),
         glance  = list(glance(model)),
         tidy    = list(tidy(model)),
         rsq     = pluck(glance, "r.squared")) |> 
  unnest(tidy) |>
  select(continent, country, rsq, term, estimate) |>
  pivot_wider(names_from = term,
              values_from = estimate) |>
  ggplot(aes(x = `(Intercept)`, y = year1950)) +
  geom_point(aes(colour = continent,
                 size = rsq)) +
  geom_smooth(se = FALSE,
              method = "loess",
              formula = "y ~ x") +
  scale_size_area() +
  labs(title = "Development of Life Expectancy",
    x = "Life expectancy (1950)",
       y = "Yearly improvement",
       colour = "Continent") +
  scale_colour_manual(values = continent_colors)

Animation with `gganimate`

library(gganimate)
gapminder |>
  ggplot(aes(x = gdpPercap,
             y = lifeExp,
             size = pop, 
             color = continent)) +
  transition_time(year) +
  ease_aes("linear") +
  scale_size(range = c(2, 12)) +
  geom_point() +
  theme_bw(16) +
  labs(title = "Year: {frame_time}", 
       x = "GDP per capita", 
       y = "Life expectancy") +
  scale_x_log10() -> p
animate(p)
anim_save("gapminder.gif")

Takes ~ 2 minutes due to easing

Before we stop

You learned to

Keep related things together:
- Input data
- Meaningful grouping ids
- Perform modelling
- Extract relevant model components
- Explore visually your findings

Acknowledgments

Hadley Wickham
Jennifer Bryan
David Robinson
Thomas Pedersen

Contributions

Eric Koncina
Aurélien Ginolhac

Case study - gapminder

Learning objectives

Managing multiple models

Tutorial based on the great conference by Hadley Wickham

List-column cheatsheet, reminder

list in a data.frame/tibble

list of tibbles in a data.frame/tibble

Keep all analyses together

Workflow from last time

Gapminder

Gapminder is a fact tank

Dataset

Hans Rosling

Gapminder colors

Your turn: Explore Gapminder!

Gapminder countries

Gapminder with country colors

Keep related things together using list-column

Keep related things together

Nest per country

Helpers

One country example: Germany

From original tibble

Nested tibble

Your turn: Add linear models

Linear models

Linear model per country

Explore a list column, counting

Explore a list column

Explore a list column, plotting

Explore a list column, plotting

Explore nested tibble with lm

Linear model for Rwanda

Cleanup dirty objects!

Extract nice tibbles from complex objects

Tidying models

Tidying models

Useful info

Extracting \(R^2\) in main tibble

Plotting \(R^2\) for countries

Do linear models fit all countries?

Snake plot

Display the raw life expectancy for countries with a low \(R^2\)

Focus on non-linear trends

Focus on best linear trends

Interpreting the linear model

Summarise on one plot

The final plot in Gapminder colors

Animation with gganimate

Before we stop

Contributions

Thank you for your attention!

`list` in a `data.frame/tibble`

`list of tibbles` in a `data.frame/tibble`

Explore nested tibble with `lm`

Animation with `gganimate`