Case study - gapminder

Guided tutorial

Roland Krause

Rworkshop

Friday, 9 February 2024

Learning objectives

You will learn

  • Use dplyr / purrr for efficient data manipulation
  • Tidying multiple linear models using broom
  • Managing related things together in one tibble
  • Summarise findings in one ggplot using relevant aesthetics

Guided practical

Interactive session

  • Use the provided template practical06_gapminder.qmd in your rworkshop-practicals-username repo.

Managing multiple models

Tutorial based on the great conference by Hadley Wickham

List-column cheatsheet, reminder

list in a data.frame/tibble

list of tibbles in a data.frame/tibble

Keep all analyses together

Workflow from last time

palmerpenguins::penguins |>
  # Nest only the relevant cols (bill measurements)
  # and ID penguins with biology criteria
  nest(data = starts_with("bill"), 
       .by = c(island, species)) |>
  mutate(model = map(data, \(x){ lm(bill_depth_mm ~ bill_length_mm,
                                   data = x) } ),
         summary = map(model, summary),
         r_squared = map_dbl(summary, "r.squared"))
# A tibble: 5 × 6
  island    species   data               model  summary    r_squared
  <fct>     <fct>     <list>             <list> <list>         <dbl>
1 Torgersen Adelie    <tibble [52 × 2]>  <lm>   <smmry.lm>    0.0620
2 Biscoe    Adelie    <tibble [44 × 2]>  <lm>   <smmry.lm>    0.219 
3 Dream     Adelie    <tibble [56 × 2]>  <lm>   <smmry.lm>    0.258 
4 Biscoe    Gentoo    <tibble [124 × 2]> <lm>   <smmry.lm>    0.414 
5 Dream     Chinstrap <tibble [68 × 2]>  <lm>   <smmry.lm>    0.427 

nest(.by = col) is roughly equivalent to:

penguins |>
  group_by(island, species) |>
  summarise(data = list(cur_data())) |>
  rowwise() 

rowwise() ensures that operations are performed on each row

Gapminder

Gapminder is a fact tank

Dataset

Hans Rosling

  • Fundamentally optimistic
  • Great talk

Gapminder colors

Let’s take a look at the Gapminder vignette

Gapminder palette

Your turn: Explore Gapminder!

  • Install the gapminder package
  • Load gapminder and tidyverse packages
  • Use the pipe |> to pass gapminder to ggplot()
  • Plot the Life expectancy (lifeExp) on y vs. the year (on x).
  • Use geom_line()

Caution

Mind the grouping!

10:00

Gapminder countries

# install.packages("gapminder")
library(gapminder)
gapminder |>
  ggplot(aes(x = year, y = lifeExp,
             group = country)) + 
  geom_line()

Gapminder with country colors

gapminder |>
  ggplot(aes(x = year, y = lifeExp, colour = country,
             group = country)) + 
  geom_line() +
  scale_color_manual(values = country_colors) +
  theme(legend.position = "none")

One country example: Germany

From original tibble

gapminder |>
  filter(country == "Germany") |>
  select(-country, -continent)
# A tibble: 12 × 4
    year lifeExp      pop gdpPercap
   <int>   <dbl>    <int>     <dbl>
 1  1952    67.5 69145952     7144.
 2  1957    69.1 71019069    10188.
 3  1962    70.3 73739117    12902.
 4  1967    70.8 76368453    14746.
 5  1972    71   78717088    18016.
 6  1977    72.5 78160773    20513.
 7  1982    73.8 78335266    22032.
 8  1987    74.8 77718298    24639.
 9  1992    76.1 80597764    26505.
10  1997    77.3 82011073    27789.
11  2002    78.7 82350671    30036.
12  2007    79.4 82400996    32170.

Nested tibble

by_country |>
  filter(country == "Germany")
# A tibble: 1 × 3
  continent country data             
  <fct>     <fct>   <list>           
1 Europe    Germany <tibble [12 × 5]>
by_country |>
  filter(country == "Germany") |>
  unnest(data)
# A tibble: 12 × 7
   continent country  year lifeExp      pop gdpPercap year1950
   <fct>     <fct>   <int>   <dbl>    <int>     <dbl>    <dbl>
 1 Europe    Germany  1952    67.5 69145952     7144.        2
 2 Europe    Germany  1957    69.1 71019069    10188.        7
 3 Europe    Germany  1962    70.3 73739117    12902.       12
 4 Europe    Germany  1967    70.8 76368453    14746.       17
 5 Europe    Germany  1972    71   78717088    18016.       22
 6 Europe    Germany  1977    72.5 78160773    20513.       27
 7 Europe    Germany  1982    73.8 78335266    22032.       32
 8 Europe    Germany  1987    74.8 77718298    24639.       37
 9 Europe    Germany  1992    76.1 80597764    26505.       42
10 Europe    Germany  1997    77.3 82011073    27789.       47
11 Europe    Germany  2002    78.7 82350671    30036.       52
12 Europe    Germany  2007    79.4 82400996    32170.       57

Your turn: Add linear models

  • Using by_country
  • Add a new column model with linear regressions of lifeExp on year1950
  • Save as by_country_lm
10:00

Linear models

Linear model per country

by_country_lm <- by_country |>
  mutate(model = map(data, \(x) lm(lifeExp ~ year1950, 
                                   data = x)))
by_country_lm
# A tibble: 142 × 4
   continent country     data              model 
   <fct>     <fct>       <list>            <list>
 1 Asia      Afghanistan <tibble [12 × 5]> <lm>  
 2 Europe    Albania     <tibble [12 × 5]> <lm>  
 3 Africa    Algeria     <tibble [12 × 5]> <lm>  
 4 Africa    Angola      <tibble [12 × 5]> <lm>  
 5 Americas  Argentina   <tibble [12 × 5]> <lm>  
 6 Oceania   Australia   <tibble [12 × 5]> <lm>  
 7 Europe    Austria     <tibble [12 × 5]> <lm>  
 8 Asia      Bahrain     <tibble [12 × 5]> <lm>  
 9 Asia      Bangladesh  <tibble [12 × 5]> <lm>  
10 Europe    Belgium     <tibble [12 × 5]> <lm>  
# ℹ 132 more rows
# A tibble: 142 × 4
   continent country     data              model 
   <fct>     <fct>       <list>            <list>
 1 Asia      Afghanistan <tibble [12 × 5]> <lm>  
 2 Europe    Albania     <tibble [12 × 5]> <lm>  
 3 Africa    Algeria     <tibble [12 × 5]> <lm>  
 4 Africa    Angola      <tibble [12 × 5]> <lm>  
 5 Americas  Argentina   <tibble [12 × 5]> <lm>  
 6 Oceania   Australia   <tibble [12 × 5]> <lm>  
 7 Europe    Austria     <tibble [12 × 5]> <lm>  
 8 Asia      Bahrain     <tibble [12 × 5]> <lm>  
 9 Asia      Bangladesh  <tibble [12 × 5]> <lm>  
10 Europe    Belgium     <tibble [12 × 5]> <lm>  
# ℹ 132 more rows

Explore a list column, counting

10:00
  • Count the number of rows per country in the data column.
  • Does any country have less data than others?
  • Try writing a single statement that solves this questions!

Note

  • A list column is a list, you need to iterate through elements
  • But remember the tibble is still rowwise grouped (good).
  • distinct() will help to find unique values
  • But remember the tibble is still rowwise grouped (bad).

Explore a list column

by_country |>
  # so n is displayed next to grouped columns
  mutate(n = map_int(data, nrow), .after = continent)
# A tibble: 142 × 4
   continent     n country     data             
   <fct>     <int> <fct>       <list>           
 1 Asia         12 Afghanistan <tibble [12 × 5]>
 2 Europe       12 Albania     <tibble [12 × 5]>
 3 Africa       12 Algeria     <tibble [12 × 5]>
 4 Africa       12 Angola      <tibble [12 × 5]>
 5 Americas     12 Argentina   <tibble [12 × 5]>
 6 Oceania      12 Australia   <tibble [12 × 5]>
 7 Europe       12 Austria     <tibble [12 × 5]>
 8 Asia         12 Bahrain     <tibble [12 × 5]>
 9 Asia         12 Bangladesh  <tibble [12 × 5]>
10 Europe       12 Belgium     <tibble [12 × 5]>
# ℹ 132 more rows
by_country |>
  mutate(n = map_int(data, nrow), .after = continent) |>
  distinct(n)
# A tibble: 1 × 1
      n
  <int>
1    12

Explore a list column, plotting

05:00
  • Plot lifeExp ~ year1950 for Bulgaria by unnesting data

Note

  • filter() for the desired country
  • unnest() raw data
  • Pipe to ggplot()

Explore a list column, plotting

by_country |>
  filter(country == "Bulgaria") |>
  unnest(data) |>
  ggplot(aes(x = year1950, y = lifeExp)) +
  geom_line() 

Explore nested tibble with lm

10:00
  • Display the \(R^2\) for the linear model of Rwanda
  • How do you interpret the \(R^2\) for this particular model?

Warning

  • filter() for the desired country
  • Use map(model, summary) to run summary() on the linear model
  • To extract the named "r.squared", use the map_dbl(summary, "r.squared") (purrr syntax)

Linear model for Rwanda

by_country_lm |>
  filter(country == "Rwanda") |>
  mutate(summary = map(model, summary),
         r2 = map_dbl(summary, "r.squared")) |> 
  select(country, r2)
# A tibble: 1 × 2
  country     r2
  <fct>    <dbl>
1 Rwanda  0.0172
  • \(R^2\) is close to 0, linearity sounds broken
  • broom will cleanup linear model elements into tibbles

Cleanup dirty objects!

Extract nice tibbles from complex objects

Tidying models

15:00
  • Install broom from CRAN
  • Using by_country_lm, add 4 new columns:
    • glance, using the broom function on the model column
    • tidy, using the broom function on the model column
    • augment, using the broom function on the model column
    • rsq from the glance column
  • Save as models
  • Why extracting the \(R^2\) in the main tibble is useful?

Remember

Use map() when dealing with a list column like glance = map(model, glance)

Tidying models

Useful info

  • Coefficients estimates:
    • slope
    • intercept
  • \(R^2\)
  • Residuals
library(broom)
models <- by_country_lm |>
  mutate(glance  = map(model, glance),
         tidy    = map(model, tidy),
         augment = map(model, augment),
         rsq     = map_dbl(glance, "r.squared"))

Extracting \(R^2\) in main tibble

Why? no need to unnest for sorting / filtering

Plotting \(R^2\) for countries

10:00
  • Plot country ~ rsq
  • Color points per continent
  • Reorder country levels by \(R^2\) (rsq): snake plot
  • Which continent shows most of the low \(R^2\) values?

Note

To reorder the discrete values of country:

  • Use the forcats package
  • fct_reorder(country, rsq) to reorder based on the rsq continuous variable

Do linear models fit all countries?

Snake plot

library(forcats)
models |>
  ggplot(aes(x = rsq, 
             y = fct_reorder(country,
                             rsq))) +
  geom_point(aes(colour = continent), 
             alpha = 0.8, size = 3) +
  theme_classic(18) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = c(0.25, 0.75)) +
   guides(color = guide_legend(
     override.aes = list(alpha = 1))) +
  labs(x = expression(R^2),
       y = "Country",
       color = "Contient") +
  scale_color_manual(values = continent_colors) 

Display the raw life expectancy for countries with a low \(R^2\)

10:00
  • Focus on non-linear trends from models
  • Filter the 20 countries with the lowest \(R^2\)
  • unnest column data
  • Plot lifeExp ~ year with lines
  • Colour per continent
  • Facet per country
  • Same questions for the top 20 \(R^2\)

Note

  • You must ungroup() as we work currently by row
  • slice_min(col, n = 5) returns the 5 minimal values of col

Interpreting the linear model

Regression

  • What represents the intercept?
    • Using year1950?
    • Using year?
    • Justify Hadley choice
  • What represents the slope?

  • Coefficients with predictor year - 1950
filter(models, country == "Germany") |>
  unnest(tidy) |>
  select(continent, country, estimate)
# A tibble: 2 × 3
  continent country estimate
  <fct>     <fct>      <dbl>
1 Europe    Germany   67.1  
2 Europe    Germany    0.214
  • Compare with model and original years
gapminder |>
  filter(country == "Germany") |>
  # Native placeholder: '_'
  lm(lifeExp ~ year, data = _) |> 
  tidy() |>
  select(term, estimate)
# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept) -350.   
2 year           0.214

Summarise on one plot

  • Unnest coefficients (tidy column)
    • Mind to keep the continent, country and rsq columns
  • Put intercept and slope in their own columns
    • In wide format, only one value can be used.
    • Discard unused columns.
  • Plot slope ~ intercept (watch out the (Intercept) name which needs to be called between backticks ‘`’)
  • Colour per continent
  • Size per \(R^2\) (use for scale_size_area() for readability)
  • Add tendency with geom_smooth(method = "loess")
20:00

The final plot in Gapminder colors

gapminder |>
  mutate(year1950 = year - 1950) |>
  nest_by(continent, country) |>
  mutate(model = list(lm(lifeExp ~ year1950,
                         data = data)),
         glance  = list(glance(model)),
         tidy    = list(tidy(model)),
         rsq     = pluck(glance, "r.squared")) |> 
  unnest(tidy) |>
  select(continent, country, rsq, term, estimate) |>
  pivot_wider(names_from = term,
              values_from = estimate) |>
  ggplot(aes(x = `(Intercept)`, y = year1950)) +
  geom_point(aes(colour = continent,
                 size = rsq)) +
  geom_smooth(se = FALSE,
              method = "loess",
              formula = "y ~ x") +
  scale_size_area() +
  labs(title = "Development of Life Expectancy",
    x = "Life expectancy (1950)",
       y = "Yearly improvement",
       colour = "Continent") +
  scale_colour_manual(values = continent_colors) 

Animation with gganimate

library(gganimate)
gapminder |>
  ggplot(aes(x = gdpPercap,
             y = lifeExp,
             size = pop, 
             color = continent)) +
  transition_time(year) +
  ease_aes("linear") +
  scale_size(range = c(2, 12)) +
  geom_point() +
  theme_bw(16) +
  labs(title = "Year: {frame_time}", 
       x = "GDP per capita", 
       y = "Life expectancy") +
  scale_x_log10() -> p
animate(p)
anim_save("gapminder.gif")

Takes ~ 2 minutes due to easing

Before we stop

You learned to

  • Keep related things together:
    • Input data
    • Meaningful grouping ids
    • Perform modelling
    • Extract relevant model components
    • Explore visually your findings

Acknowledgments

  • Hadley Wickham
  • Jennifer Bryan
  • David Robinson
  • Thomas Pedersen

Contributions

  • Eric Koncina
  • Aurélien Ginolhac

Thank you for your attention!