Gapminder tutorial

Authors

Affiliation

Published

February 13, 2025

Guided practical

Getting started

Load gapminder and tidyverse packages by running the setup chunk
Use the pipe |> to pass gapminder to ggplot()
Plot the life expectency (lifeExp in y) ~ year (x)
Use geom_line()

Mind the grouping!

Solution

# install.packages("gapminder")
library(gapminder)

gapminder |>
  ggplot(aes(x = year, 
             y = lifeExp,
             group = country)) + 
  geom_line()

Keep related things together using list-column

Add a column using mutate() named year1950 which is:
year - 1950
Nest with nest_by() the tibble by country and continent
How many rows will you get?
Save the object as by_country

Solution

by_country <- gapminder |>
  mutate(year1950 = year - 1950) |>
  nest(.by = c(continent, country))

Add linear models

Using by_country
Add a new column model with linear regressions of lifeExp on year1950
Save as by_country_lm

Solution

by_country_lm <- by_country |>
  mutate(model = map(data, \(x) lm(lifeExp ~ year1950, data = x)))

Explore a list column

Count the number of rows per country in the data column.
Does any country have less data than others?
Try writing a single statement that solves this questions!

Solution

by_country |>
  mutate(n = map_int(data, nrow), .after = continent) |>
  distinct(n) |> 
  nrow() != 1

[1] FALSE

# Only one row, all countries have the same number of rows

Explore a list column by plotting

Plot lifeExp ~ year1950 for Bulgaria by unnesting data
filter() for the desired country
unnest() raw data
Pipe to ggplot()

Solution

by_country_lm |>
  filter(country == "Rwanda") |>
  mutate(summary = map(model, summary),
         r2 = map_dbl(summary, "r.squared")) |> 
  select(country, r2)

# A tibble: 1 × 2
  country     r2
  <fct>    <dbl>
1 Rwanda  0.0172

Display the `summary` for the linear model of Rwanda

How do you interpret the \(R^2\) for this particular model?
filter() for the desired country
Use list() to run summary() on the linear model
To extract the named "r.squared", use the pluck(sumary, "r.squared"), purrr syntax

Solution

by_country |>
  filter(country == "Bulgaria") |>
  unnest(data) |>
  ggplot(aes(x = year1950, y = lifeExp)) +
  geom_line()

Cleanup using `broom`

Check that broom is loaded
Using by_country_lm, add 4 new columns:
- glance, using the broom function on the model column
- tidy, using the broom function on the model column
- augment, using the broom function on the model column
- rsq from the glance column
Save as models
Why is extracting the \(R^2\) in the main tibble is useful?
Use list() when dealing with a list column rowwise grouped

Solution

library(broom)
models <- by_country_lm |>
  mutate(glance  = map(model, glance),
         tidy    = map(model, tidy),
         augment = map(model, augment),
         rsq     = map_dbl(glance, "r.squared")) # lists

Plotting \(R^2\) for countries

Plot country ~ rsq
Color points per continent
Reorder country levels by \(R^2\) (rsq): snake plot
Which continent shows most of the low \(R^2\) values?
To reorder the discrete values of country:

Hints

Use the forcats package
fct_reorder(country, rsq) to reorder based on the rsq continuous variable

Solution

library(forcats)
models |>
  ggplot(aes(x = rsq, 
             y = fct_reorder(country,
                             rsq))) +
  geom_point(aes(colour = continent), 
             alpha = 0.8, size = 3) +
  theme_classic(18) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = c(0.25, 0.75)) +
   guides(color = guide_legend(
     override.aes = list(alpha = 1))) +
  labs(x = expression(R^2),
       y = "Country",
       color = "Contient") +
  scale_color_manual(values = continent_colors)

Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.

Display the real data for countries with a low \(R^2\)

Focus on non-linear trends
Filter the 20 countries with the lowest \(R^2\)
unnest column data
Plot lifeExp ~ year with lines
Colour per continent
Facet per country

Hints

You must ungroup() as we work currently by row.
slice_min(col, n = 5) returns the 5 minimal values of col

Solution

models |>
  ungroup() |>
  slice_min(rsq, n = 20) |> 
  unnest(data) |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(colour = country)) +
  facet_wrap(vars(country), nrow = 4) +
  theme(legend.position = "none") +
  labs(x = NULL,
       y = NULL,
       title  = "Life Expectancy development by country") +
  guides(x = guide_axis(n.dodge = 2)) +
  scale_color_manual(values = country_colors)

Same questions for the top 20 \(R^2\)

Solution

models |>
  ungroup() |>
  slice_max(rsq, n = 20) |> 
  unnest(data) |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(colour = country)) +
  facet_wrap(vars(country), nrow = 4) +
  theme(legend.position = "none") +
  labs(x = NULL,
       y = NULL,
       title  = "Life Expectancy development by country") +
  guides(x = guide_axis(n.dodge = 2)) +
  scale_color_manual(values = country_colors)

Summarise on one plot

Unnest coefficients (tidy column)
- Mind to keep the continent, country and rsq columns
Put intercept and slope in their own columns
- In wide format, only one value can be used.
- Discard unused columns.
Plot slope ~ intercept (watch out the (Intercept) name which needs to be called between backticks!
Colour per continent
Size per \(R^2\) (use for scale_size_area() for readability)
Add tendency with geom_smooth(method = "loess")

Solution

gapminder |>
  mutate(year1950 = year - 1950) |>
  nest_by(continent, country) |>
  mutate(model = list(lm(lifeExp ~ year1950,
                         data = data)),
         glance  = list(glance(model)),
         tidy    = list(tidy(model)),
         rsq     = pluck(glance, "r.squared")) |> 
  unnest(tidy) |>
  select(continent, country, rsq, term, estimate) |>
  pivot_wider(names_from = term,
              values_from = estimate) |>
  ggplot(aes(x = `(Intercept)`, y = year1950)) +
  geom_point(aes(colour = continent,
                 size = rsq)) +
  geom_smooth(se = FALSE,
              method = "loess",
              formula = "y ~ x") +
  scale_size_area() +
  labs(title = "Development of Life Expectancy",
    x = "Life expectancy (1950)",
       y = "Yearly improvement",
       colour = "Continent") +
  scale_colour_manual(values = continent_colors)

Guided practical

Getting started

Keep related things together using list-column

Add linear models

Explore a list column

Explore a list column by plotting

Display the summary for the linear model of Rwanda

Cleanup using broom

Plotting \(R^2\) for countries

Display the real data for countries with a low \(R^2\)

Same questions for the top 20 \(R^2\)

Summarise on one plot

Display the `summary` for the linear model of Rwanda

Cleanup using `broom`