Gapminder tutorial

Published

February 8, 2024

Guided practical

Getting started

  • Load gapminder and tidyverse packages by running the setup chunk
  • Use the pipe |> to pass gapminder to ggplot()
  • Plot the life expectency (lifeExp in y) ~ year (x)
  • Use geom_line()

Mind the grouping!

Solution
# install.packages("gapminder")
library(gapminder)

gapminder |>
  ggplot(aes(x = year, 
             y = lifeExp,
             group = country)) + 
  geom_line()

Add linear models

  • Using by_country
  • Add a new column model with linear regressions of lifeExp on year1950
  • Save as by_country_lm
by_country_lm <- by_country |>
  mutate(model = map(data, \(x) lm(lifeExp ~ year1950, data = x)))

Explore a list column

  • Count the number of rows per country in the data column.
  • Does any country have less data than others?
  • Try writing a single statement that solves this questions!
by_country |>
  mutate(n = map_int(data, nrow), .after = continent) |>
  distinct(n) |> 
  nrow() != 1
[1] FALSE
# Only one row, all countries have the same number of rows

Explore a list column by plotting

  • Plot lifeExp ~ year1950 for Bulgaria by unnesting data
  • filter() for the desired country
  • unnest() raw data
  • Pipe to ggplot()
Solution
by_country_lm |>
  filter(country == "Rwanda") |>
  mutate(summary = map(model, summary),
         r2 = map_dbl(summary, "r.squared")) |> 
  select(country, r2)
# A tibble: 1 × 2
  country     r2
  <fct>    <dbl>
1 Rwanda  0.0172

Display the summary for the linear model of Rwanda

  • How do you interpret the \(R^2\) for this particular model?
  • filter() for the desired country
  • Use list() to run summary() on the linear model
  • To extract the named "r.squared", use the pluck(sumary, "r.squared"), purrr syntax
Solution
by_country |>
  filter(country == "Bulgaria") |>
  unnest(data) |>
  ggplot(aes(x = year1950, y = lifeExp)) +
  geom_line() 

Cleanup using broom

  • Check that broom is loaded
  • Using by_country_lm, add 4 new columns:
    • glance, using the broom function on the model column
    • tidy, using the broom function on the model column
    • augment, using the broom function on the model column
    • rsq from the glance column
  • Save as models
  • Why is extracting the \(R^2\) in the main tibble is useful?
  • Use list() when dealing with a list column rowwise grouped
Solution
library(broom)
models <- by_country_lm |>
  mutate(glance  = map(model, glance),
         tidy    = map(model, tidy),
         augment = map(model, augment),
         rsq     = map_dbl(glance, "r.squared")) # lists

Plotting \(R^2\) for countries

  • Plot country ~ rsq
  • Color points per continent
  • Reorder country levels by \(R^2\) (rsq): snake plot
  • Which continent shows most of the low \(R^2\) values?
  • To reorder the discrete values of country:
Hints
  • Use the forcats package
  • fct_reorder(country, rsq) to reorder based on the rsq continuous variable
Solution
library(forcats)
models |>
  ggplot(aes(x = rsq, 
             y = fct_reorder(country,
                             rsq))) +
  geom_point(aes(colour = continent), 
             alpha = 0.8, size = 3) +
  theme_classic(18) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = c(0.25, 0.75)) +
   guides(color = guide_legend(
     override.aes = list(alpha = 1))) +
  labs(x = expression(R^2),
       y = "Country",
       color = "Contient") +
  scale_color_manual(values = continent_colors) 

Display the real data for countries with a low \(R^2\)

  • Focus on non-linear trends
  • Filter the 20 countries with the lowest \(R^2\)
  • unnest column data
  • Plot lifeExp ~ year with lines
  • Colour per continent
  • Facet per country
Hints
  • You must ungroup() as we work currently by row.
  • slice_min(col, n = 5) returns the 5 minimal values of col
Solution
models |>
  ungroup() |>
  slice_min(rsq, n = 20) |> 
  unnest(data) |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(colour = country)) +
  facet_wrap(vars(country), nrow = 4) +
  theme(legend.position = "none") +
  labs(x = NULL,
       y = NULL,
       title  = "Life Expectancy development by country") +
  guides(x = guide_axis(n.dodge = 2)) +
  scale_color_manual(values = country_colors) 

Same questions for the top 20 \(R^2\)

models |>
  ungroup() |>
  slice_max(rsq, n = 20) |> 
  unnest(data) |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(colour = country)) +
  facet_wrap(vars(country), nrow = 4) +
  theme(legend.position = "none") +
  labs(x = NULL,
       y = NULL,
       title  = "Life Expectancy development by country") +
  guides(x = guide_axis(n.dodge = 2)) +
  scale_color_manual(values = country_colors) 

Summarise on one plot

  • Unnest coefficients (tidy column)
    • Mind to keep the continent, country and rsq columns
  • Put intercept and slope in their own columns
    • In wide format, only one value can be used.
    • Discard unused columns.
  • Plot slope ~ intercept (watch out the (Intercept) name which needs to be called between backticks!
  • Colour per continent
  • Size per \(R^2\) (use for scale_size_area() for readability)
  • Add tendency with geom_smooth(method = "loess")
Solution
gapminder |>
  mutate(year1950 = year - 1950) |>
  nest_by(continent, country) |>
  mutate(model = list(lm(lifeExp ~ year1950,
                         data = data)),
         glance  = list(glance(model)),
         tidy    = list(tidy(model)),
         rsq     = pluck(glance, "r.squared")) |> 
  unnest(tidy) |>
  select(continent, country, rsq, term, estimate) |>
  pivot_wider(names_from = term,
              values_from = estimate) |>
  ggplot(aes(x = `(Intercept)`, y = year1950)) +
  geom_point(aes(colour = continent,
                 size = rsq)) +
  geom_smooth(se = FALSE,
              method = "loess",
              formula = "y ~ x") +
  scale_size_area() +
  labs(title = "Development of Life Expectancy",
    x = "Life expectancy (1950)",
       y = "Yearly improvement",
       colour = "Continent") +
  scale_colour_manual(values = continent_colors)