Plotting data

ggplot, part 1

Roland Krause

Rworkshop

Thursday, 8 February 2024

Introduction

Motivation for this layered system

Adapted from this article by Colin Fay thinkR

ggplot2

About this lecture

Learning objectives

  • Learn and apply the basic grammar of graphics
  • Understand how it is implemented in ggplot2
    • Input data structures as data.frame/tibble
    • Mapping columns to display features (aesthetics)
    • Types of graphics (geometries)
    • Multiple and repeating graphics (facets)
    • Transforming plots (scales)
    • Using different coordinate systems
    • Customizing graphs with themes
  • Make exploratory plots of your multidimensional data.

Introduction

ggplot2

  • Stands for grammar of graphics plot v2
  • Inspired by Leland Wilkinson work on the grammar of graphics in 2005.

Graphs are split into layers

  • Such as axis, curve(s), labels.
  • 3 elements are required: data, aesthetics, geometry \(\geqslant 1\)

Creating a plot

Data

A B C D
2 3 4 a
1 2 1 a
4 5 15 b
9 10 80 b

Aesthetics function

x = A

y = C

shape = D

Scaling to physical units \(x = \frac{A-min(A)}{range(A)} \times width\)

\(y = \frac{C-min(C)}{range(C)} \times height\)

\(shape = f_{s}(D)\)

Geometry

Scatter plot

Data drawn as points

Mapped data

x y shape
25 11 circle
0 0 circle
75 53 square
200 300 square

Result

What if we want to split into panels circles and squares?

Faceting

Split by shape, aka trellis or lattice plots

Redundancy

  • shape and facets provide the same information.
  • The shape aesthetic is free for another variable.

Familiar country shape and data

Combining layers

All ggplot layers are functions

library(ggplot2)
swiss |>
  ggplot() +
  aes(x = Education, 
      y = Examination) +
  geom_point() +
  scale_colour_brewer()

Warning

  • ggplot2 layers are combined with +!
  • The magrittr pipe %>% or the base pipe |> will not work!
  • This introduces a break in the workflow.

Using the pipe has an explicit error

  ggplot(swiss) |> 
  aes(x = Education, 
      y = Examination) |> 
  geom_point() +
  scale_colour_brewer()
Error:
! Cannot add <ggproto> objects together
ℹ Did you forget to add this object to a <ggplot> object?

Palmer penguins

Install with install.packages("palmerpenguins")

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package v0.1.0

library(palmerpenguins)

Capabilities in ggplot

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Capabilities in ggplot

ggplot(penguins)

Capabilities in ggplot

ggplot(penguins) +
  aes(x = flipper_length_mm, 
      y = body_mass_g, 
      color = sex)

Capabilities in ggplot

ggplot(penguins) +
  aes(x = flipper_length_mm, 
      y = body_mass_g, 
      color = sex) +
  geom_point()

Capabilities in ggplot

ggplot(penguins) +
  aes(x = flipper_length_mm, 
      y = body_mass_g, 
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), 
                     na.translate = FALSE) 

Capabilities in ggplot

ggplot(penguins) +
  aes(x = flipper_length_mm,
      y = body_mass_g, 
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), 
                     na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = "Dimensions for male/female Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER", 
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin sex")

Capabilities in ggplot

ggplot(penguins) +
  aes(x = flipper_length_mm,
      y = body_mass_g,
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = "Dimensions for male/female Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin sex") +
  theme(plot.subtitle = element_text(size = 13),
        axis.title   = element_text(size = 11)) 

Capabilities in ggplot

ggplot(penguins) +
  aes(x = flipper_length_mm,
      y = body_mass_g,
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = "Dimensions for male/female Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER", 
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin sex") +
  theme(plot.subtitle = element_text(size = 13),
        axis.title   = element_text(size = 11)) +
  theme(legend.position = "bottom",
        legend.background = element_rect(fill = "white", color = NA)) 

Capabilities in ggplot

ggplot(penguins) +
  aes(x = flipper_length_mm, y = body_mass_g, color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = "Dimensions for male/female Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin sex") +
  theme(plot.subtitle = element_text(size = 13),
        axis.title   = element_text(size = 11)) +
  theme(legend.position = "bottom",
        legend.background = element_rect(fill = "white", color = NA)) +
  theme(plot.caption = element_text(hjust = 0, face = "italic"),
        plot.caption.position = "plot") +
  facet_wrap(vars(species))

Capabilities in ggplot

ggplot(penguins) +
  aes(x = flipper_length_mm, y = body_mass_g, color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), na.translate = FALSE) +
    labs(title = "Penguin flipper and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = "Dimensions for male/female Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin sex") +
  theme(plot.subtitle = element_text(size = 13),
        axis.title   = element_text(size = 11)) +
  theme(legend.position = "bottom",
        legend.background = element_rect(fill = "white", color = NA)) +
  theme(plot.caption = element_text(hjust = 0, face = "italic"),
        plot.caption.position = "plot") +
  facet_wrap(vars(species)) +
  scale_x_continuous(guide = guide_axis(n.dodge = 2)) +
  scale_y_continuous(labels = scales::label_comma())

Geometric objects define the plot type to be drawn

geom_point()

geom_line()

geom_bar()

geom_violin()

geom_histogram()

geom_density()

Layers

Core layers

Other layers

They are present, it works because they have sensible default:

  • Theme is theme_grey
  • Coordinate is cartesian
  • Statistic is identity
  • Facets are disabled

3 layers are enough

  • Data
  • Aesthetics mapping data to plot component
  • Geometry at least one

Your first plot

library(palmerpenguins)
library(ggplot2)
ggplot(data = penguins) +
  geom_point(mapping = aes(x = bill_length_mm,
                           y = bill_depth_mm,
                           colour = species))

My first plot

ggplot(data = penguins) +
  geom_point(mapping = aes(x = bill_length_mm,
                           y = bill_depth_mm,
                           colour = "green"))

Mapping aesthetics

Requirements

  • aes() map columns/variables data to aesthetics
  • Specific geometries (geom) have different expectations:
    • univariate, one x or y for flipped axes
    • bivariate, x and y like scatterplot
  • Continuous or Discrete variables
    • Continuous for color ➡️ gradient
    • Discrete for color ➡️ qualitative

geom_point() requires both x and y coordinates

ggplot(penguins) +
  geom_point(aes(x = bill_length_mm,
                 y = bill_depth_mm))
  • Same as previous slide
  • Without colour mapping

Unmapped parameters

  • geom_point() accepts additional arguments such as the colour
  • Define them to a fixed value without mapping
ggplot(penguins) +
  geom_point(aes(x = bill_length_mm,
                 y = bill_depth_mm),
             colour = "black")

Important

Parameters defined outside the aesthetics aes() are applied to all data.

Mapped parameters

Mapped parameters require two conditions:

  • Being defined inside the aesthetics aes()
  • Refer to one of the column data, here: mistake
ggplot(penguins) +
  geom_point(aes(x = bill_length_mm,
                 y = bill_depth_mm,
                 colour = country)) #<<
Error in FUN(X[[i]], ...): object 'country' not found
  • Passing the unknown column as string as a different effect:
ggplot(penguins) +
  geom_point(aes(x = bill_length_mm,
                 y = bill_depth_mm,
                 colour = "country"))

  • This is hardly useful, but we shall see an application later.
  • Stick to the two mapping rules:
    • In aes()
    • Refer to a valid column.

Mapping aesthetics correctly

In aes() and refer to a data column

ggplot(penguins) +
  geom_bar(aes(y = species,
               fill = sex))

species and sex are 2 valid columns in penguins

Advantages:

  • The legend 🟥/🟦 for free
  • Missing data are highlighted in grey ⬜️
  • Using y axis for categories eases reading

Why no string for mapping?

  • Could we pass an expression?
  • Which penguins are above 4 kg?
  • Use body_mass_g > 4000 that return a boolean to find out
ggplot(penguins) +
  geom_bar(aes(y = species,
               fill = body_mass_g > 4000))

The expression was evaluated in penguins context Obvious that Gentoo are bigger than the 2 other species

Inheritance of arguments across layers

Compare the code and the results

ggplot(penguins,
       aes(x = bill_length_mm,
           y = bill_depth_mm)) +
  geom_point(
    aes(colour = species)) +
  geom_smooth(method = "lm", formula = 'y ~ x')

ggplot(penguins,
       aes(x = bill_length_mm,
           y = bill_depth_mm,
           colour = species)) +
  geom_point() +
  geom_smooth(method = "lm", formula = 'y ~ x')

Note

  • aesthetics in ggplot() are passed on to allgeometries.
  • aesthetics in geom_*() are specific (and can overwrite inherited)

Simpson’s paradox

Statistical correlation depending on stratification.

Your turn!

  • Use the classroom practical 5.
  • Install palmerpenguins package if you haven’t yet
  • Use the penguins data set and plot bill_length_mm, bill_depth_mm and species.
  • Map the variable island to the aesthetics shape.
  • Add a regression line using a linear model.
  • All dots (circles / triangles / squares) with:
    • A size of 5
    • A transparency of 30% (alpha = 0.7)

Goal

05:00

Answer:

ggplot(penguins,
       aes(x = bill_length_mm,
           y = bill_depth_mm,
           colour = species)) +
  geom_point(aes(shape = island),
             size = 5, alpha = 0.7) +
  geom_smooth(method = "lm",
              formula = 'y ~ x')

Joining observations

set.seed(212) # tidyr::crossing generate combinations
tib <- tibble(crossing(x = letters[1:4], 
                       g = factor(1:2)), 
              y = rnorm(8))

Suppose we want to connect dots by colors

tib
# A tibble: 8 × 3
  x     g          y
  <chr> <fct>  <dbl>
1 a     1     -0.239
2 a     2      0.677
3 b     1     -2.44 
4 b     2      1.24 
5 c     1     -0.327
6 c     2      0.154
7 d     1      1.04 
8 d     2     -0.780

Warning

Should be the job of geom_line()

Invisible aesthetic: grouping

ggplot(tib, aes(x, y, colour = g)) +
  geom_line() + 
  geom_point(size = 4)

ggplot(tib, aes(x, y, colour = g)) +
  geom_line(aes(group = g)) +  #<<
  geom_point(size = 4)

ggplot(tib, aes(x, y, colour = g)) +
  geom_line(aes(group = 1)) + #<<
  geom_point(size = 4)

Labels

ggplot(penguins,
       aes(x = bill_length_mm,
           y = bill_depth_mm,
           shape = island,
           colour = species)) +
  geom_point() +
  geom_smooth(method = "lm",
              formula = 'y ~ x') +
  labs(title = "Bill ratios of Palmer penguins",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = "Split per species / island",
       shape = "Islands",
       x = "cumen length (mm)",
       y = "cumen depth (mm)")

Statistics / geometries are interchangeable

ggplot(penguins) +
  geom_bar(aes(y = species))

ggplot(penguins) +
  stat_count(aes(y = species))

Tip

  • Feels more natural since visual
  • But just a preference
  • Most code in the wild use geom

Let ggplot2 doing the stat for you

stat_count could be omitted since default

ggplot(penguins, aes(x = species)) +
  geom_bar(stat = "count")

stat_count acts on the mapped var like dplyr::count()

count(penguins, species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Or do it yourselft, but with geom_col()

If you give counts, change the stat

count(penguins, species) |>
  ggplot(aes(x = species, y = n)) +
  geom_bar(stat = "identity") #<<

geom_col() has the default identity

count(penguins, species) |>
  ggplot(aes(x = species, y = n)) +
  geom_col() #<<

The stat() function allows computation, like proportions

Classic counting

ggplot(penguins, aes(y = species)) +
  geom_bar(aes(x = stat(count)))

See list in help pages

ggplot(penguins, aes(y = species)) +
  geom_bar(aes(x = stat(count) / sum(count))) +
  scale_x_continuous(labels = scales::label_percent())

  • Now compute proportions
  • Bonus: get x scale in % using scales

Flexibility in the asthetics for flipping axes

geom_bar() requires x OR y

penguins |>
  # horizontal brings readability
  ggplot(aes(y = species)) +
  geom_bar()

Cleanup plot

penguins |>
  ggplot(aes(y = species)) +
  geom_bar() +
  labs(y = NULL) +
  scale_x_continuous(expand = c(0, NA))

Annoying to see those 3 bars in disorder

Reorder the categorical variable (forcats)

Using the function fct_infreq()

library(forcats)
penguins |>
  ggplot(aes(y = fct_infreq(species))) + #<<
  geom_bar() +
  scale_x_continuous(expand = c(0, NA)) +
  labs(title = "Palmer penguins species",
       y = NULL) +
  theme_minimal(14) +
  # nice trick from T. Pedersen
  theme(panel.ontop = TRUE,
        # better to hide the horizontal grid lines
        panel.grid.major.y = element_blank())

Geometries catalogue

Histograms

penguins |>
ggplot(aes(x = body_mass_g,
           fill = species)) +
  geom_histogram(bins = 35,
                 alpha = 0.7, 
                 position = "identity")

  • Default bin value is 30 and will be printed out as a message
  • Default is stack for the position. Here we overlay and use transparency

Density plots

penguins |>
ggplot(aes(x = body_mass_g,
           fill = species,
           colour = species)) +
  geom_density(alpha = 0.7)

  • Use both colour and fill mapped to the same variable for cosmetic purposes

Overlay density and histogram

Naive approach: scale issue

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(bins = 30) +
  geom_density(colour = "red")

Solution: scale histogram to density one

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(bins = 30,
                 aes(y = stat(density))) + #<<
  geom_density(colour = "red")

Barplot, categories

Default: position = "stack"

ggplot(penguins) +
  geom_bar(aes(y = species, 
               fill = island))

Dodging island: side by side

ggplot(penguins) +
  geom_bar(aes(y = species, fill = island),
           position = "dodge")

But global width per species is preserved

Preserve single bar

ggplot(penguins) +
  geom_bar(aes(y = species,
           fill = island),
           position = position_dodge2(preserve = "single")) +
  labs(y = NULL)

Stacked barchart for proportions

penguins |> 
  drop_na(sex) |> # from tidyr
  ggplot() +
  geom_bar(aes(y = species,
               fill = sex),
           position = "fill") + #<<
  scale_x_continuous(labels = scales::label_percent(),
                     position = "top",
                     expand = c(0, 0)) +
  labs(x = NULL, y = NULL) +
  theme_classic(16)

Pie charts involved another coordinate system

ggplot(penguins) +
  geom_bar(aes(y = species,
           fill = island),
           position = "fill") +
  labs(x = NULL, y = NULL) +
  coord_polar() #<<

penguins |> 
  ggplot() +
  geom_bar(aes(y = species,
           fill = island)) +
  labs(x = NULL, y = NULL) +
  coord_polar() #<<

Boxplot, a continuous y by a categorical x

ggplot(penguins) +
  geom_boxplot(aes(y = body_mass_g,
                   x = species))

Note

geom_boxplot() is assessing that:

  • body_mass_g is continuous
  • species is categorical/discrete

Boxplot, dodging by default

Filter out NA to avoid this category

penguins |> # alternative to tidyr::drop_na()
  filter(!is.na(sex)) |>
  ggplot() +
  geom_boxplot(aes(y = body_mass_g,
                   x = species,
                   fill = sex))

Better: violin and jitter

Show the data

penguins |> 
  filter(!is.na(sex)) |>
  # define aes here for both geometries
  ggplot(aes(y = body_mass_g,
             x = species,
             fill = sex,
             # for violin contours and dots
             colour = sex
  )) +        # very transparent filling
  geom_violin(alpha = 0.1, trim = FALSE) +
  geom_point(position = position_jitterdodge(dodge.width = 0.9),
             alpha = 0.5,
             # don't need dots in legend
             show.legend = FALSE)

Even better: beeswarm

ggplot extension ggbeeswarm

library(ggbeeswarm)
penguins |> 
  filter(!is.na(sex)) |>
  ggplot(aes(y = body_mass_g,
             x = species,
             colour = sex)) +
  geom_quasirandom(dodge.width = 1) #<<

Raincloud plots

library(ggdist)
ggplot(penguins, 
       aes(y = species, 
           x = bill_depth_mm / bill_length_mm, 
           color = species, fill = species)) +
  geom_violin(width = .5, fill = "white", alpha = 0.4,
              size = 1.1, trim = FALSE) +
  ggdist::stat_halfeye(
    adjust = .33, width = .67, 
    alpha = 0.6, trim = FALSE,
    position = position_nudge(y = .35)) +
  ggbeeswarm::geom_quasirandom(groupOnX = FALSE,
                               alpha = .5, size = 3, 
                               width = 0.25) +
  scale_color_brewer(palette = "Set1", type = "qual") +
  scale_fill_brewer(palette = "Set1", type = "qual") +
  labs(x = "Bill ratio", y = NULL) +
  theme(legend.position = "none",
        axis.line = element_blank(),
        panel.grid.major.x = element_line(colour = "grey90"),
        axis.ticks = element_blank())

Coding mistake

What is wrong with the above code?

(Hint: think about inherited aesthetics)

penguins |>
  ggplot() +
  geom_point(aes(x = bill_length_mm, 
                 y = body_mass_g)) +
  geom_smooth(method = "lm")
Error in `geom_smooth()`:
! Problem while computing stat.
ℹ Error occurred in the 2nd layer.
Caused by error in `compute_layer()`:
! `stat_smooth()` requires the following missing aesthetics: x and y

Inheritance of aesthetics in main ggplot()

penguins |>
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Control the dots plotting order

ggplot2 outputs dots as they appear in the input data

tibble(x = LETTERS[1:3],
       y = x) |> 
  ggplot(aes(x, y)) +
  geom_point(aes(colour = x),
             show.legend = FALSE,
             size = 125) +
  scale_color_brewer(palette = "Dark2") +
  theme_classic(20)

tibble(x = LETTERS[1:3], y = x) |> 
  arrange(desc(x)) |> 
  ggplot(aes(x, y)) +
  geom_point(aes(colour = x),
             show.legend = FALSE,
             size = 125) +
  scale_color_brewer(palette = "Dark2") +
  theme_classic(20)

Before we stop

You learned to:

  • Apprehend Graphics as a language
  • Embrace the layer system
  • Link data columns to aesthetics
  • Discover geometries

Acknowledgments

Further reading

Thank you for your attention!