DatasauRus

Author

Affiliation

Aurélien Ginolhac

R Workshop

Published

February 10, 2025

Aims

This guided practical will demonstrate that the tidyverse allows to compute summary statistics and visualize datasets efficiently.

`datasauRus` package

Check if you have the package datasauRus installed and load it.

library(datasauRus)

should return nothing.

If there is no package called ‘datasauRus’ appears, it means that the package needs to be installed. Use this:

install.packages("datasauRus")

Explore the dataset

Since we are dealing with a tibble, we can type

datasaurus_dozen

# A tibble: 1,846 × 3
   dataset     x     y
   <chr>   <dbl> <dbl>
 1 dino     55.4  97.2
 2 dino     51.5  96.0
 3 dino     46.2  94.5
 4 dino     42.8  91.4
 5 dino     40.8  88.3
 6 dino     38.7  84.9
 7 dino     35.6  79.9
 8 dino     33.1  77.6
 9 dino     29.0  74.5
10 dino     26.2  71.4
# ℹ 1,836 more rows

Only the first 10 rows are displayed.

What are the dimensions of this dataset? Rows and columns?

base version, using either dim(), ncol() and nrow()

# dim() returns the dimensions of the data frame, i.e number of rows and columns
dim(datasaurus_dozen)

[1] 1846    3

# ncol() only number of columns
ncol(datasaurus_dozen)

[1] 3

# nrow() only number of rows
nrow(datasaurus_dozen)

[1] 1846

tidyverse version

# Nothing to be done, a `tibble` display its dimensions, starting by a comment ('#' character)

Assign the `datasaurus_dozen` to the `ds_dozen` name. This aims at populating the Global Environment

ds_dozen <- datasaurus_dozen

How many datasets are present?

# n_distinct counts the unique elements in a given vector.
# we use summarise to return only the desired column named n here.
# we use English verbs and no subsetting characters, nor we change dimensions (keep a tibble)
summarise(ds_dozen, n = n_distinct(dataset))

# A tibble: 1 × 1
      n
  <int>
1    13

Even better, compute and display the number of lines per dataset

Tip

The function count in dplyr does the group_by() by the specified column + summarise(n = n()) which returns the number of observation per defined group.

count(ds_dozen, dataset)

# A tibble: 13 × 2
   dataset        n
   <chr>      <int>
 1 away         142
 2 bullseye     142
 3 circle       142
 4 dino         142
 5 dots         142
 6 h_lines      142
 7 high_lines   142
 8 slant_down   142
 9 slant_up     142
10 star         142
11 v_lines      142
12 wide_lines   142
13 x_shape      142

Check summary statistics per dataset

Compute the mean of the `x` & `y` column. For this, you need to `group_by()` the appropriate column and then `summarise()`

Tip

In summarise() you can define as many new columns as you wish. No need to call it for every single variable.

ds_dozen |>
  group_by(dataset) |>
  summarise(mean_x = mean(x),
            mean_y = mean(y))

# A tibble: 13 × 3
   dataset    mean_x mean_y
   <chr>       <dbl>  <dbl>
 1 away         54.3   47.8
 2 bullseye     54.3   47.8
 3 circle       54.3   47.8
 4 dino         54.3   47.8
 5 dots         54.3   47.8
 6 h_lines      54.3   47.8
 7 high_lines   54.3   47.8
 8 slant_down   54.3   47.8
 9 slant_up     54.3   47.8
10 star         54.3   47.8
11 v_lines      54.3   47.8
12 wide_lines   54.3   47.8
13 x_shape      54.3   47.8

Compute both mean and standard deviation (sd) in one go using `across()`

ds_dozen |>
  # across works with first on which columns and second on what to perform on selection
  # 2 possibilities to select columns
  # summarise(across(where(is.double), list(mean = mean, sd = sd)))
  # by default since v1.0.5, grouped variables are excluded from across
  # summarise(across(everything(), list(mean = mean, sd = sd)))
  # we can use the new .by argument instead of a group_by()
  summarise(across(c(x, y), list(mean = mean, sd = sd)), .by = dataset)

# A tibble: 13 × 5
   dataset    x_mean  x_sd y_mean  y_sd
   <chr>       <dbl> <dbl>  <dbl> <dbl>
 1 dino         54.3  16.8   47.8  26.9
 2 away         54.3  16.8   47.8  26.9
 3 h_lines      54.3  16.8   47.8  26.9
 4 v_lines      54.3  16.8   47.8  26.9
 5 x_shape      54.3  16.8   47.8  26.9
 6 star         54.3  16.8   47.8  26.9
 7 high_lines   54.3  16.8   47.8  26.9
 8 dots         54.3  16.8   47.8  26.9
 9 circle       54.3  16.8   47.8  26.9
10 bullseye     54.3  16.8   47.8  26.9
11 slant_up     54.3  16.8   47.8  26.9
12 slant_down   54.3  16.8   47.8  26.9
13 wide_lines   54.3  16.8   47.8  26.9

Alternative of across() using pivoting:

ds_dozen |> 
  pivot_longer(cols = c(x, y),
               # to get all x first, then the y instead of x/y mingled
               cols_vary = "slowest",
               names_to = "variables",
               values_to = "values") |> 
  summarise(means = mean(values),
            sds = sd(values),
            .by = c(dataset, variables)) |> 
  print(n = Inf)

# A tibble: 26 × 4
   dataset    variables means   sds
   <chr>      <chr>     <dbl> <dbl>
 1 dino       x          54.3  16.8
 2 away       x          54.3  16.8
 3 h_lines    x          54.3  16.8
 4 v_lines    x          54.3  16.8
 5 x_shape    x          54.3  16.8
 6 star       x          54.3  16.8
 7 high_lines x          54.3  16.8
 8 dots       x          54.3  16.8
 9 circle     x          54.3  16.8
10 bullseye   x          54.3  16.8
11 slant_up   x          54.3  16.8
12 slant_down x          54.3  16.8
13 wide_lines x          54.3  16.8
14 dino       y          47.8  26.9
15 away       y          47.8  26.9
16 h_lines    y          47.8  26.9
17 v_lines    y          47.8  26.9
18 x_shape    y          47.8  26.9
19 star       y          47.8  26.9
20 high_lines y          47.8  26.9
21 dots       y          47.8  26.9
22 circle     y          47.8  26.9
23 bullseye   y          47.8  26.9
24 slant_up   y          47.8  26.9
25 slant_down y          47.8  26.9
26 wide_lines y          47.8  26.9

Compute the Pearson correlation between x and y per dataset?

# pearson is cor() default but worth making it clear
summarise(ds_dozen, pearson_cor = cor(x, y, method = "pearson"), .by = dataset)

# A tibble: 13 × 2
   dataset    pearson_cor
   <chr>            <dbl>
 1 dino           -0.0645
 2 away           -0.0641
 3 h_lines        -0.0617
 4 v_lines        -0.0694
 5 x_shape        -0.0656
 6 star           -0.0630
 7 high_lines     -0.0685
 8 dots           -0.0603
 9 circle         -0.0683
10 bullseye       -0.0686
11 slant_up       -0.0686
12 slant_down     -0.0690
13 wide_lines     -0.0666

Perform a linear model of y explained by x per dataset

Correlation is easy enough as it returns a double and takes vectors as input. For linear model, the R syntax lm(y ~ x, data = dino) makes it more complex to perform.

One elegant solution is to use functional programming and nesting. Combination with broom allows nice conversion of list model output to rectangle tibbles.

ds_dozen |>
  nest(points = c(x,y), .by = dataset) |> 
  mutate(lm_model = map(points, \(x) lm(x ~ y, data = x)),
         glance_lm = map(lm_model, broom::glance),
         r_squared = map_dbl(glance_lm, \(x) pull(x, r.squared)
         ))

# A tibble: 13 × 5
   dataset    points             lm_model glance_lm         r_squared
   <chr>      <list>             <list>   <list>                <dbl>
 1 dino       <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00416
 2 away       <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00411
 3 h_lines    <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00381
 4 v_lines    <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00482
 5 x_shape    <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00430
 6 star       <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00396
 7 high_lines <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00469
 8 dots       <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00364
 9 circle     <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00467
10 bullseye   <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00470
11 slant_up   <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00471
12 slant_down <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00476
13 wide_lines <tibble [142 × 2]> <lm>     <tibble [1 × 12]>   0.00443

What can you conclude?

All mean, standard deviations and correlations are the same for the 13 datasets. At least R^2 differ slightly.

Plot the datasauRus

Plot the `ds_dozen` with `ggplot` such the aesthetics are `aes(x = x, y = y)`

with the geometry geom_point()

Tip

The ggplot() and geom_point() functions must be linked with a + sign

ggplot(ds_dozen, aes(x = x, y = y)) +
  geom_point()

Reuse the above command, and now colored by the `dataset` column

ggplot(ds_dozen, 
       aes(x = x, 
           y = y, 
           colour = dataset)) +
  geom_point()

Too many datasets are displayed.

Tweak the theme and use the `theme_void()` and remove the legend

ds_dozen |>
  ggplot(aes(x = x, y = y, colour = dataset)) +
  geom_point() +
  theme_void() +
  theme(legend.position = "none") +
  facet_wrap(vars(dataset), ncol = 3)

Are the datasets actually that similar?

No ;) We were fooled by the summary stats

Animation

Plots can be animated, see for example what can be done with gganimate. Instead of panels, states are made across datasets and transitions smoothed with an afterglow effect.

Conclusion

Never trust summary statistics alone; always visualize your data | Alberto Cairo

Authors

Alberto Cairo, (creator)
Justin Matejka
George Fitzmaurice
Lucy McGowan

From this post

datasauRus package

Explore the dataset

What are the dimensions of this dataset? Rows and columns?

Assign the datasaurus_dozen to the ds_dozen name. This aims at populating the Global Environment

How many datasets are present?

Check summary statistics per dataset

Compute the mean of the x & y column. For this, you need to group_by() the appropriate column and then summarise()

Compute both mean and standard deviation (sd) in one go using across()

Compute the Pearson correlation between x and y per dataset?

Perform a linear model of y explained by x per dataset

What can you conclude?

Plot the datasauRus

Plot the ds_dozen with ggplot such the aesthetics are aes(x = x, y = y)

Reuse the above command, and now colored by the dataset column

Plot one dataset per facet

Tweak the theme and use the theme_void() and remove the legend

Are the datasets actually that similar?

Animation

Conclusion

`datasauRus` package

Assign the `datasaurus_dozen` to the `ds_dozen` name. This aims at populating the Global Environment

Compute the mean of the `x` & `y` column. For this, you need to `group_by()` the appropriate column and then `summarise()`

Compute both mean and standard deviation (sd) in one go using `across()`

Plot the `ds_dozen` with `ggplot` such the aesthetics are `aes(x = x, y = y)`

Reuse the above command, and now colored by the `dataset` column

Plot one `dataset` per facet

Tweak the theme and use the `theme_void()` and remove the legend