DatasauRus

Author

Aurélien Ginolhac

Published

February 6, 2024

Aims

This guided practical will demonstrate that the tidyverse allows to compute summary statistics and visualize datasets efficiently. This dataset is already stored in a tidy tibble, cleaning steps will come in future practicals.

Those kind of questions are optional

datasauRus package

Check if you have the package datasauRus installed

library(datasauRus)
  • should return nothing.

If there is no package called ‘datasauRus’ appears, it means that the package needs to be installed. Use this:

install.packages("datasauRus")

Explore the dataset

Since we are dealing with a tibble, we can type

datasaurus_dozen
# A tibble: 1,846 × 3
   dataset     x     y
   <chr>   <dbl> <dbl>
 1 dino     55.4  97.2
 2 dino     51.5  96.0
 3 dino     46.2  94.5
 4 dino     42.8  91.4
 5 dino     40.8  88.3
 6 dino     38.7  84.9
 7 dino     35.6  79.9
 8 dino     33.1  77.6
 9 dino     29.0  74.5
10 dino     26.2  71.4
# ℹ 1,836 more rows

Only the first 10 rows are displayed.

What are the dimensions of this dataset? Rows and columns?

  • base version, using either dim(), ncol() and nrow()
# dim() returns the dimensions of the data frame, i.e number of rows and columns
dim(datasaurus_dozen)
[1] 1846    3
# ncol() only number of columns
ncol(datasaurus_dozen)
[1] 3
# nrow() only number of rows
nrow(datasaurus_dozen)
[1] 1846
  • tidyverse version
# Nothing to be done, a `tibble` display its dimensions, starting by a comment ('#' character)

Assign the datasaurus_dozen to the ds_dozen name. This aims at populating the Global Environment

ds_dozen <- datasaurus_dozen

How many datasets are present?

# n_distinct counts the unique elements in a given vector.
# we use summarise to return only the desired column named n here.
# we use English verbs and no subsetting characters, nor we change dimensions (keep a tibble)
summarise(ds_dozen, n = n_distinct(dataset))
# A tibble: 1 × 1
      n
  <int>
1    13
  • Even better, compute and display the number of lines per dataset

The function count in dplyr does the group_by() by the specified column + summarise(n = n()) which returns the number of observation per defined group.

count(ds_dozen, dataset)
# A tibble: 13 × 2
   dataset        n
   <chr>      <int>
 1 away         142
 2 bullseye     142
 3 circle       142
 4 dino         142
 5 dots         142
 6 h_lines      142
 7 high_lines   142
 8 slant_down   142
 9 slant_up     142
10 star         142
11 v_lines      142
12 wide_lines   142
13 x_shape      142

Check summary statistics per dataset

Compute the mean of the x & y column. For this, you need to group_by() the appropriate column and then summarise()

In summarise() you can define as many new columns as you wish. No need to call it for every single variable.

ds_dozen |>
  group_by(dataset) |>
  summarise(mean_x = mean(x),
            mean_y = mean(y))
# A tibble: 13 × 3
   dataset    mean_x mean_y
   <chr>       <dbl>  <dbl>
 1 away         54.3   47.8
 2 bullseye     54.3   47.8
 3 circle       54.3   47.8
 4 dino         54.3   47.8
 5 dots         54.3   47.8
 6 h_lines      54.3   47.8
 7 high_lines   54.3   47.8
 8 slant_down   54.3   47.8
 9 slant_up     54.3   47.8
10 star         54.3   47.8
11 v_lines      54.3   47.8
12 wide_lines   54.3   47.8
13 x_shape      54.3   47.8

Compute both mean and standard deviation (sd) in one go using across()

ds_dozen |>
  # across works with first on which columns and second on what to perform on selection
  # 2 possibilities to select columns
  # summarise(across(where(is.double), list(mean = mean, sd = sd)))
  # by default since v1.0.5, grouped variables are excluded from across
  # summarise(across(everything(), list(mean = mean, sd = sd)))
  # we can use the new .by argument instead of a group_by()
  summarise(across(c(x, y), list(mean = mean, sd = sd)), .by = dataset)
# A tibble: 13 × 5
   dataset    x_mean  x_sd y_mean  y_sd
   <chr>       <dbl> <dbl>  <dbl> <dbl>
 1 dino         54.3  16.8   47.8  26.9
 2 away         54.3  16.8   47.8  26.9
 3 h_lines      54.3  16.8   47.8  26.9
 4 v_lines      54.3  16.8   47.8  26.9
 5 x_shape      54.3  16.8   47.8  26.9
 6 star         54.3  16.8   47.8  26.9
 7 high_lines   54.3  16.8   47.8  26.9
 8 dots         54.3  16.8   47.8  26.9
 9 circle       54.3  16.8   47.8  26.9
10 bullseye     54.3  16.8   47.8  26.9
11 slant_up     54.3  16.8   47.8  26.9
12 slant_down   54.3  16.8   47.8  26.9
13 wide_lines   54.3  16.8   47.8  26.9

Alternative of across() using pivoting:

ds_dozen |> 
  pivot_longer(cols = c(x, y),
               # to get all x first, then the y instead of x/y mingled
               cols_vary = "slowest",
               names_to = "variables",
               values_to = "values") |> 
  summarise(means = mean(values),
            sds = sd(values),
            .by = c(dataset, variables)) |> 
  print(n = Inf)
# A tibble: 26 × 4
   dataset    variables means   sds
   <chr>      <chr>     <dbl> <dbl>
 1 dino       x          54.3  16.8
 2 away       x          54.3  16.8
 3 h_lines    x          54.3  16.8
 4 v_lines    x          54.3  16.8
 5 x_shape    x          54.3  16.8
 6 star       x          54.3  16.8
 7 high_lines x          54.3  16.8
 8 dots       x          54.3  16.8
 9 circle     x          54.3  16.8
10 bullseye   x          54.3  16.8
11 slant_up   x          54.3  16.8
12 slant_down x          54.3  16.8
13 wide_lines x          54.3  16.8
14 dino       y          47.8  26.9
15 away       y          47.8  26.9
16 h_lines    y          47.8  26.9
17 v_lines    y          47.8  26.9
18 x_shape    y          47.8  26.9
19 star       y          47.8  26.9
20 high_lines y          47.8  26.9
21 dots       y          47.8  26.9
22 circle     y          47.8  26.9
23 bullseye   y          47.8  26.9
24 slant_up   y          47.8  26.9
25 slant_down y          47.8  26.9
26 wide_lines y          47.8  26.9

Compute the Pearson correlation between x and y per dataset?

# pearson is cor() default but worth making it clear
summarise(ds_dozen, pearson_cor = cor(x, y, method = "pearson"), .by = dataset)
# A tibble: 13 × 2
   dataset    pearson_cor
   <chr>            <dbl>
 1 dino           -0.0645
 2 away           -0.0641
 3 h_lines        -0.0617
 4 v_lines        -0.0694
 5 x_shape        -0.0656
 6 star           -0.0630
 7 high_lines     -0.0685
 8 dots           -0.0603
 9 circle         -0.0683
10 bullseye       -0.0686
11 slant_up       -0.0686
12 slant_down     -0.0690
13 wide_lines     -0.0666

Perform a linear model of y explained by x per dataset

Correlation is easy enough as it returns a double and takes vectors as input. For linear model, the R syntax lm(y ~ x, data = dino) makes it more complex to perform.

One elegant solution is to use functional programming and nesting. Combination with broom allows nice conversion of list model output to rectangle tibbles.

ds_dozen |>
  nest(.by = dataset) |> 
  mutate(lm = map(data, \(x) lm(x ~ y, data = x)),
         glance_lm = map(lm, broom::glance),
         r_squared = map_dbl(glance_lm, \(x) pull(x, r.squared)
         )) 
# A tibble: 13 × 5
   dataset    data               lm     glance_lm         r_squared
   <chr>      <list>             <list> <list>                <dbl>
 1 dino       <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00416
 2 away       <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00411
 3 h_lines    <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00381
 4 v_lines    <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00482
 5 x_shape    <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00430
 6 star       <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00396
 7 high_lines <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00469
 8 dots       <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00364
 9 circle     <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00467
10 bullseye   <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00470
11 slant_up   <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00471
12 slant_down <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00476
13 wide_lines <tibble [142 × 2]> <lm>   <tibble [1 × 12]>   0.00443

What can you conclude?

All mean, standard deviations and correlations are the same for the 13 datasets. At least R^2 differ slightly.

Plot the datasauRus

Plot the ds_dozen with ggplot such the aesthetics are aes(x = x, y = y)

with the geometry geom_point()

the ggplot() and geom_point() functions must be linked with a + sign

ggplot(ds_dozen, aes(x = x, y = y)) +
  geom_point()

Reuse the above command, and now colored by the dataset column

ggplot(ds_dozen, 
       aes(x = x, 
           y = y, 
           colour = dataset)) +
  geom_point()

Too many datasets are displayed.

Plot one dataset per facet

ds_dozen |>
  ggplot(aes(x = x, y = y, colour = dataset)) +
  geom_point() +
  facet_wrap(vars(dataset))

Tweak the theme and use the theme_void() and remove the legend

ds_dozen |>
  ggplot(aes(x = x, y = y, colour = dataset)) +
  geom_point() +
  theme_void() +
  theme(legend.position = "none") +
  facet_wrap(vars(dataset), ncol = 3)

Are the datasets actually that similar?

No ;) We were fooled by the summary stats

Animation

Plots can be animated, see for example what can be done with gganimate. Instead of panels, states are made across datasets and transitions smoothed with an afterglow effect.

Conclusion

Never trust summary statistics alone; always visualize your data | Alberto Cairo

Authors

  • Alberto Cairo, (creator)
  • Justin Matejka
  • George Fitzmaurice
  • Lucy McGowan

From this post