library(datasauRus)
DatasauRus
This guided practical will demonstrate that the tidyverse allows to compute summary statistics and visualize datasets efficiently.
datasauRus
package
Check if you have the package datasauRus
installed and load it.
- should return nothing.
If there is no package called ‘datasauRus’
appears, it means that the package needs to be installed. Use this:
install.packages("datasauRus")
Explore the dataset
Since we are dealing with a tibble
, we can type
datasaurus_dozen
# A tibble: 1,846 × 3
dataset x y
<chr> <dbl> <dbl>
1 dino 55.4 97.2
2 dino 51.5 96.0
3 dino 46.2 94.5
4 dino 42.8 91.4
5 dino 40.8 88.3
6 dino 38.7 84.9
7 dino 35.6 79.9
8 dino 33.1 77.6
9 dino 29.0 74.5
10 dino 26.2 71.4
# ℹ 1,836 more rows
Only the first 10 rows are displayed.
What are the dimensions of this dataset? Rows and columns?
- base version, using either
dim()
,ncol()
andnrow()
# dim() returns the dimensions of the data frame, i.e number of rows and columns
dim(datasaurus_dozen)
[1] 1846 3
# ncol() only number of columns
ncol(datasaurus_dozen)
[1] 3
# nrow() only number of rows
nrow(datasaurus_dozen)
[1] 1846
- tidyverse version
# Nothing to be done, a `tibble` display its dimensions, starting by a comment ('#' character)
Assign the datasaurus_dozen
to the ds_dozen
name. This aims at populating the Global Environment
<- datasaurus_dozen ds_dozen
How many datasets are present?
# n_distinct counts the unique elements in a given vector.
# we use summarise to return only the desired column named n here.
# we use English verbs and no subsetting characters, nor we change dimensions (keep a tibble)
summarise(ds_dozen, n = n_distinct(dataset))
# A tibble: 1 × 1
n
<int>
1 13
- Even better, compute and display the number of lines per
dataset
count(ds_dozen, dataset)
# A tibble: 13 × 2
dataset n
<chr> <int>
1 away 142
2 bullseye 142
3 circle 142
4 dino 142
5 dots 142
6 h_lines 142
7 high_lines 142
8 slant_down 142
9 slant_up 142
10 star 142
11 v_lines 142
12 wide_lines 142
13 x_shape 142
Check summary statistics per dataset
Compute the mean of the x
& y
column. For this, you need to group_by()
the appropriate column and then summarise()
|>
ds_dozen group_by(dataset) |>
summarise(mean_x = mean(x),
mean_y = mean(y))
# A tibble: 13 × 3
dataset mean_x mean_y
<chr> <dbl> <dbl>
1 away 54.3 47.8
2 bullseye 54.3 47.8
3 circle 54.3 47.8
4 dino 54.3 47.8
5 dots 54.3 47.8
6 h_lines 54.3 47.8
7 high_lines 54.3 47.8
8 slant_down 54.3 47.8
9 slant_up 54.3 47.8
10 star 54.3 47.8
11 v_lines 54.3 47.8
12 wide_lines 54.3 47.8
13 x_shape 54.3 47.8
Compute both mean and standard deviation (sd) in one go using across()
|>
ds_dozen # across works with first on which columns and second on what to perform on selection
# 2 possibilities to select columns
# summarise(across(where(is.double), list(mean = mean, sd = sd)))
# by default since v1.0.5, grouped variables are excluded from across
# summarise(across(everything(), list(mean = mean, sd = sd)))
# we can use the new .by argument instead of a group_by()
summarise(across(c(x, y), list(mean = mean, sd = sd)), .by = dataset)
# A tibble: 13 × 5
dataset x_mean x_sd y_mean y_sd
<chr> <dbl> <dbl> <dbl> <dbl>
1 dino 54.3 16.8 47.8 26.9
2 away 54.3 16.8 47.8 26.9
3 h_lines 54.3 16.8 47.8 26.9
4 v_lines 54.3 16.8 47.8 26.9
5 x_shape 54.3 16.8 47.8 26.9
6 star 54.3 16.8 47.8 26.9
7 high_lines 54.3 16.8 47.8 26.9
8 dots 54.3 16.8 47.8 26.9
9 circle 54.3 16.8 47.8 26.9
10 bullseye 54.3 16.8 47.8 26.9
11 slant_up 54.3 16.8 47.8 26.9
12 slant_down 54.3 16.8 47.8 26.9
13 wide_lines 54.3 16.8 47.8 26.9
Alternative of across()
using pivoting:
|>
ds_dozen pivot_longer(cols = c(x, y),
# to get all x first, then the y instead of x/y mingled
cols_vary = "slowest",
names_to = "variables",
values_to = "values") |>
summarise(means = mean(values),
sds = sd(values),
.by = c(dataset, variables)) |>
print(n = Inf)
# A tibble: 26 × 4
dataset variables means sds
<chr> <chr> <dbl> <dbl>
1 dino x 54.3 16.8
2 away x 54.3 16.8
3 h_lines x 54.3 16.8
4 v_lines x 54.3 16.8
5 x_shape x 54.3 16.8
6 star x 54.3 16.8
7 high_lines x 54.3 16.8
8 dots x 54.3 16.8
9 circle x 54.3 16.8
10 bullseye x 54.3 16.8
11 slant_up x 54.3 16.8
12 slant_down x 54.3 16.8
13 wide_lines x 54.3 16.8
14 dino y 47.8 26.9
15 away y 47.8 26.9
16 h_lines y 47.8 26.9
17 v_lines y 47.8 26.9
18 x_shape y 47.8 26.9
19 star y 47.8 26.9
20 high_lines y 47.8 26.9
21 dots y 47.8 26.9
22 circle y 47.8 26.9
23 bullseye y 47.8 26.9
24 slant_up y 47.8 26.9
25 slant_down y 47.8 26.9
26 wide_lines y 47.8 26.9
Compute the Pearson correlation between x and y per dataset?
# pearson is cor() default but worth making it clear
summarise(ds_dozen, pearson_cor = cor(x, y, method = "pearson"), .by = dataset)
# A tibble: 13 × 2
dataset pearson_cor
<chr> <dbl>
1 dino -0.0645
2 away -0.0641
3 h_lines -0.0617
4 v_lines -0.0694
5 x_shape -0.0656
6 star -0.0630
7 high_lines -0.0685
8 dots -0.0603
9 circle -0.0683
10 bullseye -0.0686
11 slant_up -0.0686
12 slant_down -0.0690
13 wide_lines -0.0666
Perform a linear model of y explained by x per dataset
Correlation is easy enough as it returns a double and takes vectors as input. For linear model, the R syntax lm(y ~ x, data = dino)
makes it more complex to perform.
One elegant solution is to use functional programming and nesting. Combination with broom
allows nice conversion of list model output to rectangle tibbles
.
|>
ds_dozen nest(points = c(x,y), .by = dataset) |>
mutate(lm_model = map(points, \(x) lm(x ~ y, data = x)),
glance_lm = map(lm_model, broom::glance),
r_squared = map_dbl(glance_lm, \(x) pull(x, r.squared)
))
# A tibble: 13 × 5
dataset points lm_model glance_lm r_squared
<chr> <list> <list> <list> <dbl>
1 dino <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00416
2 away <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00411
3 h_lines <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00381
4 v_lines <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00482
5 x_shape <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00430
6 star <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00396
7 high_lines <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00469
8 dots <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00364
9 circle <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00467
10 bullseye <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00470
11 slant_up <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00471
12 slant_down <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00476
13 wide_lines <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00443
What can you conclude?
All mean, standard deviations and correlations are the same for the 13 datasets. At least R^2 differ slightly.
Plot the datasauRus
Plot the ds_dozen
with ggplot
such the aesthetics are aes(x = x, y = y)
with the geometry geom_point()
ggplot(ds_dozen, aes(x = x, y = y)) +
geom_point()
Reuse the above command, and now colored by the dataset
column
ggplot(ds_dozen,
aes(x = x,
y = y,
colour = dataset)) +
geom_point()
Too many datasets are displayed.
Plot one dataset
per facet
|>
ds_dozen ggplot(aes(x = x, y = y, colour = dataset)) +
geom_point() +
facet_wrap(vars(dataset))
Tweak the theme and use the theme_void()
and remove the legend
|>
ds_dozen ggplot(aes(x = x, y = y, colour = dataset)) +
geom_point() +
theme_void() +
theme(legend.position = "none") +
facet_wrap(vars(dataset), ncol = 3)
Are the datasets actually that similar?
No ;) We were fooled by the summary stats
Animation
Plots can be animated, see for example what can be done with gganimate
. Instead of panels, states are made across datasets
and transitions smoothed with an afterglow effect.
Conclusion
Never trust summary statistics alone; always visualize your data | Alberto Cairo
Authors
- Alberto Cairo, (creator)
- Justin Matejka
- George Fitzmaurice
- Lucy McGowan
From this post