library(datasauRus)
DatasauRus
This guided practical will demonstrate that the tidyverse allows to compute summary statistics and visualize datasets efficiently. This dataset is already stored in a tidy tibble
, cleaning steps will come in future practicals.
Those kind of questions are optional
datasauRus
package
Check if you have the package datasauRus
installed
- should return nothing.
If there is no package called ‘datasauRus’
appears, it means that the package needs to be installed. Use this:
install.packages("datasauRus")
Explore the dataset
Since we are dealing with a tibble
, we can type
datasaurus_dozen
# A tibble: 1,846 × 3
dataset x y
<chr> <dbl> <dbl>
1 dino 55.4 97.2
2 dino 51.5 96.0
3 dino 46.2 94.5
4 dino 42.8 91.4
5 dino 40.8 88.3
6 dino 38.7 84.9
7 dino 35.6 79.9
8 dino 33.1 77.6
9 dino 29.0 74.5
10 dino 26.2 71.4
# ℹ 1,836 more rows
Only the first 10 rows are displayed.
What are the dimensions of this dataset? Rows and columns?
- base version, using either
dim()
,ncol()
andnrow()
# dim() returns the dimensions of the data frame, i.e number of rows and columns
dim(datasaurus_dozen)
[1] 1846 3
# ncol() only number of columns
ncol(datasaurus_dozen)
[1] 3
# nrow() only number of rows
nrow(datasaurus_dozen)
[1] 1846
- tidyverse version
# Nothing to be done, a `tibble` display its dimensions, starting by a comment ('#' character)
Assign the datasaurus_dozen
to the ds_dozen
name. This aims at populating the Global Environment
<- datasaurus_dozen ds_dozen
How many datasets are present?
# n_distinct counts the unique elements in a given vector.
# we use summarise to return only the desired column named n here.
# we use English verbs and no subsetting characters, nor we change dimensions (keep a tibble)
summarise(ds_dozen, n = n_distinct(dataset))
# A tibble: 1 × 1
n
<int>
1 13
- Even better, compute and display the number of lines per
dataset
The function count
in dplyr
does the group_by()
by the specified column + summarise(n = n())
which returns the number of observation per defined group.
count(ds_dozen, dataset)
# A tibble: 13 × 2
dataset n
<chr> <int>
1 away 142
2 bullseye 142
3 circle 142
4 dino 142
5 dots 142
6 h_lines 142
7 high_lines 142
8 slant_down 142
9 slant_up 142
10 star 142
11 v_lines 142
12 wide_lines 142
13 x_shape 142
Check summary statistics per dataset
Compute the mean of the x
& y
column. For this, you need to group_by()
the appropriate column and then summarise()
In summarise()
you can define as many new columns as you wish. No need to call it for every single variable.
|>
ds_dozen group_by(dataset) |>
summarise(mean_x = mean(x),
mean_y = mean(y))
# A tibble: 13 × 3
dataset mean_x mean_y
<chr> <dbl> <dbl>
1 away 54.3 47.8
2 bullseye 54.3 47.8
3 circle 54.3 47.8
4 dino 54.3 47.8
5 dots 54.3 47.8
6 h_lines 54.3 47.8
7 high_lines 54.3 47.8
8 slant_down 54.3 47.8
9 slant_up 54.3 47.8
10 star 54.3 47.8
11 v_lines 54.3 47.8
12 wide_lines 54.3 47.8
13 x_shape 54.3 47.8
Compute both mean and standard deviation (sd) in one go using across()
|>
ds_dozen # across works with first on which columns and second on what to perform on selection
# 2 possibilities to select columns
# summarise(across(where(is.double), list(mean = mean, sd = sd)))
# by default since v1.0.5, grouped variables are excluded from across
# summarise(across(everything(), list(mean = mean, sd = sd)))
# we can use the new .by argument instead of a group_by()
summarise(across(c(x, y), list(mean = mean, sd = sd)), .by = dataset)
# A tibble: 13 × 5
dataset x_mean x_sd y_mean y_sd
<chr> <dbl> <dbl> <dbl> <dbl>
1 dino 54.3 16.8 47.8 26.9
2 away 54.3 16.8 47.8 26.9
3 h_lines 54.3 16.8 47.8 26.9
4 v_lines 54.3 16.8 47.8 26.9
5 x_shape 54.3 16.8 47.8 26.9
6 star 54.3 16.8 47.8 26.9
7 high_lines 54.3 16.8 47.8 26.9
8 dots 54.3 16.8 47.8 26.9
9 circle 54.3 16.8 47.8 26.9
10 bullseye 54.3 16.8 47.8 26.9
11 slant_up 54.3 16.8 47.8 26.9
12 slant_down 54.3 16.8 47.8 26.9
13 wide_lines 54.3 16.8 47.8 26.9
Alternative of across()
using pivoting:
|>
ds_dozen pivot_longer(cols = c(x, y),
# to get all x first, then the y instead of x/y mingled
cols_vary = "slowest",
names_to = "variables",
values_to = "values") |>
summarise(means = mean(values),
sds = sd(values),
.by = c(dataset, variables)) |>
print(n = Inf)
# A tibble: 26 × 4
dataset variables means sds
<chr> <chr> <dbl> <dbl>
1 dino x 54.3 16.8
2 away x 54.3 16.8
3 h_lines x 54.3 16.8
4 v_lines x 54.3 16.8
5 x_shape x 54.3 16.8
6 star x 54.3 16.8
7 high_lines x 54.3 16.8
8 dots x 54.3 16.8
9 circle x 54.3 16.8
10 bullseye x 54.3 16.8
11 slant_up x 54.3 16.8
12 slant_down x 54.3 16.8
13 wide_lines x 54.3 16.8
14 dino y 47.8 26.9
15 away y 47.8 26.9
16 h_lines y 47.8 26.9
17 v_lines y 47.8 26.9
18 x_shape y 47.8 26.9
19 star y 47.8 26.9
20 high_lines y 47.8 26.9
21 dots y 47.8 26.9
22 circle y 47.8 26.9
23 bullseye y 47.8 26.9
24 slant_up y 47.8 26.9
25 slant_down y 47.8 26.9
26 wide_lines y 47.8 26.9
Compute the Pearson correlation between x and y per dataset?
# pearson is cor() default but worth making it clear
summarise(ds_dozen, pearson_cor = cor(x, y, method = "pearson"), .by = dataset)
# A tibble: 13 × 2
dataset pearson_cor
<chr> <dbl>
1 dino -0.0645
2 away -0.0641
3 h_lines -0.0617
4 v_lines -0.0694
5 x_shape -0.0656
6 star -0.0630
7 high_lines -0.0685
8 dots -0.0603
9 circle -0.0683
10 bullseye -0.0686
11 slant_up -0.0686
12 slant_down -0.0690
13 wide_lines -0.0666
Perform a linear model of y explained by x per dataset
Correlation is easy enough as it returns a double and takes vectors as input. For linear model, the R syntax lm(y ~ x, data = dino)
makes it more complex to perform.
One elegant solution is to use functional programming and nesting. Combination with broom
allows nice conversion of list model output to rectangle tibbles
.
|>
ds_dozen nest(.by = dataset) |>
mutate(lm = map(data, \(x) lm(x ~ y, data = x)),
glance_lm = map(lm, broom::glance),
r_squared = map_dbl(glance_lm, \(x) pull(x, r.squared)
))
# A tibble: 13 × 5
dataset data lm glance_lm r_squared
<chr> <list> <list> <list> <dbl>
1 dino <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00416
2 away <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00411
3 h_lines <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00381
4 v_lines <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00482
5 x_shape <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00430
6 star <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00396
7 high_lines <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00469
8 dots <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00364
9 circle <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00467
10 bullseye <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00470
11 slant_up <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00471
12 slant_down <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00476
13 wide_lines <tibble [142 × 2]> <lm> <tibble [1 × 12]> 0.00443
What can you conclude?
All mean, standard deviations and correlations are the same for the 13 datasets. At least R^2 differ slightly.
Plot the datasauRus
Plot the ds_dozen
with ggplot
such the aesthetics are aes(x = x, y = y)
with the geometry geom_point()
the ggplot()
and geom_point()
functions must be linked with a + sign
ggplot(ds_dozen, aes(x = x, y = y)) +
geom_point()
Reuse the above command, and now colored by the dataset
column
ggplot(ds_dozen,
aes(x = x,
y = y,
colour = dataset)) +
geom_point()
Too many datasets are displayed.
Plot one dataset
per facet
|>
ds_dozen ggplot(aes(x = x, y = y, colour = dataset)) +
geom_point() +
facet_wrap(vars(dataset))
Tweak the theme and use the theme_void()
and remove the legend
|>
ds_dozen ggplot(aes(x = x, y = y, colour = dataset)) +
geom_point() +
theme_void() +
theme(legend.position = "none") +
facet_wrap(vars(dataset), ncol = 3)
Are the datasets actually that similar?
No ;) We were fooled by the summary stats
Animation
Plots can be animated, see for example what can be done with gganimate
. Instead of panels, states are made across datasets
and transitions smoothed with an afterglow effect.
Conclusion
Never trust summary statistics alone; always visualize your data | Alberto Cairo
Authors
- Alberto Cairo, (creator)
- Justin Matejka
- George Fitzmaurice
- Lucy McGowan
From this post