This guided practical will demonstrate that the tidyverse allows to compute summary statistics and visualize datasets efficiently. This dataset is already stored in a tidy tibble
, cleaning steps will come in future practicals.
datasauRus
packagedatasauRus
installedlibrary(datasauRus)
there is no package called ‘datasauRus’
appears, it means that the package needs to be installed. Use this:install.packages("datasauRus")
Since we are dealing with a tibble
, we can just type
datasaurus_dozen
only the first 10 rows are displayed.
dataset | x | y |
---|---|---|
dino | 55.3846 | 97.1795 |
dino | 51.5385 | 96.0256 |
dino | 46.1538 | 94.4872 |
dino | 42.8205 | 91.4103 |
dino | 40.7692 | 88.3333 |
dino | 38.7179 | 84.8718 |
dino | 35.6410 | 79.8718 |
dino | 33.0769 | 77.5641 |
dino | 28.9744 | 74.4872 |
dino | 26.1538 | 71.4103 |
dim()
, ncol()
and nrow()
# dim() returns the dimensions of the data frame, i.e number of rows and columns
dim(datasaurus_dozen)
## [1] 1846 3
# ncol() only number of columns
ncol(datasaurus_dozen)
## [1] 3
# nrow() only number of rows
nrow(datasaurus_dozen)
## [1] 1846
tibble
display its dimensions, starting by a comment (‘#’ character)
datasaurus_dozen
to the ds_dozen
name This aims at populating the Global Environmentds_dozen <- datasaurus_dozen
unique(ds_dozen$dataset) %>% length()
## [1] 13
# n_distinct counts the unique elements in a given vector.
# we use summarise to return only the desired column named n here.
summarise(ds_dozen, n = n_distinct(dataset))
## # A tibble: 1 x 1
## n
## <int>
## 1 13
dataset
count
in dplyr
does the group_by()
by the specified column + summarise(n = n())
which returns the number of observation per defined group.
count(ds_dozen, dataset)
## # A tibble: 13 x 2
## dataset n
## <chr> <int>
## 1 away 142
## 2 bullseye 142
## 3 circle 142
## 4 dino 142
## 5 dots 142
## 6 h_lines 142
## 7 high_lines 142
## 8 slant_down 142
## 9 slant_up 142
## 10 star 142
## 11 v_lines 142
## 12 wide_lines 142
## 13 x_shape 142
x
& y
column. For this, you need to group_by()
the appropriate column and then summarise()
summarise()
you can define as many new columns as you wish. No need to call it for every single variable.
ds_dozen %>%
group_by(dataset) %>%
summarise(mean_x = mean(x),
mean_y = mean(y))
dataset | mean_x | mean_y |
---|---|---|
away | 54.26610 | 47.83472 |
bullseye | 54.26873 | 47.83082 |
circle | 54.26732 | 47.83772 |
dino | 54.26327 | 47.83225 |
dots | 54.26030 | 47.83983 |
h_lines | 54.26144 | 47.83025 |
high_lines | 54.26881 | 47.83545 |
slant_down | 54.26785 | 47.83590 |
slant_up | 54.26588 | 47.83150 |
star | 54.26734 | 47.83955 |
v_lines | 54.26993 | 47.83699 |
wide_lines | 54.26692 | 47.83160 |
x_shape | 54.26015 | 47.83972 |
across()
ds_dozen %>%
group_by(dataset) %>%
# across works with first on which columns and second on what to perform on selection
# 2 possibilities to select columns
# summarise(across(where(is.double), list(mean = mean, sd = sd)))
summarise(across(c(x, y), list(mean = mean, sd = sd)))
dataset | x_mean | x_sd | y_mean | y_sd |
---|---|---|---|---|
away | 54.26610 | 16.76983 | 47.83472 | 26.93974 |
bullseye | 54.26873 | 16.76924 | 47.83082 | 26.93573 |
circle | 54.26732 | 16.76001 | 47.83772 | 26.93004 |
dino | 54.26327 | 16.76514 | 47.83225 | 26.93540 |
dots | 54.26030 | 16.76774 | 47.83983 | 26.93019 |
h_lines | 54.26144 | 16.76590 | 47.83025 | 26.93988 |
high_lines | 54.26881 | 16.76670 | 47.83545 | 26.94000 |
slant_down | 54.26785 | 16.76676 | 47.83590 | 26.93610 |
slant_up | 54.26588 | 16.76885 | 47.83150 | 26.93861 |
star | 54.26734 | 16.76896 | 47.83955 | 26.93027 |
v_lines | 54.26993 | 16.76996 | 47.83699 | 26.93768 |
wide_lines | 54.26692 | 16.77000 | 47.83160 | 26.93790 |
x_shape | 54.26015 | 16.76996 | 47.83972 | 26.93000 |
ds_dozen
with ggplot
such the aesthetics are aes(x = x, y = y)
with the geometry geom_point()
ggplot()
and geom_point()
functions must be linked with a + sign
ggplot(ds_dozen, aes(x = x, y = y)) +
geom_point()
dataset
columnggplot(ds_dozen, aes(x = x, y = y, colour = dataset)) +
geom_point()
Too many datasets are displayed.
ds_dozen %>%
filter(dataset == "away") %>%
ggplot(aes(x = x, y = y)) +
geom_point()
%in%
to test if there a match of the left operand in the right one (a vector most probably)
ds_dozen %>%
filter(dataset %in% c("away", "dino")) %>%
# alternative without %in% and using OR (|)
#filter(dataset == "away" | dataset == "dino") %>%
ggplot(aes(x = x, y = y, colour = dataset)) +
geom_point()
dataset
per facetds_dozen %>%
filter(dataset %in% c("away", "dino")) %>%
ggplot(aes(x = x, y = y, colour = dataset)) +
geom_point() +
facet_wrap(~ dataset)
ds_dozen %>%
ggplot(aes(x = x, y = y, colour = dataset)) +
geom_point() +
facet_wrap(~ dataset, ncol = 3)
theme_void
and remove the legendggplot(ds_dozen, aes(x = x, y = y, colour = dataset)) +
geom_point() +
theme_void() +
theme(legend.position = "none") +
facet_wrap(~ dataset, ncol = 3)
gifski
could be installed on your machine, makes the GIF creation faster. gifski
is internally written in rust
, and this language needs cargo
to run. See this article to get it installed on your machine. First install rust
before install the R package gifski
. Please note, that the animate()
step still takes ~ 3-5 minutes depending on your machine.
gganimate
, its dependencies will be automatically installed.install.packages("gganimate")
dataset
variable to the transition_states()
argument layerlibrary(gganimate)
ds_dozen %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
# transition will be made using the dataset column
transition_states(dataset, transition_length = 5, state_length = 2) +
# for a firework effect!
shadow_wake(wake_length = 0.05) +
labs(title = "dataset: {closest_state}") +
theme_void(14) +
theme(legend.position = "none") -> ds_anim
# more frames to slow down the animation
ds_gif <- animate(ds_anim, nframes = 500, fps = 10, renderer = gifski_renderer())
ds_gif
anim_save(title_frame = TRUE, "./img/ds.gif")
ds_dozen %>%
group_by(dataset) %>%
summarise(across(c(x, y), list(mean = mean, sd = sd))) %>%
ggplot(aes(x = x_mean, y = y_mean, colour = dataset)) +
geom_point(size = 25, alpha = 0.6) +
# zoom in like crazy
coord_cartesian(xlim = c(54.25, 54.3), ylim = c(47.75, 47.9)) +
# animate
transition_states(dataset, transition_length = 5, state_length = 2) +
# do not remove previous states to pile up dots
shadow_mark() +
labs(title = "dataset: {closest_state}") +
theme_minimal(14) +
theme(legend.position = "none") -> ds_mean_anim
ds_mean_gif <- animate(ds_mean_anim, nframes = 100, fps = 10)
ds_mean_gif
anim_save("img/ds_mean.gif")
never trust summary statistics alone; always visualize your data | Alberto Cairo
Authors
from this post