This guided practical will demonstrate that the tidyverse allows to compute summary statistics and visualize datasets efficiently. This dataset is already stored in a tidy tibble, cleaning steps will come in future practicals.

Those kind of questions are optional

datasauRus package

  • check if you have the package datasauRus installed
library(datasauRus)
  • should return nothing. If there is no package called ‘datasauRus’ appears, it means that the package needs to be installed. Use this:
install.packages("datasauRus")

Explore the dataset

Since we are dealing with a tibble, we can just type

datasaurus_dozen

only the first 10 rows are displayed.

dataset x y
dino 55.3846 97.1795
dino 51.5385 96.0256
dino 46.1538 94.4872
dino 42.8205 91.4103
dino 40.7692 88.3333
dino 38.7179 84.8718
dino 35.6410 79.8718
dino 33.0769 77.5641
dino 28.9744 74.4872
dino 26.1538 71.4103
What are the dimensions of this dataset? Rows and columns?
  • base version, using either dim(), ncol() and nrow()

  • tidyverse version

Assign the datasaurus_dozen to the ds_dozen name This aims at populating the Global Environment
Using Rstudio, those dimensions are now also reported within the interface, where?

How many datasets are present?

  • base version

Tip

you want to count the number of unique elements in the column dataset. The function length() returns the length of a vector, such as the unique elements
  • tidyverse version
# n_distinct counts the unique elements in a given vector.
# we use summarise to return only the desired column named n here.
summarise(ds_dozen, n = n_distinct(dataset))
## # A tibble: 1 x 1
##       n
##   <int>
## 1    13
  • even better way, compute and display the number of lines per dataset

Tip

the function count in dplyr does the group_by() by the specified column + summarise(n = n()) which returns the number of observation per defined group.

Check summary statistics per dataset

Compute the mean of the x & y column. For this, you need to group_by() the appropriate column and then summarise()

Tip

in summarise() you can define as many new columns as you wish. No need to call it for every single variable.
Compute both mean and standard deviation (sd) in one go using across()
What can you conclude?

Plot the datasauRus

Plot the ds_dozen with ggplot such the aesthetics are aes(x = x, y = y)

with the geometry geom_point()

Tip

the ggplot() and geom_point() functions must be linked with a + sign
Reuse the above command, and now colored by the dataset column

Too many datasets are displayed.

How can we plot only one at a time?

Tip

You can filter for one dataset upstream of plotting
Adjust the filtering step to plot two datasets

Tip

R provides the inline instruction %in% to test if there a match of the left operand in the right one (a vector most probably)
Expand now by getting one dataset per facet
Remove the filtering step to facet all datasets
Tweak the theme and use the theme_void and remove the legend
Are the datasets actually that similar?

Tip

the R package gifski could be installed on your machine, makes the GIF creation faster. gifski is internally written in rust, and this language needs cargo to run. See this article to get it installed on your machine. First install rust before install the R package gifski. Please note, that the animate() step still takes ~ 3-5 minutes depending on your machine.
Install gganimate, its dependencies will be automatically installed.
Use the dataset variable to the transition_states() argument layer
Visualize the tiny differences in means for both coordinates
  • need to zoom tremendously to see differences. Accumulate all states to better see the motions.

Conclusion

never trust summary statistics alone; always visualize your data | Alberto Cairo

Authors

  • Alberto Cairo, (creator)
  • Justin Matejka
  • George Fitzmaurice
  • Lucy McGowan

from this post