This guided practical will demonstrate that the tidyverse allows to compute summary statistics and visualize datasets efficiently. This dataset is already stored in a tidy tibble, cleaning steps will come in future practicals.
datasauRus packagedatasauRus installedlibrary(datasauRus)
there is no package called ‘datasauRus’ appears, it means that the package needs to be installed. Use this:install.packages("datasauRus")
Since we are dealing with a tibble, we can type
datasaurus_dozen
only the first 10 rows are displayed.
| dataset | x | y |
|---|---|---|
| dino | 55.3846 | 97.1795 |
| dino | 51.5385 | 96.0256 |
| dino | 46.1538 | 94.4872 |
| dino | 42.8205 | 91.4103 |
| dino | 40.7692 | 88.3333 |
| dino | 38.7179 | 84.8718 |
| dino | 35.6410 | 79.8718 |
| dino | 33.0769 | 77.5641 |
| dino | 28.9744 | 74.4872 |
| dino | 26.1538 | 71.4103 |
base version, using either dim(), ncol() and nrow()
tidyverse version
datasaurus_dozen to the ds_dozen name. This aims at populating the Global Environment$ applied to a data.frame subset the column and convert the 2D structure to 1D, i. e a vector. The function length() returns the length of a vector, such as the unique elements
# n_distinct counts the unique elements in a given vector.
# we use summarise to return only the desired column named n here.
# we use English verbs and no subsetting characters, nor we change dimensions (keep a tibble)
summarise(ds_dozen, n = n_distinct(dataset))
## # A tibble: 1 × 1
## n
## <int>
## 1 13
datasetcount in dplyr does the group_by() by the specified column + summarise(n = n()) which returns the number of observation per defined group.
x & y column. For this, you need to group_by() the appropriate column and then summarise()summarise() you can define as many new columns as you wish. No need to call it for every single variable.
across()ds_dozen with ggplot such the aesthetics are aes(x = x, y = y)with the geometry geom_point()
ggplot() and geom_point() functions must be linked with a + sign
dataset columnToo many datasets are displayed.
%in% to test if there a match in the left operand with the right one (a vector most probably)
dataset per facettheme_void and remove the legendgganimate, its dependencies will be automatically installed.gifski could be installed on your machine, makes the GIF creation faster. gifski is internally written in rust, and this language needs cargo to run. See this article to get it installed on your machine. First install rust before install the R package gifski. Please note, that the animate() step still takes ~ 3-5 minutes depending on your machine.
dataset variable to the transition_states() argument layernever trust summary statistics alone; always visualize your data | Alberto Cairo
Authors
from this post