Principles of tidy data

Rectangular data

Roland Krause

R Workshop

Monday, 10 February 2025

Session set-up

Learning objectives

Learn the principles of tidy data to structure data in tables
Identify errors in existing data sets

Comments

Focus on only tidy data
Introduction of tidyr ways
tidyr provides functions to clean and tidy them but …
cleaning data also requires dplyr for most practical purposes.

A definition of tidy data

What are we talking about?

Variables: A quantity, quality, or property that you can measure.
Observations: A set of values that display the relationship between variables. To be an observation, values need to be measured under similar conditions, usually measured on the same observational unit at the same time.
Values: The state of a variable that you observe when you measure it.

source: Garret Grolemund and vignette("tidy-data")

What is tidy data

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Tidying messy data

For clean data ensure that

Column names are easy to use and informative
Row names are easy to use and informative
Obvious mistakes in the data have been removed
Variable values are internally consistent
Appropriate transformed variables have been added

On row names…

Tibbles don’t have row names…
Consider having a variable or groups of variables that characterize the observation
For data base applications we discuss keys

Bad data exercise

Question

The following table lists data from two questionnaires –stai and rec– recorded in different languages.
What’s wrong with the Excel sheet?
Which problems are tidy issues?

Tidy errors

Tidy errors

Error	Tidy violation	Comment
Person name	No	Data protection violation
Identical column names	Yes	Variable error
Inconsistent variables names	No	Bad practice
Non-English columns names	No	Bad practice
Color coding	No	The horror, the horror
Inconsistent dates	No	Use ISO8601
Multiple columns for one item	Yes	One observation per line
Redundant information	Yes	Each variable is in its own column
Repeated rows	Yes	Each observation is in its own row
Missing coding	Yes/No	Each value in its own cell
Unnecessary information (Birthdate, comments)	No	Bad practice
Name of the table	No	Bad practice

Can we clean that table programmatically?

readxl::read_excel("data/bad-table-psych.xlsx", 
                         skip = 2,
                         n_max = 12,
                   .name_repair = "universal") |> 
  select(-Name, -Birthday, -Comment) |> 
  select(where(~ !(all(is.na(.)) | all(. == "")))) |> 
  rename_with(\(x) str_to_lower(x))  |> 
  rename_with(\(x) str_replace_all(x, "\\.+", "_"))  |> 
  rename( "test-time" = `_9`,
          "rec-lang" = language_11) |> 
  separate_wider_delim(cols = test_reaction,              
                       delim = '=',                       
                       names = c("test", "reaction")) |>  
  mutate(gender = str_extract(gender, '^.'),
         language_native = 
           case_when(str_detect(language_4, "^[gG]") ~ "deu",
                     str_detect(language_4, "^[fF]") ~ "fra")) |> 
  rowid_to_column() |> 
  separate_longer_delim(stai,  delim = ",") |> 
  select(rowid, starts_with("stai"), everything()) |> 
  separate_wider_delim(stai,                              
                       names = c("stai_key", "stai_val"), 
                       delim = "=") |>                    
  mutate(stai_val = as.integer(stai_val),
         stai_key = str_c("stai_", trimws(stai_key))) |> 
  pivot_wider( names_from = stai_key, values_from = stai_val )

# A tibble: 12 × 20
   rowid    id gender language_4  test    reaction `test-time` `rec-lang` rec_1
   <int> <dbl> <chr>  <chr>       <chr>   <chr>          <dbl> <chr>      <dbl>
 1     1     1 m      German      dog     3                 92 d              1
 2     2     2 m      German      cat     3                122 d              4
 3     3     2 m      GErman      cat     3                123 d              2
 4     4     2 m      German      cat     3                122 d              3
 5     5     4 f      German      dog     2                112 d              3
 6     6     5 m      French      trolley 2                 99 <NA>           4
 7     7     5 <NA>   <NA>        <NA>    <NA>              NA <NA>           4
 8     8     7 f      Französisch trolley 2                101 f              4
 9     9     8 f      Französisch trolley 1                140 f              1
10    10     9 m      Deutsch     dog     1                138 d              3
11    11    10 m      Deutsch     dog     3                142 d              4
12    12    11 w      Deutsch     dog     5                 99 d              4
# ℹ 11 more variables: rec_2 <dbl>, rec_3 <dbl>, language_16 <chr>,
#   language_native <chr>, stai_1 <int>, stai_2 <int>, stai_3 <int>,
#   stai_6 <int>, stai_5 <int>, stai_4 <int>, `NA` <int>

The `tidyr` package functionality

Common tidy data violations

Multiple variables stored in one column (e.g tidyr::separate_* functions)
Column headers are values, not variable names (tidyr::pivot_* functions)
Variables are stored in both rows and columns (tidyr::pivot_* functions)
Repeated observations (nest or separate tables)
Inconsistent data from manual entry (stringr, dplyr)
Multiple types in one table (dplyr, data transformation)

Tidy data solutions

Reshapes tables to become longer or wider , either adding rows or columns respectively
Basic column-wise manipulation (separate)
- now after stringr
Reshaping operations (pivot)
- after dplyr
Nesting and advanced collapsing operations (nesting)
- after broom

tidyr 1.3.1

Major update of tidyr for column shaping in recent 1.3.0.
Functions separate() and extract() are superseded by more specific functions.
These older functions will be deprecated in 2024 .

Before we stop

Tidy data core

http://tidyr.tidyverse.org/
vignette("tidy-data")

Further reading

Rectangling - Taming JSON and other data into rectangles
- https://tidyr.tidyverse.org/dev/articles/rectangle.html

Acknowledgments

Hadley Wickham (Tidy concepts)
Alison Hill (tidyr)
Artwork by Allison Horst

Principles of tidy data

Session set-up

A definition of tidy data

Tidying messy data

For clean data ensure that

On row names…

Bad data exercise

Tidy errors

Can we clean that table programmatically?

The tidyr package functionality

Before we stop

Thank you for your attention!

The `tidyr` package functionality