Principles of tidy data

Rectangular data

Roland Krause

Rworkshop

Tuesday, 6 February 2024

Session set-up

Learning objectives

  • Learn the principles of tidy data to structure data in tables
  • Identify errors in existing data sets

Comments

  • Focus on only tidy data
  • Introduction of tidyr ways
  • tidyr provides functions to clean and tidy them but …
  • cleaning data also requires dplyr for most practical purposes.

A definition of tidy data

What are we talking about?

  • Variables: A quantity, quality, or property that you can measure.
  • Observations: A set of values that display the relationship between variables. To be an observation, values need to be measured under similar conditions, usually measured on the same observational unit at the same time.
  • Values: The state of a variable that you observe when you measure it.

source: Garret Grolemund and vignette("tidy-data")

What is tidy data

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Tidying messy data

For clean data ensure that

  • Column names are easy to use and informative
  • Row names are easy to use and informative
  • Obvious mistakes in the data have been removed
  • Variable values are internally consistent
  • Appropriate transformed variables have been added

On row names…

  • Tibbles don’t have row names…
  • Consider having a variable or groups of variables that characterize the observation
  • For data base applications we discuss keys

Bad data exercise

Question

  • The following table lists data from two questionnaires –stai and rec– recorded in different languages.
  • What’s wrong with the Excel sheet?
  • Which problems are tidy issues?

Tidy errors

Tidy errors

Error Tidy violation Comment
Person name No Data protection violation
Identical column names Yes Variable error
Inconsistent variables names No Bad practice
Non-English columns names No Bad practice
Color coding No The horror, the horror
Inconsistent dates No Use ISO8601
Multiple columns for one item Yes One observation per line
Redundant information Yes Each variable is in its own column
Repeated rows Yes Each observation is in its own row
Missing coding Yes/No Each value in its own cell
Unnecessary information (Birthdate, comments) No Bad practice
Name of the table No Bad practice

Can we clean that table programmatically?

readxl::read_excel("data/bad-table-psych.xlsx", 
                         skip = 2,
                         n_max = 12,
                   .name_repair = "universal") |> 
  select(-Name, -Birthday, -Comment) |> 
  select(where(~ !(all(is.na(.)) | all(. == "")))) |> 
  rename_with(\(x) str_to_lower(x))  |> 
  rename_with(\(x) str_replace_all(x, "\\.+", "_"))  |> 
  rename( "test-time" = `_9`,
          "rec-lang" = language_11) |> 
  separate_wider_delim(cols = test_reaction,              #<<
                       delim = '=',                       #<<
                       names = c("test", "reaction")) |>  #<<
  mutate(gender = str_extract(gender, '^.'),
         language_native = 
           case_when(str_detect(language_4, "^[gG]") ~ "deu",
                     str_detect(language_4, "^[fF]") ~ "fra")) |> 
  rowid_to_column() |> 
  separate_longer_delim(stai,  delim = ",") |> #<<
  select(rowid, starts_with("stai"), everything()) |> 
  separate_wider_delim(stai,                              #<<
                       names = c("stai_key", "stai_val"), #<<
                       delim = "=") |>                    #<<
  mutate(stai_val = as.integer(stai_val),
         stai_key = str_c("stai_", trimws(stai_key))) |> 
  pivot_wider( names_from = stai_key, values_from = stai_val )  #<< 
# A tibble: 12 × 20
   rowid    id gender language_4  test    reaction `test-time` `rec-lang` rec_1
   <int> <dbl> <chr>  <chr>       <chr>   <chr>          <dbl> <chr>      <dbl>
 1     1     1 m      German      dog     3                 92 d              1
 2     2     2 m      German      cat     3                122 d              4
 3     3     2 m      GErman      cat     3                123 d              2
 4     4     2 m      German      cat     3                122 d              3
 5     5     4 f      German      dog     2                112 d              3
 6     6     5 m      French      trolley 2                 99 <NA>           4
 7     7     5 <NA>   <NA>        <NA>    <NA>              NA <NA>           4
 8     8     7 f      Französisch trolley 2                101 f              4
 9     9     8 f      Französisch trolley 1                140 f              1
10    10     9 m      Deutsch     dog     1                138 d              3
11    11    10 m      Deutsch     dog     3                142 d              4
12    12    11 w      Deutsch     dog     5                 99 d              4
# ℹ 11 more variables: rec_2 <dbl>, rec_3 <dbl>, language_16 <chr>,
#   language_native <chr>, stai_1 <int>, stai_2 <int>, stai_3 <int>,
#   stai_6 <int>, stai_5 <int>, stai_4 <int>, `NA` <int>

The tidyr package functionality

Common tidy data violations

  • Multiple variables stored in one column (e.g tidyr::separate)
  • Column headers are values, not variable names (tidyr::pivot)
  • Variables are stored in both rows and columns (tidyr::pivot)
  • Repeated observations (nest or separate tables)
  • Inconsistent data from manual entry (stringr, dplyr)
  • Multiple types in one table (dplyr, data transformation)

Tidy data solutions

  • Reshapes tables to become longer or wider , either adding rows or columns respectively
  • Basic column-wise manipulation (separate)
    • now after stringr
  • Reshaping operations (pivot)
    • after dplyr
  • Nesting and advanced collapsing operations (nesting)
    • after broom

tidyr 1.3.1

  • Major update of tidyr for column shaping in recent 1.3.0.
  • Functions separate() and extract() are superseded by more specific functions.
  • These older functions will be deprecated in 2024 .

Before we stop

Tidy data core

Further reading

Acknowledgments

  • Hadley Wickham (Tidy concepts)
  • Alison Hill (tidyr)
  • Artwork by Allison Horst

Thank you for your attention!