class: title-slide
# Principles of tidy data ## An introduction to tabular data structures .center[<img src="https://raw.githubusercontent.com/tidyverse/tidyr/master/man/figures/logo.png" width="100px"/>] ### Roland Krause | rworkshop | 2021-09-08 --- ## Session set-up .pull-left[ .bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Learning objectives]] .float-img.mt3[
] + Principles of *tidy data* to structure data in tables + Find errors in existing data sets + Structure data programmatically ] ] -- .pull-right[ .flex[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[Comments
]] + Cleaning data also requires `dplyr` + `tidyr` and `dplyr` are intertwined + Focus on only *tidy data* + Introduction of `tidyr` ways + Other `tidyr` functionality will be covered in `dplyr` and `broom` lectures. ] ]] --- # Typical flow of data .flex[ .w-25.bg-washed-maroon.b--red.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Source data ➡️ ]] * Experimental data * External data sets * Manually collected data and meta data ] .w-25.bg-washed-green.b--lawngreen.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Intermediate ➡️ ]] * Derived data * Computatation * Manual curation * **Tidy data** .float-img.center[
] ] .w-25.bg-aqua.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[Analysis ➡️ ]] * Exploratory analysis * Statistical models * Hypothesis testing .float-img.center[
] ] .w-25.bg-washed-green.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[Manuscript ➡️ ]] * Can you reproduce your work? + All numbers + Summaries + Images .float-img.center[
] ]] -- .bg-washed-green.b--blue.ba.bw2.br3.shadow-5.ph3.mt3[ .large[.bbox[One workflow ]] * No editing of data at any step * All code needed to reproduce from one ingestions to manuscript coded and repeatable ☀ ] ---  .footnote[Credit: Artwork by [Allison Horst](https://github.com/allisonhorst)] --- ## A definition of tidy data .pull-left[ .bg-washed-yellow.b--green.ba.bw2.br3.shadow-5.ph3.mr1[ .large[.gbox[Basic consideration]] + **Variable**: A quantity, quality, or property that you can measure. + **Observation**: A set of values that display the relationship between variables. To be an observation, values need to be measured under similar conditions, usually measured on the same observational unit at the same time. + **Value**: The state of a variable that you observe when you measure it. [source: Garret Grolemund](http://garrettgman.github.io/tidying/) and `vignette("tidy-data")` ] ] -- .pull-right[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mr1[ .large[.bbox[What is tidy data]] > 1. Each variable forms a column. > 2. Each observation forms a row. > 3. Each type of observational unit forms a table.  ]] --- ## Tidying messy data .flex[ .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[For clean data ensure that]] > - Column names are easy to use and informative > - Row names are easy to use and informative > - Obvious mistakes in the data have been removed > - Variable values are internally consistent > - Appropriate transformed variables have been added ]] --- class: slide-practical #
Bad data exercise .w-70.bg-washed-yellow.b--blue.ba.bw2.br3.shadow-5.ph3.mt3[ - The following table lists data from two questionnaires --`stai` and `rec`-- recorded in different languages. - What's wrong with [the Excel](https://biostat2.uni.lu/practicals/data/bad-table-psych.xlsx) sheet? - Which problems are `tidy` issues?] .w-70.bg-washed-yellow.b--green.ba.bw2.br3.shadow-5.ph3.mt3[  ] --- ## Tidy errors Error | Tidy violation | Comment ------|----------------|-------- Person name | No | Data protection violation Identical column names | Yes | Variable error Inconsistent variables names | No | Bad practice Non-English columns names | No | Bad practice Color coding | No | The horror, the horror Inconsistent dates |No | Use ISO8601 Multiple columns for one item | Yes | One observation per line Redundant information | Yes | Each variable is in its own column Repeated rows | Yes | Each observation is in its own row Missing coding | Yes/No | Each value in its own cell Unnecessary information (Birthdate, comments) | No | Bad practice Name of the table | No | Bad practice --- class: slide-practical ## Data cleaning exercise .pull-left[ .bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Clean the "bad table"]] * Bring data into shape such that it conforms to tidy data requirements * Pay attention to details of format, less to actual data * Do not use R manipulaiont for doing the manipulations, just capture the results. ]] .pull-right[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ ### `tribble()` A handy way to jot down tibbles. ```r clean_data <- tribble( ~family, ~patient, "F1", 101, "F1", 102, "F2", 201 ) clean_data ``` ``` # A tibble: 3 × 2 family patient <chr> <dbl> 1 F1 101 2 F1 102 3 F2 201 ``` ]] --- class: hide_logo --- class: hide_logo # A possible solution ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% janitor::clean_names() %>% rename( "test-time" = x9, "rec-lang" = language_11) %>% separate(`test_reaction`, into = c("test", "reaction")) %>% mutate(gender = str_extract(gender, '^.'), language_native = case_when(str_detect(language_4, "^[gG]") ~ "deu", str_detect(language_4, "^[fF]") ~ "fra")) %>% rowid_to_column() %>% separate_rows(stai, sep = ",") %>% select(rowid, starts_with("stai"), everything()) %>% separate(stai, into = c("stai_key", "stai_val"), sep = "=") %>% mutate(stai_val = as.integer(stai_val), stai_key = str_c("stai_", trimws(stai_key))) %>% pivot_wider( names_from = stai_key, values_from = stai_val ) %>% select(rowid, id, starts_with("stai"), everything()) ``` ``` # A tibble: 12 × 21 rowid id stai_1 stai_2 stai_3 stai_6 stai_5 stai_4 gender language_4 test <int> <dbl> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr> 1 1 1 5 4 3 7 NA NA m German dog 2 2 2 5 4 3 NA 7 NA m German cat 3 3 2 5 4 2 NA NA 1 m GErman cat 4 4 2 5 4 3 NA NA 3 m German cat 5 5 4 5 4 4 7 NA NA f German dog 6 6 5 3 3 3 NA NA 2 m French trol… 7 7 5 NA NA NA NA NA NA <NA> <NA> <NA> 8 8 7 3 3 3 NA NA 4 f Französis… trol… 9 9 8 5 4 3 7 NA NA f Französis… trol… 10 10 9 5 4 3 NA NA 0 m Deutsch dog 11 11 10 5 4 3 NA 1 1 m Deutsch dog 12 12 11 2 4 3 NA 3 4 w Deutsch dog # … with 10 more variables: reaction <chr>, test-time <dbl>, rec-lang <chr>, # rec_1 <dbl>, rec_2 <dbl>, rec_3 <dbl>, language_16 <chr>, comment <chr>, # language_native <chr>, NA <int> ``` count: false # A possible solution .panel1-nosolution-auto[ ```r *readxl::read_excel("data/bad-table-psych.xlsx", * skip = 2, * n_max = 12) ``` ] .panel2-nosolution-auto[ ``` # A tibble: 12 × 18 Name ID Gender Language...4 Birthday ...6 `test-reaction` <chr> <dbl> <chr> <chr> <chr> <lgl> <chr> 1 Roland Krause 1 male German 38687 NA dog=3 2 Lars Juhl Krause 2 m German 26/2/1992 NA cat=3 3 <NA> 2 m GErman 26/2/1992 NA cat=3 4 <NA> 2 male German 26/2/1992 NA cat=3 5 Martina Krause 4 f German 12-03-2002 NA dog=2 6 Horst-Sergio Krause 5 m French 12/11/70 NA trolley=2 7 Horst-Sergio Krause 5 <NA> <NA> 11-12-70 NA <NA> 8 LJK 7 f Französisch <NA> NA trolley=2 9 DFK 8 f Französisch 2005-12-10 NA trolley=1 10 Horst Mond 9 male Deutsch 1972-10-01 NA dog=1 11 Kevin Mond 10 männlich Deutsch 5.4.1970 NA dog=3 12 Inga Mond 11 weiblich Deutsch 1/1/1990 NA dog=5 # … with 11 more variables: morale (by STAI test) <lgl>, ...9 <dbl>, # STAI <chr>, Language...11 <chr>, ...12 <lgl>, REC-1 <dbl>, Rec-2 <dbl>, # REC-3 <dbl>, Language...16 <chr>, ...17 <lgl>, Comment <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% * select(-Name, -Birthday, Comment) ``` ] .panel2-nosolution-auto[ ``` # A tibble: 12 × 16 ID Gender Language...4 ...6 `test-reaction` `morale (by STA… ...9 STAI <dbl> <chr> <chr> <lgl> <chr> <lgl> <dbl> <chr> 1 1 male German NA dog=3 NA 92 1=5,… 2 2 m German NA cat=3 NA 122 1=5,… 3 2 m GErman NA cat=3 NA 123 1=5,… 4 2 male German NA cat=3 NA 122 1=5,… 5 4 f German NA dog=2 NA 112 1=5,… 6 5 m French NA trolley=2 NA 99 1=3,… 7 5 <NA> <NA> NA <NA> NA NA <NA> 8 7 f Französisch NA trolley=2 NA 101 1=3,… 9 8 f Französisch NA trolley=1 NA 140 1=5,… 10 9 male Deutsch NA dog=1 NA 138 1=5,… 11 10 männli… Deutsch NA dog=3 NA 142 1=5,… 12 11 weibli… Deutsch NA dog=5 NA 99 1=2,… # … with 8 more variables: Language...11 <chr>, ...12 <lgl>, REC-1 <dbl>, # Rec-2 <dbl>, REC-3 <dbl>, Language...16 <chr>, ...17 <lgl>, Comment <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% * select(where(~ !(all(is.na(.)) | all(. == "")))) # OR ``` ] .panel2-nosolution-auto[ ``` # A tibble: 12 × 12 ID Gender Language...4 `test-reaction` ...9 STAI Language...11 `REC-1` <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> 1 1 male German dog=3 92 1=5,… d 1 2 2 m German cat=3 122 1=5,… d 4 3 2 m GErman cat=3 123 1=5,… d 2 4 2 male German cat=3 122 1=5,… d 3 5 4 f German dog=2 112 1=5,… d 3 6 5 m French trolley=2 99 1=3,… <NA> 4 7 5 <NA> <NA> <NA> NA <NA> <NA> 4 8 7 f Französisch trolley=2 101 1=3,… f 4 9 8 f Französisch trolley=1 140 1=5,… f 1 10 9 male Deutsch dog=1 138 1=5,… d 3 11 10 männlich Deutsch dog=3 142 1=5,… d 4 12 11 weiblich Deutsch dog=5 99 1=2,… d 4 # … with 4 more variables: Rec-2 <dbl>, REC-3 <dbl>, Language...16 <chr>, # Comment <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR * janitor::remove_empty("cols") ``` ] .panel2-nosolution-auto[ ``` # A tibble: 12 × 12 ID Gender Language...4 `test-reaction` ...9 STAI Language...11 `REC-1` <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> 1 1 male German dog=3 92 1=5,… d 1 2 2 m German cat=3 122 1=5,… d 4 3 2 m GErman cat=3 123 1=5,… d 2 4 2 male German cat=3 122 1=5,… d 3 5 4 f German dog=2 112 1=5,… d 3 6 5 m French trolley=2 99 1=3,… <NA> 4 7 5 <NA> <NA> <NA> NA <NA> <NA> 4 8 7 f Französisch trolley=2 101 1=3,… f 4 9 8 f Französisch trolley=1 140 1=5,… f 1 10 9 male Deutsch dog=1 138 1=5,… d 3 11 10 männlich Deutsch dog=3 142 1=5,… d 4 12 11 weiblich Deutsch dog=5 99 1=2,… d 4 # … with 4 more variables: Rec-2 <dbl>, REC-3 <dbl>, Language...16 <chr>, # Comment <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% * janitor::clean_names() ``` ] .panel2-nosolution-auto[ ``` # A tibble: 12 × 12 id gender language_4 test_reaction x9 stai language_11 rec_1 rec_2 <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> 1 1 male German dog=3 92 1=5,2… d 1 1 2 2 m German cat=3 122 1=5,2… d 4 2 3 2 m GErman cat=3 123 1=5,2… d 2 3 4 2 male German cat=3 122 1=5,2… d 3 3 5 4 f German dog=2 112 1=5,2… d 3 3 6 5 m French trolley=2 99 1=3,2… <NA> 4 3 7 5 <NA> <NA> <NA> NA <NA> <NA> 4 2 8 7 f Französisch trolley=2 101 1=3,2… f 4 2 9 8 f Französisch trolley=1 140 1=5,2… f 1 2 10 9 male Deutsch dog=1 138 1=5,2… d 3 3 11 10 männlich Deutsch dog=3 142 1=5,2… d 4 3 12 11 weiblich Deutsch dog=5 99 1=2,2… d 4 2 # … with 3 more variables: rec_3 <dbl>, language_16 <chr>, comment <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% janitor::clean_names() %>% * rename( "test-time" = x9, * "rec-lang" = language_11) ``` ] .panel2-nosolution-auto[ ``` # A tibble: 12 × 12 id gender language_4 test_reaction `test-time` stai `rec-lang` rec_1 <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> 1 1 male German dog=3 92 1=5,2=… d 1 2 2 m German cat=3 122 1=5,2=… d 4 3 2 m GErman cat=3 123 1=5,2=… d 2 4 2 male German cat=3 122 1=5,2=… d 3 5 4 f German dog=2 112 1=5,2=… d 3 6 5 m French trolley=2 99 1=3,2=… <NA> 4 7 5 <NA> <NA> <NA> NA <NA> <NA> 4 8 7 f Französisch trolley=2 101 1=3,2=… f 4 9 8 f Französisch trolley=1 140 1=5,2=… f 1 10 9 male Deutsch dog=1 138 1=5,2=… d 3 11 10 männlich Deutsch dog=3 142 1=5,2=… d 4 12 11 weiblich Deutsch dog=5 99 1=2,2=… d 4 # … with 4 more variables: rec_2 <dbl>, rec_3 <dbl>, language_16 <chr>, # comment <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% janitor::clean_names() %>% rename( "test-time" = x9, "rec-lang" = language_11) %>% * separate(`test_reaction`, into = c("test", "reaction")) ``` ] .panel2-nosolution-auto[ ``` # A tibble: 12 × 13 id gender language_4 test reaction `test-time` stai `rec-lang` rec_1 <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> 1 1 male German dog 3 92 1=5,2… d 1 2 2 m German cat 3 122 1=5,2… d 4 3 2 m GErman cat 3 123 1=5,2… d 2 4 2 male German cat 3 122 1=5,2… d 3 5 4 f German dog 2 112 1=5,2… d 3 6 5 m French trol… 2 99 1=3,2… <NA> 4 7 5 <NA> <NA> <NA> <NA> NA <NA> <NA> 4 8 7 f Französisch trol… 2 101 1=3,2… f 4 9 8 f Französisch trol… 1 140 1=5,2… f 1 10 9 male Deutsch dog 1 138 1=5,2… d 3 11 10 männlich Deutsch dog 3 142 1=5,2… d 4 12 11 weiblich Deutsch dog 5 99 1=2,2… d 4 # … with 4 more variables: rec_2 <dbl>, rec_3 <dbl>, language_16 <chr>, # comment <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% janitor::clean_names() %>% rename( "test-time" = x9, "rec-lang" = language_11) %>% separate(`test_reaction`, into = c("test", "reaction")) %>% * mutate(gender = str_extract(gender, '^.'), * language_native = * case_when(str_detect(language_4, "^[gG]") ~ "deu", * str_detect(language_4, "^[fF]") ~ "fra")) ``` ] .panel2-nosolution-auto[ ``` # A tibble: 12 × 14 id gender language_4 test reaction `test-time` stai `rec-lang` rec_1 <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> 1 1 m German dog 3 92 1=5,2… d 1 2 2 m German cat 3 122 1=5,2… d 4 3 2 m GErman cat 3 123 1=5,2… d 2 4 2 m German cat 3 122 1=5,2… d 3 5 4 f German dog 2 112 1=5,2… d 3 6 5 m French trolley 2 99 1=3,2… <NA> 4 7 5 <NA> <NA> <NA> <NA> NA <NA> <NA> 4 8 7 f Französisch trolley 2 101 1=3,2… f 4 9 8 f Französisch trolley 1 140 1=5,2… f 1 10 9 m Deutsch dog 1 138 1=5,2… d 3 11 10 m Deutsch dog 3 142 1=5,2… d 4 12 11 w Deutsch dog 5 99 1=2,2… d 4 # … with 5 more variables: rec_2 <dbl>, rec_3 <dbl>, language_16 <chr>, # comment <chr>, language_native <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% janitor::clean_names() %>% rename( "test-time" = x9, "rec-lang" = language_11) %>% separate(`test_reaction`, into = c("test", "reaction")) %>% mutate(gender = str_extract(gender, '^.'), language_native = case_when(str_detect(language_4, "^[gG]") ~ "deu", str_detect(language_4, "^[fF]") ~ "fra")) %>% * rowid_to_column() ``` ] .panel2-nosolution-auto[ ``` # A tibble: 12 × 15 rowid id gender language_4 test reaction `test-time` stai `rec-lang` <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1 1 m German dog 3 92 1=5,2… d 2 2 2 m German cat 3 122 1=5,2… d 3 3 2 m GErman cat 3 123 1=5,2… d 4 4 2 m German cat 3 122 1=5,2… d 5 5 4 f German dog 2 112 1=5,2… d 6 6 5 m French trolley 2 99 1=3,2… <NA> 7 7 5 <NA> <NA> <NA> <NA> NA <NA> <NA> 8 8 7 f Französisch trolley 2 101 1=3,2… f 9 9 8 f Französisch trolley 1 140 1=5,2… f 10 10 9 m Deutsch dog 1 138 1=5,2… d 11 11 10 m Deutsch dog 3 142 1=5,2… d 12 12 11 w Deutsch dog 5 99 1=2,2… d # … with 6 more variables: rec_1 <dbl>, rec_2 <dbl>, rec_3 <dbl>, # language_16 <chr>, comment <chr>, language_native <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% janitor::clean_names() %>% rename( "test-time" = x9, "rec-lang" = language_11) %>% separate(`test_reaction`, into = c("test", "reaction")) %>% mutate(gender = str_extract(gender, '^.'), language_native = case_when(str_detect(language_4, "^[gG]") ~ "deu", str_detect(language_4, "^[fF]") ~ "fra")) %>% rowid_to_column() %>% * separate_rows(stai, sep = ",") ``` ] .panel2-nosolution-auto[ ``` # A tibble: 47 × 15 rowid id gender language_4 test reaction `test-time` stai `rec-lang` <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1 1 m German dog 3 92 "1=5" d 2 1 1 m German dog 3 92 "2=4" d 3 1 1 m German dog 3 92 " 3=3" d 4 1 1 m German dog 3 92 "6=7" d 5 2 2 m German cat 3 122 "1=5" d 6 2 2 m German cat 3 122 "2=4" d 7 2 2 m German cat 3 122 " 3=3" d 8 2 2 m German cat 3 122 "5=7" d 9 3 2 m GErman cat 3 123 "1=5" d 10 3 2 m GErman cat 3 123 "2=4" d # … with 37 more rows, and 6 more variables: rec_1 <dbl>, rec_2 <dbl>, # rec_3 <dbl>, language_16 <chr>, comment <chr>, language_native <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% janitor::clean_names() %>% rename( "test-time" = x9, "rec-lang" = language_11) %>% separate(`test_reaction`, into = c("test", "reaction")) %>% mutate(gender = str_extract(gender, '^.'), language_native = case_when(str_detect(language_4, "^[gG]") ~ "deu", str_detect(language_4, "^[fF]") ~ "fra")) %>% rowid_to_column() %>% separate_rows(stai, sep = ",") %>% * select(rowid, starts_with("stai"), everything()) ``` ] .panel2-nosolution-auto[ ``` # A tibble: 47 × 15 rowid stai id gender language_4 test reaction `test-time` `rec-lang` <int> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> 1 1 "1=5" 1 m German dog 3 92 d 2 1 "2=4" 1 m German dog 3 92 d 3 1 " 3=3" 1 m German dog 3 92 d 4 1 "6=7" 1 m German dog 3 92 d 5 2 "1=5" 2 m German cat 3 122 d 6 2 "2=4" 2 m German cat 3 122 d 7 2 " 3=3" 2 m German cat 3 122 d 8 2 "5=7" 2 m German cat 3 122 d 9 3 "1=5" 2 m GErman cat 3 123 d 10 3 "2=4" 2 m GErman cat 3 123 d # … with 37 more rows, and 6 more variables: rec_1 <dbl>, rec_2 <dbl>, # rec_3 <dbl>, language_16 <chr>, comment <chr>, language_native <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% janitor::clean_names() %>% rename( "test-time" = x9, "rec-lang" = language_11) %>% separate(`test_reaction`, into = c("test", "reaction")) %>% mutate(gender = str_extract(gender, '^.'), language_native = case_when(str_detect(language_4, "^[gG]") ~ "deu", str_detect(language_4, "^[fF]") ~ "fra")) %>% rowid_to_column() %>% separate_rows(stai, sep = ",") %>% select(rowid, starts_with("stai"), everything()) %>% * separate(stai, into = c("stai_key", "stai_val"), sep = "=") ``` ] .panel2-nosolution-auto[ ``` # A tibble: 47 × 16 rowid stai_key stai_val id gender language_4 test reaction `test-time` <int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> 1 1 "1" 5 1 m German dog 3 92 2 1 "2" 4 1 m German dog 3 92 3 1 " 3" 3 1 m German dog 3 92 4 1 "6" 7 1 m German dog 3 92 5 2 "1" 5 2 m German cat 3 122 6 2 "2" 4 2 m German cat 3 122 7 2 " 3" 3 2 m German cat 3 122 8 2 "5" 7 2 m German cat 3 122 9 3 "1" 5 2 m GErman cat 3 123 10 3 "2" 4 2 m GErman cat 3 123 # … with 37 more rows, and 7 more variables: rec-lang <chr>, rec_1 <dbl>, # rec_2 <dbl>, rec_3 <dbl>, language_16 <chr>, comment <chr>, # language_native <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% janitor::clean_names() %>% rename( "test-time" = x9, "rec-lang" = language_11) %>% separate(`test_reaction`, into = c("test", "reaction")) %>% mutate(gender = str_extract(gender, '^.'), language_native = case_when(str_detect(language_4, "^[gG]") ~ "deu", str_detect(language_4, "^[fF]") ~ "fra")) %>% rowid_to_column() %>% separate_rows(stai, sep = ",") %>% select(rowid, starts_with("stai"), everything()) %>% separate(stai, into = c("stai_key", "stai_val"), sep = "=") %>% * mutate(stai_val = as.integer(stai_val), * stai_key = str_c("stai_", trimws(stai_key))) ``` ] .panel2-nosolution-auto[ ``` # A tibble: 47 × 16 rowid stai_key stai_val id gender language_4 test reaction `test-time` <int> <chr> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> 1 1 stai_1 5 1 m German dog 3 92 2 1 stai_2 4 1 m German dog 3 92 3 1 stai_3 3 1 m German dog 3 92 4 1 stai_6 7 1 m German dog 3 92 5 2 stai_1 5 2 m German cat 3 122 6 2 stai_2 4 2 m German cat 3 122 7 2 stai_3 3 2 m German cat 3 122 8 2 stai_5 7 2 m German cat 3 122 9 3 stai_1 5 2 m GErman cat 3 123 10 3 stai_2 4 2 m GErman cat 3 123 # … with 37 more rows, and 7 more variables: rec-lang <chr>, rec_1 <dbl>, # rec_2 <dbl>, rec_3 <dbl>, language_16 <chr>, comment <chr>, # language_native <chr> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% janitor::clean_names() %>% rename( "test-time" = x9, "rec-lang" = language_11) %>% separate(`test_reaction`, into = c("test", "reaction")) %>% mutate(gender = str_extract(gender, '^.'), language_native = case_when(str_detect(language_4, "^[gG]") ~ "deu", str_detect(language_4, "^[fF]") ~ "fra")) %>% rowid_to_column() %>% separate_rows(stai, sep = ",") %>% select(rowid, starts_with("stai"), everything()) %>% separate(stai, into = c("stai_key", "stai_val"), sep = "=") %>% mutate(stai_val = as.integer(stai_val), stai_key = str_c("stai_", trimws(stai_key))) %>% * pivot_wider( names_from = stai_key, values_from = stai_val ) ``` ] .panel2-nosolution-auto[ ``` # A tibble: 12 × 21 rowid id gender language_4 test reaction `test-time` `rec-lang` rec_1 <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> 1 1 1 m German dog 3 92 d 1 2 2 2 m German cat 3 122 d 4 3 3 2 m GErman cat 3 123 d 2 4 4 2 m German cat 3 122 d 3 5 5 4 f German dog 2 112 d 3 6 6 5 m French trolley 2 99 <NA> 4 7 7 5 <NA> <NA> <NA> <NA> NA <NA> 4 8 8 7 f Französisch trolley 2 101 f 4 9 9 8 f Französisch trolley 1 140 f 1 10 10 9 m Deutsch dog 1 138 d 3 11 11 10 m Deutsch dog 3 142 d 4 12 12 11 w Deutsch dog 5 99 d 4 # … with 12 more variables: rec_2 <dbl>, rec_3 <dbl>, language_16 <chr>, # comment <chr>, language_native <chr>, stai_1 <int>, stai_2 <int>, # stai_3 <int>, stai_6 <int>, stai_5 <int>, stai_4 <int>, NA <int> ``` ] --- count: false # A possible solution .panel1-nosolution-auto[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% select(-Name, -Birthday, Comment) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% # OR janitor::remove_empty("cols") %>% janitor::clean_names() %>% rename( "test-time" = x9, "rec-lang" = language_11) %>% separate(`test_reaction`, into = c("test", "reaction")) %>% mutate(gender = str_extract(gender, '^.'), language_native = case_when(str_detect(language_4, "^[gG]") ~ "deu", str_detect(language_4, "^[fF]") ~ "fra")) %>% rowid_to_column() %>% separate_rows(stai, sep = ",") %>% select(rowid, starts_with("stai"), everything()) %>% separate(stai, into = c("stai_key", "stai_val"), sep = "=") %>% mutate(stai_val = as.integer(stai_val), stai_key = str_c("stai_", trimws(stai_key))) %>% pivot_wider( names_from = stai_key, values_from = stai_val ) %>% * select(rowid, id, starts_with("stai"), everything()) ``` ] .panel2-nosolution-auto[ ``` # A tibble: 12 × 21 rowid id stai_1 stai_2 stai_3 stai_6 stai_5 stai_4 gender language_4 test <int> <dbl> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr> 1 1 1 5 4 3 7 NA NA m German dog 2 2 2 5 4 3 NA 7 NA m German cat 3 3 2 5 4 2 NA NA 1 m GErman cat 4 4 2 5 4 3 NA NA 3 m German cat 5 5 4 5 4 4 7 NA NA f German dog 6 6 5 3 3 3 NA NA 2 m French trol… 7 7 5 NA NA NA NA NA NA <NA> <NA> <NA> 8 8 7 3 3 3 NA NA 4 f Französis… trol… 9 9 8 5 4 3 7 NA NA f Französis… trol… 10 10 9 5 4 3 NA NA 0 m Deutsch dog 11 11 10 5 4 3 NA 1 1 m Deutsch dog 12 12 11 2 4 3 NA 3 4 w Deutsch dog # … with 10 more variables: reaction <chr>, test-time <dbl>, rec-lang <chr>, # rec_1 <dbl>, rec_2 <dbl>, rec_3 <dbl>, language_16 <chr>, comment <chr>, # language_native <chr>, NA <int> ``` ] <style> .panel1-nosolution-auto { color: black; width: 44.5454545454545%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-nosolution-auto { color: black; width: 53.4545454545455%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-nosolution-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- # Learning objectives .pull-left[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[Common *tidy data* violations
]] > - Inconsistent data from manual entry (`stringr`, `tidyr`) > - Multiple variables stored in one column (e.g `tidyr::separate`) > - Multiple types in one table (`dplyr`, data transformation) > - One type in multiple tables (`dplyr`, combine into single table) > - Column headers are values, not variable names (`tidyr::pivot`) > - Variables are stored in both rows and columns (`tidyr::pivot`) > - Repeated observations (`nest` or separate tables) ] ] -- .pull-right[ .bg-washed-green.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Tidy data solutions
]] * Nicer column names with `janitor` * Basic column-wise manipulation - now with `tidyr` and `stringr` * Reshaping operations - after `dplyr` * Nesting and advanced collapsing operations - after `broom` ]] --- class: hide_logo # Side note - `janitor` .pull-left[ .bg-washed-green.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2) ``` ``` # A tibble: 23 × 18 Name ID Gender Language...4 Birthday ...6 `test-reaction` <chr> <dbl> <chr> <chr> <chr> <lgl> <chr> 1 Roland Krause 1 male German 38687 NA dog=3 2 Lars Juhl Krause 2 m German 26/2/1992 NA cat=3 3 <NA> 2 m GErman 26/2/1992 NA cat=3 4 <NA> 2 male German 26/2/1992 NA cat=3 5 Martina Krause 4 f German 12-03-2002 NA dog=2 6 Horst-Sergio Krause 5 m French 12/11/70 NA trolley=2 7 Horst-Sergio Krause 5 <NA> <NA> 11-12-70 NA <NA> 8 LJK 7 f Französisch <NA> NA trolley=2 9 DFK 8 f Französisch 2005-12-10 NA trolley=1 10 Horst Mond 9 male Deutsch 1972-10-01 NA dog=1 # … with 13 more rows, and 11 more variables: morale (by STAI test) <chr>, # ...9 <chr>, STAI <chr>, Language...11 <chr>, ...12 <lgl>, REC-1 <dbl>, # Rec-2 <dbl>, REC-3 <dbl>, Language...16 <chr>, ...17 <lgl>, Comment <chr> ``` ]] .pull-right[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[Janitor
]] ```r readxl::read_excel("data/bad-table-psych.xlsx", skip = 2, n_max = 12) %>% janitor::clean_names() ``` ``` # A tibble: 12 × 18 name id gender language_4 birthday x6 test_reaction morale_by_stai_… <chr> <dbl> <chr> <chr> <chr> <lgl> <chr> <lgl> 1 Roland… 1 male German 38687 NA dog=3 NA 2 Lars J… 2 m German 26/2/19… NA cat=3 NA 3 <NA> 2 m GErman 26/2/19… NA cat=3 NA 4 <NA> 2 male German 26/2/19… NA cat=3 NA 5 Martin… 4 f German 12-03-2… NA dog=2 NA 6 Horst-… 5 m French 12/11/70 NA trolley=2 NA 7 Horst-… 5 <NA> <NA> 11-12-70 NA <NA> NA 8 LJK 7 f Französis… <NA> NA trolley=2 NA 9 DFK 8 f Französis… 2005-12… NA trolley=1 NA 10 Horst … 9 male Deutsch 1972-10… NA dog=1 NA 11 Kevin … 10 männl… Deutsch 5.4.1970 NA dog=3 NA 12 Inga M… 11 weibl… Deutsch 1/1/1990 NA dog=5 NA # … with 10 more variables: x9 <dbl>, stai <chr>, language_11 <chr>, x12 <lgl>, # rec_1 <dbl>, rec_2 <dbl>, rec_3 <dbl>, language_16 <chr>, x17 <lgl>, # comment <chr> ``` The data still sucks but at least you have nice column names. ]] --- class: inverse, center, middle # Basic rearrangements --- class: hide_logo ## Splitting values - `tidyr::separate()` .pull-left[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mr1[ .bbox[ Key-value pairs] ``` # A tibble: 6 × 2 subject_id gender_age <int> <chr> 1 1001 m-34 2 1002 f-24 3 1003 m-53 4 1004 f-44 5 1005 m-24 6 1006 f-30 ``` ```r patient %>% separate(gender_age, c("sex", "age"), convert = TRUE) ``` ``` # A tibble: 6 × 3 subject_id sex age <int> <chr> <int> 1 1001 m 34 2 1002 f 24 3 1003 m 53 4 1004 f 44 5 1005 m 24 6 1006 f 30 ``` ] ] -- .pull-right[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mr1[ .bbox[No separator] ``` # A tibble: 6 × 2 subject_id gender_age <int> <glue> 1 1001 f50 2 1002 m34 3 1003 m63 4 1004 m57 5 1005 f42 6 1006 m58 ``` ```r patient %>% separate(gender_age, sep = 1, c("sex", "age"), convert = TRUE) ``` ``` # A tibble: 6 × 3 subject_id sex age <int> <chr> <int> 1 1001 f 50 2 1002 m 34 3 1003 m 63 4 1004 m 57 5 1005 f 42 6 1006 m 58 ``` ] ] --- # Splitting and collating - `separate()` and `unite()` .pull-left[ ### Input tibble ```r data_value <- tibble( year = c(2015, 2014, 2014), month = c(11, 2, 4), day = c(23, 1, 30), value = c("high", "low", "low")) data_value ``` ``` # A tibble: 3 × 4 year month day value <dbl> <dbl> <dbl> <chr> 1 2015 11 23 high 2 2014 2 1 low 3 2014 4 30 low ``` .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3[
**Tip**: Demonstration only. Use the package `lubridate` for actually working with dates. ] ] .pull-right[ ### `unite()` ```r date_unite <- unite(data_value, date, year, month, day, sep = "-") date_unite ``` ``` # A tibble: 3 × 2 date value <chr> <chr> 1 2015-11-23 high 2 2014-2-1 low 3 2014-4-30 low ``` No need to clean up old columns. ] --- # Splitting columns continued .pull-left[ .bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Split with `separate()`]] - Default split on non-alphanumeric characters - Use **quotes** to designate columns since we are not refering to objects. ] ] .pull-right[ ```r date_unite %>% separate(date, c("year", "month", "day")) ``` ``` # A tibble: 3 × 4 year month day value <chr> <chr> <chr> <chr> 1 2015 11 23 high 2 2014 2 1 low 3 2014 4 30 low ``` ] --- ## Separate rows with multiple entries .pull-left[ ### Multiple values per cell ```r patient_df <- tibble( subject_id = 1001:1003, visit_id = c("1,2, 3", "1|2", "1"), measured = c("9,0, 11", "11, 3", "12")) patient_df ``` ``` # A tibble: 3 × 3 subject_id visit_id measured <int> <chr> <chr> 1 1001 1,2, 3 9,0, 11 2 1002 1|2 11, 3 3 1003 1 12 ``` Note the incoherent white space and separators. ] .pull-right[ ### Combinations of variables ```r patient_df %>% separate_rows(visit_id, measured, convert = TRUE) -> patient_separate patient_separate ``` ``` # A tibble: 6 × 3 subject_id visit_id measured <int> <int> <int> 1 1001 1 9 2 1001 2 0 3 1001 3 11 4 1002 1 11 5 1002 2 3 6 1003 1 12 ``` ] --- class: nvs3, small ## Fill all combinations with `complete()` .pull-left[ ### Combinations of variables ```r patient_separate %>% complete(subject_id, nesting(visit_id)) ``` ``` # A tibble: 9 × 3 subject_id visit_id measured <int> <int> <int> 1 1001 1 9 2 1001 2 0 3 1001 3 11 4 1002 1 11 5 1002 2 3 6 1002 3 NA 7 1003 1 12 8 1003 2 NA 9 1003 3 NA ``` ] -- .pull-right[ ### Detemine filling element as list ```r patient_separate %>% complete(subject_id, nesting(visit_id), fill = list(measured = 0)) ``` ``` # A tibble: 9 × 3 subject_id visit_id measured <int> <int> <dbl> 1 1001 1 9 2 1001 2 0 3 1001 3 11 4 1002 1 11 5 1002 2 3 6 1002 3 0 7 1003 1 12 8 1003 2 0 9 1003 3 0 ``` ] -- .flex[ .w-45.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3[
**Tip**: Use `<NA>`, not `0` for missing data in real life application. ] ] --- # Data cleaning .pull-left[ ### "Manual" Curation ```r my_col <- c("F", "M", "female", "male", "male", "female", "female") write.csv(as.data.frame("a" <- my_col), "~/Downloads/important.csv") ``` .w-47.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ Open Excel .right[
]] ```r my_clean_data <-read.csv("~/Downloads/important.csv") ``` ] -- .pull-right[ ### Scripted Curation ```r my_col <- c("F", "M", "female", "male", "male", "female", "female", "männlich") convert_gender <- function(x) { case_when( str_detect(x, "^[Ff]") ~ "Female", str_detect(x, "^[Mm]") ~ "Male", TRUE ~ x ) } convert_gender(my_col) ``` ``` [1] "Female" "Male" "Female" "Male" "Male" "Female" "Female" "Male" ``` ] --- # Before we stop .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml1[ .large[.gbox[Tidy data core] + <http://tidyr.tidyverse.org/> + `vignette("tidy-data")` ]] .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml2[ .large[.bbox[Acknowledgments 🙏 👏]] * Hadley Wickham (Tidy concepts) * Alison Hill (`tidyr`) * Artwork by [Allison Horst](https://twitter.com/allison_horst) * Initial development: Roland Krause ]] .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml1[ .large[.ybox[Further reading 📚]] + **Rectangling** - Taming JSON and other data into rectangles - <https://tidyr.tidyverse.org/dev/articles/rectangle.html> ] .w-50.pv2.ph3.ml1[ .huge[.bbox[Thank you for your attention!]] ] ]