class: title-slide
# Data wrangling ## `dplyr` .center[<img src="https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png" width="100px"/>] ### Roland Krause | rworkshop | 2021-09-09 --- # Introduction .flex.justify-around[ .w-30.bg-washed-yellow.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[
Data munging ] * Preparing data is the most time consuming part of data analysis. * Individual steps might look *easy*. * Essential part of *understanding* the data you're working with. * Additional data preparation before modeling is impossible to avoid. ] ] .w-30.bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[
At a glance ] `dplyr` is a tool box for working with data in *tibbles*, offering a unified language for operations scattered through base R. .center[<img src="https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png" width="100px"/> ] ] ] .w-30.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[
Key operations ]  .right[<span class = "small">source: [Lise Vaudor](http://perso.ens-lyon.fr/lise.vaudor/dplyr/)</span>] ] ] ] --- # This lecture .flex.justify-around[ .w-30.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[
Example data ]] Van 't Veer, Anna; Sleegers, Willem, .bold[2019], "Psychology data from an exploration of the effect of anticipatory stress on disgust vs. non-disgust related moral judgments". _Journal of Open Psychology Data_. * Data is (largely) *tidy*. * Typical data you might see in the wild. ] .w-30.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[
Learning objectives ]] + Learn the .bold[grammar] to operate on rows and columns of a table + Selection and manipulation of - observations, - variables and - values. + Grouping and summarizing + Joining and intersecting tibbles + Pivoting column headers and variables ] .w-30.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[
Key operations ]  .right[<span class = "small">source: [Lise Vaudor](http://perso.ens-lyon.fr/lise.vaudor/dplyr/)</span>] ] ] ] --- # `dplyr` Introduction: [Cheat sheets](https://www.rstudio.com/resources/cheatsheets/) .center[ <img src="img/dplyr_cheatsheet.png" width="60%" /> ] --- .center.middle[<img src="img/dplyr_schema_berlin.jpg" width="80%">] .footnote[[R Berlin](https://twitter.com/rdataberlin/status/1321503543887568896/photo/1)] --- # Using the `dplyr` package .flex.justify-around[ .w-70.bg-washed-red.b--red.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .rbox.large[
Do not use these packages! ] `dplyr` supersedes previous packages from Hadley Wickham. * `reshape` * `reshape2` * `plyr` All functionality can be found in `dplyr` or `tidyr`. ] .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[
dplyr *current version is 1.0.7*] * `dplyr` has seen many changes. * Watch out for deprecated examples on Stack overflow! * Code will break sooner or later (you might get lucky) * But handled with clear [lifecycle stages](https://lifecycle.r-lib.org/articles/stages.html) now * Changes are generally introduced to simplify operations. ] .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .ybox[Loading `dplyr`] ```r library(dplyr) # AND library(tidyr) # OR (recommended # for the workshop) library(tidyverse) ``` ] ] --- class: hide_logo # Preparing the data .pull-left[ .w-90.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ .large[.bbox[Description] Van 't Veer, Anna; Sleegers, Willem. (2019) "[Psychology data from an exploration of the effect of anticipatory stress on disgust vs. non-disgust related moral judgments](http://doi.org/10.5334/jopd.43)". *Journal of Open Psychology Data*.
* Moral dilemmata (trolley problem, survival after plane crash, etc. `moral`) * Standard questionnaires + Private Body Consciousness (`PBC`, range 0 - 4) + Rational-Experiential Inventory (`REI`, range 1 - 5) + Multidimensional Assessment of Interoceptive Awareness (`MAIA`, range 0 - 5 ) + State Trait Anxiety Inventory (`STAI`) ] ] ] ```r judge_url <- "https://biostat2.uni.lu/practicals/data/judgments.tsv" judgments <- readr::read_tsv(judge_url) ``` ``` Rows: 188 Columns: 158 ``` ``` ── Column specification ──────────────────────────────────────────────────────── Delimiter: "\t" chr (5): start_date, end_date, condition, gender, logbook dbl (153): finished, subject, age, mood_pre, mood_post, STAI_pre_1_1, STAI_p... ``` ``` ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ```r # https://dataverse.nl/api/access/datafile/11863 #"https://biostat2.uni.lu/practicals/data/judgments.tsv") ``` --- class:slide-practical # Your turn Load the data into your RStudio session.
02
:
00
```r judgments <- readr::read_tsv("https://biostat2.uni.lu/practicals/data/judgments.tsv") ``` ``` Rows: 188 Columns: 158 ``` ``` ── Column specification ──────────────────────────────────────────────────────── Delimiter: "\t" chr (5): start_date, end_date, condition, gender, logbook dbl (153): finished, subject, age, mood_pre, mood_post, STAI_pre_1_1, STAI_p... ``` ``` ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` --- # Pipes to simplify multiple function calls .flex[ .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Speaking code] <!-- Didn't extend the font size formatting to chunks --> .large[ ```r verb(subject, complement) ``` ```r subject %>% verb(complement) ``` ] Example adapted from [Romain François](https://twitter.com/ParisRaddict/status/781267225420435461) ] .w-70.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .ybox[ `tidyverse` functions are consistently designed to work with `%>%`] .large[ **data** (_subject_) is the **first** argument. ] ] .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Current developments] * Many languages are adding pipes. * Several packages exist in R. * R 4.1 introduced a base pipe `|>`. + Works mostly like the maggritr/tidyverse `%>%` but is not identical. * Will use `%>%` only in the course. [Luke Tierney useR!2020](https://youtu.be/X_eDHNVceCU?t=4084) ] ] --- # Best practices for using the pipe .left-column[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[💡 Reminder ]] First arguments for all `tidyverse` functions is the `.data`: A tibble / data.frame. But for single operations -- function calls -- it is recommended .bold[NOT] to use `%>%`. ] ] -- .flex.justify-around[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .rbox[⚠️ Warning] ```r judgments %>% mutate(judgments, # Wrong! new_col = mean(age)) ``` Don't specify `data` twice! ] .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .gbox[Better] ```r judgments %>% mutate(new_col = mean(age)) ``` Harder to read for simple operations. ] .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Recommended] ```r mutate(judgments, new_col = mean(age)) ``` For single operations only. ] ] --- class: inverse, middle, center # Inspecting tibbles --- # Inspect tibbles with `glimpse()` .left-column[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[💡 Column-wise description ]] Shows some values and the type of **each** column. The *Environment* tab in RStudio tab does it too. Clicking object `judgments` triggers `View()`. Similar to the `base::str()` function ] ] .right-column[ ```r glimpse(judgments) ``` ``` Rows: 188 Columns: 158 $ start_date <chr> "11/3/2014", "11/3/2014", "11/3/2014", "11/… $ end_date <chr> "11/3/2014", "11/3/2014", "11/3/2014", "11/… $ finished <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ condition <chr> "control", "stress", "stress", "stress", "c… $ subject <dbl> 2, 1, 3, 4, 7, 6, 5, 9, 16, 13, 18, 14, 12,… $ gender <chr> "female", "female", "female", "female", "fe… $ age <dbl> 24, 19, 19, 22, 22, 22, 18, 20, 21, 19, 19,… $ mood_pre <dbl> 81, 59, 22, 53, 48, 73, NA, 100, 67, 30, 55… $ mood_post <dbl> NA, 42, 60, 68, NA, 73, NA, NA, 74, 68, 57,… $ STAI_pre_1_1 <dbl> 2, 3, 4, 2, 1, 2, 2, 1, 2, 4, 2, 1, 2, 1, 1… $ STAI_pre_1_2 <dbl> 1, 2, 3, 2, 1, 2, 2, 1, 2, 2, 3, 2, 2, 1, 1… $ STAI_pre_1_3 <dbl> 2, 3, 3, 2, 1, 1, 1, 1, 1, 3, 1, 2, 2, 2, 2… $ STAI_pre_1_4 <dbl> 2, 1, 3, 2, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 1… $ STAI_pre_1_5 <dbl> 2, 3, 4, 3, 2, 2, 2, 1, 2, 3, 2, 2, 2, 2, 2… $ STAI_pre_1_6 <dbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1… $ STAI_pre_1_7 <dbl> 2, 3, 3, 1, 1, 2, 1, 1, 1, 3, 1, 1, 2, 1, 3… $ STAI_pre_2_1 <dbl> 2, 3, 4, 3, 3, 2, 2, 2, 2, 4, 3, 3, 2, 4, 3… $ STAI_pre_2_2 <dbl> 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1… $ STAI_pre_2_3 <dbl> 1, 2, 3, 3, 3, 2, 2, 1, 2, 3, 2, 3, 3, 3, 2… $ STAI_pre_2_4 <dbl> 1, 2, 4, 3, 3, 2, 2, 1, 2, 4, 3, 3, 3, 3, 2… $ STAI_pre_2_5 <dbl> 1, 2, 4, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 2, 1… $ STAI_pre_2_6 <dbl> 1, 3, 4, 1, 1, 2, 1, 1, 1, 3, 1, 1, 1, 2, 2… $ STAI_pre_2_7 <dbl> 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 3, 1… $ STAI_pre_3_1 <dbl> 2, 3, 4, 3, 1, 2, 2, 1, 2, 4, 2, 2, 3, 2, 3… $ STAI_pre_3_2 <dbl> 2, 3, 3, 3, 2, 2, 2, 1, 2, 3, 2, 2, 2, 3, 2… $ STAI_pre_3_3 <dbl> 2, 3, 2, 2, 2, 2, 1, 1, 1, 3, 1, 1, 2, 1, 2… $ STAI_pre_3_4 <dbl> 1, 2, 3, 1, 1, 1, 2, 1, 2, 3, 1, 1, 1, 1, 1… $ STAI_pre_3_5 <dbl> 2, 3, 4, 3, 3, 2, 2, 1, 2, 4, 2, 2, 3, 2, 3… $ STAI_pre_3_6 <dbl> 2, 3, 4, 3, 3, 2, 2, 1, 2, 4, 2, 2, 2, 2, 2… $ STAI_post_1_1 <dbl> NA, 3, 3, 2, NA, 2, NA, NA, 2, 2, 2, 3, NA,… $ STAI_post_1_2 <dbl> NA, 3, 3, 2, NA, 2, NA, NA, 2, 2, 2, 2, NA,… $ STAI_post_1_3 <dbl> NA, 3, 2, 1, NA, 1, NA, NA, 1, 1, 2, 2, NA,… $ STAI_post_1_4 <dbl> NA, 3, 2, 1, NA, 1, NA, NA, 1, 2, 1, 2, NA,… $ STAI_post_1_5 <dbl> NA, 2, 2, 2, NA, 2, NA, NA, 2, 3, 3, 3, NA,… $ STAI_post_1_6 <dbl> NA, 2, 1, 1, NA, 1, NA, NA, 1, 1, 3, 2, NA,… $ STAI_post_1_7 <dbl> NA, 3, 1, 1, NA, 2, NA, NA, 1, 1, 1, 1, NA,… $ STAI_post_2_1 <dbl> NA, 2, 3, 2, NA, 2, NA, NA, 2, 2, 3, 2, NA,… $ STAI_post_2_2 <dbl> NA, 2, 1, 1, NA, 1, NA, NA, 1, 1, 1, 1, NA,… $ STAI_post_2_3 <dbl> NA, 3, 3, 2, NA, 2, NA, NA, 2, 3, 3, 3, NA,… $ STAI_post_2_4 <dbl> NA, 3, 3, 2, NA, 2, NA, NA, 2, 3, 3, 3, NA,… $ STAI_post_2_5 <dbl> NA, 3, 1, 1, NA, 1, NA, NA, 1, 1, 1, 1, NA,… $ STAI_post_2_6 <dbl> NA, 3, 1, 1, NA, 1, NA, NA, 1, 1, 2, 1, NA,… $ STAI_post_2_7 <dbl> NA, 1, 1, 2, NA, 1, NA, NA, 1, 1, 2, 1, NA,… $ STAI_post_3_1 <dbl> NA, 2, 3, 2, NA, 2, NA, NA, 2, 3, 3, 3, NA,… $ STAI_post_3_2 <dbl> NA, 2, 3, 2, NA, 2, NA, NA, 2, 3, 4, 3, NA,… $ STAI_post_3_3 <dbl> NA, 3, 1, 1, NA, 1, NA, NA, 1, 1, 1, 2, NA,… $ STAI_post_3_4 <dbl> NA, 2, 1, 1, NA, 1, NA, NA, 1, 1, 2, 1, NA,… $ STAI_post_3_5 <dbl> NA, 3, 3, 3, NA, 3, NA, NA, 2, 3, 3, 3, NA,… $ STAI_post_3_6 <dbl> NA, 3, 3, 2, NA, 3, NA, NA, 2, 3, 3, 3, NA,… $ moral_dilemma_dog <dbl> 9, 9, 8, 8, 3, 9, 9, 9, 6, 6, 8, 7, 9, 9, 6… $ moral_dilemma_wallet <dbl> 9, 9, 7, 4, 9, 9, 5, 4, 9, 8, 7, 9, 9, 9, 7… $ moral_dilemma_plane <dbl> 8, 9, 8, 8, 9, 9, 7, 1, 3, 9, 9, 6, 9, 9, 4… $ moral_dilemma_resume <dbl> 7, 8, 5, 6, 5, 9, 3, 7, 9, 8, 5, 9, 8, 9, 7… $ moral_dilemma_kitten <dbl> 9, 9, 8, 9, 5, 8, 6, 9, 9, 9, 8, 9, 7, 9, 6… $ moral_dilemma_trolley <dbl> 5, 3, 5, 2, 4, 5, 3, 1, 1, 9, 2, 4, 5, 5, 3… $ moral_dilemma_control <dbl> 9, 2, 9, 8, 8, 6, 8, 7, 8, 6, 7, 8, 8, 3, 7… $ presentation_experience <dbl> NA, 2, 1, 2, NA, 2, NA, NA, 2, 2, 2, 2, NA,… $ presentation_unpleasant <dbl> NA, 63, 68, 32, NA, 63, NA, NA, 14, 54, 82,… $ presentation_fun <dbl> NA, 58, 26, 59, NA, 54, NA, NA, 78, 42, 7, … $ presentation_challenge <dbl> NA, 58, 65, 80, NA, 50, NA, NA, 47, 64, 72,… $ PBC_1 <dbl> 3, NA, NA, NA, 3, NA, 3, 1, NA, NA, NA, NA,… $ PBC_2 <dbl> 3, NA, NA, NA, 3, NA, 3, 4, NA, NA, NA, NA,… $ PBC_3 <dbl> 5, NA, NA, NA, 3, NA, 3, 1, NA, NA, NA, NA,… $ PBC_4 <dbl> 5, NA, NA, NA, 3, NA, 5, 4, NA, NA, NA, NA,… $ PBC_5 <dbl> 5, NA, NA, NA, 2, NA, 5, 4, NA, NA, NA, NA,… $ REI_1 <dbl> 5, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ REI_2 <dbl> 4, NA, NA, NA, 3, NA, 4, 5, NA, NA, NA, NA,… $ REI_3 <dbl> 5, NA, NA, NA, 3, NA, 4, 5, NA, NA, NA, NA,… $ REI_4 <dbl> 4, NA, NA, NA, 3, NA, 4, 5, NA, NA, NA, NA,… $ REI_5 <dbl> 4, NA, NA, NA, 4, NA, 4, 5, NA, NA, NA, NA,… $ REI_6 <dbl> 5, NA, NA, NA, 3, NA, 4, 5, NA, NA, NA, NA,… $ REI_7 <dbl> 3, NA, NA, NA, 3, NA, 3, 5, NA, NA, NA, NA,… $ REI_8 <dbl> 4, NA, NA, NA, 3, NA, 4, 5, NA, NA, NA, NA,… $ REI_9 <dbl> 3, NA, NA, NA, 3, NA, 4, 3, NA, NA, NA, NA,… $ REI_10 <dbl> 4, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ REI_11 <dbl> 5, NA, NA, NA, 3, NA, 4, 5, NA, NA, NA, NA,… $ REI_12 <dbl> 5, NA, NA, NA, NA, NA, 3, 4, NA, NA, NA, NA… $ REI_13 <dbl> 3, NA, NA, NA, 3, NA, 4, 2, NA, NA, NA, NA,… $ REI_14 <dbl> 4, NA, NA, NA, 3, NA, 4, 2, NA, NA, NA, NA,… $ REI_15 <dbl> 4, NA, NA, NA, 3, NA, 3, 4, NA, NA, NA, NA,… $ REI_16 <dbl> 4, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ REI_17 <dbl> 3, NA, NA, NA, 3, NA, 3, 4, NA, NA, NA, NA,… $ REI_18 <dbl> 5, NA, NA, NA, 3, NA, 2, 5, NA, NA, NA, NA,… $ REI_19 <dbl> 1, NA, NA, NA, 3, NA, 4, 3, NA, NA, NA, NA,… $ REI_20 <dbl> 3, NA, NA, NA, 3, NA, 5, 5, NA, NA, NA, NA,… $ REI_21 <dbl> 5, NA, NA, NA, 3, NA, 4, 5, NA, NA, NA, NA,… $ REI_22 <dbl> 3, NA, NA, NA, 3, NA, 4, 3, NA, NA, NA, NA,… $ REI_23 <dbl> 4, NA, NA, NA, 3, NA, 5, 5, NA, NA, NA, NA,… $ REI_24 <dbl> 2, NA, NA, NA, 3, NA, 1, 5, NA, NA, NA, NA,… $ REI_25 <dbl> 3, NA, NA, NA, 3, NA, 2, 3, NA, NA, NA, NA,… $ REI_26 <dbl> 5, NA, NA, NA, 3, NA, 4, 5, NA, NA, NA, NA,… $ REI_27 <dbl> 5, NA, NA, NA, 3, NA, 2, 1, NA, NA, NA, NA,… $ REI_28 <dbl> 3, NA, NA, NA, 3, NA, 4, 3, NA, NA, NA, NA,… $ REI_29 <dbl> 3, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ REI_30 <dbl> 4, NA, NA, NA, 3, NA, 3, 4, NA, NA, NA, NA,… $ REI_31 <dbl> 3, NA, NA, NA, 3, NA, 4, 5, NA, NA, NA, NA,… $ REI_32 <dbl> 3, NA, NA, NA, 3, NA, 3, 1, NA, NA, NA, NA,… $ REI_33 <dbl> 4, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ REI_34 <dbl> 3, NA, NA, NA, 3, NA, 3, 4, NA, NA, NA, NA,… $ REI_35 <dbl> 4, NA, NA, NA, 3, NA, 5, 5, NA, NA, NA, NA,… $ REI_36 <dbl> 3, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ REI_37 <dbl> 4, NA, NA, NA, 3, NA, 4, 3, NA, NA, NA, NA,… $ REI_38 <dbl> 4, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ REI_39 <dbl> 4, NA, NA, NA, 3, NA, 4, 3, NA, NA, NA, NA,… $ REI_40 <dbl> 4, NA, NA, NA, 3, NA, 3, 5, NA, NA, NA, NA,… $ MAIA_1_1 <dbl> 2, NA, NA, NA, 3, NA, 4, 3, NA, NA, NA, NA,… $ MAIA_1_2 <dbl> 4, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ MAIA_1_3 <dbl> 4, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ MAIA_1_4 <dbl> 4, NA, NA, NA, 3, NA, 5, 4, NA, NA, NA, NA,… $ MAIA_1_5 <dbl> 2, NA, NA, NA, 3, NA, 2, 5, NA, NA, NA, NA,… $ MAIA_1_6 <dbl> 2, NA, NA, NA, 3, NA, 2, 2, NA, NA, NA, NA,… $ MAIA_1_7 <dbl> 2, NA, NA, NA, 3, NA, 4, 3, NA, NA, NA, NA,… $ MAIA_1_8 <dbl> 3, NA, NA, NA, 3, NA, 3, 5, NA, NA, NA, NA,… $ MAIA_1_9 <dbl> 4, NA, NA, NA, 3, NA, 5, 6, NA, NA, NA, NA,… $ MAIA_1_10 <dbl> 4, NA, NA, NA, 3, NA, 4, 2, NA, NA, NA, NA,… $ MAIA_1_11 <dbl> 4, NA, NA, NA, 3, NA, 1, 3, NA, NA, NA, NA,… $ MAIA_1_12 <dbl> 3, NA, NA, NA, 3, NA, 2, 3, NA, NA, NA, NA,… $ MAIA_1_13 <dbl> 4, NA, NA, NA, 3, NA, 3, 5, NA, NA, NA, NA,… $ MAIA_1_14 <dbl> 4, NA, NA, NA, 3, NA, 2, 5, NA, NA, NA, NA,… $ MAIA_1_15 <dbl> 4, NA, NA, NA, 3, NA, 3, 4, NA, NA, NA, NA,… $ MAIA_1_16 <dbl> 4, NA, NA, NA, 3, NA, 2, 5, NA, NA, NA, NA,… $ MAIA_2_1 <dbl> 4, NA, NA, NA, 3, NA, 3, 5, NA, NA, NA, NA,… $ MAIA_2_2 <dbl> 4, NA, NA, NA, 3, NA, 4, 5, NA, NA, NA, NA,… $ MAIA_2_3 <dbl> 4, NA, NA, NA, 3, NA, 3, 2, NA, NA, NA, NA,… $ MAIA_2_4 <dbl> 4, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ MAIA_2_5 <dbl> 4, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ MAIA_2_6 <dbl> 4, NA, NA, NA, 3, NA, 5, 4, NA, NA, NA, NA,… $ MAIA_2_7 <dbl> 4, NA, NA, NA, 3, NA, 3, 2, NA, NA, NA, NA,… $ MAIA_2_8 <dbl> 4, NA, NA, NA, 3, NA, 3, 3, NA, NA, NA, NA,… $ MAIA_2_9 <dbl> 4, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ MAIA_2_10 <dbl> 4, NA, NA, NA, 3, NA, 2, 4, NA, NA, NA, NA,… $ MAIA_2_11 <dbl> 4, NA, NA, NA, 3, NA, 3, 3, NA, NA, NA, NA,… $ MAIA_2_12 <dbl> 3, NA, NA, NA, 3, NA, 2, 0, NA, NA, NA, NA,… $ MAIA_2_13 <dbl> 3, NA, NA, NA, 3, NA, 3, 3, NA, NA, NA, NA,… $ MAIA_2_14 <dbl> 4, NA, NA, NA, 3, NA, 3, 4, NA, NA, NA, NA,… $ MAIA_2_15 <dbl> 4, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ MAIA_2_16 <dbl> 4, NA, NA, NA, 3, NA, 4, 4, NA, NA, NA, NA,… $ STAI_pre <dbl> 32, 49, 65, 42, 33, 34, 32, 21, 31, 60, 34,… $ STAI_post <dbl> NA, 51, 41, 32, NA, 33, NA, NA, 30, 38, 45,… $ MAIA_noticing <dbl> 14, NA, NA, NA, 12, NA, 17, 15, NA, NA, NA,… $ MAIA_not_distracting <dbl> 6, NA, NA, NA, 9, NA, 8, 10, NA, NA, NA, NA… $ MAIA_not_worrying <dbl> 11, NA, NA, NA, 9, NA, 12, 13, NA, NA, NA, … $ MAIA_attention_regulation <dbl> 27, NA, NA, NA, 21, NA, 16, 30, NA, NA, NA,… $ MAIA_emotional_awareness <dbl> 20, NA, NA, NA, 15, NA, 20, 19, NA, NA, NA,… $ MAIA_self_regulation <dbl> 16, NA, NA, NA, 12, NA, 12, 13, NA, NA, NA,… $ MAIA_body_listening <dbl> 10, NA, NA, NA, 9, NA, 8, 6, NA, NA, NA, NA… $ MAIA_trusting <dbl> 12, NA, NA, NA, 9, NA, 11, 12, NA, NA, NA, … $ PBC <dbl> 21, NA, NA, NA, 14, NA, 19, 14, NA, NA, NA,… $ REI_rational_ability <dbl> 38, NA, NA, NA, 31, NA, 37, 38, NA, NA, NA,… $ REI_rational_engagement <dbl> 38, NA, NA, NA, NA, NA, 26, 41, NA, NA, NA,… $ REI_experiental_ability <dbl> 36, NA, NA, NA, 30, NA, 43, 43, NA, NA, NA,… $ REI_experiental_engagement <dbl> 39, NA, NA, NA, 30, NA, 40, 38, NA, NA, NA,… $ moral_judgment <dbl> 8.000000, 7.000000, 7.142857, 6.428571, 6.1… $ moral_judgment_disgust <dbl> 8.666667, 9.000000, 8.000000, 8.333333, 5.6… $ moral_judgment_non_disgust <dbl> 7.000000, 6.666667, 5.666667, 4.000000, 6.0… $ presentation_evaluation <dbl> NA, 3, 3, 4, NA, 4, NA, NA, 4, 3, 3, 3, NA,… $ logbook <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ exclude <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ``` ] --- # Selecting *columns* .flex[ .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.ybox[ `select()`]] - Select the columns you want: `select(tibble, your_column1, ...)` ```r select(judgments, gender, age, condition) ``` ``` # A tibble: 188 × 3 gender age condition <chr> <dbl> <chr> 1 female 24 control 2 female 19 stress 3 female 19 stress 4 female 22 stress 5 female 22 control 6 female 22 stress 7 female 18 control 8 male 20 control 9 female 21 stress 10 female 19 stress # … with 178 more rows ``` ] .w-70.bg-washed-red.b--red.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.rbox[⚠️ Warning]] The `biomaRt` package of Bioconductor (amongst others) provides a `select()` function. If loaded, we need to address the `dplyr`-package using `::`! ```r judgments %>% dplyr::select(age, gender) ``` ``` # A tibble: 188 × 2 age gender <dbl> <chr> 1 24 female 2 19 female 3 19 female 4 22 female 5 22 female 6 22 female 7 18 female 8 20 male 9 21 female 10 19 female # … with 178 more rows ``` ] ] --- # Reminder `tidyselect` .pull-left[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mr1[ .large[.bbox[💡 Helper function ] To select columns with names that: * `contains()` - a string * `starts_with()` - a string * `ends_with()` - a string * `one_of()` - names in a character **vector** * `matches()` - using regular expressions * `everything()` - all **remaining** columns * `last_col()` - last column ]]] .pull-right[ ```r select(judgments, starts_with("moral")) ``` ``` # A tibble: 188 × 10 moral_dilemma_dog moral_dilemma_wallet moral_dilemma_plane moral_dilemma_res… <dbl> <dbl> <dbl> <dbl> 1 9 9 8 7 2 9 9 9 8 3 8 7 8 5 4 8 4 8 6 5 3 9 9 5 6 9 9 9 9 7 9 5 7 3 8 9 4 1 7 9 6 9 3 9 10 6 8 9 8 # … with 178 more rows, and 6 more variables: moral_dilemma_kitten <dbl>, # moral_dilemma_trolley <dbl>, moral_dilemma_control <dbl>, # moral_judgment <dbl>, moral_judgment_disgust <dbl>, # moral_judgment_non_disgust <dbl> ``` ] --- # Combining helpers .left-column[ .w-100.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[
Remark] Helpers are found in several functions, e.g. `across()`. ] ] .right-column[ ```r select(judgments, ends_with("date"), contains("dilemma")) ``` ``` # A tibble: 188 × 9 start_date end_date moral_dilemma_dog moral_dilemma_wallet moral_dilemma_pl… <chr> <chr> <dbl> <dbl> <dbl> 1 11/3/2014 11/3/2014 9 9 8 2 11/3/2014 11/3/2014 9 9 9 3 11/3/2014 11/3/2014 8 7 8 4 11/3/2014 11/3/2014 8 4 8 5 11/3/2014 11/3/2014 3 9 9 6 11/3/2014 11/3/2014 9 9 9 7 11/3/2014 11/3/2014 9 5 7 8 11/3/2014 11/3/2014 9 4 1 9 11/3/2014 11/3/2014 6 9 3 10 11/3/2014 11/3/2014 6 8 9 # … with 178 more rows, and 4 more variables: moral_dilemma_resume <dbl>, # moral_dilemma_kitten <dbl>, moral_dilemma_trolley <dbl>, # moral_dilemma_control <dbl> ``` ] --- # Selecting *columns* using `select()` .left-column[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ .large[.bbox[💡 Negative selection ] Drop columns by **negating** their names .large.bold[`-`]. Works with the `tidyselect` helper functions. ] ] ] -- .right-column[ ```r select(judgments, -gender, -starts_with(c("STAI", "REI")), -ends_with("id")) ``` ``` # A tibble: 188 × 71 start_date end_date finished condition subject age mood_pre mood_post <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> 1 11/3/2014 11/3/2014 1 control 2 24 81 NA 2 11/3/2014 11/3/2014 1 stress 1 19 59 42 3 11/3/2014 11/3/2014 1 stress 3 19 22 60 4 11/3/2014 11/3/2014 1 stress 4 22 53 68 5 11/3/2014 11/3/2014 1 control 7 22 48 NA 6 11/3/2014 11/3/2014 1 stress 6 22 73 73 7 11/3/2014 11/3/2014 1 control 5 18 NA NA 8 11/3/2014 11/3/2014 1 control 9 20 100 NA 9 11/3/2014 11/3/2014 1 stress 16 21 67 74 10 11/3/2014 11/3/2014 1 stress 13 19 30 68 # … with 178 more rows, and 63 more variables: moral_dilemma_dog <dbl>, # moral_dilemma_wallet <dbl>, moral_dilemma_plane <dbl>, # moral_dilemma_resume <dbl>, moral_dilemma_kitten <dbl>, # moral_dilemma_trolley <dbl>, moral_dilemma_control <dbl>, # presentation_experience <dbl>, presentation_unpleasant <dbl>, # presentation_fun <dbl>, presentation_challenge <dbl>, PBC_1 <dbl>, # PBC_2 <dbl>, PBC_3 <dbl>, PBC_4 <dbl>, PBC_5 <dbl>, MAIA_1_1 <dbl>, … ``` ] --- # Filtering for rows: `filter()` .pull-left[ Let's take a look at all the data that was excluded. `exclude` is coded as `\(0=include\)` and `\(1=exclude\)`. ```r filter(judgments, exclude == 1) ``` ``` # A tibble: 3 × 158 start_date end_date finished condition subject gender age mood_pre mood_post <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> 1 11/3/2014 11/3/2014 1 stress 28 male 22 53 68 2 11/3/2014 11/3/2014 1 stress 32 female 19 74 77 3 11/7/2014 11/7/2014 1 stress 181 male 22 47 65 # … with 149 more variables: STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>, # STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>, # STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, # STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, # STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, # STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, # STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, STAI_pre_3_6 <dbl>, … ``` ] .pull-right[  .footnote[Artwork by .bold[[@allison_horst](https://github.com/allisonhorst/stats-illustrations)]] ] --- class: hide_logo # Filtering rows .pull-left[ ### Multiple conditions: AND - **comma** separated conditions are equivalent to `&` (**AND**) - Cases that are included and have a high mood before the experiment. ```r filter(judgments, exclude == 0, mood_pre > 85) ``` ``` # A tibble: 8 × 158 start_date end_date finished condition subject gender age mood_pre mood_post <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> 1 11/3/2014 11/3/2014 1 control 9 male 20 100 NA 2 11/3/2014 11/3/2014 1 control 19 male 23 100 NA 3 11/3/2014 11/3/2014 1 control 39 male 20 87 NA 4 11/4/2014 11/4/2014 1 control 74 female 23 94 NA 5 11/5/2014 11/5/2014 1 control 105 female 21 93 NA 6 11/6/2014 11/6/2014 1 stress 142 female 22 96 98 7 11/7/2014 11/7/2014 1 control 167 male 18 87 NA 8 11/7/2014 11/7/2014 1 control 168 female 21 95 NA # … with 149 more variables: STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>, # STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>, # STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, # STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, # STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, # STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, # STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, STAI_pre_3_6 <dbl>, … ``` ] -- .pull-right[ ### Multiple conditions: OR - **pipe (`|`)** separated conditions are combined with OR - Filter participants that have a low mood before or after the experiment. ```r filter(judgments, mood_pre < 20 | mood_post < 20 ) ``` ``` # A tibble: 8 × 158 start_date end_date finished condition subject gender age mood_pre mood_post <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> 1 11/3/2014 11/3/2014 1 stress 24 female 20 66 0 2 11/3/2014 11/3/2014 1 stress 22 female 18 13 37 3 11/3/2014 11/3/2014 1 control 40 male 22 19 NA 4 11/5/2014 11/5/2014 1 stress 81 female 17 27 11 5 11/6/2014 11/6/2014 1 stress 127 female 18 15 20 6 11/6/2014 11/6/2014 1 stress 138 male 21 18 44 7 11/6/2014 11/6/2014 1 stress 159 female 18 9 0 8 11/7/2014 11/7/2014 1 stress 161 female 19 11 31 # … with 149 more variables: STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>, # STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>, # STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, # STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, # STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, # STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, # STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, STAI_pre_3_6 <dbl>, … ``` ] --- # `relocate()` - Only present since `dplyr` version .bold[1.0] - `.before` and `.after` for fine placement. .center[
.footnote[Artwork by .bold[[@allison_horst](https://github.com/allisonhorst/stats-illustrations)]] ] --- # Filtering **out** rows .left-column[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[💡 Row vs column selection ]] - `filter()` acts on rows - `select()` acts on columns ]] .right-column[ - Remove excluded participants - Combine with `relocate()` to place `mood` column first ```r filter(judgments, exclude == 0) %>% relocate(contains("mood")) ``` ``` # A tibble: 185 × 158 mood_pre mood_post start_date end_date finished condition subject gender <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> 1 81 NA 11/3/2014 11/3/2014 1 control 2 female 2 59 42 11/3/2014 11/3/2014 1 stress 1 female 3 22 60 11/3/2014 11/3/2014 1 stress 3 female 4 53 68 11/3/2014 11/3/2014 1 stress 4 female 5 48 NA 11/3/2014 11/3/2014 1 control 7 female 6 73 73 11/3/2014 11/3/2014 1 stress 6 female 7 NA NA 11/3/2014 11/3/2014 1 control 5 female 8 100 NA 11/3/2014 11/3/2014 1 control 9 male 9 67 74 11/3/2014 11/3/2014 1 stress 16 female 10 30 68 11/3/2014 11/3/2014 1 stress 13 female # … with 175 more rows, and 150 more variables: age <dbl>, STAI_pre_1_1 <dbl>, # STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>, # STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, # STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, # STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, # STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, # STAI_pre_3_3 <dbl>, STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, … ``` ] --- ## Filter rows, helper `between()` ```r judgments %>% filter(between(mood_pre, 40, 60)) %>% select(age, gender, condition, mood_pre, mood_post) ``` ``` # A tibble: 64 × 5 age gender condition mood_pre mood_post <dbl> <chr> <chr> <dbl> <dbl> 1 19 female stress 59 42 2 22 female stress 53 68 3 22 female control 48 NA 4 19 female stress 55 57 5 18 female stress 53 38 6 19 female control 59 NA 7 19 female stress 60 53 8 22 male stress 53 68 9 18 male control 50 NA 10 19 female stress 58 45 # … with 54 more rows ``` --- # Set operations with `filter()` - The two below are equivalent - For larger operations use `inner_join()` - Below, `tidyselect` helper .bold.large[`:`] for a range of columns .pull-left[ ```r judgments %>% filter(is.element(start_date, c("11/3/2014", "11/5/2014"))) ``` ``` # A tibble: 79 × 158 start_date end_date finished condition subject gender age mood_pre <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> 1 11/3/2014 11/3/2014 1 control 2 female 24 81 2 11/3/2014 11/3/2014 1 stress 1 female 19 59 3 11/3/2014 11/3/2014 1 stress 3 female 19 22 4 11/3/2014 11/3/2014 1 stress 4 female 22 53 5 11/3/2014 11/3/2014 1 control 7 female 22 48 6 11/3/2014 11/3/2014 1 stress 6 female 22 73 7 11/3/2014 11/3/2014 1 control 5 female 18 NA 8 11/3/2014 11/3/2014 1 control 9 male 20 100 9 11/3/2014 11/3/2014 1 stress 16 female 21 67 10 11/3/2014 11/3/2014 1 stress 13 female 19 30 # … with 69 more rows, and 150 more variables: mood_post <dbl>, # STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, # STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, # STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, # STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, # STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, # STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, STAI_pre_3_4 <dbl>, … ``` ] .pull-right[ ```r judgments %>% filter(start_date %in% c("11/3/2014", "11/5/2014")) %>% select(start_date:age) ``` ``` # A tibble: 79 × 7 start_date end_date finished condition subject gender age <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> 1 11/3/2014 11/3/2014 1 control 2 female 24 2 11/3/2014 11/3/2014 1 stress 1 female 19 3 11/3/2014 11/3/2014 1 stress 3 female 19 4 11/3/2014 11/3/2014 1 stress 4 female 22 5 11/3/2014 11/3/2014 1 control 7 female 22 6 11/3/2014 11/3/2014 1 stress 6 female 22 7 11/3/2014 11/3/2014 1 control 5 female 18 8 11/3/2014 11/3/2014 1 control 9 male 20 9 11/3/2014 11/3/2014 1 stress 16 female 21 10 11/3/2014 11/3/2014 1 stress 13 female 19 # … with 69 more rows ``` ] --- # Filter out rows that are unique: `distinct()` .pull-left[ Suppose we select the dates and control status (`condition`) for a particular start date. Do we have different end dates? ```r judgments %>% select(start_date, end_date, condition) ``` ``` # A tibble: 188 × 3 start_date end_date condition <chr> <chr> <chr> 1 11/3/2014 11/3/2014 control 2 11/3/2014 11/3/2014 stress 3 11/3/2014 11/3/2014 stress 4 11/3/2014 11/3/2014 stress 5 11/3/2014 11/3/2014 control 6 11/3/2014 11/3/2014 stress 7 11/3/2014 11/3/2014 control 8 11/3/2014 11/3/2014 control 9 11/3/2014 11/3/2014 stress 10 11/3/2014 11/3/2014 stress # … with 178 more rows ``` ] -- .pull-right[ - Multiple identical rows. - Use `distinct()` to remove duplicated rows: ```r judgments %>% filter(exclude == 0) %>% select(start_date, end_date, condition) %>% distinct() ``` ``` # A tibble: 10 × 3 start_date end_date condition <chr> <chr> <chr> 1 11/3/2014 11/3/2014 control 2 11/3/2014 11/3/2014 stress 3 11/4/2014 11/4/2014 stress 4 11/4/2014 11/4/2014 control 5 11/5/2014 11/5/2014 stress 6 11/5/2014 11/5/2014 control 7 11/6/2014 11/6/2014 control 8 11/6/2014 11/6/2014 stress 9 11/7/2014 11/7/2014 stress 10 11/7/2014 11/7/2014 control ``` - Also possible (except columns order): ```r judgments %>% filter(exclude == 0) %>% distinct(start_date, end_date, condition) ``` ] --- # Sort columns: `arrange()` .pull-left[ ### A nested sorting example 1. sort by `mood_pre` 2. within each group of `mood_pre`, sort by `mood_post` ```r judgments %>% arrange(mood_pre, mood_post) %>% select(subject, mood_pre, mood_post) ``` ``` # A tibble: 188 × 3 subject mood_pre mood_post <dbl> <dbl> <dbl> 1 159 9 0 2 161 11 31 3 22 13 37 4 127 15 20 5 138 18 44 6 40 19 NA 7 99 20 28 8 150 20 62 9 84 21 48 10 3 22 60 # … with 178 more rows ``` ] .pull-right[ ### Reverse sort columns - Use `arrange()` with the helper function `desc()` - For example, *oldest* participant first ```r arrange(judgments, desc(age)) %>% select(subject, age, condition) ``` ``` # A tibble: 188 × 3 subject age condition <dbl> <dbl> <chr> 1 107 31 stress 2 61 27 stress 3 41 26 control 4 183 25 stress 5 2 24 control 6 93 24 stress 7 115 24 stress 8 137 24 control 9 23 23 control 10 19 23 control # … with 178 more rows ``` ] --- # Verbs to inspect data .flex[ .w-55.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1.ml1[ .large[.bbox[Summary]] - `glimpse()` to get an overview of each column's content - `select()` to pick and/or omit columns + helper functions - `relocate()` re-arrange columns order - `filter()` to subset + AND/OR conditions (`,`, `|`) - `arrange()` to sort + combine with `desc()` to reverse the sorting ] .w-45.mt2.ph3[
.footnote[Artwork by .bold[[@allison_horst](https://github.com/allisonhorst/stats-illustrations)]] ] ] --- background-image: url("https://raw.githubusercontent.com/courtiol/Rguides/master/pngs/dplyr_guide_for_one_table_part1.png") background-size: contain Source: .bold[Alexandre Couriol], R data Berlin --- class: inverse, center, middle # Transforming columns --- class: hide_logo # Changing column names .pull-left[ #### `rename()` `rename(tib, new_name = old_name)`. _to remember the order of appearance, consider `=` as "was"._ ```r rename(judgments, sex = gender, remove = exclude) ``` ``` # A tibble: 188 × 158 start_date end_date finished condition subject sex age mood_pre mood_post <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> 1 11/3/2014 11/3/2014 1 control 2 female 24 81 NA 2 11/3/2014 11/3/2014 1 stress 1 female 19 59 42 3 11/3/2014 11/3/2014 1 stress 3 female 19 22 60 4 11/3/2014 11/3/2014 1 stress 4 female 22 53 68 5 11/3/2014 11/3/2014 1 control 7 female 22 48 NA 6 11/3/2014 11/3/2014 1 stress 6 female 22 73 73 7 11/3/2014 11/3/2014 1 control 5 female 18 NA NA 8 11/3/2014 11/3/2014 1 control 9 male 20 100 NA 9 11/3/2014 11/3/2014 1 stress 16 female 21 67 74 10 11/3/2014 11/3/2014 1 stress 13 female 19 30 68 # … with 178 more rows, and 149 more variables: STAI_pre_1_1 <dbl>, # STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>, # STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, # STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, # STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, # STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, # STAI_pre_3_3 <dbl>, STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, … ``` ] -- .pull-right[ #### With a function: `rename_with()` Uppercase certain columns using `stringr` and `tidyselect` ```r rename_with(judgments, stringr::str_to_lower, starts_with("STAI")) %>% select(contains("pre")) ``` ``` # A tibble: 188 × 27 mood_pre stai_pre_1_1 stai_pre_1_2 stai_pre_1_3 stai_pre_1_4 stai_pre_1_5 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 81 2 1 2 2 2 2 59 3 2 3 1 3 3 22 4 3 3 3 4 4 53 2 2 2 2 3 5 48 1 1 1 1 2 6 73 2 2 1 1 2 7 NA 2 2 1 1 2 8 100 1 1 1 1 1 9 67 2 2 1 1 2 10 30 4 2 3 3 3 # … with 178 more rows, and 21 more variables: stai_pre_1_6 <dbl>, # stai_pre_1_7 <dbl>, stai_pre_2_1 <dbl>, stai_pre_2_2 <dbl>, # stai_pre_2_3 <dbl>, stai_pre_2_4 <dbl>, stai_pre_2_5 <dbl>, # stai_pre_2_6 <dbl>, stai_pre_2_7 <dbl>, stai_pre_3_1 <dbl>, # stai_pre_3_2 <dbl>, stai_pre_3_3 <dbl>, stai_pre_3_4 <dbl>, # stai_pre_3_5 <dbl>, stai_pre_3_6 <dbl>, presentation_experience <dbl>, # presentation_unpleasant <dbl>, presentation_fun <dbl>, … ``` ] --- # Adding columns: `mutate()` .pull-left[ Let's create a new column `mood_change` that describes the change of the mood of the participant across the experiment. - New column name: `mood_change` - Computation: subtract `mood_pre` from `mood_post` ```r judgments %>% mutate(mood_change = mood_post - mood_pre) %>% relocate(starts_with("mood")) ``` ``` # A tibble: 188 × 159 mood_pre mood_post mood_change start_date end_date finished condition subject <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl> 1 81 NA NA 11/3/2014 11/3/20… 1 control 2 2 59 42 -17 11/3/2014 11/3/20… 1 stress 1 3 22 60 38 11/3/2014 11/3/20… 1 stress 3 4 53 68 15 11/3/2014 11/3/20… 1 stress 4 5 48 NA NA 11/3/2014 11/3/20… 1 control 7 6 73 73 0 11/3/2014 11/3/20… 1 stress 6 7 NA NA NA 11/3/2014 11/3/20… 1 control 5 8 100 NA NA 11/3/2014 11/3/20… 1 control 9 9 67 74 7 11/3/2014 11/3/20… 1 stress 16 10 30 68 38 11/3/2014 11/3/20… 1 stress 13 # … with 178 more rows, and 151 more variables: gender <chr>, age <dbl>, # STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, # STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, # STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, # STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, # STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, # STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, STAI_pre_3_4 <dbl>, … ``` ] .pull-right[
.footnote[Artwork by .bold[[@allison_horst](https://github.com/allisonhorst/stats-illustrations)]] ] --- # Within one mutate statement .left-column[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt2.mr1[ .large[.bbox[💡 Instant availability ]] Use new variables in the same function call right away! ]] .right-column[ ```r judgments %>% mutate( mood_change = mood_post - mood_pre, mood_change_norm = abs(mood_change / mean(mood_change, na.rm = TRUE))) %>% relocate(starts_with("mood")) %>% arrange(desc(mood_change_norm)) ``` ``` # A tibble: 188 × 160 mood_pre mood_post mood_change mood_change_norm start_date end_date finished <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> 1 66 0 -66 9.12 11/3/2014 11/3/2014 1 2 77 22 -55 7.60 11/4/2014 11/4/2014 1 3 47 100 53 7.32 11/5/2014 11/5/2014 1 4 25 72 47 6.49 11/4/2014 11/4/2014 1 5 22 69 47 6.49 11/5/2014 11/5/2014 1 6 37 83 46 6.36 11/6/2014 11/6/2014 1 7 20 62 42 5.80 11/6/2014 11/6/2014 1 8 60 100 40 5.53 11/4/2014 11/4/2014 1 9 22 60 38 5.25 11/3/2014 11/3/2014 1 10 30 68 38 5.25 11/3/2014 11/3/2014 1 # … with 178 more rows, and 153 more variables: condition <chr>, subject <dbl>, # gender <chr>, age <dbl>, STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>, # STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>, # STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, # STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, # STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, # STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, … ``` ] --- # Replacing columns .left-column[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[💡 Update] Using existing columns updates their content. ] ] .bg-washed-red.b--red.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.rbox[💡 Warn ]] If not using names the actions is used as name (avoid) ] ] .right-column[ #### `mutate()` existing columns .large[Rescaling `mood` columns] ```r judgments %>% mutate(mood_pre = mood_pre/mean(mood_pre, na.rm = TRUE), mood_post = mood_post/mean(mood_post, na.rm = TRUE), mood_pre / mean(mood_post, na.rm = TRUE)) %>% select( starts_with("mood")) ``` ``` # A tibble: 188 × 3 mood_pre mood_post `mood_pre/mean(mood_post, na.rm = TRUE)` <dbl> <dbl> <dbl> 1 1.36 NA 1.36 2 0.994 0.680 0.994 3 0.371 0.971 0.371 4 0.893 1.10 0.893 5 0.809 NA 0.809 6 1.23 1.18 1.23 7 NA NA NA 8 1.68 NA 1.68 9 1.13 1.20 1.13 10 0.505 1.10 0.505 # … with 178 more rows ``` ] --- # `case_when()` .pull-left[ Categorize the `mood_pre` column ```r judgments %>% mutate(mood_pre_cat = case_when( mood_pre < 25 ~ "poor", mood_pre > 75 ~ "great", TRUE ~ "normal")) %>% select(starts_with("mood")) %>% arrange(desc(mood_pre)) ``` ``` # A tibble: 188 × 3 mood_pre mood_post mood_pre_cat <dbl> <dbl> <chr> 1 100 NA great 2 100 NA great 3 96 98 great 4 95 NA great 5 94 NA great 6 93 NA great 7 87 NA great 8 87 NA great 9 85 91 great 10 84 NA great # … with 178 more rows ``` ] .pull-right[
Artwork by .bold[[@allison_horst](https://github.com/allisonhorst/stats-illustrations)] ] --- # Act on multiple columns at once using `across()` .pull-left[ .bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[Usage 📚]] Can be plugged into `mutate()`, `filter()`, `summarise()`... `across`(.green[ON WHO], .red[DO WHAT]) - .green.bold[Columns] selection: + Argument `.cols =` + `tidyselect` *helpers* + `everything()` to select all columns. + **Conditions** (boolean) needs `where()`, `across(where(is.numeric))` - .red.bold[Actions] using functions: + Argument `.fns =` + `function, arg1, arg2` + `~ function(.x)`, with `placeholder` `.x` + Multiple functions need to be wrapped in a `list()` + New column names can be controlled ] ] .pull-right[ .bg-washed-red.b--red.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.rbox[Recent changes
]] The functions `mutate_all()`, `mutate_if()` and `mutate_at()` have been superseded by using `mutate()` with `across()` and `where()`. ]
.footnote[Artwork by .bold[[@allison_horst](https://github.com/allisonhorst/stats-illustrations)]] ] --- # Examples of `across()` usage .pull-left[ #### Find rows where both mood data columns are missing ```r judgments %>% filter(across(starts_with("mood_p"), is.na)) %>% select(subject, starts_with("mood")) ``` ``` # A tibble: 1 × 3 subject mood_pre mood_post <dbl> <dbl> <dbl> 1 5 NA NA ``` Of note: interesting .bold[new] function: `if_any()`: `filter(if_any(everything(), is.na))` ] -- .pull-right[ #### Add 1 to the PBC data To convert Likert scales ```r judgments %>% mutate(across(contains("STAI"), `+`, 1)) %>% select(starts_with("STAI")) ``` ``` # A tibble: 188 × 42 STAI_pre_1_1 STAI_pre_1_2 STAI_pre_1_3 STAI_pre_1_4 STAI_pre_1_5 STAI_pre_1_6 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 3 2 3 3 3 3 2 4 3 4 2 4 3 3 5 4 4 4 5 3 4 3 3 3 3 4 2 5 2 2 2 2 3 2 6 3 3 2 2 3 2 7 3 3 2 2 3 2 8 2 2 2 2 2 2 9 3 3 2 2 3 2 10 5 3 4 4 4 2 # … with 178 more rows, and 36 more variables: STAI_pre_1_7 <dbl>, # STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, # STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, # STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, # STAI_pre_3_3 <dbl>, STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, # STAI_pre_3_6 <dbl>, STAI_post_1_1 <dbl>, STAI_post_1_2 <dbl>, # STAI_post_1_3 <dbl>, STAI_post_1_4 <dbl>, STAI_post_1_5 <dbl>, … ``` `` `+` `` to access the sum function name, a bit tricky ] --- class: hide_logo # Simplifying actions with the placeholder `.x` .pull-left[ ### Add 1 to the MAIA questionnaire data Activate the placeholder (`.x`) with the .huge.blue[~] ```r judgments %>% mutate(across(contains("MAIA"), ~ .x + 1)) %>% select(starts_with("MAIA")) ``` ``` # A tibble: 188 × 40 MAIA_1_1 MAIA_1_2 MAIA_1_3 MAIA_1_4 MAIA_1_5 MAIA_1_6 MAIA_1_7 MAIA_1_8 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 3 5 5 5 3 3 3 4 2 NA NA NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA 4 NA NA NA NA NA NA NA NA 5 4 4 4 4 4 4 4 4 6 NA NA NA NA NA NA NA NA 7 5 5 5 6 3 3 5 4 8 4 5 5 5 6 3 4 6 9 NA NA NA NA NA NA NA NA 10 NA NA NA NA NA NA NA NA # … with 178 more rows, and 32 more variables: MAIA_1_9 <dbl>, MAIA_1_10 <dbl>, # MAIA_1_11 <dbl>, MAIA_1_12 <dbl>, MAIA_1_13 <dbl>, MAIA_1_14 <dbl>, # MAIA_1_15 <dbl>, MAIA_1_16 <dbl>, MAIA_2_1 <dbl>, MAIA_2_2 <dbl>, # MAIA_2_3 <dbl>, MAIA_2_4 <dbl>, MAIA_2_5 <dbl>, MAIA_2_6 <dbl>, # MAIA_2_7 <dbl>, MAIA_2_8 <dbl>, MAIA_2_9 <dbl>, MAIA_2_10 <dbl>, # MAIA_2_11 <dbl>, MAIA_2_12 <dbl>, MAIA_2_13 <dbl>, MAIA_2_14 <dbl>, # MAIA_2_15 <dbl>, MAIA_2_16 <dbl>, MAIA_noticing <dbl>, … ``` `.x` will take the value of any column selected, similar to the `i` in `for` loops. ] -- .pull-right[ ### Placeholder allows to use function calls ```r judgments %>% mutate(across(contains("mood"), ~ .x / 100)) %>% select(starts_with("mood")) ``` ``` # A tibble: 188 × 2 mood_pre mood_post <dbl> <dbl> 1 0.81 NA 2 0.59 0.42 3 0.22 0.6 4 0.53 0.68 5 0.48 NA 6 0.73 0.73 7 NA NA 8 1 NA 9 0.67 0.74 10 0.3 0.68 # … with 178 more rows ``` .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ This is advanced usage, ok to find it hard.
] ] --- # Selecting columns with a predicate `where()` .pull-left[ ### Add 1 to numeric columns for all questionnaire data - predicate means return `TRUE` or `FALSE` ```r judgments %>% mutate(across(where(is.numeric), ~ .x + 1)) ``` ``` # A tibble: 188 × 158 start_date end_date finished condition subject gender age mood_pre <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> 1 11/3/2014 11/3/2014 2 control 3 female 25 82 2 11/3/2014 11/3/2014 2 stress 2 female 20 60 3 11/3/2014 11/3/2014 2 stress 4 female 20 23 4 11/3/2014 11/3/2014 2 stress 5 female 23 54 5 11/3/2014 11/3/2014 2 control 8 female 23 49 6 11/3/2014 11/3/2014 2 stress 7 female 23 74 7 11/3/2014 11/3/2014 2 control 6 female 19 NA 8 11/3/2014 11/3/2014 2 control 10 male 21 101 9 11/3/2014 11/3/2014 2 stress 17 female 22 68 10 11/3/2014 11/3/2014 2 stress 14 female 20 31 # … with 178 more rows, and 150 more variables: mood_post <dbl>, # STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, # STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, # STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, # STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, # STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, # STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, STAI_pre_3_4 <dbl>, … ``` ] -- .pull-right[ .bg-washed-red.b--red.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.rbox[
Watchout
️]] Now we also get `subject` changed! ] ] --- # More advanced across: multiple functions .pull-left[ #### Summarise by the mean of mood ```r judgments %>% summarise(across(starts_with("mood"), list(mean, sd))) ``` ``` # A tibble: 1 × 4 mood_pre_1 mood_pre_2 mood_post_1 mood_post_2 <dbl> <dbl> <dbl> <dbl> 1 NA NA NA NA ``` NA is obviously not playing nice here. Need to supply additional arguments (`na.rm = TRUE`) ] -- .pull-right[ #### Better with naming and function arguments ```r judgments %>% summarise(across(starts_with("moral_dil"), list(aveg = ~ mean(.x, na.rm = TRUE), sdev = ~ sd(.x, na.rm = TRUE)))) ``` ``` # A tibble: 1 × 14 moral_dilemma_dog_aveg moral_dilemma_dog… moral_dilemma_wal… moral_dilemma_wa… <dbl> <dbl> <dbl> <dbl> 1 7.35 2.17 7.14 2.02 # … with 10 more variables: moral_dilemma_plane_aveg <dbl>, # moral_dilemma_plane_sdev <dbl>, moral_dilemma_resume_aveg <dbl>, # moral_dilemma_resume_sdev <dbl>, moral_dilemma_kitten_aveg <dbl>, # moral_dilemma_kitten_sdev <dbl>, moral_dilemma_trolley_aveg <dbl>, # moral_dilemma_trolley_sdev <dbl>, moral_dilemma_control_aveg <dbl>, # moral_dilemma_control_sdev <dbl> ``` ] --- class: hide_logo # Manipulation by row .pull-left[ .bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ #### Summing all scores for the STAI questionnaire ```r judgments %>% mutate(total_stai = sum(across(contains("STAI")), na.rm = TRUE)) %>% select(subject, total_stai, contains("STAI")) ``` ``` # A tibble: 188 × 44 subject total_stai STAI_pre_1_1 STAI_pre_1_2 STAI_pre_1_3 STAI_pre_1_4 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 2 22866 2 1 2 2 2 1 22866 3 2 3 1 3 3 22866 4 3 3 3 4 4 22866 2 2 2 2 5 7 22866 1 1 1 1 6 6 22866 2 2 1 1 7 5 22866 2 2 1 1 8 9 22866 1 1 1 1 9 16 22866 2 2 1 1 10 13 22866 4 2 3 3 # … with 178 more rows, and 38 more variables: STAI_pre_1_5 <dbl>, # STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, # STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, # STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, # STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, # STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, STAI_pre_3_6 <dbl>, # STAI_post_1_1 <dbl>, STAI_post_1_2 <dbl>, STAI_post_1_3 <dbl>, … ``` ] ] -- .pull-right[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ * `rowwise()` - computation by row * `c_across()` - selects columns with tidyselect helpers ```r judgments %>% * rowwise() %>% * mutate(total_stai = sum(c_across( starts_with("STAI")))) %>% select(subject, total_stai, contains("STAI")) ``` ``` # A tibble: 188 × 44 # Rowwise: subject total_stai STAI_pre_1_1 STAI_pre_1_2 STAI_pre_1_3 STAI_pre_1_4 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 2 NA 2 1 2 2 2 1 200 3 2 3 1 3 3 212 4 3 3 3 4 4 148 2 2 2 2 5 7 NA 1 1 1 1 6 6 134 2 2 1 1 7 5 NA 2 2 1 1 8 9 NA 1 1 1 1 9 16 122 2 2 1 1 10 13 196 4 2 3 3 # … with 178 more rows, and 38 more variables: STAI_pre_1_5 <dbl>, # STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, # STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, # STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, # STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, # STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, STAI_pre_3_6 <dbl>, # STAI_post_1_1 <dbl>, STAI_post_1_2 <dbl>, STAI_post_1_3 <dbl>, … ``` ] ] --- background-image: url("https://raw.githubusercontent.com/courtiol/Rguides/master/pngs/dplyr_guide_for_one_table_part2.png") background-size: contain Source: Alexandre Courtiol, R data Berlin --- class: inverse, middle, center # Grouping and summarizing --- # Counting elements .Large[How many participants per condition?] -- .pull-left[ ### Replacing calls to `table()` ```r head(as.data.frame(table(judgments$condition))) ``` ``` Var1 Freq 1 control 91 2 stress 97 ``` Ugly and convoluted.
] -- .pull-right[ - `count()` groups by specified columns - Result is a tibble that can be processed further - `sort = TRUE` avoid piping to `arrange(desc(n))` ```r count(judgments, condition, sort = TRUE) ``` ``` # A tibble: 2 × 2 condition n <chr> <int> 1 stress 97 2 control 91 ``` `count()` is a shortcut for: ```r group_by(judgments, condition) %>% summarise(n = n(), .groups = "drop") ``` ] --- # Summarise data .pull-left[ #### When counting is not enough ```r judgments %>% summarise(, min = min(mood_pre, na.rm = TRUE), max = max(mood_pre, na.rm = TRUE)) ``` ``` # A tibble: 1 × 2 min max <dbl> <dbl> 1 9 100 ``` - `summarise`returns as many rows as groups, one grouping being peeled off from the right for functions with a single return value. - `mutate` returns as many rows as given ```r judgments %>% mutate( min = min(mood_pre, na.rm = TRUE), max = max(mood_pre, na.rm = TRUE)) ``` ``` # A tibble: 188 × 160 start_date end_date finished condition subject gender age mood_pre <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> 1 11/3/2014 11/3/2014 1 control 2 female 24 81 2 11/3/2014 11/3/2014 1 stress 1 female 19 59 3 11/3/2014 11/3/2014 1 stress 3 female 19 22 4 11/3/2014 11/3/2014 1 stress 4 female 22 53 5 11/3/2014 11/3/2014 1 control 7 female 22 48 6 11/3/2014 11/3/2014 1 stress 6 female 22 73 7 11/3/2014 11/3/2014 1 control 5 female 18 NA 8 11/3/2014 11/3/2014 1 control 9 male 20 100 9 11/3/2014 11/3/2014 1 stress 16 female 21 67 10 11/3/2014 11/3/2014 1 stress 13 female 19 30 # … with 178 more rows, and 152 more variables: mood_post <dbl>, # STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, # STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, # STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, # STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, # STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, # STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, STAI_pre_3_4 <dbl>, … ``` ] -- .pull-right[ #### If we want the min/max per biotype And keep the grouping at the same time ```r judgments %>% group_by(condition) %>% summarise(min = min(mood_pre, na.rm = TRUE), max = max(mood_pre, na.rm = TRUE), .groups = "keep") ``` ``` # A tibble: 2 × 3 # Groups: condition [2] condition min max <chr> <dbl> <dbl> 1 control 19 100 2 stress 9 96 ``` ] --- # Grouping by more than one variable .pull-left[ #### Peeling effect ```r judgments %>% group_by(condition, gender) %>% summarise(n = n()) -> n_part ``` ``` `summarise()` has grouped output by 'condition'. You can override using the `.groups` argument. ``` ```r slice_head(n_part, n = 1) ``` ``` # A tibble: 2 × 3 # Groups: condition [2] condition gender n <chr> <chr> <int> 1 control female 65 2 stress female 82 ``` .rbox[
Most functions in `dplyr` are group-aware!] ] -- .pull-right[ #### No grouping after additional calls to `summarise()` ```r summarise(n_part, n = n()) %>% slice_head(n = 3) ``` ``` # A tibble: 2 × 2 condition n <chr> <int> 1 control 2 2 stress 2 ``` ] --- # Grouping (cont.) .left-column[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[💡 Remark ]] Ask explicitly to .bold[ungroup] data or use the `ungroup()` function. ] ] .right-column[ #### How many different answers for a variable by condition and gender? ```r judgments %>% group_by(condition, gender ) %>% summarise(n_ans = n_distinct( STAI_post_1_1), .groups = "drop") ``` ``` # A tibble: 4 × 3 condition gender n_ans <chr> <chr> <int> 1 control female 1 2 control male 1 3 stress female 4 4 stress male 3 ``` ] --- # Summarise can deal with multiple values per group .pull-left[ #### Since v1.0 - `range()` returns min .bold[and] max - `summarise()` duplicates the key names ```r judgments %>% group_by(condition, gender) %>% summarise(range = range(mood_pre - mood_post), n = n(), .groups = "keep") ``` ``` # A tibble: 8 × 4 # Groups: condition, gender [4] condition gender range n <chr> <chr> <dbl> <int> 1 control female NA 65 2 control female NA 65 3 control male NA 26 4 control male NA 26 5 stress female -53 82 6 stress female 66 82 7 stress male -32 15 8 stress male 0 15 ``` ] -- .pull-right[ #### More advanced but useful: 3 quantiles ```r judgments %>% filter(!is.na(mood_pre)) %>% group_by(condition) %>% summarise( quan = quantile(mood_pre, c(0.25, 0.5, 0.75)), q = c(0.25, 0.5, 0.75), n = n(), .groups = "keep") %>% rmarkdown::paged_table() ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["condition"],"name":[1],"type":["chr"],"align":["left"]},{"label":["quan"],"name":[2],"type":["dbl"],"align":["right"]},{"label":["q"],"name":[3],"type":["dbl"],"align":["right"]},{"label":["n"],"name":[4],"type":["int"],"align":["right"]}],"data":[{"1":"control","2":"53","3":"0.25","4":"90"},{"1":"control","2":"65","3":"0.50","4":"90"},{"1":"control","2":"76","3":"0.75","4":"90"},{"1":"stress","2":"44","3":"0.25","4":"97"},{"1":"stress","2":"58","3":"0.50","4":"97"},{"1":"stress","2":"67","3":"0.75","4":"97"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] --- class: inverse, middle, center # Joining data frames ## Relational operations .center[<img src="img/dplyr_join_inner.png" width="200px"/>] --- # Matching subjects in 2 tibbles .pull-left[ #### Additional data for some participants for coffee consumption Manual input using `tribble()` ```r coffee_drinkers <- tribble(~student, ~coffee_shots, 21, 1, 23, 4, 28, 2) coffee_drinkers ``` ``` # A tibble: 3 × 2 student coffee_shots <dbl> <dbl> 1 21 1 2 23 4 3 28 2 ``` ] .pull-right[ #### Smaller sample set ```r subject_mood <- judgments %>% select(subject, condition, gender, starts_with("mood")) %>% distinct() ``` .footnote[This is made up data for demonstration purposes only.] ] --- # Combining the two tables .pull-left[ ### Inner join  ] .pull-right[ .Large[.red.bold[Error!] no common key.] ```r inner_join(coffee_drinkers, subject_mood) ``` ``` Error: `by` must be supplied when `x` and `y` have no common variables. ℹ use by = character()` to perform a cross-join. ``` #### Provide the corresponding columns names ```r inner_join(subject_mood, coffee_drinkers, by = c(subject = "student")) ``` ``` # A tibble: 3 × 6 subject condition gender mood_pre mood_post coffee_shots <dbl> <chr> <chr> <dbl> <dbl> <dbl> 1 23 control female 78 NA 4 2 21 control female 68 NA 1 3 28 stress male 53 68 2 ``` .footnote[bare name for `subject` as evaluated in `subject_mood`] ] --- # **Mutating** joins .pull-left[ #### Inner join  <span class = "small">credit: [Hadley Wickham, R for data science](http://r4ds.had.co.nz/relational-data.html)</span> ] -- .pull-right[ #### Creating new tables through *joins* * Key operations in data processing * Role of *observations* as row changes * `inner_join()` is the most strict *join* operations * `merge` is a similar operation in base R ] --- .pull-left[ .center[<img src="img/dplyr_join_outer.png" width="400px"/>] ] -- .pull-right[ ### Example with `full_join()` ```r subject_mood %>% full_join(coffee_drinkers, by = c(subject = "student")) ``` ``` # A tibble: 187 × 6 subject condition gender mood_pre mood_post coffee_shots <dbl> <chr> <chr> <dbl> <dbl> <dbl> 1 2 control female 81 NA NA 2 1 stress female 59 42 NA 3 3 stress female 22 60 NA 4 4 stress female 53 68 NA 5 7 control female 48 NA NA 6 6 stress female 73 73 NA 7 5 control female NA NA NA 8 9 control male 100 NA NA 9 16 stress female 67 74 NA 10 13 stress female 30 68 NA # … with 177 more rows ``` ] --- # **Filtering** joins - .large[ >Only the .bold[existence] of a match is important; it doesn’t matter which observation is matched. This means that .green[filtering] joins .bold[never duplicate rows] like .red[mutating] joins do .tr[ — _Hadley Wickam_ - [R for data science](http://r4ds.had.co.nz/relational-data.html#filtering-joins)] ] .pull-left[ ### `semi_join()` Filter matches in x, no duplicates.  ] .pull-right[ ### `anti_join()` Extract what does **not** match.  ] --- class: nvs1 ### `semi_join()` does not alter original .pull-left[ ```r (tx <- tribble(~ x, ~key, "x1", 1, "x2", 2, "x3", 2, "x4", 3)) ``` ``` # A tibble: 4 × 2 x key <chr> <dbl> 1 x1 1 2 x2 2 3 x3 2 4 x4 3 ``` ```r (ty <- tribble(~ y, ~key, "y1", 1, "y2", 2, "y3", 2, "y4", 3)) ``` ``` # A tibble: 4 × 2 y key <chr> <dbl> 1 y1 1 2 y2 2 3 y3 2 4 y4 3 ``` ] -- .pull-right-nvs[ .bbox[Filtering] ```r semi_join(tx, ty) ``` ``` Joining, by = "key" ``` ``` # A tibble: 4 × 2 x key <chr> <dbl> 1 x1 1 2 x2 2 3 x3 2 4 x4 3 ``` .bbox[Mutating] ```r inner_join(tx, ty) ``` ``` Joining, by = "key" ``` ``` # A tibble: 6 × 3 x key y <chr> <dbl> <chr> 1 x1 1 y1 2 x2 2 y2 3 x2 2 y3 4 x3 2 y2 5 x3 2 y3 6 x4 3 y4 ``` ] --- class: center, middle, inverse # Exchange of variables and values with pivot functions ## tidyr functionality --- ## Converting into long or wide formats - pivot functions - The wide format is generally **untidy** _but_ found in the majority of datasets - The **wide** format makes computation on columns sometimes easier - The new functions `pivot_longer()` and `pivot_wider()` supersede older functions called `gather()` and `spread()`. - `pivot` functions are easier to use. No really! .flex[ .w-70.bg-washed-white.b--gray.ba.bw2.br3.shadow-5.ph3.mt3.mr1[  ] .w-70.bg-washed-white.b--gray.ba.bw2.br3.shadow-5.ph3.mt3.mr1[  ]] --- # Making a wide data set longer .pull-left[ ### Calculations involving column and values ### A toy data set Variants by sample and positions. Why is this not `tidy`? ```r variants_wide <- tribble( ~sample_id, ~`3`, ~`5`, ~`8`, "L1002", "A", "C", NA, "L1034", "A", NA, "T", "L1234", NA, "C", "T" ) variants_wide ``` ``` # A tibble: 3 × 4 sample_id `3` `5` `8` <chr> <chr> <chr> <chr> 1 L1002 A C <NA> 2 L1034 A <NA> T 3 L1234 <NA> C T ``` ] .pull-left[ ### A longer, tidier table ```r pivot_longer(variants_wide, -contains("sample"), # columns argument, required names_to = "pos", values_to = "variant") ``` ``` # A tibble: 9 × 3 sample_id pos variant <chr> <chr> <chr> 1 L1002 3 A 2 L1002 5 C 3 L1002 8 <NA> 4 L1034 3 A 5 L1034 5 <NA> 6 L1034 8 T 7 L1234 3 <NA> 8 L1234 5 C 9 L1234 8 T ``` ] --- --- class: nvs1 count: false # Values across dilemmas .panel1-len_cats-auto[ ```r *readr::read_tsv("https://biostat2.uni.lu/practicals/data/judgments.tsv") ``` ] .panel2-len_cats-auto[ ``` Rows: 188 Columns: 158 ``` ``` ── Column specification ──────────────────────────────────────────────────────── Delimiter: "\t" chr (5): start_date, end_date, condition, gender, logbook dbl (153): finished, subject, age, mood_pre, mood_post, STAI_pre_1_1, STAI_p... ``` ``` ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` # A tibble: 188 × 158 start_date end_date finished condition subject gender age mood_pre <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> 1 11/3/2014 11/3/2014 1 control 2 female 24 81 2 11/3/2014 11/3/2014 1 stress 1 female 19 59 3 11/3/2014 11/3/2014 1 stress 3 female 19 22 4 11/3/2014 11/3/2014 1 stress 4 female 22 53 5 11/3/2014 11/3/2014 1 control 7 female 22 48 6 11/3/2014 11/3/2014 1 stress 6 female 22 73 7 11/3/2014 11/3/2014 1 control 5 female 18 NA 8 11/3/2014 11/3/2014 1 control 9 male 20 100 9 11/3/2014 11/3/2014 1 stress 16 female 21 67 10 11/3/2014 11/3/2014 1 stress 13 female 19 30 # … with 178 more rows, and 150 more variables: mood_post <dbl>, # STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, # STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, # STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, # STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, # STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, # STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, STAI_pre_3_4 <dbl>, … ``` ] --- count: false # Values across dilemmas .panel1-len_cats-auto[ ```r readr::read_tsv("https://biostat2.uni.lu/practicals/data/judgments.tsv") %>% * select(subject, condition, age, starts_with("moral_dilemma") ) ``` ] .panel2-len_cats-auto[ ``` Rows: 188 Columns: 158 ``` ``` ── Column specification ──────────────────────────────────────────────────────── Delimiter: "\t" chr (5): start_date, end_date, condition, gender, logbook dbl (153): finished, subject, age, mood_pre, mood_post, STAI_pre_1_1, STAI_p... ``` ``` ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` # A tibble: 188 × 10 subject condition age moral_dilemma_dog moral_dilemma_wal… moral_dilemma_p… <dbl> <chr> <dbl> <dbl> <dbl> <dbl> 1 2 control 24 9 9 8 2 1 stress 19 9 9 9 3 3 stress 19 8 7 8 4 4 stress 22 8 4 8 5 7 control 22 3 9 9 6 6 stress 22 9 9 9 7 5 control 18 9 5 7 8 9 control 20 9 4 1 9 16 stress 21 6 9 3 10 13 stress 19 6 8 9 # … with 178 more rows, and 4 more variables: moral_dilemma_resume <dbl>, # moral_dilemma_kitten <dbl>, moral_dilemma_trolley <dbl>, # moral_dilemma_control <dbl> ``` ] --- count: false # Values across dilemmas .panel1-len_cats-auto[ ```r readr::read_tsv("https://biostat2.uni.lu/practicals/data/judgments.tsv") %>% select(subject, condition, age, starts_with("moral_dilemma") ) %>% * pivot_longer(starts_with("moral_dilemma") , * names_to = "dilemma", * values_to = "dilemma_val", * names_pattern = "moral_dilemma_(.*)") ``` ] .panel2-len_cats-auto[ ``` Rows: 188 Columns: 158 ``` ``` ── Column specification ──────────────────────────────────────────────────────── Delimiter: "\t" chr (5): start_date, end_date, condition, gender, logbook dbl (153): finished, subject, age, mood_pre, mood_post, STAI_pre_1_1, STAI_p... ``` ``` ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` # A tibble: 1,316 × 5 subject condition age dilemma dilemma_val <dbl> <chr> <dbl> <chr> <dbl> 1 2 control 24 dog 9 2 2 control 24 wallet 9 3 2 control 24 plane 8 4 2 control 24 resume 7 5 2 control 24 kitten 9 6 2 control 24 trolley 5 7 2 control 24 control 9 8 1 stress 19 dog 9 9 1 stress 19 wallet 9 10 1 stress 19 plane 9 # … with 1,306 more rows ``` ] --- count: false # Values across dilemmas .panel1-len_cats-auto[ ```r readr::read_tsv("https://biostat2.uni.lu/practicals/data/judgments.tsv") %>% select(subject, condition, age, starts_with("moral_dilemma") ) %>% pivot_longer(starts_with("moral_dilemma") , names_to = "dilemma", values_to = "dilemma_val", names_pattern = "moral_dilemma_(.*)") %>% * count(condition, dilemma_val) ``` ] .panel2-len_cats-auto[ ``` Rows: 188 Columns: 158 ``` ``` ── Column specification ──────────────────────────────────────────────────────── Delimiter: "\t" chr (5): start_date, end_date, condition, gender, logbook dbl (153): finished, subject, age, mood_pre, mood_post, STAI_pre_1_1, STAI_p... ``` ``` ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` # A tibble: 18 × 3 condition dilemma_val n <chr> <dbl> <int> 1 control 1 25 2 control 2 31 3 control 3 48 4 control 4 44 5 control 5 49 6 control 6 65 7 control 7 103 8 control 8 110 9 control 9 162 10 stress 1 13 11 stress 2 35 12 stress 3 32 13 stress 4 51 14 stress 5 56 15 stress 6 54 16 stress 7 103 17 stress 8 133 18 stress 9 202 ``` ] --- count: false # Values across dilemmas .panel1-len_cats-auto[ ```r readr::read_tsv("https://biostat2.uni.lu/practicals/data/judgments.tsv") %>% select(subject, condition, age, starts_with("moral_dilemma") ) %>% pivot_longer(starts_with("moral_dilemma") , names_to = "dilemma", values_to = "dilemma_val", names_pattern = "moral_dilemma_(.*)") %>% count(condition, dilemma_val) %>% * pivot_wider(names_from = dilemma_val, * values_from = n, * names_prefix = "n") ``` ] .panel2-len_cats-auto[ ``` Rows: 188 Columns: 158 ``` ``` ── Column specification ──────────────────────────────────────────────────────── Delimiter: "\t" chr (5): start_date, end_date, condition, gender, logbook dbl (153): finished, subject, age, mood_pre, mood_post, STAI_pre_1_1, STAI_p... ``` ``` ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` # A tibble: 2 × 10 condition n1 n2 n3 n4 n5 n6 n7 n8 n9 <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> 1 control 25 31 48 44 49 65 103 110 162 2 stress 13 35 32 51 56 54 103 133 202 ``` ] <style> .panel1-len_cats-auto { color: black; width: 44.5454545454545%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-len_cats-auto { color: black; width: 53.4545454545455%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-len_cats-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> .left.footnote[.bold[[`flipbookr`](https://evamaerey.github.io/flipbooks/about)] by [Gina Reynolds](https://github.com/EvaMaeRey)] --- class: inverse, middle, center # Helpful tools --- # Removing the data frame context through `pull()` .top[.left-column[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[💡 Remark ]] By default tidyverse operations that receive a `tibble` as input return a `tibble`. ]]] .right-column[ ```r pull(judgments, age)[1:5] ``` ``` [1] 24 19 19 22 22 ``` ```r pull(judgments, -20)[1:10] ``` ``` [1] NA 51 41 32 NA 33 NA NA 30 38 ``` ```r # TODO: make this work judgments %>% distinct(mood_pre) %>% pull(mood_pre) ``` ] --- # Comparing to data in other rows .left-column[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[💡 Leading and lagging rows ]] `lead()` for data in following row and `lag()` for data in the row above]] -- .right-column[ ### Calculate differences between subjects Let's assume the subject IDs are in order of the tests being conducted. What is the difference to the previous subject for the "initial mood"? ```r judgments %>% dplyr::select(subject, mood_pre) %>% arrange(subject) %>% mutate(prev_mood_pre = lag(mood_pre), * mood_diff = mood_pre - lag(mood_pre)) ``` ``` # A tibble: 188 × 4 subject mood_pre prev_mood_pre mood_diff <dbl> <dbl> <dbl> <dbl> 1 1 59 NA NA 2 2 81 59 22 3 3 22 81 -59 4 4 53 22 31 5 5 NA 53 NA 6 6 73 NA NA 7 7 48 73 -25 8 8 59 48 11 9 9 100 59 41 10 10 72 100 -28 # … with 178 more rows ``` ] --- # Summary .pull-left[ .flex[ .w-100.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Most commonly used - 80% ] - `select()` - columns - `filter()` - rows meeting condition - `arrange()` - sort - `glimpse()` - inspect - `rename()` - change column name - `relocate()` - move columns - `mutate()` - create columns - `across()`, `c_across()` - work on >1 column - `group_by()`, `ungroup()`, `rowwise()` - `summarise()` - group-wise summaries - `lead()` and `lag()` - Values in other rows - `inner_join` and friends - Merging tables - `case_when()` simplifies if/else/if/else source: Lise Vaudor [blog](http://perso.ens-lyon.fr/lise.vaudor/dplyr/) ] ] ] -- .pull-right[ .flex[ .w-100.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Comments] * Represent the verbs you will use 80% of your time. Go to the website to see additional functions. * `tidyr` and `dplyr` are replacing the `reshape` and `reshape2` packages - tidy data + <http://tidyr.tidyverse.org/> + `vignette("tidy-data")` - Further reading + `tidyjson` - Retrieve json data as nested tibbles  ]]] --- # The other 20% .pull-left[ .w-100.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Occasionally handy ] - Assembly: `bind_rows`, `bind_cols` - Windows function, `min_rank`, `dense_rank`, `cumsum`. See [vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html) - working with [list-columns](https://tidyr.tidyverse.org/articles/nest.html) - [multidplyr](https://multidplyr.tidyverse.org/) for parallelized code ] ] .pull-right[ .w-100.bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[SQL mapping allows database access ] - `dplyr` code can be translated into SQL and query databases online (using `dbplyr`) - different types of tabular data ([dplyr SQL backend](https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html), databases, ]] --- # Bigger data ### Go for `data.table` .large[ + See this interesting [thread](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) about comparing `data.table` versus `dplyr` + [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html), see [introduction](https://github.com/Rdatatable/data.table/wiki) is very efficient but the syntax is not so easy. + Main advantage: inline replacement (tidyverse is frequently copying) + As a summary: _tl;dr data.table for speed, dplyr for readability and convenience_ [Prashanth Sriram](https://www.quora.com/Which-is-better-to-use-for-data-manipulation-dplyr-package-or-data-table-library) + Hadley recommends that for data > 1-2 Gb, if speed is your main matter, go for `data.table` + [`dtplyr`](https://dtplyr.tidyverse.org/) is for learning the `data.table` from `dplyr` API input + [`tidytable`](https://markfairbanks.github.io/tidytable/) is actually using `data.table` but with `dplyr` syntax + `data.table` might be not useful for specialized applications with high volumes such as genomics. ]
--- # Before we stop .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml1[ .large[.gbox[You learned to:
] + Selection and manipulation of - observations, - variables and - values. + Grouping and summarizing + Joining and intersecting tibbles + Reshaping column headers and variables ] ] .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml2[ .large[.bbox[Acknowledgments 🙏 👏] * Aurelien Ginolhac, Roland Krause (development) * Hadley Wickham * Lionel Henry * Romain François * [Gina Reynolds](https://github.com/EvaMaeRey) for fantastic flipbooks * [Lise Vaudor](https://twitter.com/allison_horst) nice blog * [Allison Horst](https://twitter.com/allison_horst) for the great ArtWork * [Alexandre Courtiol](https://github.com/courtiol/Rguides) for cheatsheets * Jenny Bryan * [poorman](https://nathaneastwood.github.io/poorman/) by Nathan Eastwood, a re-implementation of `dplyr` in base only ] ] ] .w-60.pv2.ph3.mt1.ml6[ .huge[.bbox[Thank you for your attention!]] ]