class: title-slide # What's new in 2021 ## in the `tidyverse` development .center[<img src="img/00/logo_tidyverse.png" width="100px"/>] ### A. Ginolhac | rworkshop | 2021-09-10
--- class: middle, center, inverse # Tidyverse: pros & cons --- # Workflow, by David Robinson .center[] --- class: nvs1 # Packages in processes .center[] --- class: nvs2 # List of components .pull-left[ #### Core - `ggplot2`, for data visualization - `dplyr`, for data manipulation - `tidyr`, for data tidying - `readr`, for data import (`vroom` default in the future?) - `purrr`, for functional programming - `tibble`, for tibbles, a modern re-imagining of data frames - `stringr`, for strings - `forcats`, for factors .footnote[source: https://tidyverse.tidyverse.org/. H.Wickham] ] .pull-right[ #### Extended - Modelling + `modelr`, for modelling within a pipeline + `broom`, for models -> tidy data - Programming + `rlang`, low-level API + `glue`, alternative to paste - Working with specific types of vectors: + `hms`, for times + `lubridate`, for date/times + `vcts`, for vectors - Importing other types of data: + `feather`, for sharing data + `fs`, for cross platform file system ops + `haven`, for SPSS, SAS and Stata files + `httr`, for web apis + `jsonlite` for JSON + `readxl`, for `.xls` and `.xlsx` files + `rvest`, for web scraping + `xml2`, for XML files + `DBI`, for relational databases ] --- # Tidyverse criticism, a dialect <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/ucfagls">@ucfagls</a> yeah. I think the tidyverse is a dialect. But its accent isn’t so thick</p>— Hadley Wickham (@hadleywickham) <a href="https://twitter.com/hadleywickham/status/819610201946984451">12 janvier 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> --- # Criticism, controversy .pull-left[ - [In StackOverflow's comment](http://stackoverflow.com/questions/41880796/grouped-multicolumn-gather-with-dplyr-tidyr-purrr)  - See the popularity of the [data.table versus dplyr](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) question. ] .pull-right[ ### Easily summarized .large[ - [`data.table`](https://github.com/Rdatatable/data.table/wiki) is faster, for less than 10 m rows, negligible. - [`tidyfast`](https://github.com/TysonStanley/tidyfast) for `data.table` speed and `tidyverse` syntax - [`tidytable`](https://github.com/markfairbanks/tidytable) for `data.table` speed and `tidyverse` syntax - [`poorman`](https://github.com/nathaneastwood/poorman) for zero dependencies but slow ;) ] ] --- # Criticism, finding a job .flex[ .w-50.ph3.mt3.mr1[ <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Realized today: <a href="https://twitter.com/hashtag/tidyverse?src=hash">#tidyverse</a> R and base <a href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a> have little in common. Beware when looking for job which requires knowledge of R.</p>— Yeedle N. (@Yeedle) <a href="https://twitter.com/Yeedle/status/837448170963668992">2 mars 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ] .w-50.bg-washed-red.b--red.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.ybox[Personal complains]] - Still young, change quickly but [lifecycle](https://lifecycle.r-lib.org/articles/stages.html) - Backward compatibility is not always maintained. - `tibbles` are nice, recent embedding of `matrices` doesn't solve bioconductor integration - `rownames` still an issue, one must be careful not to loose them .large[.bbox[No need for opposition base / tidyverse] .center[Learning the _tidyverse_ does not prevent to learn _R base_, it helps to get things done early in the process]] ]] --- # Community complaints .center[] .footnote[source: [SO, R chat room, 29 Nov 2017](https://chat.stackoverflow.com/rooms/106/r)] --- class: inverse, center, middle # 2021 developments --- # ragg, fonts working across platforms (Thomas Pedersen)
.pull-left[  ] .pull-right[  ] .footnote[Source: `ragg`, [blog post](https://www.tidyverse.org/blog/2021/02/modern-text-features/)] --- # How to use `ragg` .pull-left[ #### Use in RStudio  ] .pull-right[ #### With `knitr` ```r knitr::opts_chunk$set(dev = "ragg_png") ```  ] --- # Sliding windows (Davis Vaughan)
.pull-left[ #### [`slider`](https://davisvaughan.github.io/slider/) ```r library(slider) slide_dbl(1:5, ~ mean(.x), .before = 1, .after = 1) ``` ``` [1] 1.5 2.0 3.0 4.0 4.5 ``` - rolling mean ```r mutate(swiss, roll_agri = slide_mean(Agriculture, before = 7), .keep = "used") ``` ``` Agriculture roll_agri Courtelary 17.0 17.00000 Delemont 45.1 31.05000 Franches-Mnt 39.7 33.93333 Moutier 36.5 34.57500 Neuveville 43.5 36.36000 Porrentruy 35.3 36.18333 Broye 70.2 41.04286 Glane 67.8 44.38750 Gruyere 53.3 48.92500 Sarine 45.2 48.93750 Veveyse 64.5 52.03750 Aigle 62.0 55.22500 Aubonne 67.5 58.22500 Avenches 60.7 61.40000 Cossonay 69.3 61.28750 Echallens 72.6 61.88750 Grandson 34.0 59.47500 Lausanne 19.4 56.25000 La Vallee 15.2 50.08750 Lavaux 73.0 51.46250 Morges 59.8 50.50000 Moudon 55.1 49.80000 Nyone 50.9 47.50000 Orbe 54.1 45.18750 Oron 71.2 49.83750 Payerne 58.1 54.67500 Paysd'enhaut 63.5 60.71250 Rolle 60.8 59.18750 Vevey 26.8 55.06250 Yverdon 49.5 54.36250 Conthey 85.9 58.73750 Entremont 84.9 62.58750 Herens 89.7 64.90000 Martigwy 78.2 67.41250 Monthey 64.9 67.58750 St Maurice 75.9 69.47500 Sierre 84.6 76.70000 Sion 63.1 78.40000 Boudry 38.4 72.46250 La Chauxdfnd 7.7 62.81250 Le Locle 16.7 53.68750 Neuchatel 17.6 46.11250 Val de Ruz 37.6 42.70000 ValdeTravers 18.7 35.55000 V. De Geneve 1.2 25.12500 Rive Droite 46.6 23.06250 Rive Gauche 27.7 21.72500 ``` ] -- .pull-right[ #### [`clock`](https://clock.r-lib.org/) - Explicitly handle invalid dates - Explicitly handle daylight saving time issues - Expose naive types for representing date-times without a time zone - Provide calendar types for representing calendar “dates” in alternative ways - Provide variable precision date-time types ```r library(clock) date_seq(date_build(2019, 1), by = duration_months(2), total_size = 10) ``` ``` [1] "2019-01-01" "2019-03-01" "2019-05-01" "2019-07-01" "2019-09-01" [6] "2019-11-01" "2020-01-01" "2020-03-01" "2020-05-01" "2020-07-01" ``` ] --- # New functions in `dplyr`, `across` for filtering
#### Following the success of `across()`, `if_any()` `if_all()` were developed for filtering .pull-left[ ```r library(palmerpenguins) is_big <- function(x) { x > mean(x, na.rm = TRUE) } # keep rows if all the selected columns are "big" filter(penguins, if_all(contains("bill"), is_big)) ``` ``` # A tibble: 61 × 8 species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 46 21.5 194 4200 2 Adelie Dream 44.1 19.7 196 4400 3 Adelie Torgersen 45.8 18.9 197 4150 4 Adelie Biscoe 45.6 20.3 191 4600 5 Adelie Torgersen 44.1 18 210 4000 6 Gentoo Biscoe 44.4 17.3 219 5250 7 Gentoo Biscoe 50.8 17.3 228 5600 8 Chinstrap Dream 46.5 17.9 192 3500 9 Chinstrap Dream 50 19.5 196 3900 10 Chinstrap Dream 51.3 19.2 193 3650 # … with 51 more rows, and 2 more variables: sex <fct>, year <int> ``` ] -- .pull-right[ ```r filter(penguins, if_any(contains("bill"), is_big)) ``` ``` # A tibble: 296 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen 36.7 19.3 193 3450 5 Adelie Torgersen 39.3 20.6 190 3650 6 Adelie Torgersen 38.9 17.8 181 3625 7 Adelie Torgersen 39.2 19.6 195 4675 8 Adelie Torgersen 34.1 18.1 193 3475 9 Adelie Torgersen 42 20.2 190 4250 10 Adelie Torgersen 37.8 17.3 180 3700 # … with 286 more rows, and 2 more variables: sex <fct>, year <int> ``` ] .footnote[Source: [blog post](https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/)] --- # Add-ons to `mutate`
.large[ New experimental `.keep` argument: - "all", the default, retains all variables. - "used" keeps any variables used to make new variables - "unused" keeps only existing variables not used to make new variables. - "none", only keeps grouping keys (like `transmute()`) ] .pull-left[ ```r group_by(penguins, species) %>% mutate(body_mass_kg = body_mass_g / 1000, .keep = "used") ``` ``` # A tibble: 344 × 3 # Groups: species [3] species body_mass_g body_mass_kg <fct> <int> <dbl> 1 Adelie 3750 3.75 2 Adelie 3800 3.8 3 Adelie 3250 3.25 4 Adelie NA NA 5 Adelie 3450 3.45 6 Adelie 3650 3.65 7 Adelie 3625 3.62 8 Adelie 4675 4.68 9 Adelie 3475 3.48 10 Adelie 4250 4.25 # … with 334 more rows ``` ] .pull-right[ ```r group_by(penguins, species) %>% mutate(body_mass_kg = body_mass_g / 1000, .keep = "none") ``` ``` # A tibble: 344 × 2 # Groups: species [3] species body_mass_kg <fct> <dbl> 1 Adelie 3.75 2 Adelie 3.8 3 Adelie 3.25 4 Adelie NA 5 Adelie 3.45 6 Adelie 3.65 7 Adelie 3.62 8 Adelie 4.68 9 Adelie 3.48 10 Adelie 4.25 # … with 334 more rows ``` ] --- # Add-ons to `pull`
.pull-left[ .large[ `pull()` can now output a **named** vector ] ```r as_tibble(swiss, rownames = "location") %>% pull(Agriculture, name = location) ``` ``` Courtelary Delemont Franches-Mnt Moutier Neuveville Porrentruy 17.0 45.1 39.7 36.5 43.5 35.3 Broye Glane Gruyere Sarine Veveyse Aigle 70.2 67.8 53.3 45.2 64.5 62.0 Aubonne Avenches Cossonay Echallens Grandson Lausanne 67.5 60.7 69.3 72.6 34.0 19.4 La Vallee Lavaux Morges Moudon Nyone Orbe 15.2 73.0 59.8 55.1 50.9 54.1 Oron Payerne Paysd'enhaut Rolle Vevey Yverdon 71.2 58.1 63.5 60.8 26.8 49.5 Conthey Entremont Herens Martigwy Monthey St Maurice 85.9 84.9 89.7 78.2 64.9 75.9 Sierre Sion Boudry La Chauxdfnd Le Locle Neuchatel 84.6 63.1 38.4 7.7 16.7 17.6 Val de Ruz ValdeTravers V. De Geneve Rive Droite Rive Gauche 37.6 18.7 1.2 46.6 27.7 ``` ] --- # Add-ons to `joins`
.large[ - New argument `keep` (TRUE/FALSE) for "both x and y be preserved in the output?" ] .pull-left[ ```r band_members ``` ``` # A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles 3 Paul Beatles ``` ```r band_instruments2 ``` ``` # A tibble: 3 × 2 artist plays <chr> <chr> 1 John guitar 2 Paul bass 3 Keith guitar ``` ### Should we keep `artist` column? ] -- .pull-right[ ```r left_join(band_members, band_instruments2, by = c("name" = "artist"), keep = TRUE) ``` ``` # A tibble: 3 × 4 name band artist plays <chr> <chr> <chr> <chr> 1 Mick Stones <NA> <NA> 2 John Beatles John guitar 3 Paul Beatles Paul bass ``` ```r left_join(band_members, band_instruments2, by = c("name" = "artist"), keep = FALSE) ``` ``` # A tibble: 3 × 3 name band plays <chr> <chr> <chr> 1 Mick Stones <NA> 2 John Beatles guitar 3 Paul Beatles bass ``` ] --- # `coalesce`, not new but tends to be forgotten
.pull-left[ ### Keep the first non-missing values ```r coalesce( c(1, NA, 3, NA), c(NA, 2, 5, 6) ) ``` ``` [1] 1 2 3 6 ``` ] .pull-right[ ### Works also in rectangle data ```r tribble( ~ a, ~ b, ~ c, "soil", NA, NA, NA, "tree", "Buch", NA, NA, "Birch") %>% mutate(col = coalesce(a, b, c)) ``` ``` # A tibble: 3 × 4 a b c col <chr> <chr> <chr> <chr> 1 soil <NA> <NA> soil 2 <NA> tree Buch tree 3 <NA> <NA> Birch Birch ``` ] --- # Before we stop .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml1[ .large[.gbox[Wrap up:] - RStudio team is growing every month - maintain incredible amount of code - still keep it FOSS (MIT license) ]] .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml2[ .large[.bbox[Acknowledgments 🙏 👏] * [Alexandre Courtiol](https://github.com/courtiol) [lecture](https://courtiol.github.io/Rcourses/dplyr1.html#1) * [Romain François](https://github.com/romainfrancois) * [Lionel Henry](https://github.com/-lionel) * [Hadley Wickham](https://github.com/hadley) * [Jennifer Bryan](https://github.com/jennybc) * [Jim Hester](https://github.com/jimhester) ] ]] .flex[ .w-60.pv2.ph3.mt2.ml6[ .huge[.bbox[Thank you for your attention!]] ]] .footnote[Palmer penguins, data are available by CC-0 license and `Artwork` by .bold[Allison Horst]]