class: title-slide
# Data import ## From flat files and excel files .center[<img src="img/00/logo_readr.png" width="100px"/> <img src="img/00/logo_readxl.png" width="100px"/>] ### Roland Krause | rworkshop | 2021-09-08 --- # Learning objectives .flex[ .w-70.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1.ml6[ .large[.gbox[You will learn to:] .float-img[
] - Create and use _Rstudio projects_ - Download a data file to a dedicated subfolder - Learn about `tibbles` - Use `readr` to import flat-files data into _R_ - Use the interactive RStudio interface to visualise your data import - Appreciate tibbles - Use `readxl` to import excel files into _R_ ] ] ] --- # Importing data .pull-left[ .bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.pb4.mt3.mr1[ .large[.gbox[Getting started]] - Represents probably the first step of your work - R can handle multiple data types + Flat files (`.csv`, `.tsv`, ...) + Excel files (`.xls`, `.xlsx`) + Foreign statistical formats + `.sas` from SAS + `.sav` from SPSS + `.dta` from Stata + databases (SQL, SQLite ...) ]] -- .pull-right[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1.pb3[ .large[.bbox[Tidyverse implementation]]
- .bold[R base] already provides functions for text files + `read.csv()` + `read.delim()` + ... - .bold[tidyverse] redefines these functions: + **speed** + **characters are not coerced to factors by default** (R base neither since `>=4.0`) + generates tibbles ]] --- # Tibbles .flex[ .w-70.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1.pb5[ .large[.gbox[Tibbles]
- Have a refined print method that shows only the first 10 rows. - Show all the columns that fit on screen and list the name of remaining ones. - Each column reports its type (`double, int, lgl, chr`) - Makes it much easier to work with large data. ] ] .w-30.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3[ .large[.bbox[Hint π‘]] Use `as_tibble()` to convert a `data.frame` to a tibble. ] ] --- class: hide_logo # `data.frame` vs `tibble` .pull-left[ .bbox[Data frame] ```r swiss ``` ``` Fertility Agriculture Examination Education Catholic Courtelary 80.2 17.0 15 12 9.96 Delemont 83.1 45.1 6 9 84.84 Franches-Mnt 92.5 39.7 5 5 93.40 Moutier 85.8 36.5 12 7 33.77 Neuveville 76.9 43.5 17 15 5.16 Porrentruy 76.1 35.3 9 7 90.57 Broye 83.8 70.2 16 7 92.85 Glane 92.4 67.8 14 8 97.16 Gruyere 82.4 53.3 12 7 97.67 Sarine 82.9 45.2 16 13 91.38 Veveyse 87.1 64.5 14 6 98.61 Aigle 64.1 62.0 21 12 8.52 Aubonne 66.9 67.5 14 7 2.27 Avenches 68.9 60.7 19 12 4.43 Cossonay 61.7 69.3 22 5 2.82 Echallens 68.3 72.6 18 2 24.20 Grandson 71.7 34.0 17 8 3.30 Lausanne 55.7 19.4 26 28 12.11 La Vallee 54.3 15.2 31 20 2.15 Lavaux 65.1 73.0 19 9 2.84 Morges 65.5 59.8 22 10 5.23 Moudon 65.0 55.1 14 3 4.52 Nyone 56.6 50.9 22 12 15.14 Orbe 57.4 54.1 20 6 4.20 Oron 72.5 71.2 12 1 2.40 Payerne 74.2 58.1 14 8 5.23 Paysd'enhaut 72.0 63.5 6 3 2.56 Rolle 60.5 60.8 16 10 7.72 Vevey 58.3 26.8 25 19 18.46 Yverdon 65.4 49.5 15 8 6.10 Conthey 75.5 85.9 3 2 99.71 Entremont 69.3 84.9 7 6 99.68 Herens 77.3 89.7 5 2 100.00 Martigwy 70.5 78.2 12 6 98.96 Monthey 79.4 64.9 7 3 98.22 St Maurice 65.0 75.9 9 9 99.06 Sierre 92.2 84.6 3 3 99.46 Sion 79.3 63.1 13 13 96.83 Boudry 70.4 38.4 26 12 5.62 La Chauxdfnd 65.7 7.7 29 11 13.79 Le Locle 72.7 16.7 22 13 11.22 Neuchatel 64.4 17.6 35 32 16.92 Val de Ruz 77.6 37.6 15 7 4.97 ValdeTravers 67.6 18.7 25 7 8.65 V. De Geneve 35.0 1.2 37 53 42.34 Rive Droite 44.7 46.6 16 29 50.43 Rive Gauche 42.8 27.7 22 29 58.33 Infant.Mortality Courtelary 22.2 Delemont 22.2 Franches-Mnt 20.2 Moutier 20.3 Neuveville 20.6 Porrentruy 26.6 Broye 23.6 Glane 24.9 Gruyere 21.0 Sarine 24.4 Veveyse 24.5 Aigle 16.5 Aubonne 19.1 Avenches 22.7 Cossonay 18.7 Echallens 21.2 Grandson 20.0 Lausanne 20.2 La Vallee 10.8 Lavaux 20.0 Morges 18.0 Moudon 22.4 Nyone 16.7 Orbe 15.3 Oron 21.0 Payerne 23.8 Paysd'enhaut 18.0 Rolle 16.3 Vevey 20.9 Yverdon 22.5 Conthey 15.1 Entremont 19.8 Herens 18.3 Martigwy 19.4 Monthey 20.2 St Maurice 17.8 Sierre 16.3 Sion 18.1 Boudry 20.3 La Chauxdfnd 20.5 Le Locle 18.9 Neuchatel 23.0 Val de Ruz 20.0 ValdeTravers 19.5 V. De Geneve 18.0 Rive Droite 18.2 Rive Gauche 19.3 ``` ] -- .pull-right[ .bbox[Tibble] ```r as_tibble(swiss) ``` ``` # A tibble: 47 Γ 6 Fertility Agriculture Examination Education Catholic Infant.Mortality <dbl> <dbl> <int> <int> <dbl> <dbl> 1 80.2 17 15 12 9.96 22.2 2 83.1 45.1 6 9 84.8 22.2 3 92.5 39.7 5 5 93.4 20.2 4 85.8 36.5 12 7 33.8 20.3 5 76.9 43.5 17 15 5.16 20.6 6 76.1 35.3 9 7 90.6 26.6 7 83.8 70.2 16 7 92.8 23.6 8 92.4 67.8 14 8 97.2 24.9 9 82.4 53.3 12 7 97.7 21 10 82.9 45.2 16 13 91.4 24.4 # β¦ with 37 more rows ``` ] --- # Tibbles .flex[ .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.ml5[ `tibbles` never use `rownames`. They are lost unless you assign them to a dedicated column ```r # library(tibble) as_tibble(swiss, rownames = "Province") ``` ``` # A tibble: 47 Γ 7 Province Fertility Agriculture Examination Education Catholic <chr> <dbl> <dbl> <int> <int> <dbl> 1 Courtelary 80.2 17 15 12 9.96 2 Delemont 83.1 45.1 6 9 84.8 3 Franches-Mnt 92.5 39.7 5 5 93.4 4 Moutier 85.8 36.5 12 7 33.8 5 Neuveville 76.9 43.5 17 15 5.16 6 Porrentruy 76.1 35.3 9 7 90.6 7 Broye 83.8 70.2 16 7 92.8 8 Glane 92.4 67.8 14 8 97.2 9 Gruyere 82.4 53.3 12 7 97.7 10 Sarine 82.9 45.2 16 13 91.4 # β¦ with 37 more rows, and 1 more variable: Infant.Mortality <dbl> ``` ] ] .flex[ .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt1.ml5[ .large[.bbox[Other enhancements]] - no characters to factors (R base neither since `>=4.0`) - colnames can be repaired, `"minimal", "unique", "universal", "check_unique"`, see [details](https://vctrs.r-lib.org/reference/vec_as_names.html) ] ] --- # The tidyverse packages to import your data .flex[ .w-70.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.mr1[ .large[.ybox[`readr`]] - `readr` package: + `read_csv()`: comma separated (`,`) + `read_csv2()`: separated (`;`) + `read_tsv()`: tab separated + `read_delim()`: general delimited files, auto-guesses delimiter + `read_fwf()`: fixed width files + `read_table()`: columns separated by white-space(s) ] .w-70.bg-washed-yellow.b--green.ba.bw2.br3.shadow-5.ph3.mt2.mr1[ .large[.ybox[`readxl`]] To import excel files (`.xls` and `.xlsx`): - `read_excel()` + `read_xls()` + `read_xlsx()` .right[] ] .w-70.bg-washed-blue.b--red.ba.bw2.br3.shadow-5.ph3.mt2.mr1[ .large[.rbox[haven]] - `read_sas()` for SAS - `read_sav()` for SPSS - `read_dta()` for Stata .right[] ] ] --- # Working directory set up with **RStudio projects** .pull-left[ .flex[ .w-100.bg-washed-red.b--red.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .red.bold[Avoid using setwd() + absolute file paths] .bold[Whatβs wrong with setwd()?] [Jenny Bryanβs great blog post on project based workflows](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/) - _setwd()_ set the working directory to a specific folder through an absolute file path ```r setwd("C:/Users/veronica.codoni/Projects/Survival_Analysis") ``` - What if this folder moved to /Downloads, or onto another machine? ```r setwd("C:/Users/veronica.codoni/Projects/Survival_Analysis") ``` ``` Error in setwd("C:/Users/veronica.codoni/Projects/Survival_Analysis"): cannot change working directory ``` This approach is **not self-contained** and **portable**!! ] ]] -- .pull-right[ .flex[ .w-100.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[**RStudio projects as alternative π‘**] Start a new research project/data analysis by creating a new _Rstudio Project_ - it helps to keep all the files together - it sets **automatically** the working directory to the project directory - it **helps resolve file path issues** by using **relative** paths Create a Project in Rstudio assures the project directory to be **stand-alone** and **portable**. ] ]] --- # How to create **RStudio projects** <style type="text/css"> #jbadv ul li ul li { line-height: 0.5em; } </style> .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ Use the **Project** button in the top right corner  - Name the project properly (keeping it short, no spaces) - Once the project has been created, a `.Rproj` extension file is generated. This allows for automatic working directory set-up. ] .w-50.bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Exercise]] - Create new project called `Rworkshop` + in a new folder + no `git` (even if you _should_) - Create sub-folders `data` (`analysis`, `plots` ...) ] ] --- class: inverse, middle, center # Importing flat files <img src="img/00/logo_readr.png" width="150px"/> --- class: slide-practical ## Reading flat files .w-70.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.ml5[ .large[.gbox[Flat file example: _swiss.csv_]] - Use the new RStudio project (finding your files will be easier) - Download the [`swiss.csv`](https://biostat2.uni.lu/practicals/data/swiss.csv) file to your project folder, in the sub-folder `data` - Open the file with a text viewer and have a look at its content - Does the delimiter fit the file extension? ``` "Fertility","Agriculture","Examination","Education","Catholic","Infant.Mortality" 80.2,17,15,12,9.96,22.2 83.1,45.1,6,9,84.84,22.2 92.5,39.7,5,5,93.4,20.2 85.8,36.5,12,7,33.77,20.3 76.9,43.5,17,15,5.16,20.6 76.1,35.3,9,7,90.57,26.6 83.8,70.2,16,7,92.85,23.6 92.4,67.8,14,8,97.16,24.9 82.4,53.3,12,7,97.67,21 ... ``` ] --- class: slide-practical # Exercise, download a csv file in a project sub-folder ### 1. Download `swiss.csv` in `data` .pull-left[ .large[ In your browser: - Right-click on the .bold[link] - Choose .bold[Save Link As] ]  ] -- .pull-right[ .large[ - You .red.bold[MUST] have created a RStudio project - Use Finder/Explorer to select this project folder - If not present .bold[New Folder] called `data` - Go inside this sub-folder `data` - Hit .bold[Save] ]  ] --- class: slide-practical # Exercise, download a csv file in a project sub-folder ### 2. Check if file is present and preview it .pull-left[ .Large[In RStudio:] .arge[ - Use file panel - Go in `data` and check if `swiss.csv` is present - Right-click, menu should be: ]  ] -- .pull-right[ .large[ - If `swiss.csv` is present in the `data` folder - You can .bold[View File] to see its content - Cheat and .bold[Import Dataset] it - However, we want to import it programmatically + Ensures reproducibility + Enables sharing with others ] ] --- class: slide-practical, hide_logo
# Exercise, download a csv file in a project sub-folder ### 3. Import the file with `readr::read_delim()` .pull-left[ .large[The file says ".csv" but could be wrong] ```r # library(readr) or # library(tidyverse) read_delim("data/swiss.csv") ``` ``` Rows: 47 Columns: 6 ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββ Delimiter: "," dbl (6): Fertility, Agriculture, Examination, Education, Catholi... βΉ Use `spec()` to retrieve the full column specification for this data. βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message. # A tibble: 47 x 6 Fertility Agriculture Examination <dbl> <dbl> <dbl> 1 80.2 17 15 2 83.1 45.1 6 3 92.5 39.7 5 4 85.8 36.5 12 5 76.9 43.5 17 6 76.1 35.3 9 7 83.8 70.2 16 8 92.4 67.8 14 9 82.4 53.3 12 10 82.9 45.2 16 # β¦ with 37 more rows, and 3 more # variables: Education <dbl>, # Catholic <dbl>, Infant.Mortality <dbl> ``` ] -- .pull-right[ #### What do we observe with `readr::read_delim()`? - Guesses automatically the field separator: .bold[","] - Reports the dimensions of the table read - Also guessed the column types (all `double` here) - Returns a `tibble`by default .large.red[Watch out!] .large[we didn't assign the tibble to a name!] .large[To do so:] ```r swiss_db <- read_delim("data/swiss.csv") ``` ] --- # [readr 2.0](https://readr.tidyverse.org/) engine is now [`vroom`](https://vroom.r-lib.org/) .pull-left[ #### `readr` is the official import package of the tidyverse - [Jim Hester](https://github.com/jimhester) also developed `vroom`. All of the functionality are included in `readr` version 2.0. - Great features + Column type guessing + Reporting parsing problems + Progress bar + Reading from urls + Reading compressed files - Amazing features in `readr` with version 2.0 (released July 2021): + Delimiter guessing + **FAST!** + multi-threaded + use [ALTREP](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) + Column selection + Merge multiple files into one `tibble` ] -- .pull-right[  .footnote[source: `vroom` [benchmarks](https://vroom.r-lib.org/articles/benchmarks.html)] ] --- class: hide_logo # Catch and report reading issues .pull-left[ #### Toy example with 2 created issues ```r read_delim("data/swiss2.csv") ``` ``` Rows: 5 Columns: 5 ``` ``` ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Delimiter: "," chr (1): fer dbl (4): agri, exa, edu, cath ``` ``` βΉ Use `spec()` to retrieve the full column specification for this data. βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` # A tibble: 5 Γ 5 fer agri exa edu cath <chr> <dbl> <dbl> <dbl> <dbl> 1 80.2 17 15 12 NA 2 a 45.1 6 9 84.8 3 92.5 39.7 5 5 NA 4 85.8 36.5 12 7 33.8 5 76.9 43.5 17 15 5.16 ``` - two missing data - `fer` is no longer of type `double` ] -- .pull-right[ #### Specifying column types `readr` record issues, `problems()` to see them ```r sw2 <- readr::read_delim("data/swiss2.csv", col_types = list(fer = col_double(), agri = col_double(), exa = col_integer(), edu = col_integer(), cath = col_double())) sw2 ``` ``` # A tibble: 5 Γ 5 fer agri exa edu cath <dbl> <dbl> <int> <int> <dbl> 1 80.2 17 15 12 NA 2 NA 45.1 6 9 84.8 3 92.5 39.7 5 5 NA 4 85.8 36.5 12 7 33.8 5 76.9 43.5 17 15 5.16 ``` ```r problems(sw2) ``` ``` # A tibble: 2 Γ 5 row col expected actual file <int> <int> <chr> <chr> <chr> 1 2 1 a double a /builds/r-training/rworkshop/site/lectures/data/sβ¦ 2 3 1 a double a /builds/r-training/rworkshop/site/lectures/data/sβ¦ ``` ] --- # Choosing which columns to read .pull-left[ #### Lighter column type definitions here, we want to skip the wrong columns: `fer` and `cath`. We let `read_delim` guessing the `agri` type (`?`), see all correspondences [here](https://vroom.r-lib.org/articles/vroom.html#column-types). ```r readr::read_delim("data/swiss2.csv", col_types = list(fer="_", agri = "?", exa = "i", edu = "i", cath = "_")) ``` ``` # A tibble: 5 x 3 agri exa edu <dbl> <int> <int> 1 17 15 12 2 45.1 6 9 3 39.7 5 5 4 36.5 12 7 5 43.5 17 15 ``` ] -- .pull-right[ #### Columns selection (support `tidyselect` syntax) ```r readr::read_delim("data/swiss2.csv", col_select = c(agri, edu)) ``` ``` # A tibble: 5 x 2 agri edu <dbl> <dbl> 1 17 12 2 45.1 9 3 39.7 5 4 36.5 7 5 43.5 15 ``` ] --- class: hide_logo # Column selection with [`tidyselect`](https://tidyselect.r-lib.org/) .bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt1.mr1.large[ This package is the backend for programatic column selection in the `tidyverse`. Features are described in [this article](https://tidyselect.r-lib.org/reference/language.html). ] -- .pull-left[ .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ .large[.bbox[Operators] - `[c()]` for .bold[combining] bare names - `[:]` for selecting a .bold[range] - `[-]` for .bold[negating] a selection .bbox[Helpers] - `everything()` .bold[all] columns - `last_col()` .bold[last] column .bbox[Using patterns] - `starts_with()` with quoted prefix - `ends_with()` with quoted suffix - `contains()` with quoted string ] ] ] -- .pull-right[ ```r readr::read_delim("data/swiss2.csv", show_col_types = FALSE, col_select = starts_with("e"), n_max = 1) ``` ``` # A tibble: 1 x 2 exa edu <dbl> <dbl> 1 15 12 ``` ```r readr::read_delim("data/swiss2.csv", show_col_types = FALSE, col_select = contains("e"), n_max = 1) ``` ``` # A tibble: 1 x 3 fer exa edu <dbl> <dbl> <dbl> 1 80.2 15 12 ``` ```r readr::read_delim("data/swiss2.csv", show_col_types = FALSE, col_select = fer:edu, n_max = 1) ``` ``` # A tibble: 1 x 4 fer agri exa edu <dbl> <dbl> <dbl> <dbl> 1 80.2 17 15 12 ``` ] --- # Header, what to do when absent .pull-left[ #### Example - Using the URL https://basv53.uni.lu/lectures/data/example.csv - This toy data contains 3 columns .arge[Content is:] ``` dog,red,1 cat,blue,2 chicken,green,6 ``` ] -- .pull-right[ #### Naive approach ```r exa <- "https://basv53.uni.lu/lectures/data/example.csv" readr::read_delim(exa) ``` ``` Rows: 2 Columns: 3 ``` ``` ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Delimiter: "," chr (2): dog, red dbl (1): 1 ``` ``` βΉ Use `spec()` to retrieve the full column specification for this data. βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` # A tibble: 2 Γ 3 dog red `1` <chr> <chr> <dbl> 1 cat blue 2 2 chicken green 6 ``` .center[.bold.large[Satisfying?]] ] --- # Provide the header .pull-left[ #### Read all lines - Colnames are self-created - `show_col_types = FALSE` suppresses the verbose mode ```r read_delim(exa, col_names = FALSE, show_col_types = FALSE) ``` ``` # A tibble: 3 Γ 3 X1 X2 X3 <chr> <chr> <dbl> 1 dog red 1 2 cat blue 2 3 chicken green 6 ``` .center[.bold.large[Satisfying?]] ] -- .pull-right[ #### Supply the colnames - Remember we are still reading directly from the url in the `exa` variable ```r read_delim(exa, col_names = c("animal", "color", "value"), show_col_types = FALSE) ``` ``` # A tibble: 3 Γ 3 animal color value <chr> <chr> <dbl> 1 dog red 1 2 cat blue 2 3 chicken green 6 ``` .center[.bold.large[Better]] ] --- class: slide-practical, hide_logo # Exercise .flex[ .w-30.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.gbox[Override the detected column types] Import `example.csv` using the [URL](https://basv53.uni.lu/lectures/data/example.csv) **but** + **Skip** the `colour` column + Read in the `value` column as **integer** ] ] .w-70.bg-washed-yellow.b--yellow.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[Hints]] Column types are specified using `col_types = list()` Function | Short | Description ----|-----|----- `col_logical()` | .large.bold[l] | TRUE/FALSE, 1/0 `col_integer()` | .large.bold[i] | integers `col_double()` | .large.bold[d] | floating point values. `col_number()` | .large.bold[n] | numbers containing the grouping_mark `col_date(format = "")` | .large.bold[D] | with the locale's date_format `col_time(format = "")`| .large.bold[t] | with the locale's time_format `col_datetime(format = "")`| .large.bold[T] | ISO8601 date times `col_factor(levels, ordered)` |.large.bold[f] | a fixed set of values `col_character()`| .large.bold[c] |everything else `col_skip()` |.large.bold[_] | .large.bold[-], don't import this column `col_guess()` | .large.bold[?] | parse using the βbestβ type based on the input ] ] --- # Solution .pull-left[ #### Using one-letter shorcuts ```r readr::read_delim(exa, col_names = c("animal", "color", "value"), col_types = list( animal = "c", color = "_", value = "i" )) ``` ``` # A tibble: 3 Γ 2 animal value <chr> <int> 1 dog 1 2 cat 2 3 chicken 6 ``` ] -- .pull-right[ #### Even more compact ```r readr::read_delim(exa, col_names = c("animal", "color", "value"), col_types = "c_i") ``` ``` # A tibble: 3 Γ 2 animal value <chr> <int> 1 dog 1 2 cat 2 3 chicken 6 ``` - Use the single character code in the column order ] --- # Skipping lines .flex[ .w-70.bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Our example] ``` # A tibble: 3 Γ 2 animal value <chr> <int> 1 dog 1 2 cat 2 3 chicken 6 ``` ] .w-70.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Comment] Do not read lines beginning with a character. ```r readr::read_delim( exa, * comment = "d", col_names = FALSE, show_col_types = FALSE) ``` ``` # A tibble: 2 Γ 3 X1 X2 X3 <chr> <chr> <dbl> 1 cat blue 2 2 chicken green 6 ``` ] .w-70.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Skip lines] ```r readr::read_delim( exa, * skip = 1, col_names = FALSE, show_col_types = FALSE) ``` ``` # A tibble: 2 Γ 3 X1 X2 X3 <chr> <chr> <dbl> 1 cat blue 2 2 chicken green 6 ``` ] .w-70.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .bbox[Limited number of lines] ```r readr::read_delim( exa, skip = 1, * n_max = 1, col_names = FALSE, show_col_types = FALSE ) ``` ``` # A tibble: 1 Γ 3 X1 X2 X3 <chr> <chr> <dbl> 1 cat blue 2 ``` ] ] --- class: center # Alternatives .w-50.bg-washed-yellow.b--yellow.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large.left[.bbox[Need for (more) speed ] Check `fread()` from [`data.table`](https://github.com/Rdatatable/data.table/wiki)
- Stable - No dependencies - Overhauled - Could feed from `bash` command - Could be faster than `readr` functions ] ] --- class: inverse, middle, center # Importing Excel files <img src="img/00/logo_readxl.png" width="150px"/> ---
# Reading Excel files #### **readxl** package is part of tidyverse and makes it easy to import tabular data from Excel spreadsheets with several options. More details are described in this [pkgdown website](https://readxl.tidyverse.org/) .pull-left[ - `readxl` author: [Hadley Wickham](https://github.com/hadley) and [Jenny Bryan](https://github.com/jennybc) - `read_excel()` reads both `xls` and `xlsx` files and detects the format from the extension. Otherwise: + `read_xls()` + `read_xlsx()` - Return tibbles - Column type guessing - Discovers the minimal data rectangle and returns that, by default - Exert more control with `range`, `skip`, and `n_max` - `excel_sheets()` returns the sheet names - Combine multiple sheets, jointly with `purrr::map()` - No external library dependencies, _e.g._, Java or Perl ] --- class: hide_logo # Before we stop .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml1[ .large[.gbox[You learned to:] - Appreciate the tibble features - Learn set-up working directory with RStudio `projects` - Learn the `tidyselect` syntax - Use `readr::read_delim` to import your flat files - Adjust the imported data types - Use `readxl` to import excel files ] .ybox[Further reading] - [R for Data Science](http://r4ds.had.co.nz/data-import.html) ] .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt2.ml2[ .large[.bbox[Acknowledgments π π] Development * Eric Koncina (initial installment) * Veronica Codoni (swiss data set) * AureliΓ©n Ginolhac (`vroom` development) * Roland Krause (`readr` 2.0 update) Input and inspiration * Jim Hester (`vroom`, `readr` development) * Jenny Bryan (advice with project organization) * Hadley Wickham * Nicholas Tierney ] ] ] .w-50.pv2.ph3.mt1.ml1[ .huge[.bbox[Thank you for your attention!]] ]