Import Data

Loading with readr and readxl

Roland Krause

Rworkshop

Tuesday, 6 February 2024

Introduction

Learning objectives

You will learn to:

  • Download a data file to a dedicated subfolder in your project
  • Learn about tibbles
  • Use readr to import flat-files data into R
  • Use the interactive RStudio interface to visualise your data import
  • Appreciate tibbles
  • Use readxl to import excel files into R

Importing data

Getting started

  • Represents the first step of your work

  • R base already provides functions for text files

  • read.csv(), read.delim(), …

  • But what about

    • Excel files (.xls, .xlsx)
    • Foreign statistical formats from SAS, SPSS or Stata
    • databases (SQL, SQLite …)

Tidyverse implementation

  • More speed
  • improved handling of column types generates more consistent data structures
  • generates tibbles

Tibbles

  • Have a refined print method that shows only the first 10 rows.
  • Show all the columns that fit on screen and list the name of remaining ones.
  • Each column reports its type (double, int, lgl, chr)
  • Makes it much easier to work with large data.

data.frame vs tibble

Data frame

swiss
             Fertility Agriculture Examination
Courtelary        80.2        17.0          15
Delemont          83.1        45.1           6
Franches-Mnt      92.5        39.7           5
Moutier           85.8        36.5          12
Neuveville        76.9        43.5          17
Porrentruy        76.1        35.3           9
Broye             83.8        70.2          16
Glane             92.4        67.8          14
Gruyere           82.4        53.3          12
Sarine            82.9        45.2          16
Veveyse           87.1        64.5          14
Aigle             64.1        62.0          21
Aubonne           66.9        67.5          14
Avenches          68.9        60.7          19
Cossonay          61.7        69.3          22
Echallens         68.3        72.6          18
Grandson          71.7        34.0          17
Lausanne          55.7        19.4          26
La Vallee         54.3        15.2          31
Lavaux            65.1        73.0          19
Morges            65.5        59.8          22
Moudon            65.0        55.1          14
Nyone             56.6        50.9          22
Orbe              57.4        54.1          20
Oron              72.5        71.2          12
Payerne           74.2        58.1          14
Paysd'enhaut      72.0        63.5           6
Rolle             60.5        60.8          16
Vevey             58.3        26.8          25
Yverdon           65.4        49.5          15
Conthey           75.5        85.9           3
Entremont         69.3        84.9           7
Herens            77.3        89.7           5
Martigwy          70.5        78.2          12
Monthey           79.4        64.9           7
St Maurice        65.0        75.9           9
Sierre            92.2        84.6           3
Sion              79.3        63.1          13
Boudry            70.4        38.4          26
La Chauxdfnd      65.7         7.7          29
Le Locle          72.7        16.7          22
Neuchatel         64.4        17.6          35
Val de Ruz        77.6        37.6          15
ValdeTravers      67.6        18.7          25
V. De Geneve      35.0         1.2          37
Rive Droite       44.7        46.6          16
Rive Gauche       42.8        27.7          22
             Education Catholic Infant.Mortality
Courtelary          12     9.96             22.2
Delemont             9    84.84             22.2
Franches-Mnt         5    93.40             20.2
Moutier              7    33.77             20.3
Neuveville          15     5.16             20.6
Porrentruy           7    90.57             26.6
Broye                7    92.85             23.6
Glane                8    97.16             24.9
Gruyere              7    97.67             21.0
Sarine              13    91.38             24.4
Veveyse              6    98.61             24.5
Aigle               12     8.52             16.5
Aubonne              7     2.27             19.1
Avenches            12     4.43             22.7
Cossonay             5     2.82             18.7
Echallens            2    24.20             21.2
Grandson             8     3.30             20.0
Lausanne            28    12.11             20.2
La Vallee           20     2.15             10.8
Lavaux               9     2.84             20.0
Morges              10     5.23             18.0
Moudon               3     4.52             22.4
Nyone               12    15.14             16.7
Orbe                 6     4.20             15.3
Oron                 1     2.40             21.0
Payerne              8     5.23             23.8
Paysd'enhaut         3     2.56             18.0
Rolle               10     7.72             16.3
Vevey               19    18.46             20.9
Yverdon              8     6.10             22.5
Conthey              2    99.71             15.1
Entremont            6    99.68             19.8
Herens               2   100.00             18.3
Martigwy             6    98.96             19.4
Monthey              3    98.22             20.2
St Maurice           9    99.06             17.8
Sierre               3    99.46             16.3
Sion                13    96.83             18.1
Boudry              12     5.62             20.3
La Chauxdfnd        11    13.79             20.5
Le Locle            13    11.22             18.9
Neuchatel           32    16.92             23.0
Val de Ruz           7     4.97             20.0
ValdeTravers         7     8.65             19.5
V. De Geneve        53    42.34             18.0
Rive Droite         29    50.43             18.2
Rive Gauche         29    58.33             19.3

Tibble

as_tibble(swiss)
# A tibble: 47 × 6
   Fertility Agriculture Examination Education
       <dbl>       <dbl>       <int>     <int>
 1      80.2        17            15        12
 2      83.1        45.1           6         9
 3      92.5        39.7           5         5
 4      85.8        36.5          12         7
 5      76.9        43.5          17        15
 6      76.1        35.3           9         7
 7      83.8        70.2          16         7
 8      92.4        67.8          14         8
 9      82.4        53.3          12         7
10      82.9        45.2          16        13
# ℹ 37 more rows
# ℹ 2 more variables: Catholic <dbl>,
#   Infant.Mortality <dbl>

Rownames in tibbles

Rownames

Base data frames have rownames, which a special type of column that need special treatments.

tibbles never use rownames!

When converting regular a data.frame they are lost unless you assign them to a dedicated column.

# library(tibble)
as_tibble(swiss, rownames = "Province")
# A tibble: 47 × 7
   Province     Fertility Agriculture Examination
   <chr>            <dbl>       <dbl>       <int>
 1 Courtelary        80.2        17            15
 2 Delemont          83.1        45.1           6
 3 Franches-Mnt      92.5        39.7           5
 4 Moutier           85.8        36.5          12
 5 Neuveville        76.9        43.5          17
 6 Porrentruy        76.1        35.3           9
 7 Broye             83.8        70.2          16
 8 Glane             92.4        67.8          14
 9 Gruyere           82.4        53.3          12
10 Sarine            82.9        45.2          16
# ℹ 37 more rows
# ℹ 3 more variables: Education <int>,
#   Catholic <dbl>, Infant.Mortality <dbl>

Importing data using readr function such as read_csv() will not create them.

Importing text files

Follow along: Downloading files

Flat file example: swiss.csv

  • Use the RStudio project – finding your files will be easier.
  • Download the swiss.csv file to your project folder, in the sub-folder data
  • Open the file with a text viewer and have a look at its content
  • Does the delimiter fit the file extension?
download.file("https://biostat2.uni.lu/practicals/data/swiss.csv", "data/swiss.csv")
readr::read_delim("data/swiss.csv")
# A tibble: 47 × 6
   Fertility Agriculture Examination Education
       <dbl>       <dbl>       <dbl>     <dbl>
 1      80.2        17            15        12
 2      83.1        45.1           6         9
 3      92.5        39.7           5         5
 4      85.8        36.5          12         7
 5      76.9        43.5          17        15
 6      76.1        35.3           9         7
 7      83.8        70.2          16         7
 8      92.4        67.8          14         8
 9      82.4        53.3          12         7
10      82.9        45.2          16        13
# ℹ 37 more rows
# ℹ 2 more variables: Catholic <dbl>,
#   Infant.Mortality <dbl>

Manual download

In your browser:

  • Right-click on the link
  • Choose Save Link As

  • Use Finder/Explorer to select this project folder
  • Go inside this sub-folder data
  • Hit Save

The readr 2.0 engine is fast

readr is the tidyverse import package

  • Column type guessing
  • Reporting parsing problems
  • Progress bar
  • Reading from URLs
  • Reading compressed files

Great features in readr with version 2.0

Jim Hester developed vroom. All of its functionality are included in the current readr (v2.1.2) No need to use vroom.

  • Delimiter guessing
  • FAST!
    • multi-threaded
    • use ALTREP
    • Sadly disabled by default due to Windows issues
  • Column selection
  • Merge multiple files into one tibble

Source: vroom benchmarks

Catch and report reading issues

Toy example with two issues

read_delim("data/swiss_ray.csv")
# A tibble: 5 × 6
  fer    agri   exa   edu  cath xray 
  <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
1 80.2   17      15    12 NA    NA   
2 a      45.1     6     9 84.8  NA   
3 92.5   39.7     5     5 NA    NA   
4 85.8   36.5    12     7 33.8  NA   
5 76.9   43.5    17    15  5.16 NA   
  • two columns with missing data
  • fer is no longer of type double
  • empty ray column imported as logical!

Specifying column types with cols()

readr records issues. Use problems() to see them

sw2 <- readr::read_delim("data/swiss2.csv",
             col_types = cols(fer = col_double(),
                              agri = col_double(),
                              exa = col_integer(),
                              edu = col_integer(),
                              cath = col_double()))

sw2
# A tibble: 5 × 5
    fer  agri   exa   edu  cath
  <dbl> <dbl> <int> <int> <dbl>
1  80.2  17      15    12 NA   
2  NA    45.1     6     9 84.8 
3  92.5  39.7     5     5 NA   
4  85.8  36.5    12     7 33.8 
5  76.9  43.5    17    15  5.16
problems(sw2)
# A tibble: 1 × 5
    row   col expected actual file                
  <int> <int> <chr>    <chr>  <chr>               
1     3     1 a double a      /builds/r-training/…

Choosing which columns to read

Lighter column type definitions

here, we want to skip the wrong columns: fer and cath.

We let read_delim guessing the agri type (?), see all correspondences here.

readr::read_delim("data/swiss2.csv",
      col_types = cols(fer="_", 
                       agri = "?",
                       exa = "i",
                       edu = "i",
                       cath = "_",
                       ray = "_"))
# A tibble: 5 × 3
   agri   exa   edu
  <dbl> <int> <int>
1  17      15    12
2  45.1     6     9
3  39.7     5     5
4  36.5    12     7
5  43.5    17    15

Columns selection

Supported by tidyselect syntax.

readr::read_delim("data/swiss_ray.csv",
      col_select = c(agri, edu))
# A tibble: 5 × 2
   agri   edu
  <dbl> <dbl>
1  17      12
2  45.1     9
3  39.7     5
4  36.5     7
5  43.5    15

Column selection with tidyselect

This package is the backend for programatic column selection in the tidyverse. Features are described in this article.

Operators and helper functions

Expression Description
[c()] Combining bare names
[:] Selecting a range
[-] Negating a selection
everything() All columns
last_col() Last column
starts_with() Quoted prefix
ends_with() Quoted suffix
contains() Quoted string
readr::read_delim("data/swiss2.csv", show_col_types = FALSE,
      col_select = starts_with("e"), n_max = 1)
# A tibble: 1 × 2
    exa   edu
  <dbl> <dbl>
1    15    12
readr::read_delim("data/swiss2.csv", show_col_types = FALSE,
      col_select = contains("e"), n_max = 1)
# A tibble: 1 × 3
    fer   exa   edu
  <dbl> <dbl> <dbl>
1  80.2    15    12
readr::read_delim("data/swiss2.csv", show_col_types = FALSE,
      col_select = fer:edu, n_max = 1)
# A tibble: 1 × 4
    fer  agri   exa   edu
  <dbl> <dbl> <dbl> <dbl>
1  80.2    17    15    12

Reading data with no column header

Example

The animal data contains three columns and three rows.

dog,red,1
cat,blue,2
chicken,green,6

Naive approach

animal_file <- "data/animals.csv"
readr::read_delim(animal_file)
# A tibble: 2 × 3
  dog     red     `1`
  <chr>   <chr> <dbl>
1 cat     blue      2
2 chicken green     6

Satisfying?

Read all lines

read_delim(animal_file, col_names = FALSE, show_col_types = FALSE)
# A tibble: 3 × 3
  X1      X2       X3
  <chr>   <chr> <dbl>
1 dog     red       1
2 cat     blue      2
3 chicken green     6

Satisfying?

  • Colnames are self-created.
  • show_col_types = FALSE suppresses the verbose mode

Supply the colnames

read_delim(animal_file, 
           col_names = c("animal", "color", "value"),
           show_col_types = FALSE)
# A tibble: 3 × 3
  animal  color value
  <chr>   <chr> <dbl>
1 dog     red       1
2 cat     blue      2
3 chicken green     6

Better

Reference for column types

Column types are specified using col_types = cols()

Function Short Description
col_logical() l TRUE/FALSE, 1/0
col_integer() i integers
col_double() d floating point values.
col_number() n numbers containing the grouping_mark
col_date(format = "") D with the locale’s date_format
col_time(format = "") t with the locale’s time_format
col_datetime(format = "") T ISO8601 date times
col_factor(levels, ordered) f a fixed set of values
col_character() c everything else
col_skip() _ don’t import this column
col_guess() ? parse using the “best” type based on the input

Examples for data-aware inputs

Using one-letter shortcuts

readr::read_delim(animal_file, 
      col_names = c("animal", "color", "value"),
      col_types = cols(
        animal = "c",
        color = "_",
        value = "i"
      ))
# A tibble: 3 × 2
  animal  value
  <chr>   <int>
1 dog         1
2 cat         2
3 chicken     6

Even more compact

readr::read_delim(animal_file, 
      col_names = c("animal", "color", "value"),
      col_types = "c_i")
# A tibble: 3 × 2
  animal  value
  <chr>   <int>
1 dog         1
2 cat         2
3 chicken     6
  • Use the single character code in the column order

Skipping lines

Our example

# A tibble: 3 × 2
  animal  value
  <chr>   <int>
1 dog         1
2 cat         2
3 chicken     6

Comment

readr::read_delim(
  animal_file,
  comment = "d", #<<
  col_names = FALSE,
  show_col_types = FALSE)
# A tibble: 2 × 3
  X1      X2       X3
  <chr>   <chr> <dbl>
1 cat     blue      2
2 chicken green     6

Do not read lines beginning with a character.

Skip lines

readr::read_delim(
  animal_file,
  skip = 1, #<<
  col_names = FALSE,
  show_col_types = FALSE)
# A tibble: 2 × 3
  X1      X2       X3
  <chr>   <chr> <dbl>
1 cat     blue      2
2 chicken green     6

Limited number of lines

readr::read_delim(
  animal_file,
  skip = 1,
  n_max = 1, #<<
  col_names = FALSE,
  show_col_types = FALSE
)
# A tibble: 1 × 3
  X1    X2       X3
  <chr> <chr> <dbl>
1 cat   blue      2

Several files at once

Multiple files are read into one tibble

readr::read_csv(c("data/swiss.csv",  "data/swiss.csv"))
# A tibble: 94 × 6
   Fertility Agriculture Examination Education
       <dbl>       <dbl>       <dbl>     <dbl>
 1      80.2        17            15        12
 2      83.1        45.1           6         9
 3      92.5        39.7           5         5
 4      85.8        36.5          12         7
 5      76.9        43.5          17        15
 6      76.1        35.3           9         7
 7      83.8        70.2          16         7
 8      92.4        67.8          14         8
 9      82.4        53.3          12         7
10      82.9        45.2          16        13
# ℹ 84 more rows
# ℹ 2 more variables: Catholic <dbl>,
#   Infant.Mortality <dbl>

Files with different columns are not read

readr::read_csv(c("data/swiss.csv", "data/swiss2.csv"))
Error: Files must all have 6 columns:
* File 2 has 5 columns

Filenames as extra column

File names are recorded by providing the id argument.

options(width = 100)
readr::read_csv(c("data/swiss_ray.csv", "data/swiss.csv"),
                 col_name = c("fer", "agri", "exa", "edu",
                              "cath", "inf", "ray"),
                id =  "orig",
                show_col_types = FALSE)
# A tibble: 54 × 7
   orig               fer       agri        exa         edu       cath     inf             
   <chr>              <chr>     <chr>       <chr>       <chr>     <chr>    <chr>           
 1 data/swiss_ray.csv fer       agri        exa         edu       cath     xray            
 2 data/swiss_ray.csv 80.2      17          15          12        <NA>     <NA>            
 3 data/swiss_ray.csv a         45.1        6           9         84.8     <NA>            
 4 data/swiss_ray.csv 92.5      39.7        5           5         <NA>     <NA>            
 5 data/swiss_ray.csv 85.8      36.5        12          7         33.8     <NA>            
 6 data/swiss_ray.csv 76.9      43.5        17          15        5.16     <NA>            
 7 data/swiss.csv     Fertility Agriculture Examination Education Catholic Infant.Mortality
 8 data/swiss.csv     80.2      17          15          12        9.96     22.2            
 9 data/swiss.csv     83.1      45.1        6           9         84.84    22.2            
10 data/swiss.csv     92.5      39.7        5           5         93.4     20.2            
# ℹ 44 more rows

Can you spot the error?

Pass the input column names if not in agreement.

Reading in multiple files of different formats

Reading in many files, e.g. database tables can be cumbersome.

Read into a list or data.frame structure (see nesting in tidyr)

Tricks and tips

Loading zipped files from the web

Loading from the web

read_csv("https://biostat2.uni.lu/practicals/data/swiss.csv")
# A tibble: 47 × 6
   Fertility Agriculture Examination Education
       <dbl>       <dbl>       <dbl>     <dbl>
 1      80.2        17            15        12
 2      83.1        45.1           6         9
 3      92.5        39.7           5         5
 4      85.8        36.5          12         7
 5      76.9        43.5          17        15
 6      76.1        35.3           9         7
 7      83.8        70.2          16         7
 8      92.4        67.8          14         8
 9      82.4        53.3          12         7
10      82.9        45.2          16        13
# ℹ 37 more rows
# ℹ 2 more variables: Catholic <dbl>,
#   Infant.Mortality <dbl>

No need to download files.

Reading zipped files

read_csv("data/mtcars.csv.gz")
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0
 2  21       6  160    110  3.9   2.88  17.0     0
 3  22.8     4  108     93  3.85  2.32  18.6     1
 4  21.4     6  258    110  3.08  3.22  19.4     1
 5  18.7     8  360    175  3.15  3.44  17.0     0
 6  18.1     6  225    105  2.76  3.46  20.2     1
 7  14.3     8  360    245  3.21  3.57  15.8     0
 8  24.4     4  147.    62  3.69  3.19  20       1
 9  22.8     4  141.    95  3.92  3.15  22.9     1
10  19.2     6  168.   123  3.92  3.44  18.3     1
# ℹ 22 more rows
# ℹ 3 more variables: am <dbl>, gear <dbl>,
#   carb <dbl>

Excel files

Reading Excel files

readxl package is part of tidyverse and makes it easy to import tabular data from Excel spreadsheets with several options.

More details are described in this pkgdown website

Features

  • readxl author: Hadley Wickham and Jenny Bryan
  • read_excel() reads both xls and xlsx files and detects the format from the extension. Otherwise:
    • read_xls()
    • read_xlsx()
  • Return tibbles
  • Column type guessing
  • Discovers the minimal data rectangle and returns that, by default
  • No external library dependencies, e.g., Java or Perl

Usage guidance

  • Exert more control with range, skip, and n_max
  • excel_sheets() returns the sheet names
  • Combine multiple sheets, jointly with purrr::map_df()
  • col_types arguments match Excel’s data types
    • “numeric”, “text”, “skip”, “guess”,

The tidyverse packages to import your data

readr

  • read_csv(): comma separated (,)
  • read_csv2(): separated (;)
  • read_tsv(): tab separated
  • read_delim(): general delimited files, auto-guesses delimiter
  • read_fwf(): fixed width files
  • read_table(): columns separated by white-space(s)

readxl

  • read_excel()
  • read_xls()
  • read_xlsx()

haven

  • read_sas() for SAS
  • read_sav() for SPSS
  • read_dta() for Stata

Additions to readr and readxl

Need for (more) speed

Check fread() from data.table

  • Stable
  • No dependencies
  • Overhauled
  • Could feed from bash command
  • Could be faster than readr functions
  • Only relevant if you have data sets in the Gbyte range to read

Import Google spreadsheets

  • Google spreadsheet can be read with googlesheets4 package
  • Well developed package

Good to know

Hell are other people’s column names

Fixing column names on import

Duplicated, Missing, Ugly (e.g. - , %)

Colnames are repaired with several options for their treatment ("minimal", "unique", "universal", "check_unique").

See details in the vctrs package description.

## readxl::read_excel("data/bad-table-psych.xlsx", 
##                         skip = 2,
##                         n_max = 12,
##                   .name_repair = "universal") |> 
## rename_with(\(x) str_to_lower(x))  |> 
##   rename_with(\(x) str_replace_all(x, "\\.+", "_"))

Best practices

General advice

  • Structure your input data process.
  • Load all data early in your process and in a central place unless there are specific performance issues.
  • Re-factor your inputs, don’t rely on changing things during analysis.
  • Do not read the whole file just because you can for your production code.
  • Avoid downloading manually - use API calls or download data programmatically for external data unless size or availability becomes an issue.

Tidy imports

  • Check the cheat sheet
  • Read columns you want explicitly with appropriate col_types()
  • You tidyselect helpers to specify many similar columns
  • Clean up column names with dplyr and stringr functions early.

Before we stop

You learned to:

  • Appreciate the tibble features
  • Learn set-up working directory with RStudio projects
  • Learn the tidyselect syntax
  • Use readr::read_delim to import your flat files
  • Adjust the imported data types
  • Use readxl to import excel files

Further reading

Check R for Data Science for Import and Spreadsheets.

Contains information on databases and specialised formats.

Acknowledgments

  • Development:

  • Eric Koncina (initial installment)

  • Veronica Codoni (swiss data set)

  • Aurélien Ginolhac (vroom development)

  • Roland Krause (readr 2.0 update)

  • Input and inspiration:

  • Jim Hester (vroom, readr development)

  • Jenny Bryan (advice with project organization)

  • Hadley Wickham

  • Nicholas Tierney

Thank you for your attention!