Importing files

Author

Affiliation

Eric Koncina/Veronica Codoni

R Workshop

Published

February 11, 2025

Aims

In this practical, you’ll learn how to setup your project and import flat files using the readr package.

Before you start

To perform reproducible research it is a good practice to store the files in a standardized location. We will take advantage of the RStudio projects and store data files in a sub-folder called data. This tutorial is meant to be completed as part of the repository that you use for all practicals of the R tidyverse workshop.

Prepare your project’s folder

Check that the project is active; its name should appear on the top-right corner. You can specify the name when cloning the repository.
We will use the data within your project’s folder. Use the Files pane in the lower right Rstudio panel, the terminal or your favorite file browser to access it.
Download the file blood_fat.csv and place it in the data.
Add a setup code chunk to this Quarto document and with those lines to load the libraries. You don’t need to install the packages if those lines are working fine.
Add a code chunk and with those lines to load the libraries. You don’t need to install the packages if those lines are working fine

library(dplyr)
library(readr)

Don’t forget to run the chunk’s code to load the library during your interactive session

Warning

If you load the library only in the console and forget to place a chunk to load it, the knitting process will fail. Indeed, when you click on the knit button, the chunks are evaluated in a new and fresh environment.

Use `readr` to load your first file

Load the `blood_fat` file

Tip

The relative path can be safely built using "data/blood_fat.csv" if you followed the preliminary steps above, download the csv in a sub-folder data of a RStudio project For example, you folder structure could be (depending on the picked names). Here:

RStudio project is rworkshop-practicals
Rmarkdown document is practical02_import.Rmd

.
├── data
│   └── blood_fat.csv
├── practical02_import.qmd
└── rworkshop-practicals.Rproj

Solution

blood_fat_file <- "data/blood_fat.csv"

Solution

read_delim(blood_fat_file)

Rows: 25 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): group
dbl (4): id, weight, age, fat

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 25 × 5
      id group weight   age   fat
   <dbl> <chr>  <dbl> <dbl> <dbl>
 1     1 A         84    46  354.
 2     2 A         73    20  190.
 3     3 A         65    52  406.
 4     4 A         70    30  264.
 5     5 A         76    57  452.
 6     6 A         69    25  302.
 7     7 A         63    28  288.
 8     8 A         72    36  386.
 9     9 A         79    57  402.
10    10 A         75    44  366.
# ℹ 15 more rows

read_delim() execution is reporting the dimensions of the file, along with the guessed delimiter and data type of each columns

If we are happy with the guessed delimiter and the column names / types, we could silent this reporting.

Load again the same file, silencing the `read_delim()` message

Solution

read_delim(blood_fat_file, show_col_types = FALSE)

# A tibble: 25 × 5
      id group weight   age   fat
   <dbl> <chr>  <dbl> <dbl> <dbl>
 1     1 A         84    46  354.
 2     2 A         73    20  190.
 3     3 A         65    52  406.
 4     4 A         70    30  264.
 5     5 A         76    57  452.
 6     6 A         69    25  302.
 7     7 A         63    28  288.
 8     8 A         72    36  386.
 9     9 A         79    57  402.
10    10 A         75    44  366.
# ℹ 15 more rows

The tibble

read_delim() loads the data as a tibble. The main advantage to use tibbles over a regular data frame is the printing.

Tibbles show some useful information such as the number of rows and columns:
- Look at the top of the tibble and find the information “A tibble rows x cols”
- How many rows are in the tibble?
The columns of a tibble report their type:
- Look at the tibble header, the type of a columns is reported just below its name.
- What is the type of the age column?

Solution

25 rows and 5 columns
age are double numbers

Actually, both age and id are integers, and should be read as such.

Read the `blood_fat.csv` specifying the data types of `age` and `id` as integers

Tip

In the col_types = cols(....) you can use the columns bare names and either the long description to call the specific data type like col_integer() or the shortcut "i"

Solution

read_delim(blood_fat_file, 
           col_types = cols(age = "i",
                         id = "i"))

# A tibble: 25 × 5
      id group weight   age   fat
   <int> <chr>  <dbl> <int> <dbl>
 1     1 A         84    46  354.
 2     2 A         73    20  190.
 3     3 A         65    52  406.
 4     4 A         70    30  264.
 5     5 A         76    57  452.
 6     6 A         69    25  302.
 7     7 A         63    28  288.
 8     8 A         72    36  386.
 9     9 A         79    57  402.
10    10 A         75    44  366.
# ℹ 15 more rows

Read the `blood_fat.csv` specifying the data types of `age` and `id` as integers, skipping `weight`.

Solution

read_delim(blood_fat_file,
           col_types = cols(age = "i",
                            weight = "_",
                            id = "i"))

# A tibble: 25 × 4
      id group   age   fat
   <int> <chr> <int> <dbl>
 1     1 A        46  354.
 2     2 A        20  190.
 3     3 A        52  406.
 4     4 A        30  264.
 5     5 A        57  452.
 6     6 A        25  302.
 7     7 A        28  288.
 8     8 A        36  386.
 9     9 A        57  402.
10    10 A        44  366.
# ℹ 15 more rows

Add the data and the changed Rmarkdown file to your Github repository

You can use Git tab in RStudio or the command line as you prefer.

Add the new files. For now it is OK to also include the data file.
Commit the new and the changed file. Choose a commit message.
Pull from the repository (this is just good practice)
Push your changes to the repository.

This is how all practicals in this workshop are submitted.

Command line calls.

git add data/blood_fat.csv
git commit -m "New data set"
<<edit and finalise the exercise>>
git commit -m "Solutions for import practical"
git pull
git push

Before you start

Prepare your project’s folder

Use readr to load your first file

Load the blood_fat file

Load again the same file, silencing the read_delim() message

The tibble

Read the blood_fat.csv specifying the data types of age and id as integers

Read the blood_fat.csv specifying the data types of age and id as integers, skipping weight.

Add the data and the changed Rmarkdown file to your Github repository

Use `readr` to load your first file

Load the `blood_fat` file

Load again the same file, silencing the `read_delim()` message

Read the `blood_fat.csv` specifying the data types of `age` and `id` as integers

Read the `blood_fat.csv` specifying the data types of `age` and `id` as integers, skipping `weight`.