Joins and pivots

Merging data

Milena Zizovic/Roland Krause

Rworkshop

Wednesday, 7 February 2024

Introduction

Motivation

Joining data frames

Combining data from two tables

How do we refer to a particular observation?

Primary key

The column that uniquely identifies an observation: - Frequently an ID - Can be introduced by row_number if no primary column exists bona fide observations

Composite key

Several columns that together identify an observation uniquely to serve as primary key.

Like patient and visit

Foreign key

A key referring to observations in another table

Your turn!

Warning

Tibbles do not have row names! Frequently these are the primary keys! Use as_tibble(data, rownames = "ID") to make the rownames into a proper column.

Which columns are keys?

  • In judgments

  • In datasauRus::datasaurus_dozen

  • In swiss

03:00

::::

Solution

Primary key in judgments

select(judgments, subject)
# A tibble: 188 × 1
   subject
     <dbl>
 1       2
 2       1
 3       3
 4       4
 5       7
 6       6
 7       5
 8       9
 9      16
10      13
# ℹ 178 more rows

Solution - Datasaurus

Composite key from all the variables

sample_n(datasauRus::datasaurus_dozen, 5)
# A tibble: 5 × 3
  dataset        x     y
  <chr>      <dbl> <dbl>
1 away        75.9  62.5
2 slant_down  39.2  33.4
3 v_lines     30.5  39.4
4 dots        75.8  17.1
5 dino        61.3  19.1

Or a primary key from row numbers

as_tibble(datasauRus::datasaurus_dozen, rownames = "ID")
# A tibble: 1,846 × 4
   ID    dataset     x     y
   <chr> <chr>   <dbl> <dbl>
 1 1     dino     55.4  97.2
 2 2     dino     51.5  96.0
 3 3     dino     46.2  94.5
 4 4     dino     42.8  91.4
 5 5     dino     40.8  88.3
 6 6     dino     38.7  84.9
 7 7     dino     35.6  79.9
 8 8     dino     33.1  77.6
 9 9     dino     29.0  74.5
10 10    dino     26.2  71.4
# ℹ 1,836 more rows

Solution swiss

Province is primary key.

Once you took care of the rownames.

as_tibble(swiss, rownames = "Province")
# A tibble: 47 × 7
   Province     Fertility Agriculture Examination
   <chr>            <dbl>       <dbl>       <int>
 1 Courtelary        80.2        17            15
 2 Delemont          83.1        45.1           6
 3 Franches-Mnt      92.5        39.7           5
 4 Moutier           85.8        36.5          12
 5 Neuveville        76.9        43.5          17
 6 Porrentruy        76.1        35.3           9
 7 Broye             83.8        70.2          16
 8 Glane             92.4        67.8          14
 9 Gruyere           82.4        53.3          12
10 Sarine            82.9        45.2          16
# ℹ 37 more rows
# ℹ 3 more variables: Education <int>,
#   Catholic <dbl>, Infant.Mortality <dbl>

Matching subjects in two tibbles

Additional data for some participants for coffee consumption

Manual input using tribble()

coffee_drinkers <- 
  tribble(~student, ~coffee_shots, 
          21,       1,
          23,       4,
          28,       2)

coffee_drinkers
# A tibble: 3 × 2
  student coffee_shots
    <dbl>        <dbl>
1      21            1
2      23            4
3      28            2

Smaller sample set

subject_mood <- judgments |>
  select(subject, condition, gender, 
         starts_with("mood")) |> 
  distinct()
subject_mood
# A tibble: 187 × 5
   subject condition gender mood_pre mood_post
     <dbl> <chr>     <chr>     <dbl>     <dbl>
 1       2 control   female       81        NA
 2       1 stress    female       59        42
 3       3 stress    female       22        60
 4       4 stress    female       53        68
 5       7 control   female       48        NA
 6       6 stress    female       73        73
 7       5 control   female       NA        NA
 8       9 control   male        100        NA
 9      16 stress    female       67        74
10      13 stress    female       30        68
# ℹ 177 more rows

Combining the two tables

Inner join

Error! no common key.

inner_join(coffee_drinkers, 
           subject_mood)
Error in `inner_join()`:
! `by` must be supplied when `x` and `y`
  have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

Provide the corresponding columns names

inner_join(subject_mood,
           coffee_drinkers, 
           by = c(subject = "student")) 
# A tibble: 3 × 6
  subject condition gender mood_pre mood_post
    <dbl> <chr>     <chr>     <dbl>     <dbl>
1      23 control   female       78        NA
2      21 control   female       68        NA
3      28 stress    male         53        68
# ℹ 1 more variable: coffee_shots <dbl>

Mutating joins

Inner join

Creating new tables through joins

  • Key operations in data processing

  • Role of observations as row changes

  • inner_join() is the most strict join operations

  • merge is a similar operation in base R

Full Join

Example with full_join()

subject_mood |> 
  full_join(coffee_drinkers, 
             by = c(subject = "student")) 
# A tibble: 187 × 6
   subject condition gender mood_pre mood_post
     <dbl> <chr>     <chr>     <dbl>     <dbl>
 1       2 control   female       81        NA
 2       1 stress    female       59        42
 3       3 stress    female       22        60
 4       4 stress    female       53        68
 5       7 control   female       48        NA
 6       6 stress    female       73        73
 7       5 control   female       NA        NA
 8       9 control   male        100        NA
 9      16 stress    female       67        74
10      13 stress    female       30        68
# ℹ 177 more rows
# ℹ 1 more variable: coffee_shots <dbl>

Two tables - same column names

What if we have gender in both tables?

coffee_drinkers <- mutate(coffee_drinkers, 
                          gender = c("female", "female", "male"))
coffee_drinkers
# A tibble: 3 × 3
  student coffee_shots gender
    <dbl>        <dbl> <chr> 
1      21            1 female
2      23            4 female
3      28            2 male  

Mutating joins distinct the same column names by adding suffixes - .x and .y.

left_join(subject_mood,
          coffee_drinkers,
          by = c(subject = "student"))
# A tibble: 187 × 7
   subject condition gender.x mood_pre mood_post
     <dbl> <chr>     <chr>       <dbl>     <dbl>
 1       2 control   female         81        NA
 2       1 stress    female         59        42
 3       3 stress    female         22        60
 4       4 stress    female         53        68
 5       7 control   female         48        NA
 6       6 stress    female         73        73
 7       5 control   female         NA        NA
 8       9 control   male          100        NA
 9      16 stress    female         67        74
10      13 stress    female         30        68
# ℹ 177 more rows
# ℹ 2 more variables: coffee_shots <dbl>,
#   gender.y <chr>

Two tables - same column names

Remark

Suffixes for columns with the same name can be controlled.

Join by one column

right_join(subject_mood,
          coffee_drinkers,
          by = c(subject = "student"),
          suffix = c( "_mood", "_coffee"))
# A tibble: 3 × 7
  subject condition gender_mood mood_pre mood_post
    <dbl> <chr>     <chr>          <dbl>     <dbl>
1      23 control   female            78        NA
2      21 control   female            68        NA
3      28 stress    male              53        68
# ℹ 2 more variables: coffee_shots <dbl>,
#   gender_coffee <chr>

Join by two columns

right_join(subject_mood,
          coffee_drinkers,
          by = c(subject = "student", "gender"))
# A tibble: 3 × 6
  subject condition gender mood_pre mood_post
    <dbl> <chr>     <chr>     <dbl>     <dbl>
1      23 control   female       78        NA
2      21 control   female       68        NA
3      28 stress    male         53        68
# ℹ 1 more variable: coffee_shots <dbl>

Filtering joins -

Only the existence of a match is important; it doesn’t matter which observation is matched. This means that filtering joins never duplicate rows like mutating joins do

semi_join()

Filter matches in x, no duplicates

anti_join()

Extract what does not match

semi_join() does not alter original

(tx <- tribble(~key, ~x,
              1, "x1", 
              2, "x2", 
              3, "x3", 
              4, "x4"))
# A tibble: 4 × 2
    key x    
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     3 x3   
4     4 x4   
(ty <- tribble(~key, ~y,
              1, "y1",
              2, "y2",
              2, "y3",
              3, "y4"))
# A tibble: 4 × 2
    key y    
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     2 y3   
4     3 y4   

Filtering

semi_join(tx, ty)
# A tibble: 3 × 2
    key x    
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     3 x3   

Mutating

inner_join(tx, ty)
# A tibble: 4 × 3
    key x     y    
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     2 x2    y3   
4     3 x3    y4   

When by = not precised, cols are reported

Exchange of variables and values with pivot functions

Converting into long or wide formats - pivot functions

  • The wide format is generally untidy but found in the majority of datasets.
  • The wide format makes computation on columns sometimes easier.
  • The functions are called pivot_longer() and pivot_wider().

Making a wide data set longer

Calculations involving column and values

A toy data set

Variants by sample and positions.

variants_wide <- tribble(
  ~sample_id, ~`3`, ~`5`, ~`8`,
  "L1002",    "A",  "C",  NA,
  "L1034",    "A",  NA,   "T",
  "L1234",    NA,   "C",  "T"
)
variants_wide
# A tibble: 3 × 4
  sample_id `3`   `5`   `8`  
  <chr>     <chr> <chr> <chr>
1 L1002     A     C     <NA> 
2 L1034     A     <NA>  T    
3 L1234     <NA>  C     T    

A longer, tidier table

variants_long <- 
  pivot_longer(variants_wide, 
  cols = -contains("sample"), # columns argument, required
  names_to = "pos",
  values_to = "variant")
variants_long
# A tibble: 9 × 3
  sample_id pos   variant
  <chr>     <chr> <chr>  
1 L1002     3     A      
2 L1002     5     C      
3 L1002     8     <NA>   
4 L1034     3     A      
5 L1034     5     <NA>   
6 L1034     8     T      
7 L1234     3     <NA>   
8 L1234     5     C      
9 L1234     8     T      

Example

judgments |> 
  select(subject, condition, age, 
         starts_with("moral_dilemma") ) 
# A tibble: 188 × 10
   subject condition   age moral_dilemma_dog moral_dilemma_wallet moral_dilemma_plane
     <dbl> <chr>     <dbl>             <dbl>                <dbl>               <dbl>
 1       2 control      24                 9                    9                   8
 2       1 stress       19                 9                    9                   9
 3       3 stress       19                 8                    7                   8
 4       4 stress       22                 8                    4                   8
 5       7 control      22                 3                    9                   9
 6       6 stress       22                 9                    9                   9
 7       5 control      18                 9                    5                   7
 8       9 control      20                 9                    4                   1
 9      16 stress       21                 6                    9                   3
10      13 stress       19                 6                    8                   9
# ℹ 178 more rows
# ℹ 4 more variables: moral_dilemma_resume <dbl>, moral_dilemma_kitten <dbl>,
#   moral_dilemma_trolley <dbl>, moral_dilemma_control <dbl>

Example

judgments |> 
  select(subject, condition, age, 
         starts_with("moral_dilemma") ) |> 
  pivot_longer(starts_with("moral_dilemma") , 
              names_to =  "dilemma",
              values_to = "dilemma_val",
              names_pattern = "moral_dilemma_(.*)")  
# A tibble: 1,316 × 5
   subject condition   age dilemma dilemma_val
     <dbl> <chr>     <dbl> <chr>         <dbl>
 1       2 control      24 dog               9
 2       2 control      24 wallet            9
 3       2 control      24 plane             8
 4       2 control      24 resume            7
 5       2 control      24 kitten            9
 6       2 control      24 trolley           5
 7       2 control      24 control           9
 8       1 stress       19 dog               9
 9       1 stress       19 wallet            9
10       1 stress       19 plane             9
# ℹ 1,306 more rows

Example

judgments |> 
  select(subject, condition, age, 
         starts_with("moral_dilemma") ) |> 
  pivot_longer(starts_with("moral_dilemma") , 
              names_to =  "dilemma",
              values_to = "dilemma_val",
              names_pattern = "moral_dilemma_(.*)")  |> 
  count(condition, dilemma_val) 
# A tibble: 18 × 3
   condition dilemma_val     n
   <chr>           <dbl> <int>
 1 control             1    25
 2 control             2    31
 3 control             3    48
 4 control             4    44
 5 control             5    49
 6 control             6    65
 7 control             7   103
 8 control             8   110
 9 control             9   162
10 stress              1    13
11 stress              2    35
12 stress              3    32
13 stress              4    51
14 stress              5    56
15 stress              6    54
16 stress              7   103
17 stress              8   133
18 stress              9   202

Example

judgments |> 
  select(subject, condition, age, 
         starts_with("moral_dilemma") ) |> 
  pivot_longer(starts_with("moral_dilemma") , 
              names_to =  "dilemma",
              values_to = "dilemma_val",
              names_pattern = "moral_dilemma_(.*)")  |> 
  count(condition, dilemma_val) |> 
  pivot_wider(names_from = dilemma_val, 
              values_from = n,
              names_prefix = "score")
# A tibble: 2 × 10
  condition score1 score2 score3 score4 score5 score6 score7 score8 score9
  <chr>      <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
1 control       25     31     48     44     49     65    103    110    162
2 stress        13     35     32     51     56     54    103    133    202

Helpful tools

Removing the data frame context through pull()

Remark

By default tidyverse operations that receive a tibble as input return a tibble.

Extract column by name as vector

pull(judgments, age)[1:10]
 [1] 24 19 19 22 22 22 18 20 21 19

Extract same vector along with names

pull(judgments, age, name = subject)[1:10]
 2  1  3  4  7  6  5  9 16 13 
24 19 19 22 22 22 18 20 21 19 

Comparing to data in other rows

Leading and lagging rows

lead() for data in following row and lag() for data in the row above

Calculate differences between subjects

Let’s assume the subject IDs are in order of the tests being conducted. What is the difference to the previous subject for the “initial mood”?

judgments |>
  select(subject, mood_pre) |> 
  arrange(subject) |> 
  mutate(prev_mood_pre = lag(mood_pre),
         mood_diff = mood_pre - lag(mood_pre)) #<<
# A tibble: 188 × 4
   subject mood_pre prev_mood_pre mood_diff
     <dbl>    <dbl>         <dbl>     <dbl>
 1       1       59            NA        NA
 2       2       81            59        22
 3       3       22            81       -59
 4       4       53            22        31
 5       5       NA            53        NA
 6       6       73            NA        NA
 7       7       48            73       -25
 8       8       59            48        11
 9       9      100            59        41
10      10       72           100       -28
# ℹ 178 more rows

The other 20% of dplyr

Occasionally handy

  • Assembly: bind_rows, bind_cols
  • Windows function, min_rank, dense_rank, cumsum. See vignette
  • Working with list-columns
  • multidplyr for parallelized code

SQL mapping allows database access

  • dplyr code can be translated into SQL and query databases online (using dbplyr)
  • Different types of tabular data (dplyr SQL backend, databases)

Your turn!

Exercises

Take the results from the previous exercise available as judgments_condition_stats and bring the data into a more readable format.

  1. Make the table longer. After this step, your tibble should contain three columns:
  • The condition
  • the name of the moral dilemma and stats (in one column)
  • the values of the stats by moral dilemma.
  1. Split the moral dilemma and stats column. You can use separate_wider_delim or mutate to create a column that contains the moral dilemma (dog, trolley, etc) and the stats (median, min, etc.).
  2. Make the table wider again and move the stats to individual columns.

Your turn!

Exercises

The tribble contains changes of the sequence of a gene. The format in the input is the expected sequence (the reference allele), the position and the variant, commonly called alternative allele. In T6G, T is the reference allele, 6 is the position (along the gene) and G is the variant allele.

  1. Clean the table of genetic variants such that all variants appear as a column labeled by their position.

  2. Select relevant variants. In this table the variants are labeled according to their effect on stability of the gene product.

  3. Identify the subjects in the table variants that carry variants labeled as damaging in variant_significance table. The final output should be vector of sample ids.

You can use several join flavours and the %in% operator to achieve the same result.

Input

```{r}
 tribble(
  ~sampleid, ~var1, ~var2, ~var3,
  "S1", "A3T", "T5G", "T6G",
  "S2", "A3G", "T5G", NA,
  "S3", "A3T", "T6C", "G10C",
  "S4", "A3T", "T6C", "G10C"
) 
```
15:00

Bigger data

Go for data.table

  • See this interesting thread about comparing data.table versus dplyr
  • data.table, see introduction is very efficient but the syntax is not so easy.
  • Main advantage: inline replacement (tidyverse is frequently copying)
  • As a summary: tl;dr data.table for speed, dplyr for readability and convenience Prashanth Sriram
  • Hadley recommends that for data > 1-2 Gb, if speed is your main matter, go for data.table
  • dtplyr is for learning the data.table from dplyr API input
  • tidytable is actually using data.table but with dplyr syntax
  • data.table might be not useful for specialized applications with high volumes such as genomics.

Before we stop

You learned to:

  • Joining and intersecting tibbles
  • Differ filtering from mutation joins
  • Reshaping column headers and variables

Further reading

Acknowledgments

Contributions

  • Milena Zizovic
  • Aurélien Ginolhac

Thank you for your attention!