03:00
Merging data
Rworkshop
Wednesday, 7 February 2024
Primary key
The column that uniquely identifies an observation: - Frequently an ID - Can be introduced by row_number
if no primary column exists bona fide observations
Composite key
Several columns that together identify an observation uniquely to serve as primary key.
Like patient
and visit
Foreign key
A key referring to observations in another table
Warning
Tibbles do not have row names! Frequently these are the primary keys! Use as_tibble(data, rownames = "ID")
to make the rownames
into a proper column.
In judgments
In datasauRus::datasaurus_dozen
In swiss
03:00
::::
Composite key from all the variables
Or a primary key from row numbers
# A tibble: 1,846 × 4
ID dataset x y
<chr> <chr> <dbl> <dbl>
1 1 dino 55.4 97.2
2 2 dino 51.5 96.0
3 3 dino 46.2 94.5
4 4 dino 42.8 91.4
5 5 dino 40.8 88.3
6 6 dino 38.7 84.9
7 7 dino 35.6 79.9
8 8 dino 33.1 77.6
9 9 dino 29.0 74.5
10 10 dino 26.2 71.4
# ℹ 1,836 more rows
swiss
Province
is primary key.
Once you took care of the rownames
.
# A tibble: 47 × 7
Province Fertility Agriculture Examination
<chr> <dbl> <dbl> <int>
1 Courtelary 80.2 17 15
2 Delemont 83.1 45.1 6
3 Franches-Mnt 92.5 39.7 5
4 Moutier 85.8 36.5 12
5 Neuveville 76.9 43.5 17
6 Porrentruy 76.1 35.3 9
7 Broye 83.8 70.2 16
8 Glane 92.4 67.8 14
9 Gruyere 82.4 53.3 12
10 Sarine 82.9 45.2 16
# ℹ 37 more rows
# ℹ 3 more variables: Education <int>,
# Catholic <dbl>, Infant.Mortality <dbl>
Manual input using tribble()
subject_mood <- judgments |>
select(subject, condition, gender,
starts_with("mood")) |>
distinct()
subject_mood
# A tibble: 187 × 5
subject condition gender mood_pre mood_post
<dbl> <chr> <chr> <dbl> <dbl>
1 2 control female 81 NA
2 1 stress female 59 42
3 3 stress female 22 60
4 4 stress female 53 68
5 7 control female 48 NA
6 6 stress female 73 73
7 5 control female NA NA
8 9 control male 100 NA
9 16 stress female 67 74
10 13 stress female 30 68
# ℹ 177 more rows
Error! no common key.
Error in `inner_join()`:
! `by` must be supplied when `x` and `y`
have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
Key operations in data processing
Role of observations as row changes
inner_join()
is the most strict join operations
merge
is a similar operation in base R
full_join()
# A tibble: 187 × 6
subject condition gender mood_pre mood_post
<dbl> <chr> <chr> <dbl> <dbl>
1 2 control female 81 NA
2 1 stress female 59 42
3 3 stress female 22 60
4 4 stress female 53 68
5 7 control female 48 NA
6 6 stress female 73 73
7 5 control female NA NA
8 9 control male 100 NA
9 16 stress female 67 74
10 13 stress female 30 68
# ℹ 177 more rows
# ℹ 1 more variable: coffee_shots <dbl>
What if we have gender
in both tables?
Mutating joins distinct the same column names by adding suffixes - .x
and .y
.
# A tibble: 187 × 7
subject condition gender.x mood_pre mood_post
<dbl> <chr> <chr> <dbl> <dbl>
1 2 control female 81 NA
2 1 stress female 59 42
3 3 stress female 22 60
4 4 stress female 53 68
5 7 control female 48 NA
6 6 stress female 73 73
7 5 control female NA NA
8 9 control male 100 NA
9 16 stress female 67 74
10 13 stress female 30 68
# ℹ 177 more rows
# ℹ 2 more variables: coffee_shots <dbl>,
# gender.y <chr>
Remark
Suffixes for columns with the same name can be controlled.
right_join(subject_mood,
coffee_drinkers,
by = c(subject = "student"),
suffix = c( "_mood", "_coffee"))
# A tibble: 3 × 7
subject condition gender_mood mood_pre mood_post
<dbl> <chr> <chr> <dbl> <dbl>
1 23 control female 78 NA
2 21 control female 68 NA
3 28 stress male 53 68
# ℹ 2 more variables: coffee_shots <dbl>,
# gender_coffee <chr>
Only the existence of a match is important; it doesn’t matter which observation is matched. This means that filtering joins never duplicate rows like mutating joins do
semi_join()
Filter matches in x, no duplicates
anti_join()
Extract what does not match
semi_join()
does not alter original# A tibble: 4 × 2
key x
<dbl> <chr>
1 1 x1
2 2 x2
3 3 x3
4 4 x4
pivot_longer()
and pivot_wider()
.Variants by sample and positions.
variants_long <-
pivot_longer(variants_wide,
cols = -contains("sample"), # columns argument, required
names_to = "pos",
values_to = "variant")
variants_long
# A tibble: 9 × 3
sample_id pos variant
<chr> <chr> <chr>
1 L1002 3 A
2 L1002 5 C
3 L1002 8 <NA>
4 L1034 3 A
5 L1034 5 <NA>
6 L1034 8 T
7 L1234 3 <NA>
8 L1234 5 C
9 L1234 8 T
# A tibble: 188 × 10
subject condition age moral_dilemma_dog moral_dilemma_wallet moral_dilemma_plane
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 control 24 9 9 8
2 1 stress 19 9 9 9
3 3 stress 19 8 7 8
4 4 stress 22 8 4 8
5 7 control 22 3 9 9
6 6 stress 22 9 9 9
7 5 control 18 9 5 7
8 9 control 20 9 4 1
9 16 stress 21 6 9 3
10 13 stress 19 6 8 9
# ℹ 178 more rows
# ℹ 4 more variables: moral_dilemma_resume <dbl>, moral_dilemma_kitten <dbl>,
# moral_dilemma_trolley <dbl>, moral_dilemma_control <dbl>
judgments |>
select(subject, condition, age,
starts_with("moral_dilemma") ) |>
pivot_longer(starts_with("moral_dilemma") ,
names_to = "dilemma",
values_to = "dilemma_val",
names_pattern = "moral_dilemma_(.*)")
# A tibble: 1,316 × 5
subject condition age dilemma dilemma_val
<dbl> <chr> <dbl> <chr> <dbl>
1 2 control 24 dog 9
2 2 control 24 wallet 9
3 2 control 24 plane 8
4 2 control 24 resume 7
5 2 control 24 kitten 9
6 2 control 24 trolley 5
7 2 control 24 control 9
8 1 stress 19 dog 9
9 1 stress 19 wallet 9
10 1 stress 19 plane 9
# ℹ 1,306 more rows
judgments |>
select(subject, condition, age,
starts_with("moral_dilemma") ) |>
pivot_longer(starts_with("moral_dilemma") ,
names_to = "dilemma",
values_to = "dilemma_val",
names_pattern = "moral_dilemma_(.*)") |>
count(condition, dilemma_val)
# A tibble: 18 × 3
condition dilemma_val n
<chr> <dbl> <int>
1 control 1 25
2 control 2 31
3 control 3 48
4 control 4 44
5 control 5 49
6 control 6 65
7 control 7 103
8 control 8 110
9 control 9 162
10 stress 1 13
11 stress 2 35
12 stress 3 32
13 stress 4 51
14 stress 5 56
15 stress 6 54
16 stress 7 103
17 stress 8 133
18 stress 9 202
judgments |>
select(subject, condition, age,
starts_with("moral_dilemma") ) |>
pivot_longer(starts_with("moral_dilemma") ,
names_to = "dilemma",
values_to = "dilemma_val",
names_pattern = "moral_dilemma_(.*)") |>
count(condition, dilemma_val) |>
pivot_wider(names_from = dilemma_val,
values_from = n,
names_prefix = "score")
# A tibble: 2 × 10
condition score1 score2 score3 score4 score5 score6 score7 score8 score9
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 control 25 31 48 44 49 65 103 110 162
2 stress 13 35 32 51 56 54 103 133 202
pull()
By default tidyverse operations that receive a tibble
as input return a tibble
.
lead()
for data in following row and lag()
for data in the row above
Let’s assume the subject IDs are in order of the tests being conducted. What is the difference to the previous subject for the “initial mood”?
judgments |>
select(subject, mood_pre) |>
arrange(subject) |>
mutate(prev_mood_pre = lag(mood_pre),
mood_diff = mood_pre - lag(mood_pre)) #<<
# A tibble: 188 × 4
subject mood_pre prev_mood_pre mood_diff
<dbl> <dbl> <dbl> <dbl>
1 1 59 NA NA
2 2 81 59 22
3 3 22 81 -59
4 4 53 22 31
5 5 NA 53 NA
6 6 73 NA NA
7 7 48 73 -25
8 8 59 48 11
9 9 100 59 41
10 10 72 100 -28
# ℹ 178 more rows
dplyr
bind_rows
, bind_cols
min_rank
, dense_rank
, cumsum
. See vignettemultidplyr
for parallelized codedplyr
code can be translated into SQL and query databases online (using dbplyr
)Take the results from the previous exercise available as judgments_condition_stats
and bring the data into a more readable format.
separate_wider_delim
or mutate
to create a column that contains the moral dilemma (dog, trolley, etc) and the stats (median, min, etc.).Exercises
The tribble
contains changes of the sequence of a gene. The format in the input is the expected sequence (the reference allele), the position and the variant, commonly called alternative allele. In T6G
, T
is the reference allele, 6
is the position (along the gene) and G
is the variant allele.
Clean the table of genetic variants such that all variants appear as a column labeled by their position.
Select relevant variants. In this table the variants are labeled according to their effect on stability of the gene product.
Identify the subjects in the table variants
that carry variants labeled as damaging in variant_significance
table. The final output should be vector of sample ids.
You can use several join flavours and the %in%
operator to achieve the same result.
data.table
data.table
versus dplyr
data.table
, see introduction is very efficient but the syntax is not so easy.data.table
dtplyr
is for learning the data.table
from dplyr
API inputtidytable
is actually using data.table
but with dplyr
syntaxdata.table
might be not useful for specialized applications with high volumes such as genomics.You learned to:
Further reading
Chapter 19 Joins
Acknowledgments