as.data.frame(table(judgments$condition)) Var1 Freq
1 control 91
2 stress 97
Grouping and summarizing
R Workshop
Tuesday, 11 February 2025
Work with summary data by
dplyr 1.1.4 features persistent and temporary grouping as well as new functions for summaries.
Base R is using vectors of factors with aggregate()
dplyrHow many participants per condition?
table()Ugly and convoluted. First column is factor!
Counting
count() groups by specified columns.
Result is a tibble that can be processed further.
sort = TRUE avoids piping to arrange(desc(n))
count() is a shortcut for:
# A tibble: 1 × 2
min max
<dbl> <dbl>
1 9 100
summarise returns as many rows as groups - one if no groups.mutate returns as many rows as given.mutate(judgments,
min = min(mood_pre, na.rm = TRUE),
max = max(mood_pre, na.rm = TRUE), .before = 1)# A tibble: 188 × 160
min max start_date end_date finished
<dbl> <dbl> <chr> <chr> <dbl>
1 9 100 11/3/2014 11/3/2014 1
2 9 100 11/3/2014 11/3/2014 1
3 9 100 11/3/2014 11/3/2014 1
4 9 100 11/3/2014 11/3/2014 1
5 9 100 11/3/2014 11/3/2014 1
6 9 100 11/3/2014 11/3/2014 1
7 9 100 11/3/2014 11/3/2014 1
8 9 100 11/3/2014 11/3/2014 1
9 9 100 11/3/2014 11/3/2014 1
10 9 100 11/3/2014 11/3/2014 1
# ℹ 178 more rows
# ℹ 155 more variables: condition <chr>,
# subject <dbl>, gender <chr>, age <dbl>,
# mood_pre <dbl>, mood_post <dbl>,
# STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
# STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
# STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, …
Commonly used
n() to count the number of rowsn_distinct() to count the number of distinct observations - used inside the dplyr verbs!first() to extract the observation in the first positionlast() to extract the observation in the last positionnth() to take the entry in a specified positionmean(), sd(), etcsummarise(judgments,
n_rows = n(),
n_subject = n_distinct(subject),
first_id = first(subject),
last_id = last(subject),
mean = mean(mood_pre, na.rm= TRUE),
id_10 = nth(subject, n = 10))# A tibble: 1 × 6
n_rows n_subject first_id last_id mean id_10
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 188 187 2 189 59.4 13
group_by()group_by() results in a persistent group# A tibble: 188 × 158
# Groups: condition [2]
start_date end_date finished condition subject
<chr> <chr> <dbl> <chr> <dbl>
1 11/3/2014 11/3/2014 1 control 2
2 11/3/2014 11/3/2014 1 stress 1
3 11/3/2014 11/3/2014 1 stress 3
4 11/3/2014 11/3/2014 1 stress 4
5 11/3/2014 11/3/2014 1 control 7
6 11/3/2014 11/3/2014 1 stress 6
7 11/3/2014 11/3/2014 1 control 5
8 11/3/2014 11/3/2014 1 control 9
9 11/3/2014 11/3/2014 1 stress 16
10 11/3/2014 11/3/2014 1 stress 13
# ℹ 178 more rows
# ℹ 153 more variables: gender <chr>, age <dbl>,
# mood_pre <dbl>, mood_post <dbl>,
# STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
# STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
# STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>,
# STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, …
The grouping is indicated in the resulting tibble.
For variable grouping, one is peeled off from the right
# A tibble: 2 × 3
# Groups: condition [2]
condition gender n
<chr> <chr> <int>
1 control female 65
2 stress female 82
Warning
Most functions in dplyr are group-aware!
Remark
Ask explicitly to ungroup data
ungroup().groups argument to keep or drop groups.You can inspect the grouping of complex objects or programmatically by the `` functions
arrange() can sort values by multiple columnsWarning
Arranging is ignoring groups!
judgments |>
mutate(mood_pre_cat = case_when(
mood_pre < 25 ~ "poor",
mood_pre > 75 ~ "great",
TRUE ~ "normal")) |>
group_by(mood_pre_cat) |>
arrange(desc(mood_post))|>
select(mood_pre_cat, mood_post) |>
distinct()# A tibble: 67 × 2
# Groups: mood_pre_cat [3]
mood_pre_cat mood_post
<chr> <dbl>
1 normal 100
2 normal 99
3 great 98
4 normal 94
5 great 91
6 great 89
7 great 85
8 great 83
9 normal 83
10 normal 82
# ℹ 57 more rows
arrange() and groupingSolution
Use .by_group = TRUE
judgments |>
mutate(mood_pre_cat = case_when(
mood_pre < 25 ~ "poor",
mood_pre > 75 ~ "great",
TRUE ~ "normal")) |>
group_by(mood_pre_cat) |>
arrange(desc(mood_post), .by_group = TRUE) |>
select(mood_pre_cat, mood_post) |>
distinct()# A tibble: 67 × 2
# Groups: mood_pre_cat [3]
mood_pre_cat mood_post
<chr> <dbl>
1 great 98
2 great 91
3 great 89
4 great 85
5 great 83
6 great 79
7 great 75
8 great 69
9 great 62
10 great 59
# ℹ 57 more rows
But, you are better off using:
range() returns min and max
summarise() duplicates keysjudgments |>
group_by(condition, gender) |>
summarise(range = range(mood_pre, na.rm = TRUE),
n = n())Warning: Returning more (or less) than 1 row per
`summarise()` group was deprecated in dplyr
1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to
`reframe()`, remember that `reframe()` always
returns an ungrouped data frame and adjust
accordingly.
`summarise()` has grouped output by 'condition',
'gender'. You can override using the `.groups`
argument.
# A tibble: 8 × 4
# Groups: condition, gender [4]
condition gender range n
<chr> <chr> <dbl> <int>
1 control female 29 65
2 control female 95 65
3 control male 19 26
4 control male 100 26
5 stress female 9 82
6 stress female 96 82
7 stress male 18 15
8 stress male 85 15
reframe()# A tibble: 8 × 4
condition gender range n
<chr> <chr> <dbl> <int>
1 control female 29 65
2 control female 95 65
3 control male 19 26
4 control male 100 26
5 stress female 9 82
6 stress female 96 82
7 stress male 18 15
8 stress male 85 15
Return values
| Function | Receives | Returns |
|---|---|---|
mutate() |
n rows (or 1) |
n rows |
summarize() |
n groups |
n groups |
reframe() |
n groups by k outputs |
an ungrouped tibble. |
New in dplyr 1.1.0 for additional safety in programming with dplyr.
judgments |>
filter(!is.na(mood_pre)) |>
group_by(condition, gender) |>
reframe(
quan = quantile(mood_pre,
c(0.25, 0.5, 0.75)),
q = c(0.25, 0.5, 0.75),
n = n())# A tibble: 12 × 5
condition gender quan q n
<chr> <chr> <dbl> <dbl> <int>
1 control female 52.8 0.25 64
2 control female 66 0.5 64
3 control female 78.2 0.75 64
4 control male 53.2 0.25 26
5 control male 65 0.5 26
6 control male 72.8 0.75 26
7 stress female 44 0.25 82
8 stress female 58.5 0.5 82
9 stress female 67 0.75 82
10 stress male 45.5 0.25 15
11 stress male 53 0.5 15
12 stress male 69.5 0.75 15
mutate()mutate()
# A tibble: 188 × 159
# Groups: condition [2]
subject condition n start_date end_date
<dbl> <chr> <int> <chr> <chr>
1 2 control 91 11/3/2014 11/3/2014
2 1 stress 97 11/3/2014 11/3/2014
3 3 stress 97 11/3/2014 11/3/2014
4 4 stress 97 11/3/2014 11/3/2014
5 7 control 91 11/3/2014 11/3/2014
6 6 stress 97 11/3/2014 11/3/2014
7 5 control 91 11/3/2014 11/3/2014
8 9 control 91 11/3/2014 11/3/2014
9 16 stress 97 11/3/2014 11/3/2014
10 13 stress 97 11/3/2014 11/3/2014
# ℹ 178 more rows
# ℹ 154 more variables: finished <dbl>,
# gender <chr>, age <dbl>, mood_pre <dbl>,
# mood_post <dbl>, STAI_pre_1_1 <dbl>,
# STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
# STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
# STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, …
dplyr 1.1.0 introduced the .by argumentsummarize(), most useful with mutate()ungroup()# A tibble: 188 × 159
subject condition n start_date end_date
<dbl> <chr> <int> <chr> <chr>
1 2 control 91 11/3/2014 11/3/2014
2 1 stress 97 11/3/2014 11/3/2014
3 3 stress 97 11/3/2014 11/3/2014
4 4 stress 97 11/3/2014 11/3/2014
5 7 control 91 11/3/2014 11/3/2014
6 6 stress 97 11/3/2014 11/3/2014
7 5 control 91 11/3/2014 11/3/2014
8 9 control 91 11/3/2014 11/3/2014
9 16 stress 97 11/3/2014 11/3/2014
10 13 stress 97 11/3/2014 11/3/2014
# ℹ 178 more rows
# ℹ 154 more variables: finished <dbl>,
# gender <chr>, age <dbl>, mood_pre <dbl>,
# mood_post <dbl>, STAI_pre_1_1 <dbl>,
# STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
# STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
# STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, …
Most commonly used - 80%
select() - columnsfilter() - rows meeting conditionarrange() - sortglimpse() - inspectrename() - change column namerelocate() - move columnsmutate() - create columnscase_when() simplifies if/else/if/elseacross(), c_across() - work on >1 columngroup_by(), ungroup(), rowwise()summarise() - group-wise summariesSource: Lise Vaudor blog
Comments
tidyr and dplyr are replacing the reshape and reshape2 packagesvignette("tidy-data")Use judgments to compute basic statistics for all moral dilemma columns considering the conditions:
med for median().judgments_condition_stats to the results.In judgments:
stress group.stress or control appear together).10:00
judgments |>
group_by(condition) |>
summarize(across(starts_with("moral_dilemma"),
list(
mean = mean,
sd = sd,
med = median ,
min = min,
max = max
))) -> judgments_condition_stats
judgments_condition_stats# A tibble: 2 × 36
condition moral_dilemma_dog_mean
<chr> <dbl>
1 control 7.24
2 stress 7.45
# ℹ 34 more variables:
# moral_dilemma_dog_sd <dbl>,
# moral_dilemma_dog_med <dbl>,
# moral_dilemma_dog_min <dbl>,
# moral_dilemma_dog_max <dbl>,
# moral_dilemma_wallet_mean <dbl>,
# moral_dilemma_wallet_sd <dbl>, …
judgments |>
group_by( condition, gender, age) |>
summarize(n = n()) |>
arrange(desc(n), .by_group = TRUE) |>
ungroup()# A tibble: 33 × 4
condition gender age n
<chr> <chr> <dbl> <int>
1 control female 18 25
2 control female 19 17
3 control female 21 7
4 control female 20 4
5 control female 22 4
6 control female 23 3
7 control female 17 2
8 control female 24 2
9 control female 26 1
10 control male 18 7
# ℹ 23 more rows
You learned to:
Next step: joining tables
Further reading
Chapter 3 - Data transformation
Acknowledgments
dplyr in base only