as.data.frame(table(judgments$condition))
Var1 Freq
1 control 91
2 stress 97
Grouping and summarizing
Rworkshop
Wednesday, 7 February 2024
Work with summary data by
dplyr
1.1.4 features persistent and temporary grouping as well as new functions for summaries.
Base R is using vectors of factors with aggregate()
dplyr
How many participants per condition?
table()
Ugly and convoluted. First column is factor!
Counting
count()
groups by specified columns.
Result is a tibble that can be processed further.
sort = TRUE
avoids piping to arrange(desc(n))
count()
is a shortcut for:
# A tibble: 1 × 2
min max
<dbl> <dbl>
1 9 100
summarise
returns as many rows as groups - one if no groups.mutate
returns as many rows as given.mutate(judgments,
min = min(mood_pre, na.rm = TRUE),
max = max(mood_pre, na.rm = TRUE), .before = 1)
# A tibble: 188 × 160
min max start_date end_date finished
<dbl> <dbl> <chr> <chr> <dbl>
1 9 100 11/3/2014 11/3/2014 1
2 9 100 11/3/2014 11/3/2014 1
3 9 100 11/3/2014 11/3/2014 1
4 9 100 11/3/2014 11/3/2014 1
5 9 100 11/3/2014 11/3/2014 1
6 9 100 11/3/2014 11/3/2014 1
7 9 100 11/3/2014 11/3/2014 1
8 9 100 11/3/2014 11/3/2014 1
9 9 100 11/3/2014 11/3/2014 1
10 9 100 11/3/2014 11/3/2014 1
# ℹ 178 more rows
# ℹ 155 more variables: condition <chr>,
# subject <dbl>, gender <chr>, age <dbl>,
# mood_pre <dbl>, mood_post <dbl>,
# STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
# STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
# STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, …
Commonly used
n()
to count the number of rowsn_distinct()
to count the number of distinct observations - used inside the dplyr verbs!first()
to extract the observation in the first positionlast()
to extract the observation in the last positionnth()
to take the entry in a specified positionmean()
, sd()
, etcsummarise(judgments,
n_rows = n(),
n_subject = n_distinct(subject),
first_id = first(subject),
last_id = last(subject),
mean = mean(mood_pre, na.rm= TRUE),
id_10 = nth(subject, n = 10))
# A tibble: 1 × 6
n_rows n_subject first_id last_id mean id_10
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 188 187 2 189 59.4 13
group_by()
group_by()
results in a persistent group# A tibble: 188 × 158
# Groups: condition [2]
start_date end_date finished condition subject
<chr> <chr> <dbl> <chr> <dbl>
1 11/3/2014 11/3/2014 1 control 2
2 11/3/2014 11/3/2014 1 stress 1
3 11/3/2014 11/3/2014 1 stress 3
4 11/3/2014 11/3/2014 1 stress 4
5 11/3/2014 11/3/2014 1 control 7
6 11/3/2014 11/3/2014 1 stress 6
7 11/3/2014 11/3/2014 1 control 5
8 11/3/2014 11/3/2014 1 control 9
9 11/3/2014 11/3/2014 1 stress 16
10 11/3/2014 11/3/2014 1 stress 13
# ℹ 178 more rows
# ℹ 153 more variables: gender <chr>, age <dbl>,
# mood_pre <dbl>, mood_post <dbl>,
# STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
# STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
# STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>,
# STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, …
The grouping is indicated in the resulting tibble.
For variable grouping, one is peeled off from the right
# A tibble: 2 × 3
# Groups: condition [2]
condition gender n
<chr> <chr> <int>
1 control female 65
2 stress female 82
Warning
Most functions in dplyr
are group-aware!
Remark
Ask explicitly to ungroup data
ungroup()
.groups
argument to keep
or drop
groups.You can inspect the grouping of complex objects or programmatically by the `` functions
arrange()
can sort values by multiple columnsWarning
Arranging is ignoring groups!
judgments |>
mutate(mood_pre_cat = case_when(
mood_pre < 25 ~ "poor",
mood_pre > 75 ~ "great",
TRUE ~ "normal")) |>
group_by(mood_pre_cat) |>
arrange(desc(mood_post))|>
select(mood_pre_cat, mood_post) |>
distinct()
# A tibble: 67 × 2
# Groups: mood_pre_cat [3]
mood_pre_cat mood_post
<chr> <dbl>
1 normal 100
2 normal 99
3 great 98
4 normal 94
5 great 91
6 great 89
7 great 85
8 great 83
9 normal 83
10 normal 82
# ℹ 57 more rows
arrange()
and groupingSolution
Use .by_group = TRUE
judgments |>
mutate(mood_pre_cat = case_when(
mood_pre < 25 ~ "poor",
mood_pre > 75 ~ "great",
TRUE ~ "normal")) |>
group_by(mood_pre_cat) |>
arrange(desc(mood_post), .by_group = TRUE) |> #<<
select(mood_pre_cat, mood_post) |>
distinct()
# A tibble: 67 × 2
# Groups: mood_pre_cat [3]
mood_pre_cat mood_post
<chr> <dbl>
1 great 98
2 great 91
3 great 89
4 great 85
5 great 83
6 great 79
7 great 75
8 great 69
9 great 62
10 great 59
# ℹ 57 more rows
But, you are better off using:
range()
returns min and max
summarise()
duplicates keysjudgments |>
group_by(condition, gender) |>
summarise(range = range(mood_pre, na.rm = TRUE),
n = n())
Warning: Returning more (or less) than 1 row per
`summarise()` group was deprecated in dplyr
1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to
`reframe()`, remember that `reframe()` always
returns an ungrouped data frame and adjust
accordingly.
`summarise()` has grouped output by 'condition',
'gender'. You can override using the `.groups`
argument.
# A tibble: 8 × 4
# Groups: condition, gender [4]
condition gender range n
<chr> <chr> <dbl> <int>
1 control female 29 65
2 control female 95 65
3 control male 19 26
4 control male 100 26
5 stress female 9 82
6 stress female 96 82
7 stress male 18 15
8 stress male 85 15
reframe()
# A tibble: 8 × 4
condition gender range n
<chr> <chr> <dbl> <int>
1 control female 29 65
2 control female 95 65
3 control male 19 26
4 control male 100 26
5 stress female 9 82
6 stress female 96 82
7 stress male 18 15
8 stress male 85 15
Return values
Function | Receives | Returns |
---|---|---|
mutate() |
n rows (or 1) |
n rows |
summarize() |
n groups |
n groups |
reframe() |
n groups by k outputs |
an ungrouped tibble. |
New in dplyr
1.1.0 for additional safety in programming with dplyr
.
judgments |>
filter(!is.na(mood_pre)) |>
group_by(condition, gender) |>
reframe(
quan = quantile(mood_pre,
c(0.25, 0.5, 0.75)),
q = c(0.25, 0.5, 0.75),
n = n())
# A tibble: 12 × 5
condition gender quan q n
<chr> <chr> <dbl> <dbl> <int>
1 control female 52.8 0.25 64
2 control female 66 0.5 64
3 control female 78.2 0.75 64
4 control male 53.2 0.25 26
5 control male 65 0.5 26
6 control male 72.8 0.75 26
7 stress female 44 0.25 82
8 stress female 58.5 0.5 82
9 stress female 67 0.75 82
10 stress male 45.5 0.25 15
11 stress male 53 0.5 15
12 stress male 69.5 0.75 15
mutate()
mutate()
# A tibble: 188 × 159
# Groups: condition [2]
subject condition n start_date end_date
<dbl> <chr> <int> <chr> <chr>
1 2 control 91 11/3/2014 11/3/2014
2 1 stress 97 11/3/2014 11/3/2014
3 3 stress 97 11/3/2014 11/3/2014
4 4 stress 97 11/3/2014 11/3/2014
5 7 control 91 11/3/2014 11/3/2014
6 6 stress 97 11/3/2014 11/3/2014
7 5 control 91 11/3/2014 11/3/2014
8 9 control 91 11/3/2014 11/3/2014
9 16 stress 97 11/3/2014 11/3/2014
10 13 stress 97 11/3/2014 11/3/2014
# ℹ 178 more rows
# ℹ 154 more variables: finished <dbl>,
# gender <chr>, age <dbl>, mood_pre <dbl>,
# mood_post <dbl>, STAI_pre_1_1 <dbl>,
# STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
# STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
# STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, …
dplyr
1.1.0 introduced the .by
argumentsummarize()
, most useful with mutate()
ungroup()
# A tibble: 188 × 159
subject condition n start_date end_date
<dbl> <chr> <int> <chr> <chr>
1 2 control 91 11/3/2014 11/3/2014
2 1 stress 97 11/3/2014 11/3/2014
3 3 stress 97 11/3/2014 11/3/2014
4 4 stress 97 11/3/2014 11/3/2014
5 7 control 91 11/3/2014 11/3/2014
6 6 stress 97 11/3/2014 11/3/2014
7 5 control 91 11/3/2014 11/3/2014
8 9 control 91 11/3/2014 11/3/2014
9 16 stress 97 11/3/2014 11/3/2014
10 13 stress 97 11/3/2014 11/3/2014
# ℹ 178 more rows
# ℹ 154 more variables: finished <dbl>,
# gender <chr>, age <dbl>, mood_pre <dbl>,
# mood_post <dbl>, STAI_pre_1_1 <dbl>,
# STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
# STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
# STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, …
Most commonly used - 80%
select()
- columnsfilter()
- rows meeting conditionarrange()
- sortglimpse()
- inspectrename()
- change column namerelocate()
- move columnsmutate()
- create columnscase_when()
simplifies if/else/if/elseacross()
, c_across()
- work on >1 columngroup_by()
, ungroup()
, rowwise()
summarise()
- group-wise summariesSource: Lise Vaudor blog
Comments
tidyr
and dplyr
are replacing the reshape
and reshape2
packagesvignette("tidy-data")
Use judgments
to compute basic statistics for all moral dilemma columns considering the conditions:
med
for median()
.judgments_condition_stats
to the results.In judgments
:
stress
group.stress
or control
appear together).10:00
You learned to:
Next step: joining tables
Further reading
Chapter 3 - Data transformation
Acknowledgments
dplyr
in base only