Data Wrangling - Grouping

Grouping and summarizing

Milena Zizovic

R Workshop

Tuesday, 11 February 2025

Introduction

Summarising

Work with summary data by
- summarising,
- adding to the existing tibble or
- changing the number of rows.
dplyr 1.1.4 features persistent and temporary grouping as well as new functions for summaries.
Base R is using vectors of factors with aggregate()

Learning objectives

Compute groups with dplyr
Create summary statistics in an analysis workflow

Counting elements

How many participants per condition?

Replacing calls to `table()`

as.data.frame(table(judgments$condition))

     Var1 Freq
1 control   91
2  stress   97

Ugly and convoluted. First column is factor!

count(judgments, condition)

# A tibble: 2 × 2
  condition     n
  <chr>     <int>
1 control      91
2 stress       97

Counting

count() groups by specified columns.

Result is a tibble that can be processed further.

sort = TRUE avoids piping to arrange(desc(n))

count() is a shortcut for:

judgments |> 
  group_by(condition) |>
  summarise(n = n(), .groups = "drop")

Summarise data

When counting is not enough

summarise(judgments,
          min = min(mood_pre, na.rm = TRUE),
          max = max(mood_pre, na.rm = TRUE))

# A tibble: 1 × 2
    min   max
  <dbl> <dbl>
1     9   100

summarise returns as many rows as groups - one if no groups.
mutate returns as many rows as given.

mutate(judgments,
       min = min(mood_pre, na.rm = TRUE),
       max = max(mood_pre, na.rm = TRUE), .before = 1)

# A tibble: 188 × 160
     min   max start_date end_date  finished
   <dbl> <dbl> <chr>      <chr>        <dbl>
 1     9   100 11/3/2014  11/3/2014        1
 2     9   100 11/3/2014  11/3/2014        1
 3     9   100 11/3/2014  11/3/2014        1
 4     9   100 11/3/2014  11/3/2014        1
 5     9   100 11/3/2014  11/3/2014        1
 6     9   100 11/3/2014  11/3/2014        1
 7     9   100 11/3/2014  11/3/2014        1
 8     9   100 11/3/2014  11/3/2014        1
 9     9   100 11/3/2014  11/3/2014        1
10     9   100 11/3/2014  11/3/2014        1
# ℹ 178 more rows
# ℹ 155 more variables: condition <chr>,
#   subject <dbl>, gender <chr>, age <dbl>,
#   mood_pre <dbl>, mood_post <dbl>,
#   STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
#   STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, …

If we want the min/max per condition

judgments |> 
  group_by(condition) |> 
  summarise(min = min(mood_pre, na.rm = TRUE),
            max = max(mood_pre, na.rm = TRUE))

# A tibble: 2 × 3
  condition   min   max
  <chr>     <dbl> <dbl>
1 control      19   100
2 stress        9    96

Within one summarise statement

Commonly used

n() to count the number of rows
n_distinct() to count the number of distinct observations - used inside the dplyr verbs!
first() to extract the observation in the first position
last() to extract the observation in the last position
nth() to take the entry in a specified position

Stats functions

mean(), sd(), etc

summarise(judgments,
          n_rows = n(), 
          n_subject = n_distinct(subject),
          first_id = first(subject),
          last_id = last(subject),
          mean = mean(mood_pre, na.rm= TRUE),
          id_10 = nth(subject, n = 10))

# A tibble: 1 × 6
  n_rows n_subject first_id last_id  mean id_10
   <int>     <int>    <dbl>   <dbl> <dbl> <dbl>
1    188       187        2     189  59.4    13

Persistent of grouping with `group_by()`

`group_by()` results in a persistent group

Each call to the tibble will respect the grouping
Summarize will strip away one level of grouping

judgments |> group_by(condition)

# A tibble: 188 × 158
# Groups:   condition [2]
   start_date end_date  finished condition subject
   <chr>      <chr>        <dbl> <chr>       <dbl>
 1 11/3/2014  11/3/2014        1 control         2
 2 11/3/2014  11/3/2014        1 stress          1
 3 11/3/2014  11/3/2014        1 stress          3
 4 11/3/2014  11/3/2014        1 stress          4
 5 11/3/2014  11/3/2014        1 control         7
 6 11/3/2014  11/3/2014        1 stress          6
 7 11/3/2014  11/3/2014        1 control         5
 8 11/3/2014  11/3/2014        1 control         9
 9 11/3/2014  11/3/2014        1 stress         16
10 11/3/2014  11/3/2014        1 stress         13
# ℹ 178 more rows
# ℹ 153 more variables: gender <chr>, age <dbl>,
#   mood_pre <dbl>, mood_post <dbl>,
#   STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
#   STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>,
#   STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, …

The grouping is indicated in the resulting tibble.

Summarising is removing grouping

judgments |> 
  group_by(condition) |> 
  summarise(min = min(mood_pre, na.rm = TRUE))

# A tibble: 2 × 2
  condition   min
  <chr>     <dbl>
1 control      19
2 stress        9

Grouping removed

Not a concern for grouping by single variables but …

Grouping by more than one variable

Peeling effect

For variable grouping, one is peeled off from the right

judgments |>
  group_by(condition, gender) |> 
  summarise(n = n()) -> n_part

slice_head(n_part, n = 1)

# A tibble: 2 × 3
# Groups:   condition [2]
  condition gender     n
  <chr>     <chr>  <int>
1 control   female    65
2 stress    female    82

Warning

Most functions in dplyr are group-aware!

No grouping after additional calls to `summarise()`

The remaining condition group was peeled off too

summarise(n_part, n = n())

# A tibble: 2 × 2
  condition     n
  <chr>     <int>
1 control       2
2 stress        2

Manipulate retaining grouping

Remark

Ask explicitly to ungroup data

by ungroup()
by .groups argument to keep or drop groups.

You can inspect the grouping of complex objects or programmatically by the `` functions

How many different answers for a variable by condition and gender?

judgments |>
  group_by(condition, gender) |> 
  summarise(n_ans = n_distinct(STAI_post_1_1),
            .groups = "drop")

# A tibble: 4 × 3
  condition gender n_ans
  <chr>     <chr>  <int>
1 control   female     1
2 control   male       1
3 stress    female     4
4 stress    male       3

Arranging values within groups

`arrange()` can sort values by multiple columns

Warning

Arranging is ignoring groups!

judgments |> 
  mutate(mood_pre_cat = case_when(
    mood_pre < 25  ~ "poor", 
    mood_pre > 75 ~ "great",
    TRUE ~ "normal")) |> 
  group_by(mood_pre_cat) |> 
  arrange(desc(mood_post))|> 
  select(mood_pre_cat, mood_post) |> 
  distinct()

# A tibble: 67 × 2
# Groups:   mood_pre_cat [3]
   mood_pre_cat mood_post
   <chr>            <dbl>
 1 normal             100
 2 normal              99
 3 great               98
 4 normal              94
 5 great               91
 6 great               89
 7 great               85
 8 great               83
 9 normal              83
10 normal              82
# ℹ 57 more rows

Arranging values within groups (continued.)

For `arrange()` and grouping

Solution

Use .by_group = TRUE

judgments |> 
  mutate(mood_pre_cat = case_when(
    mood_pre < 25  ~ "poor", 
    mood_pre > 75 ~ "great",
    TRUE ~ "normal")) |> 
  group_by(mood_pre_cat) |> 
  arrange(desc(mood_post), .by_group = TRUE) |>  
  select(mood_pre_cat, mood_post) |> 
  distinct()

# A tibble: 67 × 2
# Groups:   mood_pre_cat [3]
   mood_pre_cat mood_post
   <chr>            <dbl>
 1 great               98
 2 great               91
 3 great               89
 4 great               85
 5 great               83
 6 great               79
 7 great               75
 8 great               69
 9 great               62
10 great               59
# ℹ 57 more rows

But, you are better off using:

  [...] |>
  arrange(mood_pre_cat, desc(mood_post)) |>  
  [...]

Dealing with multiple return values per group

range() returns min and max

`summarise()` duplicates keys

judgments |>
  group_by(condition, gender) |> 
  summarise(range = range(mood_pre, na.rm = TRUE),
            n = n())

Warning: Returning more (or less) than 1 row per
`summarise()` group was deprecated in dplyr
1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to
  `reframe()`, remember that `reframe()` always
  returns an ungrouped data frame and adjust
  accordingly.

`summarise()` has grouped output by 'condition',
'gender'. You can override using the `.groups`
argument.

# A tibble: 8 × 4
# Groups:   condition, gender [4]
  condition gender range     n
  <chr>     <chr>  <dbl> <int>
1 control   female    29    65
2 control   female    95    65
3 control   male      19    26
4 control   male     100    26
5 stress    female     9    82
6 stress    female    96    82
7 stress    male      18    15
8 stress    male      85    15

Using `reframe()`

judgments |>
  group_by(condition, gender) |> 
  reframe(range = range(mood_pre, na.rm = TRUE),
            n = n())

# A tibble: 8 × 4
  condition gender range     n
  <chr>     <chr>  <dbl> <int>
1 control   female    29    65
2 control   female    95    65
3 control   male      19    26
4 control   male     100    26
5 stress    female     9    82
6 stress    female    96    82
7 stress    male      18    15
8 stress    male      85    15

Reframe with more values

Return values

Function	Receives	Returns
`mutate()`	`n` rows (or 1)	`n` rows
`summarize()`	`n` groups	`n` groups
`reframe()`	`n` groups by `k` outputs	an ungrouped tibble.

New in dplyr 1.1.0 for additional safety in programming with dplyr.

Example: Three quantiles

judgments |>
  filter(!is.na(mood_pre)) |> 
  group_by(condition, gender) |>
  reframe(
    quan = quantile(mood_pre,
                    c(0.25, 0.5, 0.75)),
    q = c(0.25, 0.5, 0.75),
    n = n())

# A tibble: 12 × 5
   condition gender  quan     q     n
   <chr>     <chr>  <dbl> <dbl> <int>
 1 control   female  52.8  0.25    64
 2 control   female  66    0.5     64
 3 control   female  78.2  0.75    64
 4 control   male    53.2  0.25    26
 5 control   male    65    0.5     26
 6 control   male    72.8  0.75    26
 7 stress    female  44    0.25    82
 8 stress    female  58.5  0.5     82
 9 stress    female  67    0.75    82
10 stress    male    45.5  0.25    15
11 stress    male    53    0.5     15
12 stress    male    69.5  0.75    15

Grouping and `mutate()`

Grouping works with other verbs

Most useful with mutate()
- As rows are not collapsed, the data is simply computed as for the groups
Note that the tibble is grouped after the operation.

judgments |>
  group_by(condition) |>
  mutate(  
    n = n()) |> 
  relocate(subject, condition, n)

# A tibble: 188 × 159
# Groups:   condition [2]
   subject condition     n start_date end_date 
     <dbl> <chr>     <int> <chr>      <chr>    
 1       2 control      91 11/3/2014  11/3/2014
 2       1 stress       97 11/3/2014  11/3/2014
 3       3 stress       97 11/3/2014  11/3/2014
 4       4 stress       97 11/3/2014  11/3/2014
 5       7 control      91 11/3/2014  11/3/2014
 6       6 stress       97 11/3/2014  11/3/2014
 7       5 control      91 11/3/2014  11/3/2014
 8       9 control      91 11/3/2014  11/3/2014
 9      16 stress       97 11/3/2014  11/3/2014
10      13 stress       97 11/3/2014  11/3/2014
# ℹ 178 more rows
# ℹ 154 more variables: finished <dbl>,
#   gender <chr>, age <dbl>, mood_pre <dbl>,
#   mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
#   STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
#   STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, …

Temporary grouping

`dplyr` 1.1.0 introduced the `.by` argument

Works with summarize(), most useful with mutate()
No need to call ungroup()

judgments  |> 
  mutate(n = n(),
         .by = condition) |> 
  relocate(subject, condition, n)

# A tibble: 188 × 159
   subject condition     n start_date end_date 
     <dbl> <chr>     <int> <chr>      <chr>    
 1       2 control      91 11/3/2014  11/3/2014
 2       1 stress       97 11/3/2014  11/3/2014
 3       3 stress       97 11/3/2014  11/3/2014
 4       4 stress       97 11/3/2014  11/3/2014
 5       7 control      91 11/3/2014  11/3/2014
 6       6 stress       97 11/3/2014  11/3/2014
 7       5 control      91 11/3/2014  11/3/2014
 8       9 control      91 11/3/2014  11/3/2014
 9      16 stress       97 11/3/2014  11/3/2014
10      13 stress       97 11/3/2014  11/3/2014
# ℹ 178 more rows
# ℹ 154 more variables: finished <dbl>,
#   gender <chr>, age <dbl>, mood_pre <dbl>,
#   mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
#   STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
#   STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, …

Summary

Most commonly used - 80%

select() - columns
filter() - rows meeting condition
arrange() - sort
glimpse() - inspect
rename() - change column name
relocate() - move columns
mutate() - create columns
case_when() simplifies if/else/if/else
across(), c_across() - work on >1 column
group_by(), ungroup(), rowwise()
summarise() - group-wise summaries

Source: Lise Vaudor blog

Comments

Represent the verbs you will use 80% of your time. Go to the website to see additional functions.
tidyr and dplyr are replacing the reshape and reshape2 packages
tidy data
- http://tidyr.tidyverse.org/
- vignette("tidy-data")

Your turn!

Exercises

Use judgments to compute basic statistics for all moral dilemma columns considering the conditions:
1. Compute the mean, the median, the standard deviation as well as min and max values.
2. Find meaningful short names for the functions such as med for median().
3. Assign judgments_condition_stats to the results.
In judgments:
1. Find the number of subjects by age, gender and condition, e.g. how many 20 years of age females are in the stress group.
2. Sort the resulting tibble such that the condition that contains the most populous group is sorted first (i.e. stress or control appear together).
3. Ensure that the resulting tibble does not contain groups.

10:00

Solution

judgments |> 
  group_by(condition) |> 
  summarize(across(starts_with("moral_dilemma"),
                   list(
                     mean = mean,
                     sd = sd,
                     med = median ,
                     min = min,
                     max = max
                   ))) -> judgments_condition_stats
judgments_condition_stats

# A tibble: 2 × 36
  condition moral_dilemma_dog_mean
  <chr>                      <dbl>
1 control                     7.24
2 stress                      7.45
# ℹ 34 more variables:
#   moral_dilemma_dog_sd <dbl>,
#   moral_dilemma_dog_med <dbl>,
#   moral_dilemma_dog_min <dbl>,
#   moral_dilemma_dog_max <dbl>,
#   moral_dilemma_wallet_mean <dbl>,
#   moral_dilemma_wallet_sd <dbl>, …

judgments |> 
  group_by( condition, gender, age) |> 
  summarize(n = n()) |> 
  arrange(desc(n), .by_group = TRUE) |> 
  ungroup()

# A tibble: 33 × 4
   condition gender   age     n
   <chr>     <chr>  <dbl> <int>
 1 control   female    18    25
 2 control   female    19    17
 3 control   female    21     7
 4 control   female    20     4
 5 control   female    22     4
 6 control   female    23     3
 7 control   female    17     2
 8 control   female    24     2
 9 control   female    26     1
10 control   male      18     7
# ℹ 23 more rows

Before we stop

You learned to:

Selection and manipulation of
- observations,
- variables and
- values.
Grouping and summarizing

Next step: joining tables

Further reading

R for Data Science

Chapter 3 - Data transformation

Acknowledgments

Gina Reynolds for fantastic flipbooks
Lise Vaudor nice blog
Allison Horst for the great ArtWork
Alexandre Courtiol for cheatsheets
Jenny Bryan
poorman by Nathan Eastwood, a re-implementation of dplyr in base only

Contributions

Roland Krause
Aurélien Ginolhac

Data Wrangling - Grouping

Introduction

Summarising

Learning objectives

Counting elements

Replacing calls to table()

Summarise data

When counting is not enough

If we want the min/max per condition

Within one summarise statement

Stats functions

Persistent of grouping with group_by()

group_by() results in a persistent group

Summarising is removing grouping

Grouping removed

Grouping by more than one variable

Peeling effect

No grouping after additional calls to summarise()

Manipulate retaining grouping

How many different answers for a variable by condition and gender?

Arranging values within groups

arrange() can sort values by multiple columns

Arranging values within groups (continued.)

For arrange() and grouping

Dealing with multiple return values per group

summarise() duplicates keys

Using reframe()

Reframe with more values

Example: Three quantiles

Grouping and mutate()

Grouping works with other verbs

Temporary grouping

dplyr 1.1.0 introduced the .by argument

Summary

Your turn!

Exercises

Solution

Before we stop

Contributions

Thank you for your attention!

Replacing calls to `table()`

Persistent of grouping with `group_by()`

`group_by()` results in a persistent group

No grouping after additional calls to `summarise()`

`arrange()` can sort values by multiple columns

For `arrange()` and grouping

`summarise()` duplicates keys

Using `reframe()`

Grouping and `mutate()`

`dplyr` 1.1.0 introduced the `.by` argument