Data Wrangling - Grouping

Grouping and summarizing

Milena Zizovic

Rworkshop

Wednesday, 7 February 2024

Introduction

Summarising

  • Work with summary data by

    • summarising,
    • adding to the existing tibble or
    • changing the number of rows.
  • dplyr 1.1.4 features persistent and temporary grouping as well as new functions for summaries.

  • Base R is using vectors of factors with aggregate()

Learning objectives

  • Compute groups with dplyr
  • Create summary statistics in an analysis workflow

Counting elements

How many participants per condition?

Replacing calls to table()

as.data.frame(table(judgments$condition))
     Var1 Freq
1 control   91
2  stress   97

Ugly and convoluted. First column is factor!

count(judgments, condition)
# A tibble: 2 × 2
  condition     n
  <chr>     <int>
1 control      91
2 stress       97

Counting

count() groups by specified columns.

Result is a tibble that can be processed further.

sort = TRUE avoids piping to arrange(desc(n))

count() is a shortcut for:

judgments |> 
  group_by(condition) |>
  summarise(n = n(), .groups = "drop")

Summarise data

When counting is not enough

summarise(judgments,
          min = min(mood_pre, na.rm = TRUE),
          max = max(mood_pre, na.rm = TRUE))
# A tibble: 1 × 2
    min   max
  <dbl> <dbl>
1     9   100
  • summarise returns as many rows as groups - one if no groups.
  • mutate returns as many rows as given.
mutate(judgments,
       min = min(mood_pre, na.rm = TRUE),
       max = max(mood_pre, na.rm = TRUE), .before = 1)
# A tibble: 188 × 160
     min   max start_date end_date  finished
   <dbl> <dbl> <chr>      <chr>        <dbl>
 1     9   100 11/3/2014  11/3/2014        1
 2     9   100 11/3/2014  11/3/2014        1
 3     9   100 11/3/2014  11/3/2014        1
 4     9   100 11/3/2014  11/3/2014        1
 5     9   100 11/3/2014  11/3/2014        1
 6     9   100 11/3/2014  11/3/2014        1
 7     9   100 11/3/2014  11/3/2014        1
 8     9   100 11/3/2014  11/3/2014        1
 9     9   100 11/3/2014  11/3/2014        1
10     9   100 11/3/2014  11/3/2014        1
# ℹ 178 more rows
# ℹ 155 more variables: condition <chr>,
#   subject <dbl>, gender <chr>, age <dbl>,
#   mood_pre <dbl>, mood_post <dbl>,
#   STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
#   STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, …

If we want the min/max per condition

judgments |> 
  group_by(condition) |> 
  summarise(min = min(mood_pre, na.rm = TRUE),
            max = max(mood_pre, na.rm = TRUE))
# A tibble: 2 × 3
  condition   min   max
  <chr>     <dbl> <dbl>
1 control      19   100
2 stress        9    96

Within one summarise statement

Commonly used

  • n() to count the number of rows
  • n_distinct() to count the number of distinct observations - used inside the dplyr verbs!
  • first() to extract the observation in the first position
  • last() to extract the observation in the last position
  • nth() to take the entry in a specified position

Stats functions

  • mean(), sd(), etc
summarise(judgments,
          n_rows = n(), 
          n_subject = n_distinct(subject),
          first_id = first(subject),
          last_id = last(subject),
          mean = mean(mood_pre, na.rm= TRUE),
          id_10 = nth(subject, n = 10))
# A tibble: 1 × 6
  n_rows n_subject first_id last_id  mean id_10
   <int>     <int>    <dbl>   <dbl> <dbl> <dbl>
1    188       187        2     189  59.4    13

Persistent of grouping with group_by()

group_by() results in a persistent group

  • Each call to the tibble will respect the grouping
  • Summarize will strip away one level of grouping
judgments |> group_by(condition)
# A tibble: 188 × 158
# Groups:   condition [2]
   start_date end_date  finished condition subject
   <chr>      <chr>        <dbl> <chr>       <dbl>
 1 11/3/2014  11/3/2014        1 control         2
 2 11/3/2014  11/3/2014        1 stress          1
 3 11/3/2014  11/3/2014        1 stress          3
 4 11/3/2014  11/3/2014        1 stress          4
 5 11/3/2014  11/3/2014        1 control         7
 6 11/3/2014  11/3/2014        1 stress          6
 7 11/3/2014  11/3/2014        1 control         5
 8 11/3/2014  11/3/2014        1 control         9
 9 11/3/2014  11/3/2014        1 stress         16
10 11/3/2014  11/3/2014        1 stress         13
# ℹ 178 more rows
# ℹ 153 more variables: gender <chr>, age <dbl>,
#   mood_pre <dbl>, mood_post <dbl>,
#   STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
#   STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>,
#   STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, …

The grouping is indicated in the resulting tibble.

Summarising is removing grouping

judgments |> 
  group_by(condition) |> 
  summarise(min = min(mood_pre, na.rm = TRUE))
# A tibble: 2 × 2
  condition   min
  <chr>     <dbl>
1 control      19
2 stress        9

Grouping removed

Not a concern for grouping by single variables but …

Grouping by more than one variable

Peeling effect

For variable grouping, one is peeled off from the right

judgments |>
  group_by(condition, gender) |> 
  summarise(n = n()) -> n_part

slice_head(n_part, n = 1) 
# A tibble: 2 × 3
# Groups:   condition [2]
  condition gender     n
  <chr>     <chr>  <int>
1 control   female    65
2 stress    female    82

Warning

Most functions in dplyr are group-aware!

No grouping after additional calls to summarise()

The remaining condition group was peeled off too

summarise(n_part, n = n())
# A tibble: 2 × 2
  condition     n
  <chr>     <int>
1 control       2
2 stress        2

Manipulate retaining grouping

Remark

Ask explicitly to ungroup data

  • by ungroup()
  • by .groups argument to keep or drop groups.

You can inspect the grouping of complex objects or programmatically by the `` functions

How many different answers for a variable by condition and gender?

judgments |>
  group_by(condition, gender) |> 
  summarise(n_ans = n_distinct(STAI_post_1_1),
            .groups = "drop")
# A tibble: 4 × 3
  condition gender n_ans
  <chr>     <chr>  <int>
1 control   female     1
2 control   male       1
3 stress    female     4
4 stress    male       3

Arranging values within groups

arrange() can sort values by multiple columns

Warning

Arranging is ignoring groups!

judgments |> 
  mutate(mood_pre_cat = case_when(
    mood_pre < 25  ~ "poor", 
    mood_pre > 75 ~ "great",
    TRUE ~ "normal")) |> 
  group_by(mood_pre_cat) |> 
  arrange(desc(mood_post))|> 
  select(mood_pre_cat, mood_post) |> 
  distinct()
# A tibble: 67 × 2
# Groups:   mood_pre_cat [3]
   mood_pre_cat mood_post
   <chr>            <dbl>
 1 normal             100
 2 normal              99
 3 great               98
 4 normal              94
 5 great               91
 6 great               89
 7 great               85
 8 great               83
 9 normal              83
10 normal              82
# ℹ 57 more rows

Arranging values within groups (continued.)

For arrange() and grouping

Solution

Use .by_group = TRUE

judgments |> 
  mutate(mood_pre_cat = case_when(
    mood_pre < 25  ~ "poor", 
    mood_pre > 75 ~ "great",
    TRUE ~ "normal")) |> 
  group_by(mood_pre_cat) |> 
  arrange(desc(mood_post), .by_group = TRUE) |> #<<
  select(mood_pre_cat, mood_post) |> 
  distinct()
# A tibble: 67 × 2
# Groups:   mood_pre_cat [3]
   mood_pre_cat mood_post
   <chr>            <dbl>
 1 great               98
 2 great               91
 3 great               89
 4 great               85
 5 great               83
 6 great               79
 7 great               75
 8 great               69
 9 great               62
10 great               59
# ℹ 57 more rows

But, you are better off using:

  [...] |>
  arrange(mood_pre_cat, desc(mood_post)) |> #<<
  [...]

Dealing with multiple return values per group

range() returns min and max

summarise() duplicates keys

judgments |>
  group_by(condition, gender) |> 
  summarise(range = range(mood_pre, na.rm = TRUE),
            n = n())
Warning: Returning more (or less) than 1 row per
`summarise()` group was deprecated in dplyr
1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to
  `reframe()`, remember that `reframe()` always
  returns an ungrouped data frame and adjust
  accordingly.
`summarise()` has grouped output by 'condition',
'gender'. You can override using the `.groups`
argument.
# A tibble: 8 × 4
# Groups:   condition, gender [4]
  condition gender range     n
  <chr>     <chr>  <dbl> <int>
1 control   female    29    65
2 control   female    95    65
3 control   male      19    26
4 control   male     100    26
5 stress    female     9    82
6 stress    female    96    82
7 stress    male      18    15
8 stress    male      85    15

Using reframe()

judgments |>
  group_by(condition, gender) |> 
  reframe(range = range(mood_pre, na.rm = TRUE),
            n = n())
# A tibble: 8 × 4
  condition gender range     n
  <chr>     <chr>  <dbl> <int>
1 control   female    29    65
2 control   female    95    65
3 control   male      19    26
4 control   male     100    26
5 stress    female     9    82
6 stress    female    96    82
7 stress    male      18    15
8 stress    male      85    15

Reframe with more values

Return values

Function Receives Returns
mutate() n rows (or 1) n rows
summarize() n groups n groups
reframe() n groups by k outputs an ungrouped tibble.

New in dplyr 1.1.0 for additional safety in programming with dplyr.

Example: Three quantiles

judgments |>
  filter(!is.na(mood_pre)) |> 
  group_by(condition, gender) |>
  reframe(
    quan = quantile(mood_pre,
                    c(0.25, 0.5, 0.75)),
    q = c(0.25, 0.5, 0.75),
    n = n())
# A tibble: 12 × 5
   condition gender  quan     q     n
   <chr>     <chr>  <dbl> <dbl> <int>
 1 control   female  52.8  0.25    64
 2 control   female  66    0.5     64
 3 control   female  78.2  0.75    64
 4 control   male    53.2  0.25    26
 5 control   male    65    0.5     26
 6 control   male    72.8  0.75    26
 7 stress    female  44    0.25    82
 8 stress    female  58.5  0.5     82
 9 stress    female  67    0.75    82
10 stress    male    45.5  0.25    15
11 stress    male    53    0.5     15
12 stress    male    69.5  0.75    15

Grouping and mutate()

Grouping works with other verbs

  • Most useful with mutate()
    • As rows are not collapsed, the data is simply computed as for the groups
  • Note that the tibble is grouped after the operation.
judgments |>
  group_by(condition) |>
  mutate(  
    n = n()) |> 
  relocate(subject, condition, n)
# A tibble: 188 × 159
# Groups:   condition [2]
   subject condition     n start_date end_date 
     <dbl> <chr>     <int> <chr>      <chr>    
 1       2 control      91 11/3/2014  11/3/2014
 2       1 stress       97 11/3/2014  11/3/2014
 3       3 stress       97 11/3/2014  11/3/2014
 4       4 stress       97 11/3/2014  11/3/2014
 5       7 control      91 11/3/2014  11/3/2014
 6       6 stress       97 11/3/2014  11/3/2014
 7       5 control      91 11/3/2014  11/3/2014
 8       9 control      91 11/3/2014  11/3/2014
 9      16 stress       97 11/3/2014  11/3/2014
10      13 stress       97 11/3/2014  11/3/2014
# ℹ 178 more rows
# ℹ 154 more variables: finished <dbl>,
#   gender <chr>, age <dbl>, mood_pre <dbl>,
#   mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
#   STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
#   STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, …

Temporary grouping

dplyr 1.1.0 introduced the .by argument

  • Works with summarize(), most useful with mutate()
  • No need to call ungroup()
judgments  |> 
  mutate(n = n(),
         .by = condition) |> 
  relocate(subject, condition, n)
# A tibble: 188 × 159
   subject condition     n start_date end_date 
     <dbl> <chr>     <int> <chr>      <chr>    
 1       2 control      91 11/3/2014  11/3/2014
 2       1 stress       97 11/3/2014  11/3/2014
 3       3 stress       97 11/3/2014  11/3/2014
 4       4 stress       97 11/3/2014  11/3/2014
 5       7 control      91 11/3/2014  11/3/2014
 6       6 stress       97 11/3/2014  11/3/2014
 7       5 control      91 11/3/2014  11/3/2014
 8       9 control      91 11/3/2014  11/3/2014
 9      16 stress       97 11/3/2014  11/3/2014
10      13 stress       97 11/3/2014  11/3/2014
# ℹ 178 more rows
# ℹ 154 more variables: finished <dbl>,
#   gender <chr>, age <dbl>, mood_pre <dbl>,
#   mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
#   STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
#   STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, …

Summary

Most commonly used - 80%

  • select() - columns
  • filter() - rows meeting condition
  • arrange() - sort
  • glimpse() - inspect
  • rename() - change column name
  • relocate() - move columns
  • mutate() - create columns
  • case_when() simplifies if/else/if/else
  • across(), c_across() - work on >1 column
  • group_by(), ungroup(), rowwise()
  • summarise() - group-wise summaries

Source: Lise Vaudor blog

Comments

  • Represent the verbs you will use 80% of your time. Go to the website to see additional functions.
  • tidyr and dplyr are replacing the reshape and reshape2 packages
  • tidy data

Your turn!

Exercises

  • Use judgments to compute basic statistics for all moral dilemma columns considering the conditions:

    • 1 Compute the mean, the median, the standard deviation as well as min and max values.
    • 2 Find meaningful short names for the functions such as med for median().
    • 3 Assign judgments_condition_stats to the results.
  • In judgments:

    • 1 Find the number of subjects by age, gender and condition, e.g. how many 20 years of age females are in the stress group.
    • 2 Sort the resulting tibble such that the condition that contains the most populous group is sorted first (i.e. stress or control appear together).
    • 3 Ensure that the resulting tibble does not contain groups.
10:00

Before we stop

You learned to:

  • Selection and manipulation of
    • observations,
    • variables and
    • values.
  • Grouping and summarizing

Next step: joining tables

Further reading

Acknowledgments

Contributions

  • Milena Zizovic
  • Aurélien Ginolhac

Thank you for your attention!