as.data.frame(table(judgments$condition))     Var1 Freq
1 control   91
2  stress   97
Grouping and summarizing
R Workshop
Tuesday, 11 February 2025
Work with summary data by
dplyr 1.1.4 features persistent and temporary grouping as well as new functions for summaries.
Base R is using vectors of factors with aggregate()
dplyrHow many participants per condition?
table()Ugly and convoluted. First column is factor!
Counting
count() groups by specified columns.
Result is a tibble that can be processed further.
sort = TRUE avoids piping to arrange(desc(n))
count() is a shortcut for:
# A tibble: 1 × 2
    min   max
  <dbl> <dbl>
1     9   100
summarise returns as many rows as groups - one if no groups.mutate returns as many rows as given.mutate(judgments,
       min = min(mood_pre, na.rm = TRUE),
       max = max(mood_pre, na.rm = TRUE), .before = 1)# A tibble: 188 × 160
     min   max start_date end_date  finished
   <dbl> <dbl> <chr>      <chr>        <dbl>
 1     9   100 11/3/2014  11/3/2014        1
 2     9   100 11/3/2014  11/3/2014        1
 3     9   100 11/3/2014  11/3/2014        1
 4     9   100 11/3/2014  11/3/2014        1
 5     9   100 11/3/2014  11/3/2014        1
 6     9   100 11/3/2014  11/3/2014        1
 7     9   100 11/3/2014  11/3/2014        1
 8     9   100 11/3/2014  11/3/2014        1
 9     9   100 11/3/2014  11/3/2014        1
10     9   100 11/3/2014  11/3/2014        1
# ℹ 178 more rows
# ℹ 155 more variables: condition <chr>,
#   subject <dbl>, gender <chr>, age <dbl>,
#   mood_pre <dbl>, mood_post <dbl>,
#   STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
#   STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, …
Commonly used
n() to count the number of rowsn_distinct() to count the number of distinct observations - used inside the dplyr verbs!first() to extract the observation in the first positionlast() to extract the observation in the last positionnth() to take the entry in a specified positionmean(), sd(), etcsummarise(judgments,
          n_rows = n(), 
          n_subject = n_distinct(subject),
          first_id = first(subject),
          last_id = last(subject),
          mean = mean(mood_pre, na.rm= TRUE),
          id_10 = nth(subject, n = 10))# A tibble: 1 × 6
  n_rows n_subject first_id last_id  mean id_10
   <int>     <int>    <dbl>   <dbl> <dbl> <dbl>
1    188       187        2     189  59.4    13
group_by()group_by() results in a persistent group# A tibble: 188 × 158
# Groups:   condition [2]
   start_date end_date  finished condition subject
   <chr>      <chr>        <dbl> <chr>       <dbl>
 1 11/3/2014  11/3/2014        1 control         2
 2 11/3/2014  11/3/2014        1 stress          1
 3 11/3/2014  11/3/2014        1 stress          3
 4 11/3/2014  11/3/2014        1 stress          4
 5 11/3/2014  11/3/2014        1 control         7
 6 11/3/2014  11/3/2014        1 stress          6
 7 11/3/2014  11/3/2014        1 control         5
 8 11/3/2014  11/3/2014        1 control         9
 9 11/3/2014  11/3/2014        1 stress         16
10 11/3/2014  11/3/2014        1 stress         13
# ℹ 178 more rows
# ℹ 153 more variables: gender <chr>, age <dbl>,
#   mood_pre <dbl>, mood_post <dbl>,
#   STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
#   STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>,
#   STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, …
The grouping is indicated in the resulting tibble.
For variable grouping, one is peeled off from the right
# A tibble: 2 × 3
# Groups:   condition [2]
  condition gender     n
  <chr>     <chr>  <int>
1 control   female    65
2 stress    female    82
Warning
Most functions in dplyr are group-aware!
Remark
Ask explicitly to ungroup data
ungroup().groups argument to keep or drop groups.You can inspect the grouping of complex objects or programmatically by the `` functions
arrange() can sort values by multiple columnsWarning
Arranging is ignoring groups!
judgments |> 
  mutate(mood_pre_cat = case_when(
    mood_pre < 25  ~ "poor", 
    mood_pre > 75 ~ "great",
    TRUE ~ "normal")) |> 
  group_by(mood_pre_cat) |> 
  arrange(desc(mood_post))|> 
  select(mood_pre_cat, mood_post) |> 
  distinct()# A tibble: 67 × 2
# Groups:   mood_pre_cat [3]
   mood_pre_cat mood_post
   <chr>            <dbl>
 1 normal             100
 2 normal              99
 3 great               98
 4 normal              94
 5 great               91
 6 great               89
 7 great               85
 8 great               83
 9 normal              83
10 normal              82
# ℹ 57 more rows
arrange() and groupingSolution
Use .by_group = TRUE
judgments |> 
  mutate(mood_pre_cat = case_when(
    mood_pre < 25  ~ "poor", 
    mood_pre > 75 ~ "great",
    TRUE ~ "normal")) |> 
  group_by(mood_pre_cat) |> 
  arrange(desc(mood_post), .by_group = TRUE) |>  
  select(mood_pre_cat, mood_post) |> 
  distinct()# A tibble: 67 × 2
# Groups:   mood_pre_cat [3]
   mood_pre_cat mood_post
   <chr>            <dbl>
 1 great               98
 2 great               91
 3 great               89
 4 great               85
 5 great               83
 6 great               79
 7 great               75
 8 great               69
 9 great               62
10 great               59
# ℹ 57 more rows
But, you are better off using:
range() returns min and max
summarise() duplicates keysjudgments |>
  group_by(condition, gender) |> 
  summarise(range = range(mood_pre, na.rm = TRUE),
            n = n())Warning: Returning more (or less) than 1 row per
`summarise()` group was deprecated in dplyr
1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to
  `reframe()`, remember that `reframe()` always
  returns an ungrouped data frame and adjust
  accordingly.
`summarise()` has grouped output by 'condition',
'gender'. You can override using the `.groups`
argument.
# A tibble: 8 × 4
# Groups:   condition, gender [4]
  condition gender range     n
  <chr>     <chr>  <dbl> <int>
1 control   female    29    65
2 control   female    95    65
3 control   male      19    26
4 control   male     100    26
5 stress    female     9    82
6 stress    female    96    82
7 stress    male      18    15
8 stress    male      85    15
reframe()# A tibble: 8 × 4
  condition gender range     n
  <chr>     <chr>  <dbl> <int>
1 control   female    29    65
2 control   female    95    65
3 control   male      19    26
4 control   male     100    26
5 stress    female     9    82
6 stress    female    96    82
7 stress    male      18    15
8 stress    male      85    15
Return values
| Function | Receives | Returns | 
|---|---|---|
mutate() | 
n rows (or 1) | 
n rows | 
summarize() | 
n groups | 
n groups | 
reframe() | 
n groups by k outputs | 
an ungrouped tibble. | 
New in dplyr 1.1.0  for additional safety in programming with dplyr.
judgments |>
  filter(!is.na(mood_pre)) |> 
  group_by(condition, gender) |>
  reframe(
    quan = quantile(mood_pre,
                    c(0.25, 0.5, 0.75)),
    q = c(0.25, 0.5, 0.75),
    n = n())# A tibble: 12 × 5
   condition gender  quan     q     n
   <chr>     <chr>  <dbl> <dbl> <int>
 1 control   female  52.8  0.25    64
 2 control   female  66    0.5     64
 3 control   female  78.2  0.75    64
 4 control   male    53.2  0.25    26
 5 control   male    65    0.5     26
 6 control   male    72.8  0.75    26
 7 stress    female  44    0.25    82
 8 stress    female  58.5  0.5     82
 9 stress    female  67    0.75    82
10 stress    male    45.5  0.25    15
11 stress    male    53    0.5     15
12 stress    male    69.5  0.75    15
mutate()mutate()
# A tibble: 188 × 159
# Groups:   condition [2]
   subject condition     n start_date end_date 
     <dbl> <chr>     <int> <chr>      <chr>    
 1       2 control      91 11/3/2014  11/3/2014
 2       1 stress       97 11/3/2014  11/3/2014
 3       3 stress       97 11/3/2014  11/3/2014
 4       4 stress       97 11/3/2014  11/3/2014
 5       7 control      91 11/3/2014  11/3/2014
 6       6 stress       97 11/3/2014  11/3/2014
 7       5 control      91 11/3/2014  11/3/2014
 8       9 control      91 11/3/2014  11/3/2014
 9      16 stress       97 11/3/2014  11/3/2014
10      13 stress       97 11/3/2014  11/3/2014
# ℹ 178 more rows
# ℹ 154 more variables: finished <dbl>,
#   gender <chr>, age <dbl>, mood_pre <dbl>,
#   mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
#   STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
#   STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, …
dplyr 1.1.0 introduced the .by argumentsummarize(), most useful with mutate()ungroup()# A tibble: 188 × 159
   subject condition     n start_date end_date 
     <dbl> <chr>     <int> <chr>      <chr>    
 1       2 control      91 11/3/2014  11/3/2014
 2       1 stress       97 11/3/2014  11/3/2014
 3       3 stress       97 11/3/2014  11/3/2014
 4       4 stress       97 11/3/2014  11/3/2014
 5       7 control      91 11/3/2014  11/3/2014
 6       6 stress       97 11/3/2014  11/3/2014
 7       5 control      91 11/3/2014  11/3/2014
 8       9 control      91 11/3/2014  11/3/2014
 9      16 stress       97 11/3/2014  11/3/2014
10      13 stress       97 11/3/2014  11/3/2014
# ℹ 178 more rows
# ℹ 154 more variables: finished <dbl>,
#   gender <chr>, age <dbl>, mood_pre <dbl>,
#   mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
#   STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
#   STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, …
Most commonly used - 80%
select() - columnsfilter() - rows meeting conditionarrange() - sortglimpse() - inspectrename() - change column namerelocate() - move columnsmutate() - create columnscase_when() simplifies if/else/if/elseacross(), c_across() - work on >1 columngroup_by(), ungroup(), rowwise()summarise() - group-wise summariesSource: Lise Vaudor blog
Comments
tidyr and dplyr are replacing the reshape and reshape2 packagesvignette("tidy-data")Use judgments to compute basic statistics for all moral dilemma columns considering the conditions:
med for median().judgments_condition_stats to the results.In judgments:
stress group.stress or control appear together).10:00
judgments |> 
  group_by(condition) |> 
  summarize(across(starts_with("moral_dilemma"),
                   list(
                     mean = mean,
                     sd = sd,
                     med = median ,
                     min = min,
                     max = max
                   ))) -> judgments_condition_stats
judgments_condition_stats# A tibble: 2 × 36
  condition moral_dilemma_dog_mean
  <chr>                      <dbl>
1 control                     7.24
2 stress                      7.45
# ℹ 34 more variables:
#   moral_dilemma_dog_sd <dbl>,
#   moral_dilemma_dog_med <dbl>,
#   moral_dilemma_dog_min <dbl>,
#   moral_dilemma_dog_max <dbl>,
#   moral_dilemma_wallet_mean <dbl>,
#   moral_dilemma_wallet_sd <dbl>, …
judgments |> 
  group_by( condition, gender, age) |> 
  summarize(n = n()) |> 
  arrange(desc(n), .by_group = TRUE) |> 
  ungroup()# A tibble: 33 × 4
   condition gender   age     n
   <chr>     <chr>  <dbl> <int>
 1 control   female    18    25
 2 control   female    19    17
 3 control   female    21     7
 4 control   female    20     4
 5 control   female    22     4
 6 control   female    23     3
 7 control   female    17     2
 8 control   female    24     2
 9 control   female    26     1
10 control   male      18     7
# ℹ 23 more rows
You learned to:
Next step: joining tables
Further reading
Chapter 3 - Data transformation
Acknowledgments
dplyr in base only