7.4 summarise & group_by

summarise collapses a data frame to a 1-row tibble (base R equivalent of aggregate())

  • Get average length of terms:
summarise(presidential2, 
          mean(duration_days))
## # A tibble: 1 x 1
##   `mean(duration_days)`
##   <drtn>               
## 1 2125.091 days
  • Get average length of terms + count of the total of entries:
summarise(presidential2,
          mean(duration_days), 
          n())
## # A tibble: 1 x 2
##   `mean(duration_days)` `n()`
##   <drtn>                <int>
## 1 2125.091 days            11
  • You can also give a name to each of the calculations you produce with summarise (and add more calculations!):
summarise(presidential2,
          mean_term=mean(duration_days), 
          min_term=min(duration_days),
          max_term=max(duration_days),
          count_presidents=n())
## # A tibble: 1 x 4
##   mean_term     min_term max_term  count_presidents
##   <drtn>        <drtn>   <drtn>               <int>
## 1 2125.091 days 895 days 2922 days               11
  • You can combine summarise with group_by to get, for example, the number of presidents per political party:

    • group_by defines a grouping based on existing variables.
    • summarise then processes the command based on the grouping
# group based on the "party" column (that contains "Democratic" or "Republican")
groups <- group_by(presidential2,
                   party)

# count the number of presidents per party:
summarise(groups, 
          n())
## # A tibble: 2 x 2
##   party      `n()`
## * <chr>      <int>
## 1 Democratic     5
## 2 Republican     6
# One line, using the %>% operator:
group_by(presidential2, 
         party) %>% 
      summarise(n())
## # A tibble: 2 x 2
##   party      `n()`
## * <chr>      <int>
## 1 Democratic     5
## 2 Republican     6

Note: the row above is equivalent to using count (a wrapper): count(presidential, party)

count(presidential2, party)
## # A tibble: 2 x 2
##   party          n
## * <chr>      <int>
## 1 Democratic     5
## 2 Republican     6
  • Use the same structure to calculate the average length of terms per political party:
group_by(presidential2, 
         party) %>% 
  summarise(mean(duration_days))
## # A tibble: 2 x 2
##   party      `mean(duration_days)`
## * <chr>      <drtn>               
## 1 Democratic 2045.4 days          
## 2 Republican 2191.5 days
  • Note that you can group using more than one variable:
group_by(starwars, 
         species, hair_color, gender) %>%
      summarise(n())
  • Grouping variables also influences how other dplyr functions work!

For example, let’s group our starwars characters by both species and gender variables:

sw_sg <- group_by(starwars, 
                  species, gender) 

We can then use slice_max (function that retrieves the row that contains the maximum value in a selected variable) to retrieve the character with the maximum height:

sw_sg %>% 
  select(name, species, gender, height) %>% # columns selection just to make the output more readable
  slice_max(height)
## # A tibble: 43 x 4
## # Groups:   species, gender [42]
##    name                  species   gender    height
##    <chr>                 <chr>     <chr>      <int>
##  1 Ratts Tyerell         Aleena    masculine     79
##  2 Dexter Jettster       Besalisk  masculine    198
##  3 Ki-Adi-Mundi          Cerean    masculine    198
##  4 Mas Amedda            Chagrian  masculine    196
##  5 Zam Wesell            Clawdite  feminine     168
##  6 R4-P17                Droid     feminine      96
##  7 IG-88                 Droid     masculine    200
##  8 Sebulba               Dug       masculine    112
##  9 Wicket Systri Warrick Ewok      masculine     88
## 10 Poggle the Lesser     Geonosian masculine    183
## # … with 33 more rows

We get one entry for each unique combination of species and gender.
If you query the maximum height on an non-grouped tibble, you will get only one row (maximum height overall):

starwars %>% 
  select(name, species, gender, height) %>%
  slice_max(height)
## # A tibble: 1 x 4
##   name        species  gender    height
##   <chr>       <chr>    <chr>      <int>
## 1 Yarael Poof Quermian masculine    264
  • Another example is that you can create new columns (with mutate) based on the grouping.

    • Here we are grouping the data by species and we add a column average_height_species that describes the average height.
    • As the data is grouped by species, we will get the average height PER SPECIES:
starwars %>% group_by(species) %>% 
  select(name, species, height, mass) %>% 
  mutate(average_height_species=mean(height, na.rm=TRUE))
## # A tibble: 87 x 5
## # Groups:   species [38]
##    name               species height  mass average_height_species
##    <chr>              <chr>    <int> <dbl>                  <dbl>
##  1 Luke Skywalker     Human      172    77                   177.
##  2 C-3PO              Droid      167    75                   131.
##  3 R2-D2              Droid       96    32                   131.
##  4 Darth Vader        Human      202   136                   177.
##  5 Leia Organa        Human      150    49                   177.
##  6 Owen Lars          Human      178   120                   177.
##  7 Beru Whitesun lars Human      165    75                   177.
##  8 R5-D4              Droid       97    32                   131.
##  9 Biggs Darklighter  Human      183    84                   177.
## 10 Obi-Wan Kenobi     Human      182    77                   177.
## # … with 77 more rows

HANDS-ON

Back to our starwarsBMI tibble:

  • Count the number of occurrences of each hair color per gender.
  • Count the average BMI per species. Add a count of the number of individuals per species.
Answer
# Count the number of occurrences of each `hair color` per `gender`
starwarsBMI %>% 
  group_by(gender, hair_color) %>% 
  summarise(mycounts=n())

# Count the average **BMI** per `species`.  Add a count of the number of individuals per species.
starwarsBMI %>% 
  group_by(species) %>% 
  summarise(average_bmi=mean(BMI, na.rm=TRUE), count_individuals=n())

# Also report the number of individuals for which BMI is NOT NA (i.e. the actual number of individuals for which the average was computed)
starwarsBMI %>%
  group_by(species) %>% 
  summarise(average_bmi=mean(BMI, na.rm=TRUE), count_all_individuals=n(), count_individuals_non_NA=sum(!is.na(BMI)))

7.4.1 ungroup

When you are grouping variables with group_by, the tibble will keep the grouping until you ungroup!


While this is not an issue when you are summarizing (you get a summary table), it can be useful in case you are using the grouping - for example - to create a new column.

In the example stated before, we created a new column average_height_species that contains the average height per species:

starwars %>% group_by(species) %>% 
  select(name, species, height, mass) %>% 
  mutate(average_height_species=mean(height, na.rm=TRUE)) %>% head(n=2)
## # A tibble: 2 x 5
## # Groups:   species [2]
##   name           species height  mass average_height_species
##   <chr>          <chr>    <int> <dbl>                  <dbl>
## 1 Luke Skywalker Human      172    77                   177.
## 2 C-3PO          Droid      167    75                   131.

What if we also want to add a column that describes the average mass of ALL individuals (regardless of the species)?

# With the current grouping (species), you get the average mass calculated per species
starwars %>% group_by(species) %>% 
  select(name, species, height, mass) %>% 
  mutate(average_height_species=mean(height, na.rm=TRUE)) %>%
  mutate(average_mass=mean(mass, na.rm=TRUE)) %>% head(n=2)
## # A tibble: 2 x 6
## # Groups:   species [2]
##   name           species height  mass average_height_species average_mass
##   <chr>          <chr>    <int> <dbl>                  <dbl>        <dbl>
## 1 Luke Skywalker Human      172    77                   177.         82.8
## 2 C-3PO          Droid      167    75                   131.         69.8

Column average_mass still contains the average per species!


* We need to ungroup the tibble before creating this new column!!

# ungroup first, and you get the average mass calculated for the whole tibble
starwars %>% group_by(species) %>% 
  select(name, species, height, mass) %>% 
  mutate(average_height_species=mean(height, na.rm=TRUE)) %>%
  ungroup %>%
  mutate(average_mass=mean(mass, na.rm=TRUE)) %>% head(n=2)
## # A tibble: 2 x 6
##   name           species height  mass average_height_species average_mass
##   <chr>          <chr>    <int> <dbl>                  <dbl>        <dbl>
## 1 Luke Skywalker Human      172    77                   177.         97.3
## 2 C-3PO          Droid      167    75                   131.         97.3