7.4 summarise & group_by
summarise
collapses a data frame to a 1-row tibble (base R equivalent of aggregate())
- Get average length of terms:
summarise(presidential2,
mean(duration_days))
## # A tibble: 1 x 1
## `mean(duration_days)`
## <drtn>
## 1 2125.091 days
- Get average length of terms + count of the total of entries:
summarise(presidential2,
mean(duration_days),
n())
## # A tibble: 1 x 2
## `mean(duration_days)` `n()`
## <drtn> <int>
## 1 2125.091 days 11
- You can also give a name to each of the calculations you produce with
summarise
(and add more calculations!):
summarise(presidential2,
mean_term=mean(duration_days),
min_term=min(duration_days),
max_term=max(duration_days),
count_presidents=n())
## # A tibble: 1 x 4
## mean_term min_term max_term count_presidents
## <drtn> <drtn> <drtn> <int>
## 1 2125.091 days 895 days 2922 days 11
You can combine
summarise
withgroup_by
to get, for example, the number of presidents per political party:group_by
defines a grouping based on existing variables.summarise
then processes the command based on the grouping
# group based on the "party" column (that contains "Democratic" or "Republican")
<- group_by(presidential2,
groups
party)
# count the number of presidents per party:
summarise(groups,
n())
## # A tibble: 2 x 2
## party `n()`
## * <chr> <int>
## 1 Democratic 5
## 2 Republican 6
# One line, using the %>% operator:
group_by(presidential2,
%>%
party) summarise(n())
## # A tibble: 2 x 2
## party `n()`
## * <chr> <int>
## 1 Democratic 5
## 2 Republican 6
Note: the row above is equivalent to using
count
(a wrapper):count(presidential, party)
count(presidential2, party)
## # A tibble: 2 x 2
## party n
## * <chr> <int>
## 1 Democratic 5
## 2 Republican 6
- Use the same structure to calculate the average length of terms per political party:
group_by(presidential2,
%>%
party) summarise(mean(duration_days))
## # A tibble: 2 x 2
## party `mean(duration_days)`
## * <chr> <drtn>
## 1 Democratic 2045.4 days
## 2 Republican 2191.5 days
- Note that you can group using more than one variable:
group_by(starwars,
%>%
species, hair_color, gender) summarise(n())
- Grouping variables also influences how other
dplyr
functions work!
For example, let’s group our starwars
characters by both species and gender variables:
<- group_by(starwars,
sw_sg species, gender)
We can then use slice_max
(function that retrieves the row that contains the maximum value in a selected variable) to retrieve the character with the maximum height
:
%>%
sw_sg select(name, species, gender, height) %>% # columns selection just to make the output more readable
slice_max(height)
## # A tibble: 43 x 4
## # Groups: species, gender [42]
## name species gender height
## <chr> <chr> <chr> <int>
## 1 Ratts Tyerell Aleena masculine 79
## 2 Dexter Jettster Besalisk masculine 198
## 3 Ki-Adi-Mundi Cerean masculine 198
## 4 Mas Amedda Chagrian masculine 196
## 5 Zam Wesell Clawdite feminine 168
## 6 R4-P17 Droid feminine 96
## 7 IG-88 Droid masculine 200
## 8 Sebulba Dug masculine 112
## 9 Wicket Systri Warrick Ewok masculine 88
## 10 Poggle the Lesser Geonosian masculine 183
## # … with 33 more rows
We get one entry for each unique combination of species and gender.
If you query the maximum height on an non-grouped tibble, you will get only one row (maximum height overall):
%>%
starwars select(name, species, gender, height) %>%
slice_max(height)
## # A tibble: 1 x 4
## name species gender height
## <chr> <chr> <chr> <int>
## 1 Yarael Poof Quermian masculine 264
Another example is that you can create new columns (with
mutate
) based on the grouping.- Here we are grouping the data by
species
and we add a columnaverage_height_species
that describes the average height. - As the data is grouped by
species
, we will get the average height PER SPECIES:
- Here we are grouping the data by
%>% group_by(species) %>%
starwars select(name, species, height, mass) %>%
mutate(average_height_species=mean(height, na.rm=TRUE))
## # A tibble: 87 x 5
## # Groups: species [38]
## name species height mass average_height_species
## <chr> <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker Human 172 77 177.
## 2 C-3PO Droid 167 75 131.
## 3 R2-D2 Droid 96 32 131.
## 4 Darth Vader Human 202 136 177.
## 5 Leia Organa Human 150 49 177.
## 6 Owen Lars Human 178 120 177.
## 7 Beru Whitesun lars Human 165 75 177.
## 8 R5-D4 Droid 97 32 131.
## 9 Biggs Darklighter Human 183 84 177.
## 10 Obi-Wan Kenobi Human 182 77 177.
## # … with 77 more rows
HANDS-ON
Back to our starwarsBMI
tibble:
- Count the number of occurrences of each
hair color
pergender
. - Count the average BMI per
species
. Add a count of the number of individuals per species.
Answer
# Count the number of occurrences of each `hair color` per `gender`
%>%
starwarsBMI group_by(gender, hair_color) %>%
summarise(mycounts=n())
# Count the average **BMI** per `species`. Add a count of the number of individuals per species.
%>%
starwarsBMI group_by(species) %>%
summarise(average_bmi=mean(BMI, na.rm=TRUE), count_individuals=n())
# Also report the number of individuals for which BMI is NOT NA (i.e. the actual number of individuals for which the average was computed)
%>%
starwarsBMI group_by(species) %>%
summarise(average_bmi=mean(BMI, na.rm=TRUE), count_all_individuals=n(), count_individuals_non_NA=sum(!is.na(BMI)))
7.4.1 ungroup
When you are grouping variables with group_by
, the tibble will keep the grouping until you ungroup!
While this is not an issue when you are summarizing (you get a summary table), it can be useful in case you are using the grouping - for example - to create a new column.
In the example stated before, we created a new column average_height_species
that contains the average height per species:
%>% group_by(species) %>%
starwars select(name, species, height, mass) %>%
mutate(average_height_species=mean(height, na.rm=TRUE)) %>% head(n=2)
## # A tibble: 2 x 5
## # Groups: species [2]
## name species height mass average_height_species
## <chr> <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker Human 172 77 177.
## 2 C-3PO Droid 167 75 131.
What if we also want to add a column that describes the average mass of ALL individuals (regardless of the species)?
# With the current grouping (species), you get the average mass calculated per species
%>% group_by(species) %>%
starwars select(name, species, height, mass) %>%
mutate(average_height_species=mean(height, na.rm=TRUE)) %>%
mutate(average_mass=mean(mass, na.rm=TRUE)) %>% head(n=2)
## # A tibble: 2 x 6
## # Groups: species [2]
## name species height mass average_height_species average_mass
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Luke Skywalker Human 172 77 177. 82.8
## 2 C-3PO Droid 167 75 131. 69.8
Column average_mass
still contains the average per species!
* We need to ungroup the tibble before creating this new column!!
# ungroup first, and you get the average mass calculated for the whole tibble
%>% group_by(species) %>%
starwars select(name, species, height, mass) %>%
mutate(average_height_species=mean(height, na.rm=TRUE)) %>%
%>%
ungroup mutate(average_mass=mean(mass, na.rm=TRUE)) %>% head(n=2)
## # A tibble: 2 x 6
## name species height mass average_height_species average_mass
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Luke Skywalker Human 172 77 177. 97.3
## 2 C-3PO Droid 167 75 131. 97.3