Part 7 Data manipulation

Package dplyr introduces a grammar of data manipulation. See the nice cheat sheet

We will first introduce the 5 intuitively-named key functions from {dplyr}:

Table 7.1: the 5 core dplyr functions
mutate adds new variables (columns) that are functions of existing variables
select picks variables (columns) based on their names.
filter picks observations (rows) based on their values.
summarise collapses multiple values down to a single summary.
arrange changes the ordering of the rows.

All 5 functions work in a similar and consistent way:

  • The first argument is the input: a data frame or a tibble.
  • The output is a new tibble.

Note that {dplyr} never modifies the input: you need to redirect the output and save in a new - or the same - object.

We will use the presidential data set from the ggplot2 package. It contains data of the terms of presidents of the USA, from Eisenhower to Obama:

  • Name
  • Term starting date
  • Term ending date of mandate
  • Political party
Table 7.2: presidential data set
name start end party
Eisenhower 1953-01-20 1961-01-20 Republican
Kennedy 1961-01-20 1963-11-22 Democratic
Johnson 1963-11-22 1969-01-20 Democratic
Nixon 1969-01-20 1974-08-09 Republican
Ford 1974-08-09 1977-01-20 Republican
Carter 1977-01-20 1981-01-20 Democratic
Reagan 1981-01-20 1989-01-20 Republican
Bush 1989-01-20 1993-01-20 Republican
Clinton 1993-01-20 2001-01-20 Democratic
Bush 2001-01-20 2009-01-20 Republican
Obama 2009-01-20 2017-01-20 Democratic