Part 7 Data manipulation
Package dplyr introduces a grammar of data manipulation. See the nice cheat sheet
We will first introduce the 5 intuitively-named key functions from {dplyr}:
| name | what.it.does |
|---|---|
mutate
|
adds new variables (columns) that are functions of existing variables |
select
|
picks variables (columns) based on their names. |
filter
|
picks observations (rows) based on their values. |
summarise
|
collapses multiple values down to a single summary. |
arrange
|
changes the ordering of the rows. |
All 5 functions work in a similar and consistent way:
- The first argument is the input: a
data frameor atibble. - The output is a new
tibble.
Note that {dplyr} never modifies the input: you need to redirect the output and save in a new - or the same - object.
We will use the presidential data set from the ggplot2 package.
It contains data of the terms of presidents of the USA, from Eisenhower to Obama:
- Name
- Term starting date
- Term ending date of mandate
- Political party
| name | start | end | party |
|---|---|---|---|
| Eisenhower | 1953-01-20 | 1961-01-20 | Republican |
| Kennedy | 1961-01-20 | 1963-11-22 | Democratic |
| Johnson | 1963-11-22 | 1969-01-20 | Democratic |
| Nixon | 1969-01-20 | 1974-08-09 | Republican |
| Ford | 1974-08-09 | 1977-01-20 | Republican |
| Carter | 1977-01-20 | 1981-01-20 | Democratic |
| Reagan | 1981-01-20 | 1989-01-20 | Republican |
| Bush | 1989-01-20 | 1993-01-20 | Republican |
| Clinton | 1993-01-20 | 2001-01-20 | Democratic |
| Bush | 2001-01-20 | 2009-01-20 | Republican |
| Obama | 2009-01-20 | 2017-01-20 | Democratic |