5.2 Separate & Unite
Let’s work on the untidy tibble table5
, a built-in data set from the tidyr
package (table5 displays the number of tuberculosis cases documented by the WHO in selected countries):
country | century | year | rate |
---|---|---|---|
Afghanistan | 19 | 99 | 745/19987071 |
Afghanistan | 20 | 00 | 2666/20595360 |
Brazil | 19 | 99 | 37737/172006362 |
Brazil | 20 | 00 | 80488/174504898 |
China | 19 | 99 | 212258/1272915272 |
China | 20 | 00 | 213766/1280428583 |
5.2.1 Separate
Function separate
separates (indeed!) a column into 2 or more columns, given a specified field separator: separate(data, col, into, sep)
# separate column "rate" into 2 columns that we name cases and population, based on the "/" separator.
<- separate(data=table5,
table5_1 col=rate,
into=c("cases", "population"),
sep="/"
)
table5_1
## # A tibble: 6 x 5
## country century year cases population
## <chr> <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745 19987071
## 2 Afghanistan 20 00 2666 20595360
## 3 Brazil 19 99 37737 172006362
## 4 Brazil 20 00 80488 174504898
## 5 China 19 99 212258 1272915272
## 6 China 20 00 213766 1280428583
The default field separator is a regular expression that matches any sequence of non-alphanumeric values.
So this would work well in our example:
separate(data=table5,
col=rate,
into=c("cases", "population")
)
## # A tibble: 6 x 5
## country century year cases population
## <chr> <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745 19987071
## 2 Afghanistan 20 00 2666 20595360
## 3 Brazil 19 99 37737 172006362
## 4 Brazil 20 00 80488 174504898
## 5 China 19 99 212258 1272915272
## 6 China 20 00 213766 1280428583
It is good practice to explicitly specify the field separator, so as not to get any bad surprise.
5.2.2 Unite
Function unite
does just the opposite: it will unite/stick together 2 or (more) columns: unite(data, col, column names, sep)
.
# unite columns "century" and "year"
<- unite(data=table5_1,
table5a_2 col=year,
c("century", "year"),
sep="")
table5a_2
## # A tibble: 6 x 4
## country year cases population
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
HANDS-ON
Fix the following untidy tibble
:
<- tibble(
patients ID1=c("A", "A", "B", "B"),
ID2=c(12, 13, 14, 15),
age=c( 21, 32, 25, 43),
c("188_93", "167_55", "155_51", "175_72")
)
- Separate the two variables that are found in the last column into columns height and weight: save in object
patients2
. - Unite columns ID1 and ID2 so as to create a unique ID per patient: save in
patients3
.
Answer
# Separate the two variables that are found in the last column.
<- separate(data=patients,
patients2 col=4,
into=c("height", "weight"),
sep="_")
# Unite columns ID1 and ID2 so as to create a unique ID per patient.
<- unite(data=patients2,
patients3 col="ID",
c("ID1", "ID2"),
sep="")
- Is there a problem with
patients3
? Display the tibble in the R console…
Columns height
and weight
are treated as characters (because they originally contained underscores "_" !).
We can convert them into numbers with the convert=TRUE
parameter in separate
:
# convert=TRUE: runs `type.convert` fonction (that converts data to appropriate type)
<- separate(data=patients,
patients2 col=4,
into=c("height", "weight"),
sep="_",
convert=TRUE)