5.2 Separate & Unite

Let’s work on the untidy tibble table5, a built-in data set from the tidyr package (table5 displays the number of tuberculosis cases documented by the WHO in selected countries):

Table 5.1: table5
country century year rate
Afghanistan 19 99 745/19987071
Afghanistan 20 00 2666/20595360
Brazil 19 99 37737/172006362
Brazil 20 00 80488/174504898
China 19 99 212258/1272915272
China 20 00 213766/1280428583

5.2.1 Separate

Function separate separates (indeed!) a column into 2 or more columns, given a specified field separator: separate(data, col, into, sep)

# separate column "rate" into 2 columns that we name cases and population, based on the "/" separator.
table5_1 <- separate(data=table5, 
         col=rate, 
         into=c("cases", "population"),
         sep="/"
         )

table5_1
## # A tibble: 6 x 5
##   country     century year  cases  population
##   <chr>       <chr>   <chr> <chr>  <chr>     
## 1 Afghanistan 19      99    745    19987071  
## 2 Afghanistan 20      00    2666   20595360  
## 3 Brazil      19      99    37737  172006362 
## 4 Brazil      20      00    80488  174504898 
## 5 China       19      99    212258 1272915272
## 6 China       20      00    213766 1280428583

The default field separator is a regular expression that matches any sequence of non-alphanumeric values.
So this would work well in our example:

separate(data=table5, 
         col=rate, 
         into=c("cases", "population")
         )
## # A tibble: 6 x 5
##   country     century year  cases  population
##   <chr>       <chr>   <chr> <chr>  <chr>     
## 1 Afghanistan 19      99    745    19987071  
## 2 Afghanistan 20      00    2666   20595360  
## 3 Brazil      19      99    37737  172006362 
## 4 Brazil      20      00    80488  174504898 
## 5 China       19      99    212258 1272915272
## 6 China       20      00    213766 1280428583

It is good practice to explicitly specify the field separator, so as not to get any bad surprise.

5.2.2 Unite

Function unite does just the opposite: it will unite/stick together 2 or (more) columns: unite(data, col, column names, sep).

# unite columns "century" and "year"
table5a_2 <- unite(data=table5_1,
                col=year,
                c("century", "year"), 
                sep="")

table5a_2
## # A tibble: 6 x 4
##   country     year  cases  population
##   <chr>       <chr> <chr>  <chr>     
## 1 Afghanistan 1999  745    19987071  
## 2 Afghanistan 2000  2666   20595360  
## 3 Brazil      1999  37737  172006362 
## 4 Brazil      2000  80488  174504898 
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

HANDS-ON

Fix the following untidy tibble:

patients <- tibble(
  ID1=c("A", "A", "B", "B"),
  ID2=c(12, 13, 14, 15),
  age=c( 21,  32, 25, 43),
  c("188_93", "167_55", "155_51", "175_72")
)
  • Separate the two variables that are found in the last column into columns height and weight: save in object patients2.
  • Unite columns ID1 and ID2 so as to create a unique ID per patient: save in patients3.
Answer
# Separate the two variables that are found in the last column.
patients2 <- separate(data=patients, 
         col=4,
         into=c("height", "weight"),
         sep="_")

# Unite columns ID1 and ID2 so as to create a unique ID per patient.
patients3 <- unite(data=patients2,
      col="ID",
      c("ID1", "ID2"),
      sep="")
  • Is there a problem with patients3? Display the tibble in the R console…

Columns height and weight are treated as characters (because they originally contained underscores "_" !).
We can convert them into numbers with the convert=TRUE parameter in separate:

# convert=TRUE: runs `type.convert` fonction (that converts data to appropriate type)
patients2 <- separate(data=patients, 
                      col=4,
                      into=c("height", "weight"),
                      sep="_", 
                      convert=TRUE)