5.3 Pivoting data
The examples given before were obviously untidy: two values in one cell, date split in two different columns.
It will happen that the format of the data you are usually working with is NOT tidy.
Example: here we create a tibble that contains the expression of 4 genes in 4 samples: the gene expression is measured at two time points and for two experimental conditions:
<- tibble(gene=c("Fgfrl1", "Usp34", "Mlf2", "Gga1"),
df_expr KO_t0=1:4,
KO_t10=5:2,
WT_t0=6:3,
WT_t10=0:3
)
Why this tibble can’t be considered tidy
?
* One variable (expression) is spread across 4 columns.
In order to fix this, we can use the pivot_longer
function: we will increase the number of rows and decrease the number of columns:
pivot_longer(df_expr,
cols=c("KO_t0", "KO_t10", "WT_t0", "WT_t10")
)
## # A tibble: 16 x 3
## gene name value
## <chr> <chr> <int>
## 1 Fgfrl1 KO_t0 1
## 2 Fgfrl1 KO_t10 5
## 3 Fgfrl1 WT_t0 6
## 4 Fgfrl1 WT_t10 0
## 5 Usp34 KO_t0 2
## 6 Usp34 KO_t10 4
## 7 Usp34 WT_t0 5
## 8 Usp34 WT_t10 1
## 9 Mlf2 KO_t0 3
## 10 Mlf2 KO_t10 3
## 11 Mlf2 WT_t0 4
## 12 Mlf2 WT_t10 2
## 13 Gga1 KO_t0 4
## 14 Gga1 KO_t10 2
## 15 Gga1 WT_t0 3
## 16 Gga1 WT_t10 3
Now only one column - value
- contains the expression variable.
We can change the name of the newly created columns with parameters values_to
(where the values are stored) and names_to
(where the former column names are stored):
<- pivot_longer(df_expr,
expr_long cols=c("KO_t0", "KO_t10", "WT_t0", "WT_t10"),
values_to="expression",
names_to="sample"
)
Is there anything else that we could do here, to make the data even more tidy?
- There are 2 variables in column
sample
: samples experimental group (KO or WT) and time point (t0 and t10): we canseparate
them.
# we can add "remove=FALSE" so as to keep the original column (sample name), that can be useful in this case:
<- separate(expr_long,
expr_tidy col=sample,
into=c("group", "time"),
sep="_",
remove=FALSE
)
The pivot_wider
function does just the opposite: when an observation is scattered across multiple rows, you can move it back to a column.
Let’s take the following example, where expression of 2 genes is measured under two conditions (KO and WT) using two techniques (RNA-seq and qPCR):
<- tibble(gene=c(rep("Fgfrl1", 4), rep("Mlf2", 4)),
df_expr2 group=rep(rep(c("KO", "WT"), each=2), 2),
type=rep(c("counts_rnaseq", "Ct_qpcr"), 4),
value=c(10, 25, 8, 22, 7, 23, 11, 28)
)
Here, it would make sense to have a column counts_rnaseq
and a column Ct_qpcr
, which would contain their corresponding values:
<- pivot_wider(df_expr2,
df_expr2_wider names_from=type,
values_from=value)
HANDS-ON
Make the following data frame tidy!
<- tibble(Name=c("Anna", "Jordi", "Laura", "Miquel"),
df_w Weight_day0=c(67, 72, 62, 85),
Weight_day10=c(65, 73, 66, 80),
)
Answer
# `pivot_longer` on "Weight" columns:
<- pivot_longer(df_w,
df_w2 cols=c("Weight_day0", "Weight_day10"),
values_to="weight",
names_to="day"
)
# `separate` on the "day" column
# into=c(NA, "day"): the first output is ignored (in the), the "Weight" strings
<- separate(df_w2, col = "day",
df_w3 sep="_",
into=c(NA, "day")
)# If you want to keep only "0" and "10", you can use "day" as a field separator (optionally set convert=TRUE if you want to convert the newly created column to numeric values)
<- separate(df_w2, col = "day",
df_w4 sep="day",
into=c(NA, "day"),
convert=TRUE
)# If you want to remove "day" from an existing tibble, you can also use the str_remove function from the stringr package, and optionally add a conversion step:
$day <- as.numeric(str_remove(df_w3$day, "day")) df_w3