5.3 Pivoting data

The examples given before were obviously untidy: two values in one cell, date split in two different columns.

It will happen that the format of the data you are usually working with is NOT tidy.


Example: here we create a tibble that contains the expression of 4 genes in 4 samples: the gene expression is measured at two time points and for two experimental conditions:

df_expr <- tibble(gene=c("Fgfrl1", "Usp34", "Mlf2", "Gga1"),
                KO_t0=1:4,
                KO_t10=5:2,
                WT_t0=6:3,
                WT_t10=0:3
                )

Why this tibble can’t be considered tidy?
* One variable (expression) is spread across 4 columns.

In order to fix this, we can use the pivot_longer function: we will increase the number of rows and decrease the number of columns:

pivot_longer(df_expr,
             cols=c("KO_t0", "KO_t10", "WT_t0", "WT_t10")
)
## # A tibble: 16 x 3
##    gene   name   value
##    <chr>  <chr>  <int>
##  1 Fgfrl1 KO_t0      1
##  2 Fgfrl1 KO_t10     5
##  3 Fgfrl1 WT_t0      6
##  4 Fgfrl1 WT_t10     0
##  5 Usp34  KO_t0      2
##  6 Usp34  KO_t10     4
##  7 Usp34  WT_t0      5
##  8 Usp34  WT_t10     1
##  9 Mlf2   KO_t0      3
## 10 Mlf2   KO_t10     3
## 11 Mlf2   WT_t0      4
## 12 Mlf2   WT_t10     2
## 13 Gga1   KO_t0      4
## 14 Gga1   KO_t10     2
## 15 Gga1   WT_t0      3
## 16 Gga1   WT_t10     3

Now only one column - value - contains the expression variable.

We can change the name of the newly created columns with parameters values_to (where the values are stored) and names_to (where the former column names are stored):

expr_long <- pivot_longer(df_expr,
             cols=c("KO_t0", "KO_t10", "WT_t0", "WT_t10"),
             values_to="expression",
             names_to="sample"
)

Is there anything else that we could do here, to make the data even more tidy?

  • There are 2 variables in column sample: samples experimental group (KO or WT) and time point (t0 and t10): we can separate them.
# we can add "remove=FALSE" so as to keep the original column (sample name), that can be useful in this case:
expr_tidy <- separate(expr_long,
              col=sample, 
              into=c("group", "time"),
              sep="_",
              remove=FALSE
         )    

The pivot_wider function does just the opposite: when an observation is scattered across multiple rows, you can move it back to a column.

Let’s take the following example, where expression of 2 genes is measured under two conditions (KO and WT) using two techniques (RNA-seq and qPCR):

df_expr2 <- tibble(gene=c(rep("Fgfrl1", 4), rep("Mlf2", 4)),
                group=rep(rep(c("KO", "WT"), each=2), 2),
                type=rep(c("counts_rnaseq", "Ct_qpcr"), 4),
                value=c(10, 25, 8, 22, 7, 23, 11, 28)
                )

Here, it would make sense to have a column counts_rnaseq and a column Ct_qpcr, which would contain their corresponding values:

df_expr2_wider <- pivot_wider(df_expr2,
                        names_from=type, 
                        values_from=value)

HANDS-ON

Make the following data frame tidy!

df_w <- tibble(Name=c("Anna", "Jordi", "Laura", "Miquel"),
                Weight_day0=c(67, 72, 62, 85),
                Weight_day10=c(65, 73, 66, 80),
                )
Answer
# `pivot_longer` on "Weight" columns:
df_w2 <- pivot_longer(df_w, 
                    cols=c("Weight_day0", "Weight_day10"),
                    values_to="weight", 
                    names_to="day"
                    )

# `separate` on the "day" column
  # into=c(NA, "day"): the first output is ignored (in the), the "Weight" strings
df_w3 <- separate(df_w2, col = "day", 
                  sep="_", 
                  into=c(NA, "day")
                  )
# If you want to keep only "0" and "10", you can use "day" as a field separator (optionally set convert=TRUE if you want to convert the newly created column to numeric values)
df_w4 <- separate(df_w2, col = "day", 
                  sep="day", 
                  into=c(NA, "day"),
                  convert=TRUE
                  )
# If you want to remove "day" from an existing tibble, you can also use the str_remove function from the stringr package, and optionally add a conversion step:
df_w3$day <- as.numeric(str_remove(df_w3$day, "day"))