Part 10 Mini project

We will work with two input files:

annotation.tsv: contains the annotation (from Gencode) of 29244 Human genes.
normalized_counts.csv: contains the log2-transformed normalized counts from an RNA-seq project. The expression of 14769 Human genes in 8 samples from 4 experimental groups is assessed:
- Treatment 1, time 0 (2 samples)
- Treatment 1, time 20 (2 samples)
- Treatment 2, time 0 (2 samples)
- Treatment 2, time 20 (2 samples)

Download / read in the two files that are found here into two tibbles.

# Download and read in the files
annotation <- read_tsv("https://public-docs.crg.es/biocore/projects/training/R_tidyverse_2021/mini_project/annotation.tsv")

norm_data <- read_csv("https://public-docs.crg.es/biocore/projects/training/R_tidyverse_2021/mini_project/normalized_counts.csv")

Tidy each tibble individually.

Some tips!

For the annotation: something needs to be separated.
For the normalized data: it’s important to pivot some columns… And to separate one created by the pivoting!

# Separate the "gene_name_gene_type" column into gene_name and gene_type
annotation_tidy <- annotation %>% 
                        separate(col=gene_name_gene_type, 
                                 into=c("gene_name", "gene_type"), 
                                 sep="/")

# Pivot the data so there is a single column for the expression value
# Use separate to create 3 new columns that will contain information, respectively on: Treatment, replicate, time point.
# (optional) mutate the "time" column to get a numeric column that contains only the actual time point (without the "t")
norm_data_tidy <- norm_data %>% 
                      pivot_longer(cols=contains("Treatment"), 
                                   values_to="expression", 
                                   names_to="samples") %>% 
                      separate(col=samples, 
                               into=c("Treatment", "replicate", "time"), 
                               sep="_") %>% 
                      mutate(time=as.numeric(str_remove(time, "t")))

Join both datasets so as to obtain one tibble (keep the intersection).

Don’t know which columns to use for the joining? Click here for help!

Work on the gencode_id column of the normalized data: Gencode IDs only differ from the Ensembl IDs by the suffix (point + numbers). e.g. ENSG00000140853.15 in Gencode is ENSG00000140853 in Ensembl. Perhaps str_sub can help?

# First, we need a common ID: remove the suffix with the point + numbers from "gencode_id" in order to obtain "ensembl" IDs
norm_data_tidy <- norm_data_tidy %>% 
                      mutate(ensembl_gene_id=str_sub(gencode_id, 1, 15), 
                             .before=Treatment) # not important, just more convenient

# Use inner join in order to keep only common data
all_data <- inner_join(x=annotation_tidy, 
                       y=norm_data_tidy, 
                       by="ensembl_gene_id")

What is the average expression of the different types of genes (gene_type)?
- According to this data, which 2 gene types have the highest average expression?
- Remove all rows which correspond to these 2 gene types from the dataset.
- What is now the size of our dataset?

A couple of tips…

Remember slice_max?
pull could also be useful!

highest_mean_2types <- all_data %>% 
                          group_by(gene_type) %>% 
                          summarise(average_mean_gene_type=mean(expression)) %>% 
                          slice_max(order_by=average_mean_gene_type, 
                                    n=2) %>% 
                          pull(gene_type)

data_no_highest_2types <- all_data %>% 
                            filter(!gene_type %in% highest_mean_2types)

dim(data_no_highest_2types)

Create a new column that contains the median expression per gene, per experimental group and per gene type.
By experimental group, we mean Treatment + time (for example, samples “Treatment1_rep1_t0” and “Treatment1_rep2_t0” are part of the same experimental group: Treatment1_t0)

Help!

If you don’t have it already, create a column experimental_group.
The grouping should be done using 3 variables.
Remember that, if the data is grouped, the newly created columns will take into account the groups…

# New column for the actual experimental group (no "rep1" and "rep2")
# group_by gene_name, gene_type and experimental_group
data_no_highest_2_types <- data_no_highest_2types %>% 
                                unite(col="experimental_group", 
                                      Treatment, time, 
                                      sep="_") %>% 
                                group_by(experimental_group, gene_type, gene_name) %>% 
                                mutate(median_expression_experimental_group=median(expression))

For each experimental group, retrieve the lincRNA that has the highest median expression.
- Is it the same lincRNA gene for all 4 experimental groups?

Stuck? Click here.

Check this stackoverflow post for inspiration.

data_no_highest_2_types %>% 
                      ungroup() %>% 
                      group_by(experimental_group, gene_type) %>% 
                      filter(expression == max(expression) & gene_type=="lincRNA")