Part 10 Mini project
We will work with two input files:
- annotation.tsv: contains the annotation (from Gencode) of 29244 Human genes.
- normalized_counts.csv: contains the log2-transformed normalized counts from an RNA-seq project. The expression of 14769 Human genes in 8 samples from 4 experimental groups is assessed:
- Treatment 1, time 0 (2 samples)
- Treatment 1, time 20 (2 samples)
- Treatment 2, time 0 (2 samples)
- Treatment 2, time 20 (2 samples)
- Download / read in the two files that are found here into two tibbles.
# Download and read in the files
<- read_tsv("https://public-docs.crg.es/biocore/projects/training/R_tidyverse_2021/mini_project/annotation.tsv")
annotation
<- read_csv("https://public-docs.crg.es/biocore/projects/training/R_tidyverse_2021/mini_project/normalized_counts.csv") norm_data
- Tidy each tibble individually.
Some tips!
- For the annotation: something needs to be separated.
- For the normalized data: it’s important to pivot some columns… And to separate one created by the pivoting!
# Separate the "gene_name_gene_type" column into gene_name and gene_type
<- annotation %>%
annotation_tidy separate(col=gene_name_gene_type,
into=c("gene_name", "gene_type"),
sep="/")
# Pivot the data so there is a single column for the expression value
# Use separate to create 3 new columns that will contain information, respectively on: Treatment, replicate, time point.
# (optional) mutate the "time" column to get a numeric column that contains only the actual time point (without the "t")
<- norm_data %>%
norm_data_tidy pivot_longer(cols=contains("Treatment"),
values_to="expression",
names_to="samples") %>%
separate(col=samples,
into=c("Treatment", "replicate", "time"),
sep="_") %>%
mutate(time=as.numeric(str_remove(time, "t")))
- Join both datasets so as to obtain one tibble (keep the intersection).
Don’t know which columns to use for the joining? Click here for help!
Work on the gencode_id
column of the normalized data: Gencode IDs only differ from the Ensembl IDs by the suffix (point + numbers).
e.g. ENSG00000140853.15 in Gencode is ENSG00000140853 in Ensembl. Perhaps str_sub
can help?
# First, we need a common ID: remove the suffix with the point + numbers from "gencode_id" in order to obtain "ensembl" IDs
<- norm_data_tidy %>%
norm_data_tidy mutate(ensembl_gene_id=str_sub(gencode_id, 1, 15),
.before=Treatment) # not important, just more convenient
# Use inner join in order to keep only common data
<- inner_join(x=annotation_tidy,
all_data y=norm_data_tidy,
by="ensembl_gene_id")
- What is the average expression of the different types of genes (
gene_type
)?- According to this data, which 2 gene types have the highest average expression?
- Remove all rows which correspond to these 2 gene types from the dataset.
- What is now the size of our dataset?
A couple of tips…
- Remember
slice_max
? pull
could also be useful!
<- all_data %>%
highest_mean_2types group_by(gene_type) %>%
summarise(average_mean_gene_type=mean(expression)) %>%
slice_max(order_by=average_mean_gene_type,
n=2) %>%
pull(gene_type)
<- all_data %>%
data_no_highest_2types filter(!gene_type %in% highest_mean_2types)
dim(data_no_highest_2types)
- Create a new column that contains the median expression per gene, per experimental group and per gene type.
By experimental group, we mean Treatment + time (for example, samples “Treatment1_rep1_t0” and “Treatment1_rep2_t0” are part of the same experimental group: Treatment1_t0)
Help!
- If you don’t have it already, create a column
experimental_group
. - The grouping should be done using 3 variables.
- Remember that, if the data is grouped, the newly created columns will take into account the groups…
# New column for the actual experimental group (no "rep1" and "rep2")
# group_by gene_name, gene_type and experimental_group
<- data_no_highest_2types %>%
data_no_highest_2_types unite(col="experimental_group",
Treatment, time, sep="_") %>%
group_by(experimental_group, gene_type, gene_name) %>%
mutate(median_expression_experimental_group=median(expression))
- For each experimental group, retrieve the lincRNA that has the highest median expression.
- Is it the same lincRNA gene for all 4 experimental groups?
Stuck? Click here.
Check this stackoverflow post for inspiration.
%>%
data_no_highest_2_types ungroup() %>%
group_by(experimental_group, gene_type) %>%
filter(expression == max(expression) & gene_type=="lincRNA")