Functional analysis

First, get the files from the undifferentiated only DESeq2 analysis (in case we did not have time to do it):

# get file
wget https://public-docs.crg.es/biocore/projects/training/PHINDaccess2020/undiff.tar.gz

# extract archive
tar -xvzf undiff.tar.gz

# remove archive
rm undiff.tar.gz

Data bases

Gene Ontology

The Gene Ontology (GO) describes our knowledge of the biological domain with respect to three aspects:

GO domains / root terms	Description
Molecular Function	Molecular-level activities performed by gene products. e.g. catalysis, binding.
Biological Process	Larger processes accomplished by multiple molecular activities. e.g. apoptosis, DNA repair.
Cellular Component	The locations where a gene product performs a function. e.g. cell membrane, ribosome.

Example of GO annotation: the gene product “cytochrome c” can be described by the molecular function oxidoreductase activity, the biological process oxidative phosphorylation, and the cellular component mitochondrial matrix.

The structure of GO can be described as a graph: each GO term is a node, each edge represents the relationships between the nodes. For example:

GO:0019319 (hexose biosynthetic process) is part of GO:0019318 (hexose metabolic process) and also part of GO:0046364 (monosaccharide biosynthetic process). They all share common parent nodes, for example GO:0008152 (metabolic process), and eventually a root node that is here biological process.

KEGG pathways

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database for understanding high-level functions and utilities of the biological system.

It provides comprehensible manually-drawn pathways representing biological processes or disease-specific pathways.
Example of the Homo sapiens melanoma pathway:

Molecular Signatures Database (MSigDB)

The Molecular Signatures Database (MSigDB) is a collection of 17810 annotated gene sets (as of May 2019) created to be used with the GSEA software (but not only).

It is divided into 8 major collections (that include the previously described Gene Ontologies and KEGG pathways):

Enrichment analysis based on gene selection

Tools based on a user-selection of genes usually require 2 inputs:

Gene Universe: in our example: all genes used in our analysis (after filtering out low counts in our case).
List of genes selected from the universe: our selection of genes, give the criteria we previously used: padj < 0.05.

They are often based on the Hypergeometric test or on the Fisher’s exact test. You can have a look at this page for some explanation of both tests.

Let’s prepare this list from the file we saved before:

cd ~/rnaseq_course/functional_analysis

# The gene symbol is in the 12th column
cut -f 12 ~/rnaseq_course/differential_expression/undiff/deseq2_selection_padj005_undiff.txt | sed '1d' > deseq2_selection_padj005_symbols.txt

enrichR

EnrichR is a gene-list enrichment tool developped at the Icahn Schoold of Medicine (Mount Sinai).

It does not require the input of a gene universe: only a selection of genes or a BED file.

The default EnrichR interface works for Homo sapiens and Mus musculus.
However, EnrichR also provides a set of tools for ortholog conversion and enrichment analysis of more organisms:

In the main page, paste our list of selected gene symbols (deseq2_selection_padj005_symbols.txt) and Submit !

KEGG Human pathway bar graph vizualization:

KEGG Human pathway table vizualization:

KEGG Human pathway clustergram vizualization:

For Cell Types, you can also visualize networks, for example Human gene Atlas:

You can also export some graphs as PNG, JPEG or SVG.

GO / Panther tool

The main page of GO provides a tool to test the enrichment of gene ontologies or Panther/Reactome pathways in pre-selected gene lists.
The tool needs a selection of differentially expressed genes (supported IDs are: gene symbols, ENSEMBL IDs, HUGO IDs, UniGene, ..) and a gene universe.

Prepare files using s time the ENSEMBL IDs:

cd ~/rnaseq_course/functional_analysis

# Extract all gene IDs used in our analysis
cut -f 1 ~/rnaseq_course/differential_expression/undiff/normalized_counts_log2_star_undiff.txt | sed '1d' > deseq2_UNIVERSE_ENSEMBL.txt

# Extract significant gene symbols only
cut -f 1 ~/rnaseq_course/differential_expression/undiff/deseq2_selection_padj005_undiff.txt | sed '1d' > deseq2_selection_padj005_ENSEMBL.txt

Paste our selection, and select biological process and Homo sapiens:

Launch !

Analyzed List is what we just uploaded (deseq2_selection_padj005_ENSEMBL.txt).
In Reference List, we need to upload a file containg the universe (deseq2_UNIVERSE_ENSEMBL.txt): Change -> Browse -> (select deseq2_UNIVERSE_ENSEMBL.txt) -> Upload list

Launch analysis

Try the same analysis using the gene symbols instead of ENSEMBL IDs

# Get universe with gene symbols (we already have the gene selection in deseq2_selection_padj005_symbol.txt
cut -f11 ~/rnaseq_course/differential_expression/undiff/normalized_counts_log2_star_undiff.txt | sed '1d' > deseq2_UNIVERSE_symbols.txt

Launch !

with R: GOstats

In RStudio, load the “GOstats” package

setwd("~/rnaseq_course/functional_analysis")

library("GOstats")

Read in differentially expressed genes

de_select <- read.table("deseq2_selection_padj005_ENSEMBL.txt", header=T, as.is=T, sep="\t")

The gene universe can be the list of genes after filtering for low counts: we prepared it already: deseq2_UNIVERSE_ENSEMBL.txt

de_univ <- read.table("deseq2_UNIVERSE_ENSEMBL.txt", header=T, as.is=T, sep="\t")

GOstats works only with Entrez IDs: we can get them with the biomaRt package, as we did for the differential expression analysis.

library(biomaRt)


# load database
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="mar2017.archive.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")

# ENSEMBL IDs for differentially expressed genes
ids <- de_select

# get Entrez IDs
entrez <- getBM(attributes=c('entrezgene', 'ensembl_gene_id'), filters ='ensembl_gene_id', values = ids, mart = mart)

# get Entrez IDs for the universe
ids_univ <- de_univ
entrez_univ <- getBM(attributes=c('entrezgene', 'ensembl_gene_id'), filters ='ensembl_gene_id', values = ids_univ, mart = mart)

As biomaRt is causing trouble, you can get the files the following way

download.file("https://public-docs.crg.es/biocore/projects/training/PHINDaccess2020/deseq2_UNIVERSE_entrez.txt")
download.file("https://public-docs.crg.es/biocore/projects/training/PHINDaccess2020/deseq2_selection_padj005_entrez.txt")

entrez <- scan("deseq2_selection_padj005_entrez.txt")
entrez_univ <- scan("deseq2_UNIVERSE_entrez.txt")

We can proceed with the hypergeometric test (enrichment) for the Biological Process ontologies:

# install annotation package
BiocManager::install("org.Hs.eg.db")

# Set p-value cutoff
hgCutoff <- 0.001

# Set parameters
params <- new("GOHyperGParams",
	geneIds=na.omit(unique(entrez)),
	universeGeneIds=na.omit(unique(entrez_univ)),
	ontology="BP",
	annotation="org.Hs.eg",
	pvalueCutoff=hgCutoff,
	conditional=FALSE,
	testDirection="over")

# Run enrichment test
hgOver <- hyperGTest(params)

# Get a summary table
df <- summary(hgOver)

GOBPID	Pvalue	OddsRatio	ExpCounts	Counts	Size	Term
GO:0016477	4.322880e-05	3.594820	5.82993675	17	1098	cell migration
GO:0010884	7.630680e-05	45.181065	0.08495354	3	16	positive regulation of lipid storage
GO:0097755	8.408726e-05	19.842188	0.23362224	4	44	positive regulation of blood vessel diameter
GO:0060485	9.996919e-05	7.305817	1.08315765	7	204	mesenchyme development
GO:0035150	1.250790e-04	11.541689	0.48848286	5	92	regulation of tube size
GO:0035296	1.250790e-04	11.541689	0.48848286	5	92	regulation of tube diameter

GOstats also provides an HTML report:

# Produce HTML report
htmlReport(hgOver, file="GOstats_BP.html")

We can open the report in a web browser.

With R: KEGGprofile

Load KEGGprofile package:

library(KEGGprofile)

KEGGprofile also works with Entrez ID (we got them for the GOstats analysis).

# KEGG pathway enrichment
KEGGresult <- find_enriched_pathway(na.omit(unique(entrez$entrezgene)), 
	returned_genenumber=10, 
	species='hsa', 
	download_latest = TRUE)

# Format file file
kegg_final <- data.frame(KEGGresult$stastic, 
	genes_entrezid=unlist(lapply(KEGGresult$detail, function(x)paste(x, collapse=",")), use.names=F), 
	stringsAsFactors=F)

# Write table to file
write.table(kegg_final, "KEGGprofile_results.txt", sep="\t", row.names=F, col.names=T, quote=F)

Results table:

Pathway_Name	Gene_Found	Gene_Pathway	Percentage	pvalue	pvalueAdj
Metabolic pathways	88	1489	0.06	4.644397e-08	1.565162e-05
Cytokine-cytokine receptor interaction	26	294	0.09	3.187192e-05	3.580279e-03
Viral protein interaction with cytokine and cytokine receptor	12	100	0.12	1.671507e-04	8.047114e-03
NF-kappa B signaling pathway	10	102	0.10	2.490180e-03	3.356763e-02
Mitophagy - animal	11	65	0.17	8.801457e-06	1.483046e-03
C-type lectin receptor signaling pathway	10	104	0.10	2.898658e-03	3.757106e-02

Enrichment based on ranked lists of genes using GSEA

GSEA (Gene Set Enrichment Analysis)

GSEA is available as a Java-based tool.

Algorithm

GSEA doesn’t require a threshold: the whole set of genes is considered.

GSEA checks whether a particular gene set (for example, a gene ontology) is randomly distributed across a list of ranked genes.
The algorithm consists of 3 key elements:

Calculation of the Enrichment Score The Enrichment Score (ES) reflects the degree to which a gene set is overrepresented at the extremes (top or bottom) of the entire ranked gene list.
Estimation of Significance Level of ES The statistical significant (nominal p-value) of the Enrichment Score (ES) is estimated by using an empirical phenotype-based permutation test procedure. The Normalized Enrichment Score (NES) is obtained by normalizing the ES for each gene set to account for the size of the set.
Adjustment for Multiple Hypothesis Testing Calculation of the FDR ti control the proportion of falses positives.

See the GSEA Paper for more details on the algorithm.

The main GSEA algorithm requires 3 inputs:

Gene expression data
Phenotype labels
Gene sets

Gene expression data in TXT format

The input should be normalized read counts filtered out for low counts (-> we created it in the DESeq2 tutorial -> normalized_counts.txt !).
The first column contains the gene ID (HUGO symbols for Homo sapiens).
The second column contains any description or symbol, and will be ignoreed by the algorithm.
The remaining columns contains normalized expressions: one column per sample.

NAME	DESCRIPTION	5p4_25c	5p4_27c	5p4_28c	5p4_29c	5p4_30c	5p4_31cfoxc1	…
DKK1	NA	0	0	0	0	0	0	…
HGT	NA	0	0	0	0	0	0	…

Exercise
Adjust the file normalized_counts_log2_star.txt so the first column is the gene symbol, the second is the gene ID (or anything else), and the remaining ones are the expression columns. You can save that new file as gsea_normalized_counts.txt.

cd ~/rnaseq_course/functional_analysis

awk -F "\t" 'BEGIN{OFS="\t"}{print $16,$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11}' ~/rnaseq_course/differential_expression/normalized_counts_log2_star.txt > gsea_normalized_counts.txt

Phenotype labels in CLS format

A phenotype label file defines phenotype labels (experimental groups) and assigns those labels to the samples in the corresponding expression data file.

Let’s create it for our experiment:

10	2	1
#	WT	KO
WT	WT	WT	WT	WT	KO	KO	KO	KO	KO

NOTE: the first label used is assigned to the first class named on the second line; the second unique label is assigned to the second class named; and so on.
So the phenotype file could also be:

10	2	1
#	WT	KO
0	0	0	0	0	1	1	1	1	1

The first label WT in the second line is associated to the first label 0 on the third line.

Exercise
Create the phenotype labels file and save it as gsea_phenotypes.cls.

Download and run GSEA

Download Java application:

Enter the registration page, enter your email and organization, then to download page, enter your Email and login:
Click on download gsea-3.0.jar link and save file locally to your home directory.

Launch the GSEA application

GSEA is Java-based. Launch it from a terminal window:

$RUN java -Xmx1024m -jar gsea-3.0.jar

In Steps in GSEA analysis (upper left corner):

Go to Load data: select gsea_normalized_counts.txt and gsea_phenotypes.cls and load.

Go to Run GSEA

Results: index.html

Enrichments results in html

Details for one gene set
- Summary of enrichment: phenotype, stats (Nominal p-value, FDR q-value, FWER p-value), enrichment scores (ES and Normalized ES)
- Enrichment plot

Table of genes: ranking, individual enrichment scores, core enrichment Yes/No.

From GSEA documentation, regarding core enrichment genes: “Genes with a Yes value in this column contribute to the leading-edge subset within the gene set. This is the subset of genes that contributes most to the enrichment result.”

Heatmap of all genes from that gene set (ranked by GSEA) for each sample:

Suggestions for GSEA:

Selection of pathways / gene sets: select the lowest FDR first.
If you are looking for genes to validate on certain pathways:
- It is better if those genes belong to the core enrichment.
- It is also good to go back to the differential expression analysis table and make sure that their adjusted-value is low.
You can also upload your own gene sets (for example a gene signature taken from a specific paper) to test against your list of genes, using one of the GSEA gene set database formats.