Functional analysis

Data bases

Gene Ontology

The Gene Ontology (GO) describes our knowledge of the biological domain with respect to three aspects:

GO domains / root terms Description
Molecular Function Molecular-level activities performed by gene products. e.g. catalysis, binding.
Biological Process Larger processes accomplished by multiple molecular activities. e.g. apoptosis, DNA repair.
Cellular Component The locations where a gene product performs a function. e.g. cell membrane, ribosome.

Example of GO annotation: the gene product “cytochrome c” can be described by the molecular function oxidoreductase activity, the biological process oxidative phosphorylation, and the cellular component mitochondrial matrix.

The structure of GO can be described as a graph: each GO term is a node, each edge represents the relationships between the nodes. For example:

GO:0019319 (hexose biosynthetic process) is part of GO:0019318 (hexose metabolic process) and also part of GO:0046364 (monosaccharide biosynthetic process). They all share common parent nodes, for example GO:0008152 (metabolic process), and eventually a root node that is here biological process.

KEGG pathways

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database for understanding high-level functions and utilities of the biological system.

It provides comprehensible manually-drawn pathways representing biological processes or disease-specific pathways.
Example of the Homo sapiens melanoma pathway:

Molecular Signatures Database (MSigDB)

The Molecular Signatures Database (MSigDB) is a collection of 17810 annotated gene sets (as of May 2019) created to be used with the GSEA software (but not only).

It is divided into 8 major collections (that include the previously described Gene Ontologies and KEGG pathways):

Enrichment analysis based on gene selection

Tools based on a user-selection of genes usually require 2 inputs:

  • Gene Universe: in our example: all genes used in our analysis (after filtering out low counts in our case).
  • List of genes selected from the universe: our selection of genes, give the criteria we previously used: padj < 0.05, |log2FoldChange| >= 0.5.

They are often based on the Hypergeometric test or on the Fisher’s exact test. You can have a look at this page for some explanation of both tests.

enrichR

EnrichR is a gene-list enrichment tool developped at the Icahn Schoold of Medicine (Mount Sinai).

It does not require the input of a gene universe: only a selection of genes or a BED file.

The default EnrichR interface works for Homo sapiens and Mus musculus.
However, EnrichR also provides a set of tools for ortholog conversion and enrichment analysis of more organisms:

In the main page, paste our list of selected gene symbols (deseq2_results_padj0.05_log2fc0.5_symbols.txt) and Submit !

KEGG Human pathway bar graph vizualization:

KEGG Human pathway table vizualization:

KEGG Human pathway clustergram vizualization:

For Cell Types, you can also visualize networks, for example Human gene Atlas:

You can also export some graphs as PNG, JPEG or SVG.

GO / Panther tool

The main page of GO provides a tool to test the enrichment of gene ontologies or Panther/Reactome pathways in pre-selected gene lists.
The tool needs a selection of differentially expressed genes (supported IDs are: gene symbols, ENSEMBL IDs, HUGO IDs, UniGene, ..) and a gene universe.

Prepare files using the ENSEMBL IDs:

# Extract all gene IDs used in our analysis and convert from Gencode (e.g ENSG00000057657.16) to ENSEMBL (e.g. ENSG00000057657) IDs
cut -f1 deseq2_results.txt | sed '1d' | sed 's/\..//g' > deseq2_universe_ensemblIDs.txt

# Convert from Gencode to ENSEMBL IDs from selected gene list
sed 's/\..//g' deseq2_results_padj0.05_log2fc0.5_IDs.txt > deseq2_results_padj0.05_log2fc0.5_ensemblIDs.txt

Paste our selection, and select biological process and Homo sapiens(file deseq2_results_padj0.05_log2fc0.5_IDs.txt):

Launch !

Analyzed List is what we just uploaded (deseq2_results_padj0.05_log2fc0.5_ensemblIDs.txt).
In Reference List, we need to upload a file containg the universe (deseq2_universe_ensemblIDs.txt): Change -> Browse -> (select deseq2_universe_ensemblIDs.txt) -> Upload list

  • Launch analysis
  • Try the same analysis using the gene symbols instead of ENSEMBL IDs
    # Get universe with gene symbols (we already have the gene selection in deseq2_results_padj0.05_log2fc0.5_symbols.txt)
    cut -f2 deseq2_results.txt | sed '1d' > deseq2_universe_symbols.txt
    
  • Launch !

Enrichment based on ranked lists of genes using GSEA

GSEA (Gene Set Enrichment Analysis)

GSEA is available as a Java-based tool.

Algorithm

GSEA doesn’t require a threshold: the whole set of genes is considered.

GSEA checks whether a particular gene set (for example, a gene ontology) is randomly distributed across a list of ranked genes.
The algorithm consists of 3 key elements:

  1. Calculation of the Enrichment Score The Enrichment Score (ES) reflects the degree to which a gene set is overrepresented at the extremes (top or bottom) of the entire ranked gene list.
  2. Estimation of Significance Level of ES The statistical significant (nominal p-value) of the Enrichment Score (ES) is estimated by using an empirical phenotype-based permutation test procedure. The Normalized Enrichment Score (NES) is obtained by normalizing the ES for each gene set to account for the size of the set.
  3. Adjustment for Multiple Hypothesis Testing Calculation of the FDR ti control the proportion of falses positives.

See the GSEA Paper for more details on the algorithm.

The main GSEA algorithm requires 3 inputs:

  • Gene expression data
  • Phenotype labels
  • Gene sets

Gene expression data in TXT format

The input should be normalized read counts filtered out for low counts (-> we created it in the DESeq2 tutorial -> normalized_counts.txt !).
The first column contains the gene ID (HUGO symbols for Homo sapiens).
The second column contains any description or symbol, and will be ignoreed by the algorithm.
The remaining columns contains normalized expressions: one column per sample.

NAME DESCRIPTION A549_0_1 A549_0_2 A549_0_3 A549_25_1 A549_25_2 A549_25_3
DKK1 NA 0 0 0 0 0 0
HGT NA 0 0 0 0 0 0

Exercise
Adjust the file normalized_counts.txt so the first column is the gene symbol, the second is the gene ID (or anything else), and the remaining ones are the expression columns. You can save that new file as gsea_normalized_counts.txt.

awk 'BEGIN{OFS="\t"}{print $2,$1,$3,$4,$5,$6,$7,$8}' normalized_counts.txt > gsea_normalized_counts.txt

Phenotype labels in CLS format

A phenotype label file defines phenotype labels (experimental groups) and assigns those labels to the samples in the corresponding expression data file.

Let’s create it for our experiment:

6 2 1      
# t0 t25      
t0 t0 t0 t25 t25 t25

NOTE: the first label used is assigned to the first class named on the second line; the second unique label is assigned to the second class named; and so on.
So the phenotype file could also be:

6 2 1      
# t0 t25      
0 0 0 1 1 1

The first label t0 in the second line is associated to the first label 0 on the third line.

Exercise
Create the phenotype labels file and save it as gsea_phenotypes.cls.

Download and run GSEA

Download Java application:

Enter the download page, enter your Email and login:
Click on download gsea-3.0.jar link and save file locally to your home directory.

Launch the GSEA application

GSEA is Java-based. Launch it from a terminal window:

$RUN java -Xmx1024m -jar gsea-3.0.jar

In Steps in GSEA analysis (upper left corner):

  • Go to Load data: select gsea_normalized_counts.txt and gsea_phenotypes.cls and load.

  • Go to Run GSEA

  • Results: index.html

  • Enrichments results in html

  • Details for one gene set
    • Summary of enrichment: phenotype, stats (Nominal p-value, FDR q-value, FWER p-value), enrichment scores (ES and Normalized ES)
    • Enrichment plot

Table of genes: ranking, individual enrichment scores, core enrichment Yes/No.

From GSEA documentation, regarding core enrichment genes: “Genes with a Yes value in this column contribute to the leading-edge subset within the gene set. This is the subset of genes that contributes most to the enrichment result.

Heatmap of all genes from that gene set (ranked by GSEA) for each sample:

Suggestions for GSEA:

  • Selection of pathways / gene sets: select the lowest FDR first.
  • If you are looking for genes to validate on certain pathways:
    • It is better if those genes belong to the core enrichment.
    • It is also good to go back to the differential expression analysis table and make sure that their adjusted-value is low.
  • You can also upload your own gene sets (for example a gene signature taken from a specific paper) to test against your list of genes, using one of the GSEA gene set database formats.