Hands-on: Functional analysis¶

First, get the files from the undifferentiated only DESeq2 analysis (in case we did not have time to do it):

cd ~/rnaseq_course/differential_expression
# get file
wget https://biocorecrg.github.io/RNAseq_coursesCRG_2026/latest/data/differential_expression/undiff.tar.gz

# extract 
tar -xvzf undiff.tar.gz
rm undiff.tar.gz

Databases¶

Gene enrichment analysis use gene sets and pathways to look for over-represented biological themes in our study list of genes.

Gene Ontology¶

The Gene Ontology (GO) describes our knowledge of the biological domain with respect to three aspects (domains):

GO domain	Description	Biological Level
Biological Process (BP)	Coordinated series of molecular activities that contribute to a biological goal.	Integrative (e.g., mitosis, DNA repair)
Molecular Function (MF)	Elemental activities performed by a gene product at the molecular level.	Elemental (e.g., catalysis, binding)
Cellular Component (CC)	The physical locations or complexes where a gene product is active.	Spatial (e.g., nucleus, ribosome)

Differences in number and granularity

Biological Process is typically the largest and most complex domain, often yielding the most specific biological insights in RNA-seq studies.
The structure is a Directed Acyclic Graph (DAG), meaning a term can have multiple parents.
True Path Rule: If a gene is associated with a specific child term, it is automatically associated with all its parent terms up to the root. This is why “higher” terms in the hierarchy (closer to the root) contain many more genes than “lower”, more specific terms.

Example: the gene product “cytochrome c” is associated with the molecular function oxidoreductase activity, the biological process oxidative phosphorylation, and the cellular component mitochondrial matrix.

The structure of GO can be described as a graph: each GO term is a node, and each edge represents the relationship (e.g., is_a, part_of) between nodes.

Source: Isoform Function Prediction Using Deep Neural Network - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/GO-Terms-Graph-The-GO-terms-have-a-graph-structure-nodes-represent-the-functions-and_fig2_362567903 [accessed 14 Mar 2026]

GO:0019319 (hexose biosynthetic process) is part of GO:0019318 (hexose metabolic process) and also part of GO:0046364 (monosaccharide biosynthetic process). They all share common parent nodes, for example GO:0008152 (metabolic process), and eventually a root node that is here biological process.

KEGG pathways¶

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database for understanding high-level functions and utilities of the biological system.

It provides comprehensible manually-drawn pathways representing biological processes or disease-specific pathways. Example of the Homo sapiens melanoma pathway:

Source: https://link.springer.com/article/10.1186/s12859-019-2881-7/figures/14

Molecular Signatures Database (MSigDB)¶

The Molecular Signatures Database (MSigDB) is a collection of 35361 gene sets in the Human Molecular Signatures Database (MSigDB) (as of March 2026) created to be used with the GSEA software (but not only).

It is divided into 9 major collections (that include the previously described Gene Ontologies and KEGG pathways):

Mouse MSigDB Collections

A collection of gene sets is also available for mouse: Mouse Molecular Signatures Database (MSigDB).

Enrichment analysis based on gene selection¶

Tools based on a user-selection of genes usually require 2 inputs:

Gene Universe: in our example, all genes used in our analysis (after filtering out low counts).
List of genes selected from the universe: our selection of genes given the criteria we previously used: padj < 0.05.

They are often based on the Hypergeometric test or on the Fisher’s exact test.

Let’s prepare this list from the file we saved before:

mkdir -p ~/rnaseq_course/functional_analysis
cd ~/rnaseq_course/functional_analysis

# The gene symbol is in the 12th column, sed '1d' removes the header
cut -f 12 ~/rnaseq_course/differential_expression/undiff/deseq2_selection_padj005_undiff.txt | sed '1d' > deseq2_selection_padj005_symbols.txt

enrichR¶

EnrichR is a gene-list enrichment tool developed at the Icahn School of Medicine (Mount Sinai).

It does not require the input of a gene universe: only a selection of genes or a BED file.

Source: https://pmc.ncbi.nlm.nih.gov/articles/PMC3637064/

The default EnrichR interface works for Homo sapiens and Mus musculus. However, EnrichR also provides a set of tools for ortholog conversion and enrichment analysis of more organisms:

In the main page, paste our list of selected gene symbols (deseq2_selection_padj005_symbols.txt) and Submit!

KEGG Human pathway bar graph visualization:

KEGG Human pathway table visualization:

KEGG Human pathway clustergram visualization:

For Cell Types, you can also visualize networks, for example Human gene Atlas:

You can also export some graphs as PNG, JPEG or SVG.

GO / Panther tool¶

The main page of GO provides a tool to test the enrichment of gene ontologies or Panther/Reactome pathways in pre-selected gene lists.

The tool needs a selection of differentially expressed genes (supported IDs are: gene symbols, ENSEMBL IDs, HUGO IDs, UniGene, ..) and a gene universe.

Prepare files using this time the ENSEMBL IDs:

cd ~/rnaseq_course/functional_analysis

# Extract all gene IDs used in our analysis
cut -f 1 ~/rnaseq_course/differential_expression/undiff/normalized_counts_log2_star_undiff.txt | sed '1d' > deseq2_UNIVERSE_ENSEMBL.txt

# Extract significant gene symbols only
cut -f 1 ~/rnaseq_course/differential_expression/undiff/deseq2_selection_padj005_undiff.txt | sed '1d' > deseq2_selection_padj005_ENSEMBL.txt

Paste our selection, and select biological process and Homo sapiens:

Launch!

Analyzed List is what we just uploaded (deseq2_selection_padj005_ENSEMBL.txt). In Reference List, we need to upload a file containing the universe (deseq2_UNIVERSE_ENSEMBL.txt): Change → Browse → (select deseq2_UNIVERSE_ENSEMBL.txt) → Upload list

Launch analysis

Try the same analysis using the gene symbols instead of ENSEMBL IDs

# Get universe with gene symbols (we already have the gene selection in deseq2_selection_padj005_symbol.txt)
cut -f11 ~/rnaseq_course/differential_expression/undiff/normalized_counts_log2_star_undiff.txt | sed '1d' > deseq2_UNIVERSE_symbols.txt

Launch!

With R: GOstats¶

In RStudio, load the GOstats package:

setwd("~/rnaseq_course/functional_analysis")

library("GOstats")

Read in differentially expressed genes:

de_select <- read.table("deseq2_selection_padj005_ENSEMBL.txt", header=T, as.is=T, sep="\t")

The gene universe can be the list of genes after filtering for low counts: we prepared it already: deseq2_UNIVERSE_ENSEMBL.txt

de_univ <- read.table("deseq2_UNIVERSE_ENSEMBL.txt", header=T, as.is=T, sep="\t")

GOstats works only with Entrez IDs: we can get them with the biomaRt package, as we did for the differential expression analysis.

library(biomaRt)

# load database
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="https://sep2025.archive.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")

# ENSEMBL IDs for differentially expressed genes
ids <- de_select

# get Entrez IDs
entrez <- getBM(attributes=c('entrezgene', 'ensembl_gene_id'), filters ='ensembl_gene_id', values = ids, mart = mart)

# get Entrez IDs for the universe
ids_univ <- de_univ
entrez_univ <- getBM(attributes=c('entrezgene', 'ensembl_gene_id'), filters ='ensembl_gene_id', values = ids_univ, mart = mart)

Note

As biomaRt may cause connection issues, you can download pre-computed Entrez ID files directly:

cd ~/rnaseq_course/functional_analysis

wget https://biocorecrg.github.io/RNAseq_coursesCRG_2026/latest/data/functional_analysis/deseq2_selection_padj005_entrez.txt

wget https://biocorecrg.github.io/RNAseq_coursesCRG_2026/latest/data/functional_analysis/deseq2_UNIVERSE_entrez.txt

# read files
entrez <- read.table("deseq2_selection_padj005_entrez.txt", header=T, as.is=T, sep="\t")
entrez_univ <- read.table("deseq2_UNIVERSE_entrez.txt", header=T, as.is=T, sep="\t")

We can proceed with the hypergeometric test (enrichment) for the Biological Process ontologies:

# install annotation package
BiocManager::install("org.Hs.eg.db")

# Set p-value cutoff
hgCutoff <- 0.001

# Set parameters
params <- new("GOHyperGParams",
 geneIds=na.omit(unique(entrez)),
 universeGeneIds=na.omit(unique(entrez_univ)),
 ontology="BP",
 annotation="org.Hs.eg",
 pvalueCutoff=hgCutoff,
 conditional=FALSE,
 testDirection="over")

# Run enrichment test
hgOver <- hyperGTest(params)

# Get a summary table
df <- summary(hgOver)

GOBPID	Pvalue	OddsRatio	ExpCounts	Counts	Size	Term
GO:0016477	4.322880e-05	3.594820	5.82993675	17	1098	cell migration
GO:0010884	7.630680e-05	45.181065	0.08495354	3	16	positive regulation of lipid storage
GO:0097755	8.408726e-05	19.842188	0.23362224	4	44	positive regulation of blood vessel diameter
GO:0060485	9.996919e-05	7.305817	1.08315765	7	204	mesenchyme development
GO:0035150	1.250790e-04	11.541689	0.48848286	5	92	regulation of tube size
GO:0035296	1.250790e-04	11.541689	0.48848286	5	92	regulation of tube diameter

GOstats also provides an HTML report:

# Produce HTML report
htmlReport(hgOver, file="GOstats_BP.html")

We can open the report in a web browser.

Enrichment based on ranked lists of genes using GSEA¶

GSEA (Gene Set Enrichment Analysis)¶

GSEA is available as a Java-based tool.

Algorithm¶

GSEA doesn’t require a threshold: the whole set of genes is considered.

GSEA checks whether a particular gene set (for example, a gene ontology) is randomly distributed across a list of ranked genes.

The algorithm consists of 3 key elements:

Calculation of the Enrichment Score The Enrichment Score (ES) reflects the degree to which a gene set is overrepresented at the extremes (top or bottom) of the entire ranked gene list.
Estimation of Significance Level of ES The statistical significance (nominal p-value) of the Enrichment Score (ES) is estimated by using an empirical phenotype-based permutation test procedure. The Normalized Enrichment Score (NES) is obtained by normalizing the ES for each gene set to account for the size of the set.
Adjustment for Multiple Hypothesis Testing Calculation of the FDR to control the proportion of false positives.

Gene expression data in TXT format¶

The input should be normalized read counts filtered for low counts (we created it in the DESeq2 tutorial → normalized_counts.txt!).

The first column contains the gene ID (HUGO symbols for Homo sapiens). The second column contains any description or symbol, and will be ignored by the algorithm. The remaining columns contain normalized expressions: one column per sample.

NAME	DESCRIPTION	5p4_25c	5p4_27c	5p4_28c	5p4_29c	5p4_30c	5p4_31cfoxc1	…
DKK1	NA	0	0	0	0	0	0	…
HGT	NA	0	0	0	0	0	0	…

Exercise

Adjust the file normalized_counts_log2_star.txt so the first column is the gene symbol, the second is the gene ID (or anything else), and the remaining ones are the expression columns. Save the new file as gsea_normalized_counts.txt.

setwd("~/rnaseq_course/differential_expression/undiff/")

## Read undiff normalized counts
norm_counts <- read.table("normalized_counts_log2_star_undiff.txt",as.is = TRUE, sep="\t", header = TRUE)

## Check our sample columns and gene id and gene name columns
colnames(norm_counts)
# write normalized counts for GSEA
x <- norm_counts
sample_val<-norm_counts[,2:6] ## Selecting the sample columns with the normalized values. 

#Select column names and ensembl gene ids
y <- x[,c(1,11)] ## Select the gene_id and gene_name columns
colnames(y)
x <- data.frame(NAME=y[,1], DESCRIPTION=y[,2])
x <- cbind(x,sample_val)
colnames(x)

setwd("~/rnaseq_course/functional_analysis")

write.table(x, "R_norm_counts_for_GSEA.txt", quote=F, col.names=T, row.names=F, sep="\t")

Phenotype labels in CLS format¶

A phenotype label file defines phenotype labels (experimental groups) and assigns those labels to the samples in the corresponding expression data file.

Let’s create it for our experiment:

10 2 1
# WT KO
WT WT WT WT WT KO KO KO KO KO

Note

The first label used is assigned to the first class named on the second line; the second unique label is assigned to the second class named; and so on.

So the phenotype file could also be:

10 2 1
# WT KO
0 0 0 0 0 1 1 1 1 1

The first label WT in the second line is associated to the first label 0 on the third line.

Exercise

Create the phenotype labels file and save it as gsea_phenotypes.cls.

BACKUP

### Downloading normalized counts 
wget  https://biocorecrg.github.io/RNAseq_coursesCRG_2026/latest/data/functional_analysis/gsea_normalized_counts.txt

### Downloading phenotype labels
wget  https://biocorecrg.github.io/RNAseq_coursesCRG_2026/latest/data/functional_analysis/gsea_phenotypes.cls

Download and run GSEA¶

Download Java application¶

GSEA is Java-based. Launch it from a terminal window:

#Download GSEA
wget https://data.broadinstitute.org/gsea-msigdb/.test/gsea/software/desktop/4.0/GSEA_Linux_4.0.3.zip
unzip GSEA_Linux_4.0.3.zip
cd GSEA_Linux_4.0.3
bash gsea.sh 

In Steps in GSEA analysis (upper left corner):

Go to Load data: select gsea_normalized_counts.txt and gsea_phenotypes.cls and load.

Go to Run GSEA

Results: index.html

Enrichment results in HTML

Details for one gene set:
- Summary of enrichment: phenotype, stats (Nominal p-value, FDR q-value, FWER p-value), enrichment scores (ES and Normalized ES)
- Enrichment plot

Table of genes: ranking, individual enrichment scores, core enrichment Yes/No.

Note on core enrichment genes

From GSEA documentation: “Genes with a Yes value in this column contribute to the leading-edge subset within the gene set. This is the subset of genes that contributes most to the enrichment result.”

Heatmap of all genes from that gene set (ranked by GSEA) for each sample:

Suggestions for GSEA

Selection of pathways / gene sets: select the lowest FDR first.
If you are looking for genes to validate on certain pathways:
- It is better if those genes belong to the core enrichment.
- It is also good to go back to the differential expression analysis table and make sure that their adjusted-value is low.
You can also upload your own gene sets (for example a gene signature taken from a specific paper) to test against your list of genes, using one of the GSEA gene set database formats.