Hands-on: Differential expression analysis

Once we have the gene counts, we can perform differential expression analysis. The goal of differential expression analysis is to perform statistical analysis to try and discover changes in expression levels of defined features (genes, transcripts, exons) between experimental groups with replicated samples.

Flowchart with the steps in RNAseq differential expression analysis.

Read quantification
 ┌───────────────────────────────┐
 │ STAR alignment → gene counts  │
 │        or                     │
 │ Salmon → transcript counts    │
 └───────────────────────────────┘
            │
            │
            ▼
    Import counts into R
    (counts + sample metadata)
            │
            │
            ▼
    Create DESeq2 dataset
    DESeqDataSetFromMatrix()
    or
    tximport (for Salmon)
            │
            │
            ▼
    Filtering
    (remove very low count genes)
            │
            │
            ▼
    Normalization
    (size factor estimation)
            │
            │
            ▼
    Exploratory analysis
    VST / rlog transformation
    PCA / sample correlation
    (outlier detection)
            │
            │
            ▼
    Differential expression model
    Run DESeq()
    • dispersion estimation
    • GLM fitting
    • statistical testing
            │
            │
            ▼
    Results extraction
    results()
    (log2FoldChange, p-value, padj)
            │
            │
            ▼
    Identify DE genes
    (padj < 0.05, |log2FC| threshold)
            │
            │
            ▼
    Downstream analysis
    • heatmaps
    • volcano plots
    • pathway enrichment
    • GO analysis

Tutorial on basic DESeq2 usage for differential analysis of gene expression

  • In this tutorial, we will use the counts calculated from the mapping on all chromosomes (we practiced so far QC and mapping for data of only one chromosome but here we consider all chromosomes), for the 10 samples previously selected from GEO:

GEO ID

SRA ID

Sample name

Differentiation

Condition

GSM2031982

SRR3091420

5p4_25c

undiff

WT

GSM2031983

SRR3091421

5p4_27c

undiff

WT

GSM2031984

SRR3091422

5p4_28c

diff 5 days

WT

GSM2031985

SRR3091423

5p4_29c

diff 5 days

WT

GSM2031986

SRR3091424

5p4_30c

diff 5 days

WT

GSM2031987

SRR3091425

5p4_31cfoxc1

undiff

KO

GSM2031988

SRR3091426

5p4_32cfoxc1

undiff

KO

GSM2031989

SRR3091427

5p4_33cfoxc1

undiff

KO

GSM2031990

SRR3091428

5p4_34cfoxc1

diff 5 days

KO

GSM2031991

SRR3091429

5p4_35cfoxc1

diff 5 days

KO

The FOXC1 protein plays a critical role in early development, particularly in the formation of structures in the front part of the eye (the anterior segment). These structures include the colored part of the eye (the iris), the lens of the eye, and the clear front covering of the eye (the cornea). Studies suggest that the FOXC1 protein may also have functions in the adult eye, such as helping cells respond to oxidative stress. Oxidative stress occurs when unstable molecules called free radicals accumulate to levels that can damage or kill cells. The FOXC1 protein is also involved in the normal development of other parts of the body, including the heart, kidneys, and brain. Source.

Get the count data for the full data set, output of both STAR and Salmon:

# Navigate to your course directory
mkdir -p ~/rnaseq_course/differential_expression
cd ~/rnaseq_course/differential_expression

# Download the full count data folder from the course repository
wget https://biocorecrg.github.io/RNAseq_coursesCRG_2026/latest/data/differential_expression/full_data_counts.tar.gz

# Untar the data -x extract folder -z gzip decompression -v verbose -f file
tar -xzvf full_data_counts.tar.gz
rm full_data_counts.tar.gz

Raw count matrices

DESeq2 takes as an input raw (non normalized) counts, in various forms:

  • A matrix for all sample, very typical in microarrays: DESeqDataSetFromMatrix()

  • One file per sample (our option for STAR): DESeqDataSetFromMatrix()

  • A txi object (our option for Salmon): DESeqDataSetFromTximport()

Prepare data from STAR

We need to create one file per sample, each file containing the raw counts of all genes:

File SRR3091420_1_chr6_counts.txt:

gene_id

count

ENSG00000260370.1

0

ENSG00000237297.1

10

ENSG00000261456.5

210

File SRR3091421_1_chr6_counts.txt:

gene_id

count

ENSG00000260370.1

0

ENSG00000237297.1

8

ENSG00000261456.5

320

and so on…

Remember that the STAR count file contains 4 columns depending on the library preparation protocol!

Exercise

  • Prepare the 10 files needed for our analysis, from the STAR output, and save them in the counts_selected directory: knowing that our libraries are unstranded, which column will you pick?

  • Create the sub-directory counts_STAR_selected inside the deseq2 directory:

cd ~/rnaseq_course/differential_expression
mkdir counts_STAR_selected
  • Loop around the 10 ReadsPerGene.out.tab files and extract the gene ID (1st column) and the correct counts (2nd column).

for i in full_data_counts/counts_STAR/*ReadsPerGene.out.tab
do echo $i 
# retrieve the first (gene name) and second column (raw reads for unstranded protocol)
cut -f1,2 $i | grep -v "_" > counts_STAR_selected/$(basename $i .ReadsPerGene.out.tab)_counts.txt
done

Sample sheet

Additionally, DESeq2 needs a sample sheet that describes the samples characteristics: treatment, knock-out / wild type, replicates, time points, etc. in the form:

SampleName

FileName

Differentiation

Condition

5p4_25c

SRR3091420_counts.txt

undiff

WT

5p4_27c

SRR3091421_counts.txt

undiff

WT

5p4_28c

SRR3091422_counts.txt

diff5days

WT

5p4_29c

SRR3091423_counts.txt

diff5days

WT

5p4_30c

SRR3091424_counts.txt

diff5days

WT

5p4_31cfoxc1

SRR3091425_counts.txt

undiff

KO

5p4_32cfoxc1

SRR3091426_counts.txt

undiff

KO

5p4_33cfoxc1

SRR3091427_counts.txt

undiff

KO

5p4_34cfoxc1

SRR3091428_counts.txt

diff5days

KO

5p4_35cfoxc1

SRR3091429_counts.txt

diff5days

KO

The first column is the sample name, the second column the file name of the count file generated by STAR (after selection of the appropriate column as we just did), and the remaining columns are description of the samples, some of which will be used in the statistical design.

The design indicates how to model the samples: in the model we need to specify what we want to measure and what we want to control.

Exercise

  • Prepare this file (tab-separated columns) in a text editor: save it as sample_sheet_foxc1.txt in the differential_analysis directory: you can do it “manually” using a text editor, or you can try using the command line.

Note

The same sample sheet will be used for both the STAR and the Salmon DESeq2 analysis. (with a slight modification that we will see later on)

BACK UP

You can download it the following way, in R:

wget https://biocorecrg.github.io/RNAseq_coursesCRG_2026/latest/data/differential_expression/sample_sheet_foxc1.txt

Analysis

The analysis is done in R!

We will use a local RStudio server running a Singularity container as we did in R introduction section.

From RStudio, open a new R script file.

Note that in the R code boxes below # is followed by comments, i.e. words not interpreted by R, as in Linux.

  • Go to the differential_expression working directory and install and load the required packages (loading a package in R allows to use specific sets of functions developed as part of this package).

The main libraries we will use are belong to the Bioconductor project, a repository of R packages for the analysis of omics data.

library(BiocManager)  ## Library to install and manage packages from Bioconductor. 

BiocManager::install("DESeq2") ## Installation of the package for differential expression analysis, only needed the first time in RStudio local installation.
library("DESeq2") ## Loading the package.

BiocManager::install("EnhancedVolcano")
BiocManager::install("tximport")
BiocManager::install("biomaRt")
BiocManager::install("pheatmap")
install.packages("reshape2")

# setwd = set working directory; equivalent to the Linux "cd". All the files you will write will be in this directory
# the R equivalent to the Linux pwd is getwd() = get working directory
# unless you change or specify another address, all the written files will be in this directory. 
setwd("~/rnaseq_course/differential_expression")

Import STAR counts

  • Read in the sample table that we prepared:

# read in the sample sheet
# header = TRUE: the first row is the "header", i.e. it contains the column names
# sep = "\t": the columns/fields are separated with tabs
sampletable <- read.table("sample_sheet_foxc1.txt", header=T, sep="\t")

# add the SRA codes as row names (it is needed for some of the DESeq functions)
rownames(sampletable) <- gsub("_counts.txt", "", sampletable$FileName)

# display the first 6 rows
head(sampletable)

# check the number of rows and the number of columns
nrow(sampletable) # if this is not 10, please raise your hand !
ncol(sampletable) # if this is not 4, also raise your hand !

Warning

DESeq will process only the counts for the files listed in the sample table (keep in mind for future exercises and when you want to exclude some samples from the analysis).

  • Load the count data from STAR into a DESeq object:

    • sampleTable: the sample sheet / metadata we created

    • directory: path to the directory where the counts are stored (one file per sample)

    • design: design formula describing which variables will be used to model the data. Here we want to compare the experimental groups in “Condition”.

# Import STAR counts
se_star <- DESeqDataSetFromHTSeqCount(sampleTable = sampletable,
                        directory = "counts_STAR_selected",
                        design = ~ Condition)

Design formula

The design formula is used to estimate the dispersions and to estimate the log2 fold changes of the model! For more information on how to build a design formula, see here.

  • We will focus the rest of the analysis on the se_star.

Filtering out lowly expressed genes

From DESeq2 vignette: While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2.

Let’s filter:

# Number of genes before filtering
nrow(se_star)

# Filter, we keep only the genes that have at least 10 counts in total (across all samples)
se_star <- se_star[rowSums(counts(se_star)) > 10, ]

# Number of genes left after low-count filtering
nrow(se_star)

Prepare annotation

The biomaRt package is used for adding a more detailed annotation to our data sets.

Tip

When we don’t perform the mapping, this tool is very useful when we receive directly the counts matrix and we lack the annotation file information.

Additionally to the ENSEMBL gene IDs we want (for example):

  • The external gene name (symbol)

  • A more thorough gene description

  • Chromosome and coordinates

# load library
library(biomaRt)

# list ENSEMBL archives
listEnsemblArchives()

# We used release 115 of ENSEMBL, which corresponds to URL <https://sep2025.archive.ensembl.org>
# we can load the corresponding database
mart <- useMart(
  biomart="ENSEMBL_MART_ENSEMBL",               # database name
  host="https://sep2025.archive.ensembl.org", # URL
  path="/biomart/martservice",                  # path after the host to get access to the web service URL 
  dataset="hsapiens_gene_ensembl")              # specie dataset (human in this case)

# "filters" correspond to the input we want to retrieve more annotation for
# a list of available filters can be obtained with listFilters(mart)
head(listFilters(mart))

# let's see what is available in terms of ensembl ID
grep("ensembl", listFilters(mart)[,1], value=TRUE)

# ensembl_gene_id is what we want
# "attributes" correspond to the kind of annotation you want to retrieve
# a list of available attributes can be obtained with listAttributes(mart)

head(listAttributes(mart))

# you can browse and decide what is interesting for you. For this exercise, we will use 'ensembl_gene_id', 'chromosome_name', 'start_position', 'end_position', 'description', 'external_gene_name'

# list of ENSEMBL IDs we want to annotate
gene_ids <- rownames(se_star)
# 18084 IDs

# annotate
annot <- getBM(attributes=c('ensembl_gene_id', 'chromosome_name', 'start_position', 'end_position', 'description', 'external_gene_name'), filters ='ensembl_gene_id', values = gene_ids, mart = mart)
dim(annot)
# 18084 rows

head(annot)

See also

For more information on how to use biomaRt, see here

Fit statistical model

All steps are wrapped up in a single call to DESeq(), which runs three phases: normalization, dispersion estimation, and statistical testing.

se_star2 <- DESeq(se_star)
  • Estimating size factors: corrects for differences in sequencing depth between samples using the median-of-ratios method. Each sample gets a size factor; dividing its raw counts by that factor normalizes it (as wee see in the previously)

  • Gene-wise dispersion estimates: RNA-seq counts are overdispersed (variance > mean). DESeq2 uses a Negative Binomial model with a gene-specific dispersion parameter (α), estimated independently for each gene via maximum likelihood.

  • Mean-dispersion relationship: lowly expressed genes tend to be more variable. DESeq2 fits a smooth trend curve of dispersion vs. mean expression across all genes.

  • Final dispersion estimates: combines gene-wise estimates with the global trend via empirical Bayes shrinkage — noisy outlier estimates are pulled toward the trend, improving reliability with few replicates.

  • Fitting model and testing: fits a Negative Binomial GLM per gene using the design formula, then applies a Wald test to assess whether the log2 fold change is significantly different from zero. P-values are corrected for multiple testing with Benjamini-Hochbergpadj.

Normalized counts

We will extract the DESeq normalized counts, which are calculated by dividing the raw counts by the size factor (that represents the sequencing depth). Then we will log2-transform these normalized counts in order to compress large values and expand small values, thus making distributions more symmetric.

# compute normalized counts (log2 transformed); + 1 is a count added to avoid errors during the log2 transformation: log2(0) gives an infinite number, but log2(1) is 0

# normalized = TRUE: divide the counts by the size factors calculated by the DESeq function
norm_counts <- log2(counts(se_star2, normalized = TRUE)+1)

# add annotation
norm_counts_symbols <- merge(data.frame(ID=rownames(norm_counts), norm_counts, check.names=FALSE), annot, by.x="ID", by.y="ensembl_gene_id", all=F)

# write normalized counts to text file
write.table(norm_counts_symbols, "normalized_counts_log2_star.txt", quote=F, col.names=T, row.names=F, sep="\t")

Why log2-transform the normalized counts?

RNA-seq count data is heavily right-skewed — a handful of housekeeping genes have counts in the tens of thousands while most genes sit near zero. Log transformation compresses that skew and makes the distribution more symmetric. The base-2 choice is purely for interpretability: a difference of 1 on the log2 scale equals exactly a 2-fold change in expression, which is the natural unit biologists think in

Exercise

  • What are the normalized counts corresponding to genes “ENSG00000054598” and “ENSG00000111640”?

  • Calculate the average and the standard deviation of these genes’ normalized counts. How do they differ? What can you tell about them?

Sample QC and Data exploration

  • Transform raw counts to be able to visualize the data

DESeq2 developers advise using transformed counts, rather than normalized counts, for anything involving a distance (e.g. visualization).

Because RNA-seq data typically contain many genes with very low counts and a small number of genes with extremely high counts, the variance of the data depends strongly on the mean expression level. To avoid that PCA or heatmaps are dominated by highly expressed genes, a variance stabilizing transformation (VST) is applied to the normalized counts.

They offer two transformation methods, both of which stabilize the variance across the mean:

  • rlog (Regularized log)

  • VST (Variance Stabilizing Transformation)

Both options produce log2 scale data which has been normalized by the DESeq2 method with respect to library size.

Warning

The values are not on a natural count scale and are not directly interpretable as counts or as fold-changes in the way log2(x+1) is. A difference of 1 between two VST values is not guaranteed to mean a 2-fold change.

From this tutorial: The VST is much faster to compute and is less sensitive to high count outliers than the rlog. The rlog tends to work well on small datasets (n < 30), potentially outperforming the VST when there is a wide range of sequencing depth across samples (an order of magnitude difference). However, authors of DESeq2 generally recommend vst() for most analysis, because it gives almost identical results to rlog and is much faster, on the other hand rlog becomes impractical when we have more than 30 samples.

So let’s use the vst transformation.

As a homework, you can try and use the rlog transformation (function rlog).

# Try with the vst transformation

se_vst <- vst(se_star2)

Samples correlation

The aim of this plot is to compare the expression of all genes for each pair of samples, so if gene expression behaves similarly between two samples, their correlation will be high. The most common measure is Pearson correlation, but you can use others such as Euclidean distance or Spearman correlation.

We expect that replicates show a high correlation, and they will cluster together.

Tip

Good replicate concordance has a Pearson correlation value > 0.9.

Calculate the sample-to-sample distances:

# load libraries pheatmap to create the heatmap plot
library(pheatmap)

# Retrieves the vst counts table
vst_counts <- assay(se_vst)

# Correlation matrix
sampleDists  <- cor(vst_counts, method = "pearson") # method = "spearman" if you want to use Spearman correlation
sampleDistMatrix <- as.matrix(sampleDists )

# prepare a "metadata" object to add a colored bar with the differentiation and condition information
metadata <- sampletable[,c("Differentiation", "Condition")]
rownames(metadata) <- sampletable$SampleName

# create figure in PNG format
png("sample_distance_heatmap_star.png")
pheatmap(sampleDistMatrix, annotation_col=metadata)
# close PNG file after writing figure in it
dev.off()

Sample distance heatmap

Are the samples clustering as expected? Are they clustering better by differentiation or by condition?

Principal Component Analysis (PCA)

Because we can not represent each sample in 20.000 dimensions (number of genes).PCA is used to reduce the dimensionality where each dot is a sample. The axes (PC1, PC2) are not individual genes — they are linear combinations of thousands of genes, weighted by how much they contribute to that PC’s direction. Samples that are close together have similar overall expression profiles; samples far apart are very different.

png("PCA_star.png")
plotPCA(object = se_vst,
  intgroup = c("Condition", "Differentiation"))
dev.off()

PCA plot

The horizontal axis (PC1 = Principal Component 1) represents the highest variation between the samples. Differences along PC1 are more important than differences along PC2.

For the PC1 axis, do samples separate by differentiation or by condition?

Interpreting the PCA plot

  • At this stage, the PCA plot allows us to evaluate whether samples belonging to the same experimental condition cluster together. Ideally, biological replicates should appear close to each other in the plot, indicating similar global gene expression profiles.

  • Do samples cluster by the biological condition you care about (treatment vs control, WT vs KO)?

  • Or do any samples cluster by other factors such as batch, RNA quality, sex, differentiation?

  • If an individual sample does not cluster with the other replicates of its condition, it may indicate a potential technical problem, such as low library complexity, RNA degradation, contamination, or a sample labeling error. In such cases, the sample should be carefully evaluated by reviewing quality control metrics, and sample annotation before deciding whether it should be retained or excluded from further analysis.

Gene expression plots

We can also plot the normalized counts of a gene per sample / experimental group:

# FOXC1 is ENSG00000054598
plotCounts(se_star2, gene="ENSG00000054598", intgroup="Condition")

FOXC1 counts

Let’s produce a more comprehensive plot: we can add the sample names and the differentiation status. To do so, we can use the ggplot2 package.

library(ggplot2)
library(ggrepel)
library(reshape2)

# Retrieve the normalized counts per sample for FOXC1 / ENSG00000054598
tmp <- norm_counts[rownames(norm_counts)=="ENSG00000054598",]

# convert to "long" format
mygenelong <- melt(tmp)

# sample name
mygenelong$name <- rownames(mygenelong)

# sample Condition and Differentiation: merge with sample table
mygenelong <- merge(mygenelong, sampletable, by.x="name", by.y="SampleName", all=F)

# Dot plot
pdot <- ggplot(data=mygenelong, mapping=aes(x=Condition, y=value, col=Differentiation, shape=Condition, label=name)) + 
  geom_point() +
  geom_text_repel(nudge_x=0.2) +  
  xlab(label="Experimental group") +
  ylab(label="Normalized expression (log2)") +
  labs(title = "FOXC1") +
  theme_bw()

ggsave("counts_foxc1_nice.png", pdot)

FOXC1 counts detailed

We can represent it as a boxplot:

# Boxplot
pbox <- ggplot(data=mygenelong,
               mapping=aes(x=Condition, y=value, fill=Differentiation, label=name)) +
  geom_boxplot(alpha = 0.3) +
  geom_jitter(aes(color=Differentiation), width=0.2, size=2)+
  geom_text_repel() +
  xlab("Experimental group") +
  ylab("Normalized expression (log2)") +
  labs(title = "FOXC1") +
  theme_bw()

ggsave("counts_foxc1_nice_boxplot.png", pbox)

Here we can see clearly that in KO this gene was expressed 2^1.5 times higher in 5 days, and same for WT.

FOXC1 counts detailed

  • Comparing FOXC1 and GAPDH expression

Also, we can compare the expression of our study gene with a control gene (GADPH). GAPDH ensembl id ENSG00000111640

# Retrieve the normalized counts per sample for FOXC1 (ENSG00000054598) and GAPDH (ENSG00000111640) genes
tmp<-norm_counts[c("ENSG00000054598","ENSG00000111640"),]

# convert to "long" format
mygenelong <- melt(tmp)
mygenelong

# sample name
colnames(mygenelong) <- c("gene","name","value")

# sample Condition and Differentiation: merge with sample table
mygenelong <- merge(mygenelong, sampletable, by.x="name", by.y="SampleName", all=F)
mygenelong

# Dot plot
pdot_comp <- ggplot(data=mygenelong, mapping=aes(x=Condition, y=value, col=Differentiation, shape=Condition, label=name)) +
  geom_point() +
  geom_text(nudge_x=0.2) +  
  xlab(label="Experimental group") +
  ylab(label="Normalized expression (log2)") +
  facet_wrap(~ gene) +
  theme_bw()

ggsave("counts_foxc1_gapdh.png", pdot_comp)

FOXC1 and GAPDH counts

Differential expression analysis

Now is the moment to retrieve the results of the differential expression analysis for the constrast we are interested in. In this case, it is the comparison between WT and KO.

From results we will obtain the following columns each one with a value for each gene:

|baseMean|log2FoldChange|lfcSE|stat|pvalue|padj|

# check results names: depends on what was modeled. Here it was the "Condition"
resultsNames(se_star2)

# extract results for WT vs KO
 # contrast: the column from the metadata that is used for the grouping of the samples (Condition), then WT is compared to the KO -> results will be as "WT vs KO"
de <- results(object = se_star2, 
  name="Condition_WT_vs_KO")
# This is equivalent to:
de <- results(object = se_star2, contrast=c("Condition", "WT", "KO"))

# If you want the results to be expressed as "KO vs WT", you can run:
# de <- results(object = se_star2, contrast=c("Condition", "KO", "WT"))

# check first rows
head(de)

# add more annotation to "de"
de_symbols <- merge(data.frame(ID=rownames(de), de, check.names=FALSE), annot, by.x="ID", by.y="ensembl_gene_id", all=F)

# write differential expression analysis result to a text file
write.table(de_symbols, "deseq2_results.txt", quote=F, col.names=T, row.names=F, sep="\t")

DESeq2 output

  • log2 fold change: A positive fold change indicates an increase of expression while a negative fold change indicates a decrease in expression for a given comparison. This value is reported in a logarithmic scale (base 2): for example, a log2 fold change of 1.5 in the “WT vs KO comparison” means that the expression of that gene is increased, in the WT relative to the KO, by a multiplicative factor of 2^1.5 ≈ 2.82.

log2 fold change

Fold change (condition B vs A)

Interpretation

log2 = 1

2-fold change

gene is 2× higher in condition B

log2 = 2

4-fold change

gene is 4× higher in condition B

log2 = 3

8-fold change

gene is 8× higher in condition B

log2 = -1

2-fold lower

gene is 2× lower in condition B

log2 = -2

4-fold lower

gene is 4× lower in condition B

log2 = -3

8-fold lower

gene is 8× lower in condition B

  • pvalue: Wald test p-value: Indicates whether the gene analysed is likely to be differentially expressed in that comparison. The lower the more significant.

  • padj: Bonferroni-Hochberg adjusted p-values (FDR): the lower the more significant. More robust than the regular p-value because it controls for the occurrence of false positives.

  • baseMean: Mean of normalized counts for all samples.

  • lfcSE: Standard error of the log2FoldChange.

  • stat: Wald statistic: the log2FoldChange divided by its standard error.

Note on p-values set to NA

Some values in the results table can be set to NA for one of the following reasons (from Analyzing RNA-seq data with DESeq2 by M. Love et al., 2017):

  • If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p value and adjusted p value will all be set to NA.

  • If a row contains a sample with an extreme count outlier then the p value and adjusted p value will be set to NA. These outlier counts are detected by Cook’s distance. If there are very many outliers (e.g. many hundreds or thousands) reported by summary(res), one might consider further exploration to see if a single sample or a few samples should be removed due to low quality.

  • If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA. This independent filtering can be customized or turned off in the DESeq2 function results(dds, independentFiltering=FALSE).

Volcano plot

A volcano plot combines effect size and statistical significance into a single view, making it one of the most widely used plots in differential expression analysis.

library(EnhancedVolcano)

## Let's select the columns with the gene.name, Log2 foldchange and padjusted value information. 
colnames(de_symbols)
res_for_volc <- de_symbols[, c("external_gene_name","log2FoldChange","padj")]

volcano_plot <- EnhancedVolcano(res_for_volc,
                         lab = res_for_volc$external_gene_name, #column with the gene names for the points out of the defined thresholds for Log2fc and pvalue
                         x = 'log2FoldChange',
                         pCutoff = 0.05, ## padjusted threshold 
                         FCcutoff = 2,   ## Log2 fold change threshold to select upregulated and downregulated genes. 
                         y = 'padj')

pdf("Volcano_plot_of_WT_vs_KO.pdf", width = 10, height = 8)
print(volcano_plot)
dev.off()
  • X-axis: log2 fold change (log2FC) — how much a gene’s expression changes between conditions. Positive values = higher in the first group; negative = lower.

  • Y-axis: −log10(padj) — the adjusted p-value on a negative log scale. The higher a point is on the Y-axis, the more statistically significant the difference.

Each dot is a gene. The plot naturally splits into four regions:

Region

Meaning

Top-right

Significantly upregulated (high FC, low padj)

Top-left

Significantly downregulated (low FC, low padj)

Bottom-center

Not significant (low FC or high padj)

Top-center

Large statistical significance, but small fold change

The name “volcano” comes from the shape: most genes cluster at the bottom (non-significant), with two “plumes” of significant genes rising on the left and right flanks.

How to read it

Focus on the top corners — genes that are both far from zero on the X-axis (large effect) and high on the Y-axis (highly significant). Those are the most biologically meaningful candidates.

Volcano plot

Do we see FOXC1 in the top corners?

Gene selection

  • padj (p-value corrected for multiple testing)

  • log2FC (log2 Fold Change)

The log2FoldChange gives a quantitative information about the expression changes, but does not give information on the within-group variability, hence the reliability of the information:

In the picture below, fold changes for gene A and for gene B between groups t25 and t0 (from another data set) are the same, however the variability between the replicated samples in gene B is higher, so the result for gene A will be more reliable (i.e. the p-value will be smaller).

RNA-seq dispersion

DESeq2 also takes into account the library size, sufficient coverage of a gene, …

We need to take into account the p-value or, better the adjusted p-value (padj).

Setting a p-value threshold of 0.05 means that there is a 5% chance that the observed result is a false positive. For thousands of simultaneous tests (as in RNA-seq, there are thousands of genes tested at the same time), 5% can result in a large number of false positives.

For example, if we test 20000 genes, and we use a p-value threshold of 0.05, we expect to have 1000 false positives.

The Benjamini-Hochberg procedure controls the False Discovery Rate (FDR) (it is one of many methods to adjust p-values for multiple testing).

A FDR adjusted p-value of 0.05 implies that 5% of significant tests according to the “raw” p-value will result in false positives.

So for 1000 significant genes at raw p-value < 0.05, we expect to have 50 false positives.

  • Selection of differentially expressed genes between WT and KO based on padj < 0.05.

# how many genes are differentially expressed, taking into account "padj < 0.05"?
# contains NAs... Filter them out
de_select <- de_symbols[de_symbols$padj < 0.05 & !is.na(de_symbols$padj),]
nrow(de_select)
# 85 genes

# save results in file for further usage
write.table(de_select, "deseq2_selection_padj005.txt", quote=F, col.names=T, row.names=F, sep="\t")
  • Selection of differentially expressed genes between WT and KO based on padj < 0.05 AND log2FC > 0.5 or log2FC < -0.5 (However, note that selecting by log2FoldChange is not required if the selection is done using the padj).

# how many genes are differentially expressed, taking into account "padj < 0.05" and log2FoldChange < -0.5 or > 0.5?
# contains NAs... Filter them out
de_select <- de_symbols[de_symbols$padj < 0.05 & !is.na(de_symbols$padj) & abs(de_symbols$log2FoldChange) > 0.5,]
nrow(de_select)
# 83 genes

Exercise 1

  • Is FOXC1 differentially expressed? What are the corresponding adjusted-value and log2FoldChanges?

  • How many genes are found differentially expressed if you change the log2FoldChange threshold to 0.8 / -0.8 and the padj threshold to 0.01?

Exercise 2

  • Repeat the analysis comparing WT vs KO for the undifferentiated samples only!

  • Steps are:

    • Modify the “sampletable” so that it contains only samples corresponding to “undiff” Differentiation state.

SampleName

FileName

Differentiation

Condition

5p4_25c

SRR3091420_1_counts.txt

undiff

WT

5p4_27c

SRR3091421_1_counts.txt

undiff

WT

5p4_31cfoxc1

SRR3091425_1_counts.txt

undiff

KO

5p4_32cfoxc1

SRR3091426_1_counts.txt

undiff

KO

5p4_33cfoxc1

SRR3091427_1_counts.txt

undiff

KO

  • Read in data DESeqDataSetFromHTSeqCount()

  • Filter out low counts (keep high counts)

  • Fit statistical model DESeq()

  • VST-transform counts vst()

    • Plot PCA and sample-to-sample distances heatmap

  • Check differential expression resultsNames()

    • How many genes are differentially expressed, when considering padj < 0.05?

DON’T FORGET TO WRITE FILES DOWN AT EACH STEP!!

Exercise 3

Control for “Differentiation”

While in Exercise 2 we tested WT vs KO on undifferentiated samples only, we can also use a more complex design formula. If we specify:

~ Differentiation + Condition

it means that we want to test for the effect of the FOXC1 knock out, while controlling for the effect of differentiation. In a way, we “discard” the expected changes due to differentiation to focus on the changes specifically driven by the KO.

  • Repeat the first analysis, changing the design ~ Condition to ~ Differentiation + Condition.

  • How many genes are now found differentially expressed, when filtering for padj < 0.05?

Homework

Do the same using the Salmon counts (object se_salmon): how many genes are found differentially expressed when using the Salmon counts? How do results overlap between STAR and Salmon?

Other cases

Batch effect in the data

Sometimes, we perform RNA-seq experiments one year, and then another year later with the same study, making us wonder if we can merge data from the previous year in order to have more replicates and a better statistical power.

Other times, some samples are processed at different time points, by different operators, or using slightly different protocols (e.g. different RNA extraction kits, different library preparation lots, or different sequencing runs).

In all these cases, unintended technical variation can be introduced into the data — something known as a batch effect.

What is a batch effect?

A batch effect is systematic, non-biological variation in gene expression data introduced by technical factors such as:

  • Different sequencing runs or flow cells

  • Samples processed on different dates or by different operators

  • Different reagent lots or library prep kits

  • Samples stored under different conditions before RNA extraction

    Batch effects can be just as large as (or larger than) the biological signal you are trying to detect, and will confound your differential expression results if not properly accounted for.

    In the worst case scenario, batch effects becomes completely inseparable from the biological signal of interest. For example, if all your control samples were processed in batch 1 and all your treatment samples were processed in batch 2, you will not be able to distinguish between the effect of the treatment and the effect of the batch.

Detect batch effects using PCA

The first step is always to visualize the data and check whether samples cluster by their batch rather than by their biological condition of interest.

On the following example, we have samples that were processed at different weeks but RNA extraction was performed on the same day for all samples. A batch column indicating if samples were processed on week 1 (values = 1) or week 2 (values = 2) is included in the sample table.

Download batch raw counts and sample table data:

wget https://biocorecrg.github.io/RNAseq_coursesCRG_2026/latest/data/differential_expression/batch_example.tar.gz

tar -xvzf batch_example.tar.gz
rm batch_example.tar.gz
setwd("~/rnaseq_course/differential_expression/batch_example")
# Batch Effect Correction

batch_counts <- read.csv("rnaseq_batch_example_raw_counts.txt",header = TRUE, sep="\t")
head(batch_counts) ## this are annotated raw counts.

## Let's set the gene_id as row names
rownames(batch_counts) <- batch_counts$gene_id
colnames(batch_counts)

## Let's prepare matrix with only raw counts for DESeq
matrix_counts <- batch_counts[,9:length((batch_counts))]
colnames(matrix_counts)

## Here we separate the annotation from the counts
annot <- batch_counts[,1:8]
colnames(annot)

## Let's read the sample table
sampletable <- read.csv("batch_sample_table.txt", sep="\t", header=TRUE)
head(sampletable)  

## Creating deseq object from counts matrix
se_matrix <- DESeqDataSetFromMatrix(countData = matrix_counts, colData = sampletable, design = ~treatment)
se_matrix

## Run DESeq
se_2 <- DESeq(se_matrix)

## VST, in this case we use variance stabilizing transformation to normalize the data because we have a small dataset. 
se_vst <- varianceStabilizingTransformation(se_2)

### Let's extract PCA data into a table to create a plot with sample labels and detect samples in one batch and another.

pcaData <- plotPCA(se_vst, intgroup="treatment", returnData=TRUE) ## ReturnData=TRUE  return the data in a dataframe 
percentVar <- round(100 * attr(pcaData, "percentVar")) ## Extracting explained variance (%) for PC1 and PC2

PCA_batch <- ggplot(pcaData, aes(PC1, PC2, color=treatment)) +
                geom_point(size=3) +
                geom_text_repel(aes(label=name), vjust=-1) +
                xlab(paste0("PC1: ", percentVar[1], "% variance")) +
                ylab(paste0("PC2: ", percentVar[2], "% variance")) 

ggsave("PCA_batch.png",PCA_batch,  width = 8, height = 6)

## Batch correction with Combatseq

BiocManager::install("sva")
library(sva)

rownames(sampletable) <- sampletable$sampleName
batch <- sampletable$batch


mat_corrected <- ComBat_seq(
  counts = counts(se_matrix),
  batch = batch)

mat_corrected
head(mat_corrected)

se_correc <- DESeqDataSetFromMatrix(countData = mat_corrected, colData = sampletable, design = ~treatment)
vst_correct <- varianceStabilizingTransformation(se_correc)

png("PCA_batch_corrected.png")
plotPCA(object = vst_correct,
        intgroup = c("treatment"))
dev.off()

library(ggrepel)

pcaData <- plotPCA(vst_correct, intgroup="treatment", returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))

pca_batch_correc <- ggplot(pcaData, aes(PC1, PC2, color=treatment)) +
                      geom_point(size=3) +
                      geom_text_repel(aes(label=name), vjust=-1) +
                      xlab(paste0("PC1: ", percentVar[1], "% variance")) +
                      ylab(paste0("PC2: ", percentVar[2], "% variance")) 

ggsave("PCA_batch_corrected.png",pca_batch_correc,  width = 8, height = 6)

PCA before correction

PCA before correction

PCA after correction

PCA after correction

Homework

Another approach to take into account batch effect is to include the Batch variable in the DESeq2 design formula. This tells DESeq2 to model and account for the batch effect when estimating fold changes and p-values, without modifying the raw count data.

Note

More information https://www.biostars.org/p/403053/

Using the previous sample table and the matrix counts, create a DESeq2 object accounting for batch effect in the design formula.

# Re-create the DESeq2 object with Batch in the design
se_correc <- DESeqDataSetFromMatrix(
  countData = matrix_counts, 
  colData = sampletable, 
  design = ~batch + treatment  # Batch is controlled and treatment is tested
  ) 

Create a vst object and visualize your data using PCA. Does this approach corrects for batch effect?

Outliers detection

In some cases we can find samples that are not clustered with the others. This could be due to several reasons, such as sample preparation errors, technical issues, or biological differences between samples. In this case, we need a review of the different sequencing Quality Control parameters we studied previously to identify possible sample degradation, contamination, or other technical issues. If we find such issues, we can remove the outlier samples from the dataset and re-run the analysis.

PCA with outlier

source: https://www.biostars.org/p/249493/

Sample mislabeling

Sometimes during sample preparation or sample table creation, samples are mislabeled. This can be detected by visualizing the data and checking if samples cluster by their biological condition of interest.

PCA mislabeled sample

Source: Image generated by Anthropic Claude (4.6 Sonnet), 2026, https://claud.ai/

Using the following sample table and matrix counts:

cd ~/rnaseq_course/differential_expression/

wget https://biocorecrg.github.io/RNAseq_coursesCRG_2026/latest/data/differential_expression/mislabeled_sample_example.tar.gz

tar -xvzf mislabeled_sample_example.tar.gz
rm mislabeled_sample_example.tar.gz
  • Create a DESeq2 object using the condition column as the design formula.

  • Generate vst normalized counts.

  • Visualize it using PCA.

¿Do you find any sample that is not in the cluster is should be in?

PCA mislabeled sample

Further reading materials