Talk: Experimental design of RNA-seq experiment

Side note: scope of this course

This course focuses on bulk mRNA-seq experimental design and differential expression analysis. Many principles apply equally to total RNA-seq, small RNA-seq, scRNA-seq and long-read RNA-seq, but these technologies introduce additional considerations that are briefly noted throughout the material.


Reproducibility in RNA-seq

RNA-seq measures thousands of genes and relies on statistical models to detect differences between conditions.

Reproducibility is affected by:

  • sample size,

  • experimental variability,

  • technical bias.

Careful experimental design is required to:

  • reduce bias,

  • control variation,

  • estimate biological variability.

Important

Design determines whether RNA-seq results are interpretable.



Experimental design begins with the biological question

Think

biological question
→ experimental design
→ sample preparation
→ sequencing
→ analysis

Important

The biological question determines:

  • what to measure,

  • how to design the experiment,

  • and how to analyze the data.

Examples:

  • Which genes change expression between treated and control samples?

  • Do alternative isoforms differ between conditions?

  • Which cell types respond to treatment?


Core principles of RNA-seq experimental design

  • Replication → estimate biological variability

  • Randomization → avoid systematic bias

  • Blocking → control known sources of variation


In RNA-seq experiments:
  • replication → independent biological samples

  • randomization → distribute samples across batches and lanes

  • blocking → include batch in the model (design = ~ batch + condition)


Replication or Why independent samples are required

Replication allows estimation of biological variability, which determines whether observed differences are real.


Golden rule of replication

Replicates must be independent biological units, not repeated measurements of the same sample.


Experimental unit

The experimental unit is the smallest entity that independently receives the treatment.

It defines what counts as a biological replicate.

Term

Meaning

Experimental unit

entity receiving the treatment

Sample

material measured (e.g. RNA)


Example:
  • experimental unit = mouse

  • sample = RNA extracted from each mouse
    *→ each mouse = one biological replicate


Biological vs technical replicates

Biological replicates are samples derived from independent biological sources.
They capture natural biological variability.

Examples:

  • different individuals

  • independent cell cultures

  • independent tissue samples


Technical replicates are repeated measurements of the same biological sample.

Examples:

  • different library preparations from the same RNA sample

  • sequencing the same library in multiple lanes

  • repeated sequencing runs of the same library


Simple rule:

  • 3 different mice → biological replicates

  • 3 libraries from the same RNA → technical replicates

_images/tech_bio_replicates.png

adapted from Bernd Klaus, EMBO 2015


Side note: scRNA-seq

In single-cell RNA-seq experiments, the biological replicate is usually the sample or donor, not the individual cell. Thousands of cells from the same sample do not replace biological replication. Treating cells as replicates leads to pseudoreplication. In differential expression analysis, cells are typically aggregated per sample (“pseudobulk”) to restore the correct unit of replication.


Pseudoreplication

Treating non-independent samples as replicates leads to pseudoreplication.

Example:

  • one mouse → 3 libraries

  • treated as 3 samples → incorrect


Important

The number of libraries does not determine replication.
The number of independent experimental units does.


Questions: Are these samples technical replicates, biological replicates, or pseudoreplicates?

  1. Three samples from the same patient at different time points (treated as independent samples)

  2. 1000 cells were isolated from a tumor sample. The cells were divided into three batches, and three RNA-seq libraries were prepared, sequenced, and analyzed independently.

  3. RNA was extracted from a single mouse liver sample. Three independent RNA-seq libraries were prepared from this RNA extraction by 3 different people and sequenced separately.

  4. Bone marrow pooled from 6 mice per group to form two samples.


Technical vs biological variability in RNA-seq

Modern RNA-seq protocols have low technical variability (SEQC Consortium 2014).
In contrast, biological variability is substantial and differs between samples, and the statistical power of reliable differential expression analysis strongly depends on the number of biological replicates (RNA 2016).


Important

Biological variability is typically much larger than technical variability.

→ Biological replication is more important than technical replicates.

Side note: scRNA-seq

In single-cell experiments, technical variability can be large due to capture efficiency, dropout events, and amplification bias. This is why scRNA-seq analysis typically includes additional steps such as cell filtering, normalization methods designed for sparse data, and sometimes pseudobulk aggregation.

Side note: total RNA-seq

Total RNA-seq often relies on rRNA depletion rather than poly(A) selection. The efficiency of rRNA removal can vary between samples and may influence the proportion of informative reads.


Replication and statistical power

Statistical power is the probability of detecting a true difference in gene expression between conditions.


Note

  • The desired power of the research experiment is usually above 80%.

  • In clinical studies, power might be required to be above 90%.


The statistical power of an RNA-seq experiment depends on:

  • number of biological replicates: 2 replicates → typically insufficient, 5-6 replicates → good power for many studies

  • effect size (magnitude of expression change): log2fc=3 easy to detect, log2fc=0.5 needs more power

  • biological variability within each condition: higher variability → lower power

  • sequencing depth: more reads → lower-expressed genes are detected


Important

Increasing the number of biological replicates improves power more than increasing sequencing depth.


fishy

(a) Increase in biological replication significantly increases the number of DE genes identified. Numbers of sequencing reads have a diminishing return after 10 M reads. Line thickness indicates depth of replication, with 2 replicates the darkest and 7 replicates the lightest. The lines are smoothed averages for each replication level, with the shaded regions corresponding to the 95% confidence intervals.
(b) Power of detecting DE genes increases with both sequencing depth and biological replication. Similar to the trends in (a), increases in the power showed diminishing returns after 10 M reads.
Figure is adapted from Liu et al., 2014


Overpowered vs underpowered experiment

If all other parameters remain the same, a larger experiment will have more power than a smaller experiment.

However, if an experiment is too large and a smaller experiment would have achieved the same statistical result, the experiment is overpowered.

On the other hand, if an experiment is too small, it may be underpowered.

Warning

Both underpowered and overpowered experiments waste subjects, money, time and effort, and are potentially unethical.


Side note: scRNA-seq

Increasing the number of cells per sample improves cell-type resolution but does not increase statistical power for detecting differential expression between conditions. Power depends primarily on the number of biological samples (donors or experimental units).


Side note: ONT RNA-seq

For isoform-level analyses using long reads, statistical power depends strongly on read depth per transcript, because many isoforms are rare and require sufficient long-read coverage to be detected reliably.


Number of replicates in RNA-seq experiments

fishy

Statistical properties of edgeR (exact) as a function of log2FC threshold, T, and the number of replicates, nr. (A) The fraction of all (7126) genes called as SDE as a function of the number of replicates nr. (B) Mean true positive rate (TPR) as a function of nr for four thresholds (solid curves). (C) Mean TPR as a function of T for nr (solid curves). (D) The number of genes called as TP, FP, TN, and FN as a function of nr.
Figure is adapted from Schurch, et al., RNA, 2016

The study showed that:

  • With 3 replicates: many differentially expressed genes were missed (Fig A - <30% SDEs identified); especially many lower expressed genes were missed (Fig B - TPR = 0.8 for log2FC > 1).

  • With 6 replicates: most strong signals were detected (Fig B - TPR = 0.87 for log2FC > 1; TPR = 0.8 for log2FC > 0.5)

  • More than 10 replicates might be required to detect small expression changes (Fig. C - TPR = 0.9 for log2FC > 0.3, or FC = 1.24)


Side note: scRNA-seq

In many scRNA-seq studies, only a few donors are sequenced but thousands of cells are obtained per donor. While this enables detailed cell-type characterization, robust differential expression between biological conditions still requires multiple independent samples.


Side note: ONT RNA-seq

Long-read RNA sequencing experiments sometimes include fewer biological replicates due to cost and sequencing throughput. In such cases, results should often be interpreted as exploratory, especially for differential expression analyses.


When the number of replicates is limited

In practice, RNA-seq experiments often use a small number of biological replicates due to cost or sample availability.

A large-scale analysis (Degen & Medo, 2025) showed that small-cohort experiments frequently produce results that do not replicate well.

_images/expdesign_journal.pcbi.1011630.g002.PNG

DEG performance metrics as a function of the cohort size. Each symbol summarizes the median of 100 cohorts. All panels show results using the DESeq2 Wald test with abs(log2FC) above 1.
Figure is adapted from Degen & Medo, 2025


Key observations:

  • Small cohorts (N ≤ 3) show low replicability (Fig A, except SNF2 dataset)

  • Precision can be high even with a few samples (Fig C & D)

  • Recall is low, so many true differences are missed (for all data sets except SNF2, recall is below 0.5 for N<7)

→ Small experiments detect mainly strong effects


fishy

Heat maps and fold change estimates for SNF2 and LMAB data sets.
Left column: Heat maps showing the logCPM correlation of samples for the SNF2 and LMAB data sets. Heat map rows and columns were ordered using hierarchical Ward clustering.
Right column: Fold change estimates of expressed genes in the SNF2 and LMAB data sets. Blue dots represent the ground truth estimate from the full data set. Gray (red) bars represent the interquartile range of estimates obtained from 100 subsampled cohorts of size N = 3 (N = 15). The horizontal dashed line shows the logFC threshold used to define DEGs.
Figure is adapted from Degen & Medo, 2025

A second analysis shows that effect size estimates become unstable when replication is low.

Key observation:

  • In heterogeneous datasets, fold-change estimates vary widely (LMAB dataset: samples were derived from heterogeneous tumor tissues) → estimates can be inflated or underestimated

  • Homogeneous systems behave more consistently (SNF2 data set: cell colonies) → variability depends on the biological system


Practical implications

When replication is limited:

  • treat results as hypothesis-generating

  • interpret results cautiously (especially in heterogeneous samples)

  • focus on genes with larger effect sizes

  • validate key findings independently

What does it mean “focus on large effect size”?
That means that instead of testing the null hypothesis log2FC=0 test for abs(log2FC) > lfcThreshold (e.g., =1).

DESeq2 implementation:

results(dds, lfcThreshold = 1)

Thus, instead of asking

Is the gene differentially expressed?

you ask

Is the fold change larger than a biologically meaningful threshold?

Rule of thumb for RNA-seq experiments

  • 2 replicates → insufficient

  • 3 replicates → minimum

  • 5–6 replicates → good / recommended

  • 10+ replicates → robust

Small experiments detect only large expression changes;
subtle effects require more replication.


:::{admonition} Warning on replicate correlation :class: warning

High replicate concordance (e.g. high correlation) does not replace biological replication.
Similarity reflects consistency, but not the variability required for statistical inference.

For bulk RNA-seq differential expression (human/mouse):

  • ~30 million mapped reads per sample (ENCODE guideline)
    → typically ~35–40 million sequenced reads

Typical recommendations:

Application

Reads per sample

Gene-level differential expression

35–40 million

Isoform / splicing analysis

50–100 million

Low-abundance transcripts

>100 million

Sequencing depth affects:

  • detection of low-expressed genes

  • precision of expression estimates

  • isoform resolution


Important

Increasing sequencing depth improves sensitivity,
but does not replace biological replication.

More reads help detect genes; more samples help detect differences.

Side note: scRNA-seq

Sequencing depth is typically expressed per cell rather than per sample. Typical recommendations for droplet-based scRNA-seq range from 20,000 to 50,000 reads per cell for gene-level analysis, although deeper sequencing may be required for rare transcripts.


Randomization and batch effects

RNA-seq experiments include both:

  • biological factors (e.g. treatment, sex, genotype)

  • technical factors (e.g. batch, sample processing order, library prep, sequencing lane, flowcell)

If technical factors are aligned with biological conditions,
they become confounded, and biological effects cannot be separated from technical variation.

“Confounding” definition

  • Confounding (Adjective): Used to describe something that causes confusion or bewilderment (e.g., “the confounding details”) (Merriam-Webster dictionary).

  • In statistics, confounding means mixing two effects so they cannot be separated.

In RNA-seq: if a technical factor (e.g. batch) is aligned with the biological condition, you cannot tell which one caused the difference.


Example of confounding (bad design)

Lane

Samples

Lane 1

control, control, control

Lane 2

treated, treated, treated

→ lane effect = treatment effect
→ impossible to distinguish


Proper randomization (good design)

Lane

Samples

Lane 1

control, treated, control

Lane 2

treated, control, treated

→ technical variation is distributed across conditions


Important

Randomize samples across all experimental steps:

  • RNA extraction

  • library preparation

  • sequencing


Batch effects in RNA-seq

A batch is a group of samples processed together under the same technical conditions during an experimental step.

How to detect potential batches?

Ask whether technical conditions differed across samples:

  • Were all RNA isolations performed on the same day?

  • Were all library preparations performed on the same day?

  • Did the same person perform the RNA isolation/library preparation for all samples?

  • Did you use the same reagents for all samples?

  • Did you perform the RNA isolation/library preparation in the same location?

If any answer is “No”, then technical batches may exist.

Having batches does not automatically mean there is a batch effect problem.

Batches are common and often unavoidable in real experiments.


Batches exist but are randomized (good design)

Batch

Samples

Batch 1

control, treated

Batch 2

control, treated

Batch 3

control, treated

Batch 4

control, treated

Here, batch and condition are independent.

Batch effects may exist but can be separated from the biological effect.

Solution: batch can be ignored if its effect is small or modeled (in DESeq, design = ~ batch + condition)


Batches exist and are partially correlated (risky design)

Batch

Samples

Batch 1

treated, treated

Batch 2

control, treated

Batch 3

control, treated

Batch 4

control, control

Here, batch and condition are partially correlated.

Batch effects can still be modeled statistically, but the design is suboptimal and may reduce statistical power.

Solution: batch can be modeled (in DESeq, design = ~ batch + condition)


Batch is fully confounded with condition (fatal design error):

Batch

Samples

Batch 1

treated, treated

Batch 2

control, control

Batch 3

treated, treated

Batch 4

control, control

treatment effect = batch effect

The biological condition is perfectly confounded with the batch variable.

No statistical method can separate these effects.

Batch effect rule

Batch effects are problematic only when they are correlated with the biological condition.


Blocking or How to control known variation

While randomization protects against factors that are not modeled or cannot be modeled,
blocking is used when a known source of variation exists (for example, library preparation batches or patients).
Blocking explicitly models batch as a variable.

Even if batches contain the same number of samples per condition, processing order can introduce bias (e.g., reagent degradation, operator fatigue - technical factors that cannot be modeled).

Example: systematic bias caused by processing order
Batch 1

Sample

treated

treated

treated

control

control

control

Processing order

1

2

3

4

5

6


Batch 2

Sample

control

control

control

treated

treated

treated

Processing order

1

2

3

4

5

6

Since the processing order can introduce unknown bias that cannot be modeled, randomization has to be applied.

Randomized processing order
Batch 1

Sample

control

treated

control

treated

treated

control

Processing order

1

2

3

4

5

6


Batch 2

Sample

treated

control

treated

control

control

treated

Processing order

1

2

3

4

5

6


Randomizing the processing order distributes potential technical variation (e.g. reagent degradation or operator effects) across biological conditions.

Key message

Randomize what you cannot control,
block what you can control (using batch in statistical model: design = ~ batch + condition).


Factorial designs and interactions

Many RNA-seq experiments involve more than one biological factor.

Example:

Treatment

Sex

control

male

control

female

drug

male

drug

female

Each factor has two levels, producing four experimental groups.
This is a 2 × 2 factorial design.

Such a design allows testing:

  • Main effect of treatment - Does the treatment change gene expression?

  • Main effect of sex - Do males and females differ in gene expression?

  • Interaction between treatment and sex - Does the treatment affect males and females differently?

An interaction occurs when the effect of one factor depends on the level of another factor.

Interaction = difference of differences:

(treated_males − control_males) − (treated_females − control_females)

Example for a specific gene in an RNA-seq experiment:

Sex

Control

Drug

Male

low expression

high expression

Female

low expression

unchanged

→ treatment effect exists only in males
→ suggests treatment × sex interaction

That is,

(treated_male − control_male) ≠ (treated_female − control_female)

In RNA-seq analysis, for each gene, a model such as

design = ~ sex + treatment + sex:treatment

tests three things:

term

interpretation

sex

baseline difference between males and females

treatment

treatment effect averaged across sexes

sex:treatment

whether treatment effect differs between sexes

The pattern suggests a potential treatment × sex interaction, because the treatment increases expression in males but not in females. However, whether this interaction is statistically significant depends on the variability between replicates and the number of samples.


Practical implications

Factorial designs allow researchers to:

  • test multiple biological hypotheses in a single experiment

  • increase experimental efficiency

  • avoid performing multiple separate experiments

However, factorial designs require sufficient replication in each experimental group to reliably estimate effects and interactions.

When multiple factors are studied, the number of experimental groups increases because each combination of factor levels must be measured.

If each group contains N biological replicates, the total number of samples becomes:

total samples = number of groups × N

For example, for a 2 × 2 factorial design (4 experimental groups):

Replicates per group

Total samples

3

12

5

20

8

32

In RNA-seq experiments, factorial designs are common when studying treatment effects across sex, genotype, time points, or environmental conditions.

Practical rule of thumb for factorial design

  • 3 biological replicates per group → minimum (exploratory; sufficient mainly for homogeneous systems)

  • 5–6 per group → reasonable for main effects and strong interactions

  • 8+ per group → more reliable detection of moderate to small interactions


Key design rules for RNA-seq experiments

  • Define the experimental unit correctly

  • Use sufficient biological replication (3 minimum, preferably 5–6)

  • Randomize samples across all experimental steps

  • Record and model batch effects when present

  • Ensure adequate sequencing depth (~30M mapped reads)

  • Use factorial designs when studying multiple factors


Important

Replication determines power.
Design determines whether results are interpretable.


References

RNA-seq experimental design and replication


Statistical power and experimental design


Guidelines and educational resources