Hands-on: Pre-processing and Quality Control of Raw Sequencing Data

Data for this course and SRA download

The Sequence Read Archive (SRA) at NCBI is a public repository of raw sequencing data. It is the primary source for publicly available datasets to use for practice, benchmarking, or reproduction of published results.

SRA Accession Types:

Prefix

Level

Example

PRJNA

BioProject — the overall study

PRJNA257197

SAMN / SRS

BioSample — a biological specimen

SAMN03254300

SRX

Experiment — library preparation metadata

SRX123456

SRR / ERR

Run — the actual sequencing reads

SRR1553607

The SRR Run accession is what you need to download reads. A single experiment (SRX) may have multiple runs (SRR). The data that will be using for the hands-on exercises is from GEO data set GSE76647. Specifically, it contains Homo sapiens samples from differentiated (5-day differentiation) and undifferentiated primary keratinocytes. Additionally, some samples underwent a knock-down of the FOXC1 gene.

We could try to download these datasets from SRA directly but due to time constrains, we have already prepared fastq files that correspond to chromosome 6 only. You will find them in the data directory (* ~/RNAseq_coursesCRG_2026/docs/data/reads/*) within the repository of the course.

Raw Data QC

Before any downstream analysis (alignment, read count, differential expression, functional enrichment), it is critical to assess the quality of raw sequencing reads and pre-process them accordingly. Poor-quality data can introduce errors that propagate through the entire pipeline.

Pre-processing includes:

  • Raw data QC:

    • Quality control of initial reads

    • Contamination checks

  • Read pre-processing:

    • Adapter trimming

    • Filtering out low-quality reads/base positions

    • rRNA removal (where applicable)

Tools Overview

Tool

Purpose

FastQC

Per-read quality metrics

FastQ Screen

Contamination screening against reference genomes

Kraken2

Taxonomic classification to detect biological contamination

MultiQC

Aggregates reports from all tools into one summary

FastQC

FastQC is the standard first-pass QC tool for sequencing reads. It reads FASTQ files (also in gzipped format) and produces an HTML report with multiple modules, each assessing a different aspect of read quality.

Key metrics assessed:

  • Per-base sequence quality (Phred scores)

  • Per-sequence quality scores

  • Per-base sequence content (A/T/G/C composition)

  • Per-sequence GC content

  • Per-base N content

  • Sequence length distribution

  • Sequence duplication levels

  • Overrepresented sequences

  • Adapter content

Reminder — Phred quality scores

Phred scores (Q scores) encode the probability of a base-calling error:

Phred Score

Error Probability

Accuracy

Q10

1 in 10

90%

Q20

1 in 100

99%

Q30

1 in 1,000

99.9%

Q40

1 in 10,000

99.99%

Reads are generally considered good quality if the median Phred score is ≥ Q30 across most bases.

Running FastQC

# Run FastQC on a single file
fastqc sample.fastq.gz

# Run on multiple files
fastqc sample1.fastq.gz sample2.fastq.gz

# Specify output directory
fastqc -o ./fastqc_results/ sample.fastq.gz

# Use multiple threads (faster for large files)
fastqc -t 4 -o ./fastqc_results/ *.fastq.gz

Output files

For each sample, two files are excepted:

sample_fastqc.html     # Visual HTML report with bar charts
sample_fastqc.zip      # Zip file with graphs and results in txt

Hands-on: FastQC

# Go to the "quality_control" folder
mkdir ~/RNAseq_coursesCRG_2026/raw_qc
cd ~/RNAseq_coursesCRG_2026/raw_qc

# Run FastQC for one sample
$RUN fastqc ~/RNAseq_coursesCRG_2026/docs/data/reads/SRR3091420_1_chr6.fastq.gz -o .

# Run for all samples
$RUN fastqc ~/RNAseq_coursesCRG_2026/docs/data/reads/*.fastq.gz -o .

# Check output
ls .

Display the results in a browser:

firefox SRR3091420_1_chr6_fastqc.html &

You can also extract the zip archive to access results in text format:

# Extract
unzip SRR3091420_1_chr6_fastqc.zip

# Remove remaining .zip file
rm SRR3091420_1_chr6_fastqc.zip

# Display content of directory
ls SRR3091420_1_chr6_fastqc

# File fastqc_data.txt contains the results in text format
less SRR3091420_1_chr6_fastqc/fastqc_data.txt

Interpreting Key Plots

1. Per Base Sequence Quality

  • Good: Box plots remain in the green zone (Q > 28)

  • Warning: Boxes drop into yellow (Q 20–28)

  • Fail: Boxes drop into red (Q < 20)

  • Common pattern: quality drops at the 3’ end — this is normal for Illumina data

Below is an example of a good quality dataset (top) and a poor quality dataset (bottom). In the latter, the average quality drops dramatically towards the 3’-end.

Good quality per base

Figure 1: Good quality per base sequence

Bad quality per base

Figure 2: Bad quality per base sequence

2. Per Sequence GC Content

  • Should follow a normal bell-shaped curve

  • A shifted peak may indicate contamination or adapter dimers

  • Bimodal distribution = likely contamination from another organism

3. Adapter Content

  • Should be near 0% in well-prepared libraries

  • High adapter content indicates insufficient insert size or failed library prep

  • Adapters must be trimmed before alignment

This problem arises when the sequenced DNA fragment is shorter than the read length — the sequencer runs out of insert and reads into the adapter:

Ideal case (insert > read length):
  5'──────────────── INSERT ────────────────3'
  |←────── Read (150 bp) ──────→|

Problem case (insert < read length):
  5'──── INSERT ────3'
  |←──── Read ─────────────── ADAPTER →|
                   ↑
         Adapter sequence contaminates the read

4. Overrepresented Sequences

  • FastQC will BLAST-match overrepresented sequences

  • Common hits: TruSeq adapters, polyA tails, rRNA

  • Any unknown overrepresented sequence warrants investigation

Important — General rules do not apply to all applications

:class: important

The guides above apply to regular RNA-seq datasets. In other applications, these rules do not hold:

  • Small RNA-seq libraries: insert size is smaller and adapter content is expected to be higher (see example below).

  • Single-cell experiments: read 1 contains the cell barcode and UMI, so its GC content profile may be altered.

FastQC report for a small RNA-seq sample

Figure 3: FastQC report for a small RNA-seq sample showing elevated adapter content

Exercises I

The first part of this exercise is to download publicly available data from SRA, using the SRA toolkit. Below you can find the basic commands to use this tool:

# Download a single run (compressed .sra format)
prefetch SRR1234567

# Convert to FASTQ (single-end)
fasterq-dump SRR1234567 --outdir ./fastq/

# For paired-end data, split into R1 and R2
fasterq-dump SRR1234567 --split-files --outdir ./fastq/

# Compress the output (fasterq-dump does not gzip by default)
gzip ./fastq/SRR1234567.fastq

Now, download the data (accession number SRR36179215) using the SRA toolkit, run FastQC, and answer the questions below.

# Add command:
export RUN_SRA="singularity exec -e /home/training/sra-tools_3.3.0.sif"

cd ~/RNAseq_coursesCRG_2026/
mkdir exercise1_fastqc

# Download from SRA:
$RUN_SRA prefetch SRR36179215
$RUN_SRA fasterq-dump SRR36179215 --outdir ./exercise1/
gzip ./exercise1/SRR36179215.fastq

# Run FastQC
$RUN fastqc ./exercise1/SRR36179215.fastq.gz -o ./exercise1_fastqc/
firefox ./exercise1_fastqc/SRR36179215_fastqc.html &

Discussion questions:

  1. Is this adapter content a sign of a failed experiment? Why or why not? Which type of dataset this might be?

  2. Would you apply the same quality interpretation to a standard mRNA-seq dataset with this level of adapter content?

Note

Reads are shorter than what we can normally expect nowadays, because it is old data and also because it is convenient to run analyses faster for the course.

FastQ Screen

FastQ Screen screens reads to detect cross-species contamination or unexpected biological sequences. It is particularly useful for:

  • Detecting human contamination in non-human samples (and vice versa)

  • Identifying mycoplasma, PhiX, or E. coli contamination

  • Verifying the species of origin of a sample

The tool works by:

  1. Taking a subsample (~100,000 reads by default, adjustable with --subset)

  2. Aligning reads to each reference genome using Bowtie2

  3. Reporting the percentage of reads mapping to each reference

  4. Generating a stacked bar chart showing mapping distribution

Setting Up Databases

Warning

Do not run the following command in class — it will take too much time and resources.

# Download default databases
$RUN fastq_screen --get_genomes

This downloads 14 Bowtie2 indexes (model organisms and known contaminants):

  • Arabidopsis thaliana, Drosophila melanogaster, Escherichia coli

  • Homo sapiens, Mus musculus, Rattus norvegicus

  • Caenorhabditis elegans, Saccharomyces cerevisiae

  • Lambda, Mitochondria, PhiX, Adapters, Vectors, rRNA

Upon download, the FastQ_Screen_Genomes folder is created. You must provide a fastq_screen.conf configuration file with paths to the Bowtie2 binary and genome indices:

# This is a configuration file for fastq_screen

BOWTIE2 /usr/local/bin/bowtie2

## Human
DATABASE Human /path/to/FastQ_Screen_Genomes/Human/Homo_sapiens.GRCh38

## Mouse
DATABASE Mouse /path/to/FastQ_Screen_Genomes/Mouse/Mus_musculus.GRCm38

## E. coli
DATABASE Ecoli /path/to/FastQ_Screen_Genomes/E_coli/Ecoli

## PhiX
DATABASE PhiX /path/to/FastQ_Screen_Genomes/PhiX/phi_plus_SNPs

## Adapters
DATABASE Adapters /path/to/FastQ_Screen_Genomes/Adapters/contaminants

Running FastQ Screen

In the command below, you can see how to run fastq_screen in a single sample. IMPORTANT: due to time constrains, please do not run fastq_screen. If there is time, feel free to try it at the end of the hands-on session.

$RUN fastq_screen --conf FastQ_Screen_Genomes/fastq_screen.conf \
           sample.fq.gz \
           --outdir ~/RNAseq_coursesCRG_2026/raw_qc/

To view example results for SRR3091420_1_chr6.fastq.gz:

FastQ Screen results

Figure 4: Fastq_screen results

Output Files

sample_screen.html     # Visual HTML report with bar charts
sample_screen.txt      # Tab-delimited mapping statistics

Example _screen.txt output:

#Fastq_screen version: 0.15.3
Genome    #Reads_processed  #Unmapped  %Unmapped  #One_hit_one_genome  %One_hit  #Multiple_hits  %Multiple
Human     100000            95020      95.02      4830                 4.83      150             0.15
Mouse     100000            99780      99.78      180                  0.18      40              0.04
Ecoli     100000            99950      99.95      45                   0.05      5               0.00
PhiX      100000            100000     100.00     0                    0.00      0               0.00

Interpreting Results

FastQ Screen results require careful interpretation — context matters depending on which databases are in your config.

Column

Meaning

One hit, one genome

Read maps uniquely to this genome only

Multiple hits, one genome

Read maps to this genome in multiple locations, but nowhere else

One hit, multiple genomes

Read maps uniquely within this genome, but also maps to other genomes

Multiple hits, multiple genomes

Read maps to multiple locations in this genome and in other genomes

Exercises II

  1. Check both fastq_screen outputs below. Inspect them carefully and justify which one reflects a contamination and which one is expected.

Good quality per base

Figure 5: Fastq screen output 1. Image taken from web.

Bad quality per base

Figure 6: Fastq screen output 2. Image taken from web.

  1. Inspect the fastq_screen output. What do you think is happening to the sample? Would you need to ask more information about it before reaching any conclusion?

Good quality per base

Figure 7: Fastq screen output 3.

Kraken2

Kraken2 performs k-mer based taxonomic classification of sequencing reads. Unlike FastQ Screen — which aligns to specific genomes — Kraken2 classifies reads against a broad taxonomic database, enabling detection of unexpected organisms at genus/species level.

Use cases:

  • Detecting microbial contamination in eukaryotic samples

  • Metagenomics community profiling

  • Validating microbial sequencing samples

  • Identifying viral contamination

How Kraken2 Works

  1. Breaks each read into k-mers (default k=35)

  2. Looks up each k-mer in a pre-built hash table of taxonomic assignments

  3. Uses a lowest common ancestor (LCA) algorithm to assign a taxonomy

  4. Reports classification at each taxonomic level (phylum → species)

Basic Usage

# Classify single-end reads
kraken2 \
    --db kraken2_db/ \
    --threads 8 \
    --report sample_kraken2_report.txt \
    --output sample_kraken2_output.txt \
    sample.fastq.gz

# Classify paired-end reads
kraken2 \
    --db kraken2_db/ \
    --threads 8 \
    --paired \
    --report sample_kraken2_report.txt \
    --output sample_kraken2_output.txt \
    sample_R1.fastq.gz sample_R2.fastq.gz

Understanding the Output

The report file (--report) follows this tab-delimited format:

% reads   # reads  # reads at taxon  rank  taxID   name
 78.50    78500    420               D     2       Bacteria
 15.20    15200    15200             S     9606    Homo sapiens
  5.10    5100     200               D     10239   Viruses
  1.20    1200     1200              U     0       unclassified

Column

Description

% reads

Percentage of reads assigned to this taxon

# reads

Number of reads under this taxon (including children)

# reads at taxon

Reads assigned directly (not to children)

rank

D=Domain, P=Phylum, C=Class, O=Order, F=Family, G=Genus, S=Species

taxID

NCBI taxonomy ID

name

Taxonomic name

Hands-on: Kraken2

# Uncompress the kraken2 database:
cd ~/RNAseq_indices/
tar -zxf minikraken2_v2_8GB_201904_UPDATE.tar.gz

# Classify single-end reads
cd ~/RNAseq_coursesCRG_2026

$RUN kraken2 \
    --db ~/RNAseq_indices/minikraken2_v2_8GB_201904_UPDATE \
    --output ~/RNAseq_coursesCRG_2026/raw_qc/SRR3091420_1_chr6_kraken_output.txt \
    --report ~/RNAseq_coursesCRG_2026/raw_qc/SRR3091420_1_chr6_kraken_report.txt \
    docs/data/reads/SRR3091420_1_chr6.fastq.gz
    

Exercises III:

Now, run Kraken2 for all of our samples and check the results. Are what do you expected? Do they match with those from fastq_screen?

# Go to the repository directory:
cd ~/RNAseq_coursesCRG_2026

# Run Kraken2 in all samples:

for fastq_file in ~/RNAseq_coursesCRG_2026/docs/data/reads/*.fastq.gz; do
  
    filename=$(basename "$fastq_file")

    # Strip extension(s): handles .fastq.gz, .fq.gz, .fastq, .fq
    sample_name="${filename%%.*}"

    # Set output and report names:
    output_path="$HOME/RNAseq_coursesCRG_2026/raw_qc/${sample_name}_kraken_output.txt"
    report_path="$HOME/RNAseq_coursesCRG_2026/raw_qc/${sample_name}_kraken_report.txt"


    $RUN kraken2 \
        --db ~/RNAseq_indices/minikraken2_v2_8GB_201904_UPDATE \
        --output $output_path \
        --report $report_path \
        $fastq_file
; done

MultiQC

MultiQC aggregates QC reports from multiple tools and multiple samples into a single interactive HTML report. Instead of reviewing dozens of individual FastQC HTML files, MultiQC produces one unified dashboard.

Supported tools include: FastQC, FastQ Screen, Kraken2, Trimmomatic, STAR, HISAT2, Salmon, featureCounts, Picard, samtools, and many more.

Running MultiQC

cd ~/RNAseq_coursesCRG_2026/

# Create a folder for the MultiQC result
mkdir multiqc_report
cd ~/RNAseq_coursesCRG_2026/multiqc_report

# Link QC, trimming and mapping data
ln -s ~/RNAseq_coursesCRG_2026/raw_qc .

# Run MultiQC on the directory
$RUN multiqc .

# Visualize in browser
firefox multiqc_report.html &

MultiQC report

Figure 8: Example of multiQC report (html).


Read Pre-processing: adapter trimming and rRNA removal

After running raw QC, two issues commonly stand out in the MultiQC report that must be addressed before alignment:

Problem

Consequence if not addressed

Adapter sequences

Misalignments, spurious variant calls, reduced mapping rate

Low quality 3’ bases

Increased error rate in alignments and assemblies

rRNA reads

Wasted sequencing depth; biased expression estimates

Adapter Trimming

The Landscape of Tools

Several mature tools exist for adapter trimming and quality filtering. They share the same core goal but differ in speed, flexibility, and default behaviour.

Tool

Language

Strengths

Typical Use Case

Trimmomatic

Java

Highly configurable, widely used

General Illumina trimming

Cutadapt

Python

Precise adapter specification, great docs

When adapters are known exactly

TrimGalore

Perl wrapper

Auto-detects adapters, wraps Cutadapt + FastQC

Most RNA-seq / WGS workflows

fastp

C++

Very fast, all-in-one QC + trim

Large datasets, speed-critical

BBDuk

Java (BBTools)

Flexible k-mer trimming, contaminant removal

Complex filtering needs

PRINSEQ

Perl

Broad filtering options including complexity

Metagenomics, older workflows

How Trimming Works

All tools follow the same general logic:

Step 1 — Adapter Sequence Matching: The tool is given (or detects) the known adapter sequence and searches for it within each read. When found, the adapter and everything 3’ of it is clipped. Most tools allow a configurable error rate to account for sequencing errors within the adapter.

Read:   ACGTACGTACGTACGT[AGATCGGAAGAGC...]
                         ↑ adapter match
Result: ACGTACGTACGTACGT

Step 2 — Quality-Based Trimming: Separately from adapter removal, tools can trim bases from the 3’ end below a Phred quality threshold. The most common algorithm is a sliding window approach: start from the 3’ end, calculate average quality in a window (e.g., 4 bp), remove it if below threshold, and repeat moving inward until the window passes.

Step 3 — Minimum Length Filtering: After trimming, reads that are too short to align reliably are discarded entirely (common threshold: 20–36 bp).

Important: how paired-end reads are handled

In paired-end mode, both reads of a pair must be handled together. If one read is discarded (too short after trimming), its partner must also be removed or placed in an “orphan” file to maintain pairing integrity — aligners require paired reads to be in the same order.

Trimming with TrimGalore

TrimGalore is a wrapper around Cutadapt and FastQC that adds automatic adapter detection, integrated FastQC re-running, and sensible defaults — making it the most practical choice for standard Illumina RNA-seq and WGS data.

# Single-end reads
trim_galore sample.fastq.gz

# Specify output directory and run FastQC on output
trim_galore --fastqc -o trimmed/ sample.fastq.gz

Key options:

trim_galore \
    --quality 20 \              # Trim 3' bases below Q20 (default: 20)
    --length 36 \               # Discard reads shorter than 36 bp after trimming
    --stringency 3 \            # Minimum adapter overlap (default 1; increase to avoid over-trimming)
    --fastqc \                  # Run FastQC on output
    --cores 4 \                 # Parallel processing
    -o trimmed_output/ \
    sample.fastq.gz

Specifying adapters manually:

# Illumina TruSeq adapter (most common for single-end)
trim_galore \
    --adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
    sample.fastq.gz

# Nextera adapter
trim_galore --nextera sample.fastq.gz

# Small RNA (miRNA) adapter
trim_galore --small_rna sample.fastq.gz

Hands-on: Trim Galore

cd ~/RNAseq_coursesCRG_2026/

# Create a folder for the trimmed data:
mkdir trimmed_data

$RUN trim_galore \
    --quality 20 \
    -o trimmed_data/ \
    ~/RNAseq_coursesCRG_2026/docs/data/reads/SRR3091420_1_chr6.fastq.gz

Output files:

trimmed_data/
├── SRR3091420_1_chr6_trimmed.fq.gz                      # Trimmed reads
└── SRR3091420_1_chr6.fastq.gz_trimming_report.txt       # Trimming stats

Reading the trimming report:

Total reads processed:                 835,169
Reads with adapters:                   277,510 (33.2%)
Reads written (passing filters):       835,169 (100.0%)

Total basepairs processed:    40,923,281 bp
Quality-trimmed:                       0 bp (0.0%)
Total written (filtered):     40,540,114 bp (99.1%)

Key numbers to check:

  • Reads with adapters: confirms the adapter contamination seen in FastQC

  • Reads written (passing filters): should remain high (>95%); low values suggest aggressive trimming or very short inserts

Exercises IV

Adapt the code provided in exercise III (Kraken2) and, together with the trim-galore code above, modify it to run trimming in all of our samples. Afterwards, check the results and see if all samples behave similarly.

rRNA Removal with RiboDetector

Even after trimming, RNA-seq libraries often retain a proportion of ribosomal RNA reads from incomplete ribo-depletion (e.g., RiboZero, RNase H) or polyA selection inefficiency. These reads:

  • Consume sequencing depth without contributing to gene expression estimates

  • Inflate total read counts relative to informative mRNA reads

RiboDetector is a deep learning-based tool that accurately identifies and removes rRNA reads without relying on a reference database. It uses a neural network trained on a broad set of rRNA sequences from bacteria, archaea, and eukaryotes, enabling robust detection even for divergent or novel rRNA sequences.

Advantages over alignment-based tools:

  • No reference database download or maintenance required

  • Faster and more memory-efficient than alignment-based approaches

  • Handles partial rRNA matches and divergent sequences well

  • GPU acceleration available for large datasets

Basic Usage

# Single-end reads
ribodetector_cpu \
    -t 8 \
    -l 49 \
    -i sample_R1_trimmed.fq.gz \
    -e rrna \
    --rrna rrna.fq.gz \
    -o nonrrna.fq.gz

# Paired-end reads
ribodetector_cpu \
    -t 8 \
    -l 49 \
    -i sample_R1_trimmed.fq.gz sample_R2_trimmed.fq.gz \
    -e rrna \
    --rrna rrna_R1.fq.gz rrna_R2.fq.gz \
    -o nonrrna_R1.fq.gz nonrrna_R2.fq.gz

Key Parameters

Parameter

Description

-t

Number of threads

-l

Read length (bp) — should match your actual read length after trimming

-i

Input FASTQ file(s)

-e rrna

Ensure rRNA reads are written to the --rrna output (use norrna to keep only clean reads)

--rrna

Output file for detected rRNA reads

-o

Output file for non-rRNA reads (use downstream)

--chunk_size

Number of reads processed per batch (default: 256; increase for speed if RAM allows)

Hands-on: Ribodetector

cd ~/RNAseq_coursesCRG_2026/

# Create a folder for the trimmed data:
mkdir ribodetector
cd ribodetector

# Ribodetector does not give us a % of rRNA reads per se, so we will have to calculate it apart of running the tool:

tot=$(( $(zcat ~/RNAseq_coursesCRG_2026/trimmed_data/SRR3091420_1_chr6_trimmed.fq.gz | wc -l) / 4 ))

length=` zcat ~/RNAseq_coursesCRG_2026/trimmed_data/SRR3091420_1_chr6_trimmed.fq.gz | awk '{num++}{if(num%4==2) {seq++; len+=length(\$0)}}END{print int(len/seq)}' `

$RUN ribodetector_cpu \
-l $length \
-o SRR3091420_1_chr6_nonrrna.fq.gz -e rrna -r SRR3091420_1_chr6_rrna.fq.gz \
-i ~/RNAseq_coursesCRG_2026/trimmed_data/SRR3091420_1_chr6_trimmed.fq.gz

awk -v tot=$tot \
    '{num++} END{print id" "num/4/tot*100}' \
    <(zcat SRR3091420_1_chr6_rrna.fq.gz) > SRR3091420_1_chr6_trimmed_rna_perc.txt

Output Files

ribodetector_output/
├── SRR3091420_1_chr6_rrna.fq.gz       # rRNA reads (removed from analysis)
└── SRR3091420_1_chr6_nonrrna.fq.gz    # Clean reads
└── SRR3091420_1_chr6_trimmed_rna_perc.txt # Percentage of %rRNA reads

Interpreting the Output

A rRNA percentage of 1–10% is typical for well-depleted libraries. Values >20–30% suggest ribo-depletion failure and should be flagged, though the reads can still be usable after removal.

MultiQC - Post-Processing QC

Now, once we have finished the preprocessing, it is time to colapse all results into a multiQC report, which will contain pre- and post-trimming results, enabling direct side-by-side comparison.

Hands-on: multiQC

# Go to the working directory: 
cd ~/RNAseq_coursesCRG_2026/multiqc_report

# Link trimming data:
ln -s ~/RNAseq_coursesCRG_2026/trimmed_data .

# Run multiQC:
$RUN multiqc raw_qc/ trimmed_data/ -n 'Raw_trimmed_qc' -d --dirs-depth 1 --force

firefox ~/RNAseq_coursesCRG_2026/multiqc_report/Raw_trimmed_qc.html &

Final Exercise

The file ~/RNAseq_coursesCRG_2026/docs/data/reads/example2_reads.fq.gz contains a set of single-end reads from an unknown sample. Apply the full pre-processing pipeline we covered in this session:

  1. Run FastQC on the raw reads and inspect the HTML report.

  2. Run FastQ Screen to check for cross-species contamination.

  3. Run Kraken2 to perform taxonomic classification.

  4. Run TrimGalore to remove adapters and low-quality bases.

  5. Run RiboDetector to remove rRNA reads from the trimmed output.

  6. Run FastQC again on both the processed reads.

  7. Aggregate all results with MultiQC and compare the pre- and post-processing reports.

Once you have inspected all the outputs, consider the following questions:

  • Which results from the quality control are unexpected for a standard RNA-seq dataset? Describe what you observe and why it deviates from what would normally be considered “good” quality.

  • Can you think of a specific sequencing application where those results would actually be expected and perfectly normal? Explain your reasoning.


Further Reading

Documentation

Key Articles