Pre-processing of sequencing reads
The raw data obtained from the sequencing machine need to be pre-processed.
Pre-processing can include the quality control of initial reads and read trimming that includes removing adapter sequences, filtering out low quality reads and trimming reads off low quality base pairs.
Quality control of sequencing reads
FastQC calculates statistics about the composition and quality of raw sequences, while Fastq Screen looks for possible contaminations.
# Go to the "quality_control" folder cd ~/rnaseq_course/quality_control # Run FastQC for one sample $RUN fastqc ~/rnaseq_course/raw_data/fastq_chr6/SRR3091420_1_chr6.fastq.gz -o . # Run for all samples $RUN fastqc ~/rnaseq_course/raw_data/fastq_chr6/*fastq.gz -o .
The output files are a .zip archive and an .html file
We can display the results (.html file) with an Internet browser; e.g. Firefox:
firefox SRR3091420_1_chr6_fastqc.html &
Below is an example of a poor quality dataset. As you can see, the average quality drops dramatically towards the 3’-end.
You can extract the files from the .zip archive:
# extract unzip SRR3091420_1_chr6_fastqc.zip # remove remaining .zip file rm SRR3091420_1_chr6_fastqc.zip # display content of directory ls SRR3091420_1_chr6_fastqc
File fastqc_data.txt contains the results in text format, for easier parsing of the results:
FastQ Screen is a tool that allows to screen libraries for potential contaminations.
It requires to download genome indices (data bases) from a variety of organisms, a lot of which can be downloaded by default.
Note that fastq_screen can be found in the singularity image, but not the data bases!
WARNING: do not run the following command in class: it will take too much time and resources!
# Download default data bases $RUN fastq_screen --get_genomes
The latter commands download 14 indexes from Bowtie2 mapper (from model organisms or known contaminants):
- Arabidopsis thaliana
- Drosophila melanogaster
- Escherichia coli
- Homo sapiens
- Mus musculus
- Rattus norvegicus
- Caenorhabditis elegans
- Saccharomyces cerevisiae
Upon download, the FastQ_Screen_Genomes folder is created, containing all indexes.
The file fastq_screen.conf will be also downloaded in this folder: in order to use the tool, you will have to modify the fastq_screen.conf by providing the full path to the Bowtie2 executable (/usr/local/bin/bowtie2 if you use our singularity image) and full paths to the downloaded folders with genome index files.
Here we show where to change the executables:
# This is a configuration file for fastq_screen ########### ## Bowtie # ########### ## If the bowtie binary is not in your PATH then you can ## set this value to tell the program where to find it. ## Uncomment the line below and set the appropriate location ## #BOWTIE /usr/local/bin/bowtie/bowtie BOWTIE2 /usr/local/bin/bowtie2 ...
FastQ Screen runs checks on a random subset of 100,000 reads (that can be changed using option –subset).
You can execute FastQ Screen this way:
$RUN fastq_screen --conf FastQ_Screen_Genomes/fastq_screen.conf \ ~/rnaseq_course/raw_data/SRR3091420_1.fastq.gz \ --outdir ~/rnaseq_course/quality_control/
Below is an example of the FastQ Screen results for SRR3091420_1.fastq.gz.
cd ~/rnaseq_course/quality_control/ # get files from fastq_screen run wget https://public-docs.crg.es/biocore/projects/training/PHINDaccess2020/SRR3091420_1_fastq_screen.tar.gz # extract tar -zvxf SRR3091420_1_fastq_screen.tar.gz firefox SRR3091420_1_screen.html
Initial processing of sequencing reads
Before mapping reads to the genome/transcriptome or performing a de novo assembly, the reads have to be pre-processed, if needed, as follows:
- Demultiplex by index or barcode (usually done in the sequencing facility)
- Remove/Trim adapter sequences
- Trim reads by quality
- Discard reads by quality/ambiguity
- Filter reads by k-mer coverage (recommended for the de novo assembly)
- Normalize k-mer coverage (recommended for the de novo assembly)
As shown before, both the presence of low quality reads and adapters are reported in the fastqc output.
Adapters are usually expected in small RNA-Seq because the molecules are typically shorter than the reads, and that makes an adapter to be present at 3’-end.
Below is an example of the FastQC report for a small RNA-seq sample:
Let’s use skewer to trim the Illumina 3’ adapter.
cd ~/rnaseq_course/trimming $RUN skewer ~/rnaseq_course/raw_data/fastq_chr6/SRR3091420_1_chr6.fastq.gz \ -x TGGAATTCTCGGGTGCCAAGG \ -o SRR3091420_1_chr6 \ -z
.--. .-. : .--': :.-. `. `. : `'.' .--. .-..-..-. .--. .--. _`, :: . `.' '_.': `; `; :' '_.': ..' `.__.':_;:_;`.__.'`.__.__.'`.__.':_; skewer v0.2.2 [April 4, 2016] Parameters used: -- 3' end adapter sequence (-x): TGGAATTCTCGGGTGCCAAGG -- maximum error ratio allowed (-r): 0.100 -- maximum indel error ratio allowed (-d): 0.030 -- minimum read length allowed after trimming (-l): 18 -- file format (-f): Solexa/Illumina 1.3+/Illumina 1.5+ FASTQ (auto detected) -- minimum overlap length for adapter detection (-k): 3 Tue Feb 4 11:00:56 2020 >> started |=================================================>| (100.00%) Tue Feb 4 11:01:40 2020 >> done (43.669s) 12157169 reads processed; of these: 0 ( 0.00%) short reads filtered out after trimming by size control 0 ( 0.00%) empty reads filtered out after trimming by size control 12157169 (100.00%) reads available; of these: 480367 ( 3.95%) trimmed reads available after processing 11676802 (96.05%) untrimmed reads available after processing log has been saved to "SRR3091420_1-trimmed.log".
We can look at the read distribution after the trimming of the adapter by inspecting the log-file or re-launching FastQC.
cd ~/rnaseq_course/quality_control/ $RUN fastqc ~/rnaseq_course/trimming/SRR3091420_1_chr6-trimmed.fastq.gz -o . firefox SRR3091420_1_chr6-trimmed_fastqc.html &
Example of a FastQC report for a trimmed small-RNA sample:
Now run skewer for all samples
for fastq in ~/rnaseq_course/raw_data/fastq_chr6/*fastq.gz do echo $fastq $RUN skewer $fastq \ -x TGGAATTCTCGGGTGCCAAGG \ -o $(basename $fastq .fastq.gz) \ -z done
Let’s explore the tool skewer in more detail: use “skewer –help” for a description of the parameters.
- Which parameter indicates the minimum read length allowed after trimming? What is its default value?
- Which parameter indicates the threshold on the average read quality to be filtered out?
- Using skewer, filter out reads in “SRR3091420_1-trimmed.fastq” that have average quality below 30 and trim them on 3’-end until the base quality has reached 30. How many reads were filtered out and how many remain?