Mapping using Salmon
Salmon is a tool for quantifying the expression of transcripts using RNA-seq data.
It is a quasi-mapper: it doesn’t produce the read alignments (and doesn’t output BAM/SAM files). Salmon “quasi-maps” reads to the transcriptome rather than to the genome.
Salmon can also make use of pre-computed alignments (in the form of a SAM/BAM file) instead of the FASTQ files.
Building the Salmon index
To make an index for Salmon, we need transcript sequences in the FASTA format.
This can be found easily in GENCODE.
The transcript sequences corresponding to chromosome 6 was prepared and already downloaded in ~/rnaseq_course/reference_genome/
Salmon does not need any decompression of the input, so we can index by using this command:
cd ~/rnaseq_course/mapping
# index and store the index files in index_salmon folder
$RUN salmon index -t ~/rnaseq_course/reference_genome/reference_chr6/gencode.v33.transcripts.chr6.fa.gz \
-i index_salmon \
--gencode
We add the parameter –gencode as our data come from Gencode version 33 and their header contains several identifiers separated by the character |. This parameter allows the program to parse the header and keep only the transcript identifier.
Quantifying transcript expression
To quantify reads with Salmon, we need to specify the type sequencing library, aka Fragment Library Types in Salmon, using three letters:
The first:
Symbol | Meaning | Reads |
---|---|---|
I | inward | -> … <- |
O | outward | <- … -> |
M | matching | -> … -> |
The second:
Symbol | Meaning |
---|---|
S | stranded |
U | unstranded |
The third:
Symbol | Meaning |
---|---|
F | read 1 (or single-end read) comes from the forward strand |
R | read 1 (or single-end read) comes from the reverse strand |
From the STAR output for read counts we already know that for the analyzed experiment the U (Unstranded) library was used.
If the library was paired-end and sequenced with a stranded reverse library, we would set the parameter to ISR.
If we want to assign the reads to the genes (option -g) in addition to transcripts we have to provide a GTF file corresponding to the transcript version which was used to build the Salmon index.
We have it already for chromosome 6, in ~/rnaseq_course/reference_genome/
We can proceed with the mapping.
cd ~/rnaseq_course/mapping
# create folder for salmon's output files
mkdir alignments_salmon
$RUN salmon quant -i index_salmon -l U \
-r ~/rnaseq_course/raw_data/fastq_chr6/SRR3091420_1_chr6.fastq.gz \
-o alignments_salmon/SRR3091420_1_chr6_salmon \
-g ~/rnaseq_course/reference_genome/reference_chr6/gencode.v33.annotation_chr6.gtf.gz \
--seqBias \
--validateMappings
We can check the results inside the folder “alignments”.
ls alignments_salmon/SRR3091420_1_chr6_salmon/
For explanation of all output files, see the Salmon documentation.
The most interesting to us in this course is the file quant.genes.sf, that is a tab-separated file containing the read counts for genes:
Column | Meaning |
---|---|
Name | Gene name |
Length | Gene length |
EffectiveLength | Effective length after considering biases |
TPM | Transcripts Per Million |
NumReads | Estimated number of reads considering both univocally and multimapping reads |
head -n 5 alignments_salmon/SRR3091420_1_chr6_salmon/quant.genes.sf
Name Length EffectiveLength TPM NumReads
ENSG00000285803.1 1152 1116.21 7.96961 15.764
ENSG00000285712.1 1590 1545.58 2.19064 6
ENSG00000285824.1 1120 860.855 6.91601 10.551
ENSG00000285884.1 790 515.683 3.28285 3
There is a similarly formatted file quant.sf that provides read counts for transcript:
head -n 5 alignments_salmon/SRR3091420_1_chr6_salmon/quant.sf
Name Length EffectiveLength TPM NumReads
ENST00000016171.5 2356 1970.742 659.861626 2304.468
ENST00000020673.5 4183 5925.497 0.000000 0.000
ENST00000173785.4 925 868.802 0.000000 0.000
ENST00000181796.6 3785 3216.057 0.000000 0.000
We will use information on read counts for genes from quant.sf files for the differential expression (DE) analysis.
Now if time and resources allows, proceed with the mapping of the 9 remaining samples:
cd ~/rnaseq_course/mapping
for fastq in ~/rnaseq_course/raw_data/fastq_chr6/SRR309142{1,2,3,4,5,6,7,8,9}_1_chr6.fastq.gz
do echo $fastq
$RUN salmon quant -i index_salmon -l U \
-r ${fastq} \
-o alignments_salmon/$(basename $fastq .fastq.gz)_salmon \
-g ~/rnaseq_course/reference_genome/reference_chr6/gencode.v33.annotation_chr6.gtf.gz \
--seqBias \
--validateMappings
done