Mapping using STAR
For the STAR running options, see STAR Manual.
Building the STAR index
To make an index for STAR, we need both the genome sequence in FASTA format and the annotation in GTF format. We will be building an index only for chromosome 10.
Let’s look at the files we will need in the directory “annotations”:
ls -alht annotations
total 136M
drwxr-xr-x 9 lcozzuto Bioinformatics_Unit 1.6K Apr 30 17:37 ..
drwxr-xr-x 2 lcozzuto Bioinformatics_Unit 253 Apr 30 17:37 .
-rw-r--r-- 1 lcozzuto Bioinformatics_Unit 63M Apr 30 16:56 gencode.v29.transcripts.fa.gz
-rw-r--r-- 1 lcozzuto Bioinformatics_Unit 39M Apr 17 16:25 Homo_sapiens.GRCh38.dna.chromosome.10.fa.gz
-rw-r--r-- 1 lcozzuto Bioinformatics_Unit 1.5M Apr 17 16:08 gencode.v29.annotation_chr10.gtf.gz
STAR requires unzipped .fa and .gtf files. Let’s unzip them, using the option -k which allows to keep .gz files. This works on Mac, but dpending on the version might not work on the cluster:
gunzip -k gencode.v29.annotation_chr10.gtf.gz
gunzip -k Homo_sapiens.GRCh38.dna.chromosome.10.fa.gz
This works on any Linux:
zcat gencode.v29.annotation_chr10.gtf.gz > gencode.v29.annotation_chr10.gtf
zcat Homo_sapiens.GRCh38.dna.chromosome.10.fa.gz > Homo_sapiens.GRCh38.dna.chromosome.10.fa
Q. How much (in percentage) disk space is saved when those two files are kept zipped vs unzipped?
Once index is built, we have to not forget to remove those unzipped files!
To index the genome with STAR for RNA-seq analysis, the sjdbOverhang option needs to be specified for detecting possible splicing sites. It usually equals to the minimum read size minus 1; it tells STAR what is the maximum possible stretch of sequence that can be found on one side of a spicing site. In our case, since the read size is 51 bases, we can accept maximum 50 bases on one side and one base on the other of a splicing site; that is, to set up this parameter to 50. This also means that for every different read-length to be aligned a new STAR index needs to be generated. Otherwise a drop in aligned reads can be experienced.
Building the STAR index (–runMode genomeGenerate):
cd ..
mkdir indexes
mkdir indexes/chr10
$RUN STAR --runMode genomeGenerate --genomeDir indexes/chr10 \
--genomeFastaFiles annotations/Homo_sapiens.GRCh38.dna.chromosome.10.fa \
--sjdbGTFfile annotations/gencode.v29.annotation_chr10.gtf \
--sjdbOverhang 50 --outFileNamePrefix chr10
Apr 30 18:21:00 ..... started STAR run
Apr 30 18:21:00 ... starting to generate Genome files
Apr 30 18:21:05 ... starting to sort Suffix Array. This may take a long time...
Apr 30 18:21:06 ... sorting Suffix Array chunks and saving them to disk...
Apr 30 18:25:18 ... loading chunks from disk, packing SA...
Apr 30 18:25:42 ... finished generating suffix array
Apr 30 18:25:42 ... generating Suffix Array index
Apr 30 18:26:37 ... completed Suffix Array index
Apr 30 18:26:37 ..... processing annotations GTF
Apr 30 18:26:38 ..... inserting junctions into the genome indices
Apr 30 18:27:08 ... writing Genome to disk ...
Apr 30 18:27:09 ... writing Suffix Array to disk ...
Apr 30 18:27:16 ... writing SAindex to disk
Apr 30 18:27:24 ..... finished successfully
Remove unzipped files:
rm annotations/gencode.v29.annotation_chr10.gtf
rm annotations/Homo_sapiens.GRCh38.dna.chromosome.10.fa
Aligning reads to the genome (and counting them at the same time!)
To use STAR for the read alignment (default –runMode option), we have to specify the following options:
- the index directory (–genomeDir)
- the read files (–readFilesIn)
- if reads are compressed or not (–readFilesCommand)
The following options are optional:
- type of output (–outSAMtype). Defaul is “BAM Unsorted”; STAR outputs unsorted Aligned.out.bam file(s). “The paired ends of an alignment are always adjacent, and multiple alignments of a read are adjacent as well. This ”unsorted” file cannot be directly used with downstream software such as HTseq, without the need of name sorting.”
- the path for the output directory and prefix of all output files prefix (–outFileNamePrefix). By default, this parameter is ./, i.e. all output files are written in the current directory.
- (–quantMode). With –quantMode GeneCounts option STAR will count the number of reads per gene while mapping. A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the paired- end read are checked for overlaps. The counts coincide with those produced by htseq-count with default parameters. This option requires annotations (GTF or GFF with –sjdbGTFfile option) used at the genome generation step, or at the mapping step.” (from STAR Manual)
mkdir alignments
$RUN STAR --genomeDir indexes/chr10 \
--readFilesIn resources/A549_0_1chr10_1.fastq.gz resources/A549_0_1chr10_2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--outFileNamePrefix alignments/A549_0_1
Apr 30 18:50:13 ..... started STAR run
Apr 30 18:50:13 ..... loading genome
Apr 30 18:50:52 ..... started mapping
Apr 30 18:54:41 ..... started sorting BAM
Apr 30 18:55:30 ..... finished successfully
Let’s explore the output directory “alignments”.
ln -lh alignments
Read counts
STAR outputs read counts per gene into PREFIXReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options: column 1: gene ID column 2: counts for unstranded RNA-seq column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes) column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)
gene id | read counts per gene (unstranded) | read counts per gene (read 1) | read counts per gene (read 2) |
N_unmapped | 26098 | 26098 | 26098 |
… | … | … | … |
ENSG00000261456.5 | 13 | 0 | 13 |
… | … | … | … |
ENSG00000151240.16 | 1652 | 5 | 1664 |
… | … | … | … |
Select the output according to the strandedness of your data. Note, if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads.
For example, in the stranded protocol shown in “Library preparation”, Read 1 is mapped to the antisense strand (this is also true for single-end reads), while Read 2, to the sense strand.
Which protocol, stranded or unstranded, was used for this RNA-seq data?
more alignments/A549_0_1ReadsPerGene.out.tab
N_unmapped 26098 26098 26098
N_multimapping 337723 337723 337723
N_noFeature 211089 2547710 246034
N_ambiguous 151810 4514 32331
ENSG00000260370.1 0 0 0
ENSG00000237297.1 0 0 0
ENSG00000261456.5 13 0 13
ENSG00000232420.2 0 0 0
ENSG00000015171.19 6874 131 6743
ENSG00000276662.1 6 6 0
ENSG00000212331.1 0 0 0
ENSG00000151240.16 1652 5 1664
...
We can count the number of reads mapped to each strand by using a simple awk script:
grep -v "N_" alignments/A549_0_1ReadsPerGene.out.tab | awk '{unst+=$2;forw+=$3;rev+=$4}END{print unst,forw,rev}'
2343002 153677 2427536
It can be seen that only 153,677 Reads 1 (forward) were mapped to known genes; while 24,27,536 Reads 2 (reverse) were mapped to known genes. This means that the protocol used for this mRNA-Seq experiment was stranded; and the reverse complement of mRNA was sequenced.
EXERCISE
- Scroll the file alignments/A549_0_1ReadsPerGene.out.tab further to find genes for which read 1 count was much more than read 2 count. Why is it so? (look at the gene in Ensemble)
- Look up also a gene for which read 1 count was comparable with read 2 count. What can you tell about this gene?
BAM/SAM/CRAM format
The BAM format is a compressed version of the SAM format (which is a plain text) and cannot thus being seen as a text. To explore the BAM file, we have to convert it to the SAM format by using samtools. Note that we use the parameter -h to show also the header that is hidden by default.
$RUN samtools view -h alignments/A549_0_1Aligned.sortedByCoord.out.bam | head -n 10
@HD VN:1.4 SO:coordinate
@SQ SN:chr10 LN:133797422
@PG ID:STAR PN:STAR VN:2.7.0f CL:STAR --genomeDir indexes/chr10 --readFilesIn resources/A549_0_1chr10_1.fastq.gz resources/A549_0_1chr10_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix alignments/A549_0_1 --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts
@CO user command line: STAR --genomeDir indexes/chr10 --readFilesIn resources/A549_0_1chr10_1.fastq.gz resources/A549_0_1chr10_2.fastq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix alignments/A549_0_1
D00137:453:HLFY2BCXY:2:1115:10428:98737 419 chr10 35442 3 13M174562N32M6S = 236864 201473 AGCTGTTATTGAACAAGAAGGGATTGGTTGCCAGGAGATGAGATTAGCATT @DDD?<1D1CGEE11<C@GHIIH?@E?G@CHH?FH@0GH@C<<<@1DFCHH NH:i:2 HI:i:2 AS:i:74 nM:i:9
D00137:453:HLFY2BCXY:2:2208:10640:12788 419 chr10 37161 0 51M = 37323 213 CCAATATTCAACATTCTTAAAGAAAAGAATGTTCAACCCAGAATTTCATAT DABDDFHHIIIHCFHEHIEEHHHFFEHHIIH@GHHIIIIIIIIHIIHIIIH NH:i:5 HI:i:5 AS:i:98 nM:i:1
D00137:453:HLFY2BCXY:2:2208:10640:12788 419 chr10 37161 0 51M = 37323 213 CCAATATTCAACATTCTTAAAGAAAAGAATGTTCAACCCAGAATTTCATAT DABDDFHHIIIHCFHEHIEEHHHFFEHHIIH@GHHIIIIIIIIHIIHIIIH NH:i:5 HI:i:5 AS:i:98 nM:i:1
D00137:453:HLFY2BCXY:2:2208:10640:12788 339 chr10 37323 0 51M = 37161 -213 GCACTAAACATGGAAAGGAACAACCGGTACCAGCCACTGCAAAATCATGCC CHIHEIIHHHHHIIIHHGFCIIIHIIHIIHIIHGHIIGHIHHIIHHDDADD NH:i:5 HI:i:5 AS:i:98 nM:i:1
D00137:453:HLFY2BCXY:2:2208:10640:12788 339 chr10 37323 0 51M = 37161 -213 GCACTAAACATGGAAAGGAACAACCGGTACCAGCCACTGCAAAATCATGCC CHIHEIIHHHHHIIIHHGFCIIIHIIHIIHIIHGHIIGHIHHIIHHDDADD NH:i:5 HI:i:5 AS:i:98 nM:i:1
D00137:453:HLFY2BCXY:1:1214:7640:70489 419 chr10 37872 3 51M = 37978 156 TGGGAGACTTTAACACCCCACTGTCAACATTAGACAGCTCAACAAGACAGA DDDDDHDHIIIIIIIIIIIIIIHIIFHGIIIIIIIIIIIIGHIIIIIIIHI NH:i:2 HI:i:1 AS:i:99 nM:i:0
The first part indicated by the first character @ in each row is the header:
Symbol | ||
---|---|---|
@HD header line | VN:1.4 version of the SAM format | SO:coordinate sorting order |
@SQ reference sequence dictionary | SN:chr10 sequence name | LN:133797422 sequence length |
@PG program used | ID:STAR PN:STAR VN:2.7.0f version | CL:STAR –genomeDir indexes/chr10 –readFilesIn resources/A549_0_1chr10_1.fastq.gz resources/A549_0_1chr10_2.fastq.gz –readFilesCommand zcat –outFileNamePrefix alignments/A549_0_1 –outSAMtype BAM SortedByCoordinate –quantMode GeneCounts command line |
@CO One-line text comment | user command line: STAR –genomeDir indexes/chr10 –readFilesIn resources/A549_0_1chr10_1.fastq.gz resources/A549_0_1chr10_2.fastq.gz –readFilesCommand zcat –outSAMtype BAM SortedByCoordinate –quantMode GeneCounts –outFileNamePrefix alignments/A549_0_1 |
The rest is a read alignment.
Field | Value |
---|---|
Query name | D00137:453:HLFY2BCXY:2:1115:10428:98737 |
FLAG | 419 * |
Reference name | chr10 |
Leftmost mapping position (1-based) | 35442 |
Mapping quality | 3 (p=0.5) |
CIGAR string | 13M174562N32M6S * |
Reference sequence name of the primary alignment of the mate | = same chromosome |
Position of the primary alignment of the mate | 236864 |
observed fragment length | 201473 |
Sequence | AGCTGTTATTGAACAAGAAGGGATTGGTTGCCAGGAGATGAGATTAGCATT |
Quality | @DDD?<1D1CGEE11<C@GHIIH?@E?G@CHH?FH@0GH@C«<@1DFCHH |
* FLAG 419 means: read paired, read mapped in proper pair, mate on the reverse strand, second in pair, not primary alignment. SIGAR string 13M174562N32M6S means 13 bases equal to the reference (M), 174562 bases were not mapping (that is, 1 insertion) (N), 32 bases were mapped (M), and 6 bases were soft clipped (S). You can use this website for the translation of FLAG and SIGAR values into plain English.
Extra fields are often present and differ among aligner tools https://samtools.github.io/hts-specs/SAMtags.pdf. In our case we have:
Field | Meaning |
---|---|
NH:i:2 | number of mapping to the reference |
HI:i:2 | which alignment is the reported one (in this case is the second one) |
AS:i:74 | Alignment score calculate by the aligner |
nM:i:9 | number of difference with the reference* |
* Note that historically this has been ill-defined and both data and tools exist that disagree with this definition.
Let’s convert BAM to SAM:
$RUN samtools view -h alignments/A549_0_1Aligned.sortedByCoord.out.bam > alignments/A549_0_1.sam
ls -alht alignments/A549_0_1*[sb]am
-rw-r--r-- 1 lcozzuto Bioinformatics_Unit 1.5G May 2 18:34 A549_0_1.sam
-rw-r--r-- 1 lcozzuto Bioinformatics_Unit 320M Apr 30 18:55 A549_0_1Aligned.sortedByCoord.out.bam
You can see that the SAM file is 5 times larger than the BAM file. Yet, the more efficient way to store the alignment is to use the CRAM format. CRAM is fully compatible with BAM, and main repositories, such as GEO and SRA, accept alignments in the CRAM format. UCSC Genome Browser can visualize both BAM and CRAM files. It is now a widly recommended format for storing alignments. To convert BAM to CRAM, we have to provide unzipped and indexed version of the genome.
$RUN samtools faidx annotations/Homo_sapiens.GRCh38.dna.chromosome.10.fa
$RUN samtools view -C alignments/A549_0_1Aligned.sortedByCoord.out.bam -T annotations/Homo_sapiens.GRCh38.dna.chromosome.10.fa > alignments/A549_0_1.cram
ls -alht A549_0_1*.*am
-rw-r--r-- 1 lcozzuto Bioinformatics_Unit 1.5G May 2 18:59 A549_0_1.sam
-rw-r--r-- 1 lcozzuto Bioinformatics_Unit 167M May 2 18:58 A549_0_1.cram
-rw-r--r-- 1 lcozzuto Bioinformatics_Unit 320M Apr 30 18:55 A549_0_1Aligned.sortedByCoord.out.bam
You can see that a .cram file is as twice as smaller than a .bam file. Let’s remove the .sam file:
rm alignments/*.sam
Alignment QC
The quality of the resulting alignment can be checked using the tool QualiMap. To run QualiMap, we specify the kind of analysis (rnaseq), the presence of paired-end reads within the bam file (-pe) and the strand of the library (-p strand-specific-reverse).
IMPORTANT: before running QualiMap ensure enough disk space for a temporary directory ./tmp that the program is required, running the following command:
export _JAVA_OPTIONS="-Djava.io.tmpdir=./tmp -Xmx6G"
$RUN qualimap rnaseq -pe -bam alignments/A549_0_1Aligned.sortedByCoord.out.bam \
-gtf annotations/gencode.v29.annotation_chr10.gtf \
-outdir QC -p strand-specific-reverse
Java memory size is set to 1200M
Launching application...
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
QualiMap v.2.2.1
Built on 2016-10-03 18:14
....
We can check the final report in a browser:
firefox QC/qualimapReport.html
The report gives a lot of useful information, such as the total number of mapped reads, the amount of reads mapped to exons, introns or intergenic regions, and the bias towards one of the ends of mRNA (that can give information about RNA integrity or a protocol used).
Looking at the gene coverage we can see a bias towards 5’-end that is compatible with the kind of stranded protocol used.
Finally, we can see that the majority of reads map to the exons.
IMPORTANT for running QualiMap on many samples (for detail, see QualiMap documentation
- Make sure to give to the output folder the name corresponding to a running sample; e.g., ./QC/A549_0_1; otherwise you will end up with files and folders for the last running sample only.
- If you run QualiMap in parallel for many samples, make sure to create a different tmp-folder for each sample; e.g., ./tmp/A549_0_1.
- QualiMap sorts BAM files by read names. To speed up this part of the program execution, you can use samtools to sort the BAM files in parallel and using multiple CPUs and then to give to QualiMap a BAM file sorted by read names and provide an option –sorted.