Read alignment and SAM format
For mapping of 20-30 million reads to a genome that is around 3 billion bases long, we need a ultra fast aligner (that is, an algorithm for the read mapping) to perform the mapping.
Before doing the mapping, we have to calculate an index for the reference DNA sequence. Each algorithm/program has its own specific method to produce the index.
Like the index at the end of a book, an index of a large DNA sequence allows one to rapidly find shorter sequences embedded within it. Different tools use different approaches at genome/transcriptome indexing.
Fast aligners
These tools can be used for aligning short reads to a reference genome or transcriptome. They can be much faster than traditional aligners like Blast but less sensitive and may have limitations about the read size.
- Bowtie is an ultrafast, memory-efficient short read aligner geared toward quickly aligning large sets of short DNA sequences (reads) to large genomes/transcriptomes. Bowtie uses a Burrows-Wheeler index.
- Bowtie2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of length 50 up to 100s or 1,000s to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the transcriptome with an FM Index.
- BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. BWA indexes the genome with an FM Index.
- GEM is a high-performance mapping tool for aligning sequenced reads against large reference genomes. In particular, it is designed to obtain best results when mapping sequences up to 1K bases long. GEM3 indexes the reference genome using a custom FM-Index design and performs an adaptive gapped search based on the characteristics of the input and the user settings.
There is a standard format for the alignment called SAM format.
BAM/SAM format
The format Sequence Alignment/Map Format SAM format is plain text and can be displayed like a normal file:
pwd # check where you are located in the folder structure
cd .. # go up one level to the root of the project, if needed
mkdir alignment
cd alignment
wget https://biocorecrg.github.io/PhD_course_genomics_format_2023/data/H3K4me1.sam.gz
gunzip H3K4me1.sam.gz
more H3K4me1.sam
@HD VN:1.0 SO:unsorted
@SQ SN:chr21 LN:46709983
@PG ID:Bowtie VN:1.2.3 CL:"/usr/local/bin/bowtie-align-s --wrapper basic-0 -S genome H3K4me1_chr21.fq.gz H3K4me
1.sam"
D00733:297:CBT52ANXX:6:2204:16979:89185 0 chr21 5010030 255 50M * 0 0 TGGGATTACAGGTGTTAGCCACCA
CGTCCAGCTGTTAATTTTTATTTAAT AAB@@1FC1DGGFEGGGGGGGGGE>FGGFDGGGGGEEDGGCGGGFGGGGF XA:i:0 MD:Z:50 NM:i:0 XM:i:2
D00733:297:CBT52ANXX:6:1302:12939:54715 0 chr21 5010030 255 50M * 0 0 TGGGATTACAGGTGTTAGCCACCA
CGTCCAGCTGTTAATTTTTATTTAAT BBBBAF@FGGGGGGGGGGEGEGFFGGGGFBCFGCFGGEGCDBDFGFGGGF XA:i:0 MD:Z:50 NM:i:0 XM:i:2
D00733:297:CBT52ANXX:6:1211:9390:70340 0 chr21 5010030 255 50M * 0 0 TGGGATTACAGGTGTTAGCCACCA
CGTCCAGCTGTTAATTTTTATTTAAT CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGCGGGG XA:i:0 MD:Z:50 NM:i:0 XM:i:2
D00733:297:CBT52ANXX:7:2310:12330:36449 16 chr21 44253418 255 50M * 0 0 ATTAAATAAAAATTAA
CAGCTGGACGTGGTGTCTAACACCTGTAATCCCA GGGGGGGGGGGGGGGGGGGGGGGGGGGGGE=1GGGGGGGGGGGGGCCCCC XA:i:1 MD:Z:31G18 NM:i:1
XM:i:2
The first part (rows starting with the character @) is the header:
Symbol | ||
---|---|---|
@HD header line | VN:1.0 version of the SAM format | SO:unsorted sorting order |
@SQ reference sequence dictionary | SN:chr21 sequence name | LN:46709983 sequence length |
@PG program used | ID:Bowtie VN:1.2.3 version | CL:”/usr/local/bin/bowtie-align-s –wrapper basic-0 -S genome H3K4me1_chr21.fq.gz H3K4me1.sam” command line |
The rest is a read alignment.
Field | Value |
---|---|
Query name | D00733:297:CBT52ANXX:6:2204:16979:89185 |
FLAG | 0 * |
Reference name | chr21 |
Leftmost mapping position (1-based) | 5010030 |
Mapping quality | 255 range from 0 to 255 |
CIGAR string | 50M * |
Reference sequence name of the primary alignment of the mate | = same chromosome |
Position of the primary alignment of the mate | 0 |
observed fragment length | 0 |
Sequence | TGGGATTACAGGTGTTAGCCACCACGTCCAGCTGTTAATTTTTATTTAAT |
Quality | AAB@@1FC1DGGFEGGGGGGGGGE>FGGFDGGGGGEEDGGCGGGFGGGGF |
* FLAG 0 means: read unpaired, mapped to the reference and mapped to the forward strand.
The CIGAR string 50M means all the 50 bases are equal to the reference (M).
The BAM format is a compressed version of the SAM format (plain text) and cannot thus be seen as a text.
Converting SAM to BAM is really useful to save disk space and to quickly access the data: indeed, the BAM format can be sorted and indexed.
To convert SAM to BAM, sort, index, etc. you can use the program: samtools.
wget https://biocorecrg.github.io/PhD_course_genomics_format_2023/data/H3K4me1.bam
Let’s see how much space can be saved using BAM format instead of SAM using ls -lh command
EXERCISE
- In the file H3K4me1.sam, count the number of aligned reads containing exactly 40 A’s .
TIPs:
- Using cat > test2 create a small file containing a few lines with 2, 3, 4, and 5 A’s. Try your commands on this small file first to understands what is going on.
- Look up man grep for options to use grep with extended regular expressions.
- Look up man grep for basic regular expressions or this resource https://www.codingame.com/playgrounds/218/regular-expressions-basics/repetitions