General Transfer Format
The file we have just downloaded provides information on the genome annotation.
The genomic annotation is stored in General Transfer Format (GTF) format (which is an extension of the older GFF format): a tabular format that has a header (rows starting with ”#”, following by one line per genome feature, each one containing 9 columns of data:
Column number | Column name | Details |
---|---|---|
1 | seqname | name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. |
2 | source | name of the program that generated this feature, or the data source (database or project name) |
3 | feature | feature type name, e.g. Gene, Variation, Similarity |
4 | start | Start position of the feature, with sequence numbering starting at 1. |
5 | end | End position of the feature, with sequence numbering starting at 1. |
6 | score | A floating point value. |
7 | strand | defined as + (forward) or - (reverse). |
8 | frame | One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on.. |
9 | attribute | A semicolon-separated list of tag-value pairs, providing additional information about each feature. |
Check the first rows of the annotation file:
zcat annotation.gtf.gz | head
##description: evidence-based annotation of the human genome (GRCh38), version 32 (Ensembl 98)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2019-09-05
chr21 HAVANA gene 5011799 5017145 . + . gene_id "ENSG00000279493.1"; gene_type "protein_coding"; gene_name "FP565260.4"; level 2; havana_gene "OTTHUMG00000189354.1";
chr21 HAVANA transcript 5011799 5017145 . + . gene_id "ENSG00000279493.1"; transcript_id "ENST00000624081.1"; gene_type "protein_coding"; gene_name "FP565260.4"; transcript_type "protein_coding"; transcript_name "FP565260.4-201"; level 2; protein_id "ENSP00000485664.1"; transcript_support_level "5"; tag "mRNA_start_NF"; tag "cds_start_NF"; tag "basic"; tag "appris_principal_1"; havana_gene "OTTHUMG00000189354.1"; havana_transcript "OTTHUMT00000479422.1";
chr21 HAVANA exon 5011799 5011874 . + . gene_id "ENSG00000279493.1"; transcript_id "ENST00000624081.1"; gene_type "protein_coding"; gene_name "FP565260.4"; transcript_type "protein_coding"; transcript_name "FP565260.4-201"; exon_number 1; exon_id "ENSE00003760288.1"; level 2; protein_id "ENSP00000485664.1"; transcript_support_level "5"; tag "mRNA_start_NF"; tag "cds_start_NF"; tag "basic"; tag "appris_principal_1"; havana_gene "OTTHUMG00000189354.1"; havana_transcript "OTTHUMT00000479422.1";
chr21 HAVANA CDS 5011799 5011874 . + 0 gene_id "ENSG00000279493.1"; transcript_id "ENST00000624081.1"; gene_type "protein_coding"; gene_name "FP565260.4"; transcript_type "protein_coding"; transcript_name "FP565260.4-201"; exon_number 1; exon_id "ENSE00003760288.1"; level 2; protein_id "ENSP00000485664.1"; transcript_support_level "5"; tag "mRNA_start_NF"; tag "cds_start_NF"; tag "basic"; tag "appris_principal_1"; havana_gene "OTTHUMG00000189354.1"; havana_transcript "OTTHUMT00000479422.1";
chr21 HAVANA exon 5012548 5012687 . + . gene_id "ENSG00000279493.1"; transcript_id "ENST00000624081.1"; gene_type "protein_coding"; gene_name "FP565260.4"; transcript_type "protein_coding"; transcript_name "FP565260.4-201"; exon_number 2; exon_id "ENSE00003758404.1"; level 2; protein_id "ENSP00000485664.1"; transcript_support_level "5"; tag "mRNA_start_NF"; tag "cds_star
Let’s check how many genes are in the annotation file:
zcat annotation.gtf.gz | grep -v "#" | awk '$3=="gene"' | wc -l
872
And get a final counts of every feature:
zcat annotation.gtf.gz | grep -v "#" | cut -f3 | sort | uniq -c
7709 CDS
16659 exon
872 gene
857 start_codon
813 stop_codon
2925 transcript
2896 UTR
How many protein coding genes are there?
zcat annotation.gtf.gz | grep -v "#" | awk '$3=="gene"' | grep "protein_coding" | wc -l
232
Retrieve all unique gene IDs (let’s look up for the options of commands cut and sort using man):
zcat annotation.gtf.gz | grep -v "#" | cut -d"\"" -f2 | sort -u > annotation_geneIDs.txt
Here, we use the fact that gene ID appears after the first and before the second occurance of doublequote (“) ! That is why we used the command cut -d”"“ where backslash () is used as an escape character for special characters (“ in this case).
EXERCISE
- How many lncRNA genes are in the file annotation.gtf.gz?
TIP: Command cut can be used with different one-character separators and applied to different columns many times in a sequence via pipe.