Nanopore analysis pipeline
Nextflow pipeline for analysis of Nanopore data from direct RNA sequencing. This is a joint project between CRG bioinformatics core and Epitranscriptomics and RNA Dynamics research group.
The direct RNA sequencing platform offered by Oxford Nanopore Technologies allows for direct measurement of RNA molecules without the need of conversion to complementary DNA (cDNA), and as such, is virtually capable of detecting any given RNA modification present in the molecule that is being sequenced.
Although the technology has been publicly available since 2017, the complexity of the raw current intensity output data generated by nanopore sequencing, together with lack of systematic and reproducible pipelines for the analysis of direct RNA sequencing datasets, have greatly hindered the access of this technology to the general user. Here we provide an in silico scalable and parallelizable workflow for the analysis of direct RNA sequencing reads, which converts raw current intensities into multiple types of processed data, providing metrics of the quality of the run, per-gene counts, RNA modification predictions and polyA tail length predictions.
The workflow named Master of Pores, which has been built using the Nextflow framework and is distributed with Docker and Singularity containers, can be executed on any Unix-compatible OS on a computer, cluster or cloud without the need of installing any additional software or dependencies. The workflow is easily scalable, as it can incorporate updated software versions or algorithms that may be released in the future in a modular manner. We expect that our pipeline will make the analysis of direct RNA sequencing datasets highly simplified and accessible to the non-bioinformatic expert, and thus boost our understanding of the epitranscriptome with single molecule resolution.
The MasterOfPores workflow includes all steps needed to process raw FAST5 files produced by Nanopore direct RNA sequencing and executes the following steps, allowing users a choice among different algorithms. The pipeline consists of 3 modules:
Module 1: NanoPreprocess
This module takes as input the raw Fast5 reads and produces as output base-called FASTQ and BAM. The pre-processing module performs base-calling, demultiplexing, filtering, quality control, mapping, read counting, generating a final report of the performance and results of each of the steps performed. It automatically detects the kinf of input fast5 file (single or multi sequence).
The NanoPreprocess module comprises 8 main steps:
- Read base-calling with the algorithm of choice, using Albacore (https://nanoporetech.com) or Guppy (https://nanoporetech.com). This step can be run in parallel and the user can decide the number of files to be processed in a single job by using the command –granularity. When using GPU the granularity is ignored and all the files are analyzed sequentially.
- Filtering of the resulting fastq files using Nanofilt (https://github.com/wdecoster/nanofilt). This step is optional and can be run in parallel.
- Demultiplexing of the fastq files using DeePlexiCon (https://github.com/Psy-Fer/deeplexicon). This step is optional, and can only be used if the libraries have been barcoded using the oligonucleotides used to train the deep neural classifier. The model must be given as option as indicated in the params.config
- Quality control of the base-called data using MinIONQC (https://github.com/roblanf/minion_qc) and FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc).
- Read mapping to the reference genome or transcriptome using minimap2 (https://github.com/lh3/minimap2) or Graphmap2 (https://github.com/lbcb-sci/graphmap2).
- Quality control on the alignment using NanoPlot (https://github.com/wdecoster/NanoPlot) and bam2stats (https://github.com/lpryszcz/bin).
- Gene or Isoform quantification using HTSeq (https://htseq.readthedocs.io/) or NanoCount (https://github.com/a-slide/NanoCount) which estimates transcript abundance using an expectation-maximization algorithm. Of note, NanoCount is run if the reads have been mapped to the transcriptome, using the flag –reference_type transcriptome while HTseq is used when mapping to the genome. By default, reads are mapped to the genome and HTSeq is used to quantify per-gene counts.
- Final report of the data processing using MultiQC (https://github.com/ewels/MultiQC) that combines the single quality controls done previously, as well as global run statistics.
Module 2: NanoTail
This module takes as input the output produced by the NanoPreprocess module and produces polyA tail estimations.
The NanoTail module estimates polyA tail lengths using Nanopolish (https://github.com/jts/nanopolish) and Tailfindr (https://github.com/adnaniazi/tailfindr), producing a plain text file that includes polyA tail length estimates for each read, computed using both algorithms. The correlation between the two algorithms is also reported as a plot.
Module 3: NanoMod
This module takes as input the rthe output produced by the NanoPreprocess module and produces a flat text file which includes the predicted RNA modifications.
The NanoMod module predicts RNA modifications using Tombo (https://github.com/nanoporetech/tombo) and EpiNano (https://github.com/enovoa/EpiNano), producing a plain text file that is intersection of predicted sites both algorithms, to reduce the number of false positives.
Citing this work
If you use this tool please cite our pre-print:
“MasterOfPores: A Workflow for the Analysis of Oxford Nanopore Direct RNA Sequencing Datasets” Luca Cozzuto, Huanle Liu, Leszek P. Pryszcz, Toni Hermoso Pulido, Anna Delgado-Tejedor, Julia Ponomarenko, Eva Maria Novoa. Front. Genet., 17 March 2020.