NanoPreprocess

This module takes as input the raw fast5 reads and produces a number of outputs (basecalled fast5, sequences in fastq format, aligned reads in BAM format etc). The pre-processing module performs base-calling, demultiplexing (optional), filtering, quality control, mapping to a genome / transcriptome reference, read counting and it generates a final report of the performance and results of each of the steps performed. It automatically detects the kinf of input fast5 file (single or multi sequence).

Workflow

Process name Description
testInput Detection of kind of fast5 (multi or single)
baseCalling Basecalling with Albacore or Guppy
demultiplexing_with_deeplexicon Demultiplexing (optional) with DeePlexiCon
concatenateFastQFiles This process concatenates the fastq files produces for each single basecalling
QC performed with MinIONQC
fastQC on fastq files
mapping to the genome / transcriptome with either minimap2 or graphmap2
counting counts per gene, if mapping to the genome with htseq-count, or per transcript if mapping to the transcriptome with NanoCount. Reads are also assigned to a gene or to a transcript if they are uniquely mapping. A report file is also generated
alnQC QC of aligned reads with bam2stats.
joinCountQCs This process is for merging the report files generated by the counting step.
joinAlnQCs This process is for merging the QC files generated by the alnQC step.
alnQC2 QC of aligned reads with NanoPlot. The plots PercentIdentityvsAverageBaseQuality_kde, LengthvsQualityScatterPlot_dot, HistogramReadlength and Weighted_HistogramReadlength are then merged together in a single picture.
multiQC Final repor enventually sent by mail too.

You can launch the pipeline choosing either the parameter -with-singularity or -with-docker depending on which containers you want to use:

Input Parameters

Parameter name Description
fast5 files Path to fast5 input files. They can contain either a single sequence or multiple ones. They should be inside a folder that will be used as sample name.
reference file in fasta format. It can be either a genome or a transcriptome. this must be specified via ref_type parameter.
kit and flowcell parameters needed for basecalling.
annotation in GTF format. It is optional and needed only in case of mapping to the genome and when interested in gene counts.
seq_type It can be either RNA or DNA.
output output folder name
granularity indicates the number of input fast5 files analyzed in a single process. It is by default 4000 for single-sequence fast5 files and 1 for multi-sequence fast5 files. In case GPU option is turned on this value is not needed since every file will be analyzed sequentially.
basecaller program. guppy or albacore are supported.
basecaller_opt command line options for basecaller program
GPU it allows using GPU or not. I can be either OFF or NO
demultiplexing program. It is supported only deeplexicon. It can be turned off by specifying “OFF”
demultiplexing_opt options for the demultiplexing program.
filter it can be NanoFilt or OFF is filtering is needed.
filter_opt options of the filtering program.
mapper it can be either minimap2 or graphmap2
mapper_opt options of the mapping program.
map_type it can be either spliced or not. In case the alignment is to a eukaryotic genome it should be spliced.
counter this parameter can be YES for counting the number of tags per gene (in case of mapping to the genome) or per transcript (in case of mapping to the transcriptome). An annotation file is needed in case of mapping to the genome.
counter_opt options of the counter program: NanoCount for transcripts and Htseq-count for genes.
email for receving a mail with the final report when the pipeline is finished

You can change them by editing the params.config file or using the command line (each param name needs to have the characters before):

nextflow run nanopreprocess.nf -with-singularity -bg --output test2 > log.txt

To resume a previous execution that failed at a certain step or if you change a parameter that affects only some steps you can use the Netxtlow parameter -resume (only one dash!):

nextflow run nanopreprocess.nf -with-singularity -bg -resume > log.txt

...

[warm up] executor > crg
[e8/2e64bd] Cached process > baseCalling (RNA081120181_1)
[b2/21f680] Cached process > QC (RNA081120181_1)
[c8/3f5d17] Cached process > mapping (RNA081120181_1)
...

Results:

Seven folders are created by the pipeline within the output folder specified by the output parameter:

  • fast5_files: contains the basecalled multisequence fast5 files. Each batch contains 4000 sequences.
  • fastq_files: contains one or, in case of demultiplexing, more fastq files.
  • QC_files: contains each single QC produced by the pipeline.
  • alignment: contains the bam file(s)
  • counts: contains read counts per gene / transcript. It is optional.
  • assigned: contains assignment of each read to a given gene / transcript. It is optional.
  • report: contains the final multiqc report.