NanoPreprocess

This module takes as input the raw fast5 reads and produces a number of outputs (basecalled fast5, sequences in fastq format, aligned reads in BAM format etc). The pre-processing module performs base-calling, demultiplexing (optional), filtering, quality control, mapping to a genome / transcriptome reference, read counting and it generates a final report of the performance and results of each of the steps performed. It automatically detects the kinf of input fast5 file (single or multi sequence).

Workflow

Process name	Description
testInput	Detection of kind of fast5 (multi or single)
baseCalling	Basecalling with Albacore or Guppy
demultiplexing_with_deeplexicon	Demultiplexing (optional) with DeePlexiCon
concatenateFastQFiles	This process concatenates the fastq files produces for each single basecalling
QC	performed with MinIONQC
fastQC	on fastq files
mapping	to the genome / transcriptome with either minimap2 or graphmap2
counting	counts per gene, if mapping to the genome with htseq-count, or per transcript if mapping to the transcriptome with NanoCount. Reads are also assigned to a gene or to a transcript if they are uniquely mapping. A report file is also generated
alnQC	QC of aligned reads with bam2stats.
joinCountQCs	This process is for merging the report files generated by the counting step.
joinAlnQCs	This process is for merging the QC files generated by the alnQC step.
alnQC2	QC of aligned reads with NanoPlot. The plots PercentIdentityvsAverageBaseQuality_kde, LengthvsQualityScatterPlot_dot, HistogramReadlength and Weighted_HistogramReadlength are then merged together in a single picture.
multiQC	Final repor enventually sent by mail too.

You can launch the pipeline choosing either the parameter -with-singularity or -with-docker depending on which containers you want to use:

Input Parameters

Parameter name	Description
fast5 files	Path to fast5 input files. They can contain either a single sequence or multiple ones. They should be inside a folder that will be used as sample name.
reference	file in fasta format. It can be either a genome or a transcriptome. this must be specified via ref_type parameter.
kit	and flowcell parameters needed for basecalling.
annotation	in GTF format. It is optional and needed only in case of mapping to the genome and when interested in gene counts.
seq_type	It can be either RNA or DNA.
output	output folder name
granularity	indicates the number of input fast5 files analyzed in a single process. It is by default 4000 for single-sequence fast5 files and 1 for multi-sequence fast5 files. In case GPU option is turned on this value is not needed since every file will be analyzed sequentially.
basecaller	program. guppy or albacore are supported.
basecaller_opt	command line options for basecaller program
GPU	it allows using GPU or not. I can be either OFF or NO
demultiplexing	program. It is supported only deeplexicon. It can be turned off by specifying “OFF”
demultiplexing_opt	options for the demultiplexing program.
filter	it can be NanoFilt or OFF is filtering is needed.
filter_opt	options of the filtering program.
mapper	it can be either minimap2 or graphmap2
mapper_opt	options of the mapping program.
map_type	it can be either spliced or not. In case the alignment is to a eukaryotic genome it should be spliced.
counter	this parameter can be YES for counting the number of tags per gene (in case of mapping to the genome) or per transcript (in case of mapping to the transcriptome). An annotation file is needed in case of mapping to the genome.
counter_opt	options of the counter program: NanoCount for transcripts and Htseq-count for genes.
email	for receving a mail with the final report when the pipeline is finished

You can change them by editing the params.config file or using the command line (each param name needs to have the characters – before):

nextflow run nanopreprocess.nf -with-singularity -bg --output test2 > log.txt

To resume a previous execution that failed at a certain step or if you change a parameter that affects only some steps you can use the Netxtlow parameter -resume (only one dash!):

nextflow run nanopreprocess.nf -with-singularity -bg -resume > log.txt

...

[warm up] executor > crg
[e8/2e64bd] Cached process > baseCalling (RNA081120181_1)
[b2/21f680] Cached process > QC (RNA081120181_1)
[c8/3f5d17] Cached process > mapping (RNA081120181_1)
...

Results:

Seven folders are created by the pipeline within the output folder specified by the output parameter:

fast5_files: contains the basecalled multisequence fast5 files. Each batch contains 4000 sequences.
fastq_files: contains one or, in case of demultiplexing, more fastq files.
QC_files: contains each single QC produced by the pipeline.
alignment: contains the bam file(s)
counts: contains read counts per gene / transcript. It is optional.
assigned: contains assignment of each read to a given gene / transcript. It is optional.
report: contains the final multiqc report.