NanoPreprocess

This module takes as input the raw fast5 reads - single or multi - and produces a number of outputs (basecalled fast5, sequences in fastq format, aligned reads in BAM format etc). The pre-processing module performs base-calling, demultiplexing (optional), filtering, quality control, mapping to a genome / transcriptome reference, feature counting and it generates a final report of the performance and results of each of the steps performed. It automatically detects the kind of input fast5 file (single or multi sequence).

Workflow

Process name	Description
testInput	Detection of kind of fast5 (multi or single)
baseCalling	Basecalling with Albacore or Guppy (up to guppy 4.0)
demultiplexing	Demultiplexing (optional)
concatenateFastQFiles	This process concatenates the fastq files produces for each single basecalling
QC	Performed with MinIONQC
fastQC	Executed on fastq files
mapping	Mapping to genome / transcriptome with either minimap2, graphmap orgraphmap2
counting	If mapping to the genome, it obtains counts per gene with htseq-count. Otherwise, if mapping to the transcriptome, transcript counts are generated with NanoCount. Reads are also assigned to a gene or to a transcript if they are uniquely mapping. A report file is also generated.
alnQC2	QC of aligned reads with NanoPlot. The plots PercentIdentityvsAverageBaseQuality_kde, LengthvsQualityScatterPlot_dot, HistogramReadlength and Weighted_HistogramReadlength are then merged together in a single picture.
alnQC	QC of aligned reads with bam2stats.
cram_conversion	Generating cram file from alignment.
joinAlnQCs	Merging the QC files generated by the alnQC step.
joinCountQCs	Merging the report files generated by the counting step.
multiQC	Final report generation - enventually sent by mail to the user too.

Input Parameters

Parameter name	Description
fast5 files	Path to fast5 input files (single or multi-fast5 files). They should be inside a folder that will be used as sample name. *[/Path/sample_name/.fast5]**
reference	File in fasta format. [Reference_file.fa]
ref_type	Specify if the reference is a genome or a transcriptome. [genome / transcriptome]
kit	Kit used in library prep - required for basecalling.
flowcell	Flowcell used in sequencing - required for basecalling.
annotation	Annotation file in GTF format. It is optional and needed only in case of mapping to the genome and when interested in gene counts. [Annotation_file.gtf]
seq_type	Sequence type. [RNA / DNA]
output	Output folder name. [/Path/to_output_folder]
granularity	indicates the number of input fast5 files analyzed in a single process. It is by default 4000 for single-sequence fast5 files and 1 for multi-sequence fast5 files. In case GPU option is turned on this value is not needed since every file will be analyzed sequentially.
basecaller	Algorithm to perform the basecalling. guppy or albacore are supported. [albacore / guppy]
basecaller_opt	Command line options for basecaller program. Check available options in respective basecaller repository.
GPU	Allow the pipeline to run with GPU. [OFF / ON]
demultiplexing	Demultiplexing algorithm to be used. [OFF / deeplexicon / guppy / guppy-readucks]
demultiplexing_opt	Command line options for the demultiplexing software.
demulti_fast5	If performing demultiplexing, also generate demultiplexed multifast5 files. [OFF / ON]
filter	Program to filter fastq files. [nanofilt / OFF]
filter_opt	Command line options of the filtering program.
mapper	Mapping algorithm. [minimap2 / graphmap / graphmap2]
mapper_opt	Command line options of the mapping algorithm.
map_type	Spliced - recommended for genome mapping - or unspliced - recommended for transcriptome mapping. [spliced / unspliced]
counter	Generating gene counts (genome mapping) or transcript counts (transcriptome mapping). [YES / “”]
counter_opt	Command line options of the counter program: NanoCount for transcripts and Htseq-count for genes.
email	Users email for receving the final report when the pipeline is finished. [user_email]

You can change them by editing the params.config file or using the command line - please, see next section.

How to run the pipeline

Before launching the pipeline, user should decide which containers to use - either docker or singularity [-with-docker / -with-singularity]. Then, to launch the pipeline, please use the following command:

nextflow run nanopreprocess.nf -with-singularity > log.txt

Run the pipeline in the background:

nextflow run nanopreprocess.nf -with-singularity -bg > log.txt

Run the pipeline while changing params.config file:

nextflow run nanopreprocess.nf -with-singularity -bg --output test2 > log.txt

Specify a directory for the working directory (temporary files location):

nextflow run nanopreprocess.nf -with-singularity -bg -w /path/working_directory > log.txt

Run the pipeline with GPU - CRG GPU cluster users

nextflow run nanopreprocess.nf -with-singularity -bg -w /path/working_directory -profile cluster > log.txt

Run the pipeline with GPU - local GPU

nextflow run nanopreprocess.nf -with-singularity -bg -w /path/working_directory -profile local > log.txt

Troubleshooting

Checking what has gone wrong:
If there is an error, please see the log file (log.txt) for more details. Furthermore, if more information is needed, you can also find the working directory of the process in the file. Then, access that directory and check both the .command.log and .command.err files.
Resume an execution:
Once the error has been solved or if you change a specific parameter, you can resume the execution with the Netxtlow parameter -resume (only one dash!). If there was an error, the pipeline will resume from the process that had the error and proceed with the rest. If a parameter was changed, only processes affected by this parameter will be re-run.

nextflow run nanopreprocess.nf -with-singularity -bg -resume > log_resumed.txt

To check whether the pipeline has been resumed properly, please check the log file. If previous correctly executed process are found as Cached, resume worked!

...

[warm up] executor > crg
[e8/2e64bd] Cached process > baseCalling (RNA081120181_1)
[b2/21f680] Cached process > QC (RNA081120181_1)
[c8/3f5d17] Cached process > mapping (RNA081120181_1)
...

IMPORTANT: To resume the execution, temporary files generated previously by the pipeline must be kept. Otherwise, pipeline will re-start from the beginning.

Results:

Several folders are created by the pipeline within the output directory specified by the output parameter and the input folder name is taken as sample name.

fast5_files: Contains the basecalled multifast5 files. Each batch contains 4000 sequences.
fastq_files: Contains one or, in case of demultiplexing, more fastq files.
QC_files: Contains each single QC produced by the pipeline.
alignment: Contains the bam file(s).
cram_files: Contains the cram file(s).
counts (OPTIONAL): Contains read counts per gene / transcript if counting was performed.
assigned (OPTIONAL): Contains assignment of each read to a given gene / transcript if counting was performed.
report: Contains the final multiqc report.
variants (OPTIONAL): still experimental. It contains variant calling.

NanoPreprocessSimple

This is a light version of NanoPreprocess that does not perform the basecalling step. It allows to make the same analysis starting from basecalled reads in fastq format. You can also provide fast5 files if you need to demultiplex using DeepLexiCon.

This module will allow to run the pipeline on multiple input samples by using this syntax in the params.file:

fastq               = "$baseDir/../../../org_data/**/*.fastq.gz"

In this way it will produces a number of output files with the same sample name indicated by the two asterisks.