NanoPreprocess

This module takes as input the raw fast5 reads - single or multi - and produces a number of outputs (basecalled fast5, sequences in fastq format, aligned reads in BAM format etc). The pre-processing module performs base-calling, demultiplexing (optional), filtering, quality control, mapping to a genome / transcriptome reference, feature counting and it generates a final report of the performance and results of each of the steps performed. It automatically detects the kind of input fast5 file (single or multi sequence).

Workflow

Process name Description
testInput Detection of kind of fast5 (multi or single)
baseCalling Basecalling with Albacore or Guppy (up to guppy 4.0)
demultiplexing Demultiplexing (optional)
concatenateFastQFiles This process concatenates the fastq files produces for each single basecalling
QC Performed with MinIONQC
fastQC Executed on fastq files
mapping Mapping to genome / transcriptome with either minimap2, graphmap orgraphmap2
counting If mapping to the genome, it obtains counts per gene with htseq-count. Otherwise, if mapping to the transcriptome, transcript counts are generated with NanoCount. Reads are also assigned to a gene or to a transcript if they are uniquely mapping. A report file is also generated.
alnQC2 QC of aligned reads with NanoPlot. The plots PercentIdentityvsAverageBaseQuality_kde, LengthvsQualityScatterPlot_dot, HistogramReadlength and Weighted_HistogramReadlength are then merged together in a single picture.
alnQC QC of aligned reads with bam2stats.
cram_conversion Generating cram file from alignment.
joinAlnQCs Merging the QC files generated by the alnQC step.
joinCountQCs Merging the report files generated by the counting step.
multiQC Final report generation - enventually sent by mail to the user too.

Input Parameters

Parameter name Description
fast5 files Path to fast5 input files (single or multi-fast5 files). They should be inside a folder that will be used as sample name. [/Path/sample_name/*.fast5]
reference File in fasta format. [Reference_file.fa]
ref_type Specify if the reference is a genome or a transcriptome. [genome / transcriptome]
kit Kit used in library prep - required for basecalling.
flowcell Flowcell used in sequencing - required for basecalling.
annotation Annotation file in GTF format. It is optional and needed only in case of mapping to the genome and when interested in gene counts. [Annotation_file.gtf]
seq_type Sequence type. [RNA / DNA]
output Output folder name. [/Path/to_output_folder]
granularity indicates the number of input fast5 files analyzed in a single process. It is by default 4000 for single-sequence fast5 files and 1 for multi-sequence fast5 files. In case GPU option is turned on this value is not needed since every file will be analyzed sequentially.
basecaller Algorithm to perform the basecalling. guppy or albacore are supported. [albacore / guppy]
basecaller_opt Command line options for basecaller program. Check available options in respective basecaller repository.
GPU Allow the pipeline to run with GPU. [OFF / ON]
demultiplexing Demultiplexing algorithm to be used. [OFF / deeplexicon / guppy / guppy-readucks]
demultiplexing_opt Command line options for the demultiplexing software.
demulti_fast5 If performing demultiplexing, also generate demultiplexed multifast5 files. [OFF / ON]
filter Program to filter fastq files. [nanofilt / OFF]
filter_opt Command line options of the filtering program.
mapper Mapping algorithm. [minimap2 / graphmap / graphmap2]
mapper_opt Command line options of the mapping algorithm.
map_type Spliced - recommended for genome mapping - or unspliced - recommended for transcriptome mapping. [spliced / unspliced]
counter Generating gene counts (genome mapping) or transcript counts (transcriptome mapping). [YES / “”]
counter_opt Command line options of the counter program: NanoCount for transcripts and Htseq-count for genes.
email Users email for receving the final report when the pipeline is finished. [user_email]

You can change them by editing the params.config file or using the command line - please, see next section.

How to run the pipeline

Before launching the pipeline, user should decide which containers to use - either docker or singularity [-with-docker / -with-singularity]. Then, to launch the pipeline, please use the following command:

nextflow run nanopreprocess.nf -with-singularity > log.txt


  • Run the pipeline in the background:
    nextflow run nanopreprocess.nf -with-singularity -bg > log.txt
    


  • Run the pipeline while changing params.config file:
    nextflow run nanopreprocess.nf -with-singularity -bg --output test2 > log.txt
    


  • Specify a directory for the working directory (temporary files location):
    nextflow run nanopreprocess.nf -with-singularity -bg -w /path/working_directory > log.txt
    


  • Run the pipeline with GPU - CRG GPU cluster users
    nextflow run nanopreprocess.nf -with-singularity -bg -w /path/working_directory -profile cluster > log.txt
    


  • Run the pipeline with GPU - local GPU
    nextflow run nanopreprocess.nf -with-singularity -bg -w /path/working_directory -profile local > log.txt
    

Troubleshooting

  • Checking what has gone wrong:
    If there is an error, please see the log file (log.txt) for more details. Furthermore, if more information is needed, you can also find the working directory of the process in the file. Then, access that directory and check both the .command.log and .command.err files.

  • Resume an execution:
    Once the error has been solved or if you change a specific parameter, you can resume the execution with the Netxtlow parameter -resume (only one dash!). If there was an error, the pipeline will resume from the process that had the error and proceed with the rest. If a parameter was changed, only processes affected by this parameter will be re-run.

nextflow run nanopreprocess.nf -with-singularity -bg -resume > log_resumed.txt

To check whether the pipeline has been resumed properly, please check the log file. If previous correctly executed process are found as Cached, resume worked!

...

[warm up] executor > crg
[e8/2e64bd] Cached process > baseCalling (RNA081120181_1)
[b2/21f680] Cached process > QC (RNA081120181_1)
[c8/3f5d17] Cached process > mapping (RNA081120181_1)
...

IMPORTANT: To resume the execution, temporary files generated previously by the pipeline must be kept. Otherwise, pipeline will re-start from the beginning.

Results:

Several folders are created by the pipeline within the output directory specified by the output parameter and the input folder name is taken as sample name.

  • fast5_files: Contains the basecalled multifast5 files. Each batch contains 4000 sequences.
  • fastq_files: Contains one or, in case of demultiplexing, more fastq files.
  • QC_files: Contains each single QC produced by the pipeline.
  • alignment: Contains the bam file(s).
  • cram_files: Contains the cram file(s).
  • counts (OPTIONAL): Contains read counts per gene / transcript if counting was performed.
  • assigned (OPTIONAL): Contains assignment of each read to a given gene / transcript if counting was performed.
  • report: Contains the final multiqc report.
  • variants (OPTIONAL): still experimental. It contains variant calling.

NanoPreprocessSimple

This is a light version of NanoPreprocess that does not perform the basecalling step. It allows to make the same analysis starting from basecalled reads in fastq format. You can also provide fast5 files if you need to demultiplex using DeepLexiCon.

This module will allow to run the pipeline on multiple input samples by using this syntax in the params.file:

fastq               = "$baseDir/../../../org_data/**/*.fastq.gz"

In this way it will produces a number of output files with the same sample name indicated by the two asterisks.