NanoPreprocess
This module takes as input the raw fast5 reads and produces a number of outputs (basecalled fast5, sequences in fastq format, aligned reads in BAM format etc). The pre-processing module performs base-calling, demultiplexing (optional), filtering, quality control, mapping to a genome / transcriptome reference, read counting and it generates a final report of the performance and results of each of the steps performed. It automatically detects the kinf of input fast5 file (single or multi sequence).
Workflow
Process name | Description |
---|---|
testInput | Detection of kind of fast5 (multi or single) |
baseCalling | Basecalling with Albacore or Guppy |
demultiplexing_with_deeplexicon | Demultiplexing (optional) with DeePlexiCon |
concatenateFastQFiles | This process concatenates the fastq files produces for each single basecalling |
QC | performed with MinIONQC |
fastQC | on fastq files |
mapping | to the genome / transcriptome with either minimap2 or graphmap2 |
counting | counts per gene, if mapping to the genome with htseq-count, or per transcript if mapping to the transcriptome with NanoCount. Reads are also assigned to a gene or to a transcript if they are uniquely mapping. A report file is also generated |
alnQC | QC of aligned reads with bam2stats. |
joinCountQCs | This process is for merging the report files generated by the counting step. |
joinAlnQCs | This process is for merging the QC files generated by the alnQC step. |
alnQC2 | QC of aligned reads with NanoPlot. The plots PercentIdentityvsAverageBaseQuality_kde, LengthvsQualityScatterPlot_dot, HistogramReadlength and Weighted_HistogramReadlength are then merged together in a single picture. |
multiQC | Final repor enventually sent by mail too. |
You can launch the pipeline choosing either the parameter -with-singularity or -with-docker depending on which containers you want to use:
Input Parameters
Parameter name | Description |
---|---|
fast5 files | Path to fast5 input files. They can contain either a single sequence or multiple ones. They should be inside a folder that will be used as sample name. |
reference | file in fasta format. It can be either a genome or a transcriptome. this must be specified via ref_type parameter. |
kit | and flowcell parameters needed for basecalling. |
annotation | in GTF format. It is optional and needed only in case of mapping to the genome and when interested in gene counts. |
seq_type | It can be either RNA or DNA. |
output | output folder name |
granularity | indicates the number of input fast5 files analyzed in a single process. It is by default 4000 for single-sequence fast5 files and 1 for multi-sequence fast5 files. In case GPU option is turned on this value is not needed since every file will be analyzed sequentially. |
basecaller | program. guppy or albacore are supported. |
basecaller_opt | command line options for basecaller program |
GPU | it allows using GPU or not. I can be either OFF or NO |
demultiplexing | program. It is supported only deeplexicon. It can be turned off by specifying “OFF” |
demultiplexing_opt | options for the demultiplexing program. |
filter | it can be NanoFilt or OFF is filtering is needed. |
filter_opt | options of the filtering program. |
mapper | it can be either minimap2 or graphmap2 |
mapper_opt | options of the mapping program. |
map_type | it can be either spliced or not. In case the alignment is to a eukaryotic genome it should be spliced. |
counter | this parameter can be YES for counting the number of tags per gene (in case of mapping to the genome) or per transcript (in case of mapping to the transcriptome). An annotation file is needed in case of mapping to the genome. |
counter_opt | options of the counter program: NanoCount for transcripts and Htseq-count for genes. |
for receving a mail with the final report when the pipeline is finished |
You can change them by editing the params.config file or using the command line (each param name needs to have the characters – before):
nextflow run nanopreprocess.nf -with-singularity -bg --output test2 > log.txt
To resume a previous execution that failed at a certain step or if you change a parameter that affects only some steps you can use the Netxtlow parameter -resume (only one dash!):
nextflow run nanopreprocess.nf -with-singularity -bg -resume > log.txt
...
[warm up] executor > crg
[e8/2e64bd] Cached process > baseCalling (RNA081120181_1)
[b2/21f680] Cached process > QC (RNA081120181_1)
[c8/3f5d17] Cached process > mapping (RNA081120181_1)
...
Results:
Seven folders are created by the pipeline within the output folder specified by the output parameter:
- fast5_files: contains the basecalled multisequence fast5 files. Each batch contains 4000 sequences.
- fastq_files: contains one or, in case of demultiplexing, more fastq files.
- QC_files: contains each single QC produced by the pipeline.
- alignment: contains the bam file(s)
- counts: contains read counts per gene / transcript. It is optional.
- assigned: contains assignment of each read to a given gene / transcript. It is optional.
- report: contains the final multiqc report.