6.5 Adding more steps
We can make pipelines incrementally complex by adding more and more processes.
Nextflow will take care of the dependencies between the input / output and of the parallelization.
Within the test3 folder we have two more steps to add: the reference indexing and the read alignments with bowtie (http://bowtie-bio.sourceforge.net/index.shtml).
We add a new input for the reference sequence:
log.info """
BIOCORE@CRG - N F TESTPIPE ~ version ${version}
=============================================
reads : ${params.reads}
reference : ${params.reference}
outdir : ${params.outdir}
"""
reference = file(params.reference)
The singleton channel called reference is created. Its content is never consumed and can be indefinitely used. We also add a path specifying where to place the output files.
/*
* Defining the output folders.
*/
fastqcOutputFolder = "${params.outdir}/output_fastqc"
alnOutputFolder = "${params.outdir}/output_aln"
multiqcOutputFolder = "${params.outdir}/output_multiQC"
We add two more processes. The first one is for the indexing the reference genome (with bowtie-build
):
/*
* Process 2. Bowtie index
*/
process bowtieIdx {
tag { "${ref}" }
input:
path ref
output:
tuple val("${ref}"), path ("${ref}*.ebwt")
script:
"""
gunzip -c ${ref} > reference.fa
bowtie-build reference.fa ${ref}
rm reference.fa
"""
}
Since bowtie indexing requires unzipped reference fasta file, we first gunzip it, we then build the reference index, and we finally remove the unzipped file.
The output channel generated is organized as a tuple, i.e. a list of elements.
The first element of the list is the name of the index as a value, the second is a list of files constituting the index.
The former is needed for building the command line of the alignment step, the latter are the files needed for the alignment.
The second process bowtieAln
is the alignment step:
/*
* Process 3. Bowtie alignment
*/
process bowtieAln {
publishDir alnOutputFolder, pattern: '*.sam'
tag { "${reads}" }
label 'twocpus'
input:
tuple val(refname), path (ref_files)
path reads
output:
path "${reads}.sam", emit: samples_sam
path "${reads}.log", emit: samples_log
script:
"""
bowtie -p ${task.cpus} ${refname} -q ${reads} -S > ${reads}.sam 2> ${reads}.log
"""
}
There are two different input channels: the index and the reads.
The index name specified by refname is used for building the command line while the index files, indicated by ref_files, are just linked in the current directory by using the path qualifier.
We also produced two kind of outputs: the alignments and the logs.
The first one is the one we want to keep as a final result. So we specify this using the pattern parameter in publishDir.
The second one will be just passed to the next process for being used by the multiQC process. To distinguish them we can assign them different names.
This section will allow us to connect these outputs directly with other processes when we call them in the workflow section:
workflow {
fastqc_out = fastQC(reads)
bowtie_index = bowtieIdx(reference)
bowtieAln(bowtie_index, reads)
multiQC(fastqc_out.mix(bowtieAln.out.samples_log).collect())
}
So we passed the samples_log output to the multiqc process after mixing it with the output channel from the fastqc process.