5.9 More complex scripts

We can feed the channel that is generated by a process to another process in the workflow definition. In this way we have a proper pipeline. You can see that we need to escape the variable used by AWK otherwise they will be considered proper Nextflow variables producing an error. So every special character like $ needs to be escaped ($) or you’ll get an error. Sometimes with long an difficult one liners you might want to make a small shell script and call it as an executable. You need to place it in a folder named bin inside the pipeline folder. This will be automatically considered from Nextflow as tool in the path.

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

// the default "$baseDir/testdata/test.fa" can be overridden by using --inputfile OTHERFILENAME
params.inputfile = "$baseDir/testdata/test.fa"

// the "file method" returns a file system object given a file path string  
sequences_file = file(params.inputfile)             

// check if the file exists
if( !sequences_file.exists() ) exit 1, "Missing genome file: ${genome_file}"


/*
 * Process 1 for splitting a fasta file in multiple files
 */

process splitSequences {

    input:
    path sequencesFile

    output:
    path ('seq_*')    

    // simple awk command
    script:
    """
    awk '/^>/{f="seq_"++d} {print > f}' < ${sequencesFile}
    """
}


/*
 * Process 2 for reversing the sequences. Note the escaped AWK variables \$
 */

process reverseSequence {
    tag { "${seq}" }                

    input:
    path seq

    output:
    path "all.rev"

    script:
    """
    cat ${seq} | awk '{if (\$1~">") {print \$0} else system("echo " \$0 " |rev")}' > all.rev
    """
}

workflow {
    splitted_seq    = splitSequences(sequences_file)

    // Here you have the output channel as a collection
    splitted_seq.view()

    // Here you have the same channel reshaped to send separately each value
    splitted_seq.flatten().view()

    // DLS2 allows you to reuse the channels! In past you had to create many identical
    // channels for sending the same kind of data to different processes

    rev_single_seq  = reverseSequence(splitted_seq)
}

Here we have two simple processes:

the former splits the input fasta file into single sequences.
the latter is able to reverse the position of the sequences.

The input path is fed as a parameter using the script parameters ${seq}

params.inputfile

Note: you can get the file “test.fa” from the githu repository of the course

This value can be overridden when calling the script:

nextflow run test1.nf --inputfile another_input.fa

The workflow part connects the two processes so that the output of the first process is fed as an input to the second one.

During the execution Nextflow creates a number of temporary folders, and will this time also create a soft link to the original input file. It will then store output files locally.

The output file is then linked in other folders for being used as input from other processes.
This avoids clashes and each process is nicely isolated from the others.

nextflow run test1.nf -bg

N E X T F L O W  ~  version 20.07.1
Launching `test1.nf` [sad_newton] - revision: 82e66714e4
[09/53e071] Submitted process > splitSequences
[/home/ec2-user/git/CoursesCRG_Containers_Nextflow_May_2021/nextflow/nextflow/work/09/53e071d286ed66f4020869c8977b59/seq_1, /home/ec2-user/git/CoursesCRG_Containers_Nextflow_May_2021/nextflow/nextflow/work/09/53e071d286ed66f4020869c8977b59/seq_2, /home/ec2-user/git/CoursesCRG_Containers_Nextflow_May_2021/nextflow/nextflow/work/09/53e071d286ed66f4020869c8977b59/seq_3]
/home/ec2-user/git/CoursesCRG_Containers_Nextflow_May_2021/nextflow/nextflow/work/09/53e071d286ed66f4020869c8977b59/seq_1
/home/ec2-user/git/CoursesCRG_Containers_Nextflow_May_2021/nextflow/nextflow/work/09/53e071d286ed66f4020869c8977b59/seq_2
/home/ec2-user/git/CoursesCRG_Containers_Nextflow_May_2021/nextflow/nextflow/work/09/53e071d286ed66f4020869c8977b59/seq_3
[fe/0a8640] Submitted process > reverseSequence ([seq_1, seq_2, seq_3])

We can inspect the content of work/09/53e071* generated by the process splitSequences:

ls -l work/09/53e071*

total 24
-rw-r--r--  1 lcozzuto  staff  29 Oct  8 19:16 seq_1
-rw-r--r--  1 lcozzuto  staff  33 Oct  8 19:16 seq_2
-rw-r--r--  1 lcozzuto  staff  27 Oct  8 19:16 seq_3
lrwxr-xr-x  1 lcozzuto  staff  69 Oct  8 19:16 test.fa -> /home/ec2-user/git/CoursesCRG_Containers_Nextflow_May_2021/nextflow/nextflow/testdata/test.fa

File test.fa is a soft link to the original input.
If now we inspect work/fe/0a8640* that is generated by the process reverseSequence, we see that the files generated by splitSequences are now linked as input.

ls -l work/fe/0a8640*

total 8
-rw-r--r--  1 lcozzuto  staff  89 Oct  8 19:16 all.rev
lrwxr-xr-x  1 lcozzuto  staff  97 Oct  8 19:16 seq_1 -> /home/ec2-user/git/CoursesCRG_Containers_Nextflow_May_2021/nextflow/nextflow/work/09/53e071d286ed66f4020869c8977b59/seq_1
lrwxr-xr-x  1 lcozzuto  staff  97 Oct  8 19:16 seq_2 -> /home/ec2-user/git/CoursesCRG_Containers_Nextflow_May_2021/nextflow/nextflow/work/09/53e071d286ed66f4020869c8977b59/seq_2
lrwxr-xr-x  1 lcozzuto  staff  97 Oct  8 19:16 seq_3 -> /home/ec2-user/git/CoursesCRG_Containers_Nextflow_May_2021/nextflow/nextflow/work/09/53e071d286ed66f4020869c8977b59/seq_3

At this point we can make two different workflows to demonstrate how the new DSL allows reusing of the code.

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

// this can be overridden by using --inputfile OTHERFILENAME
params.inputfile = "$baseDir/testdata/test.fa"

// the "file method" returns a file system object given a file path string  
sequences_file = file(params.inputfile)             

// check if the file exists
if( !sequences_file.exists() ) exit 1, "Missing genome file: ${genome_file}"


/*
 * Process 1 for splitting a fasta file in multiple files
 */

process splitSequences {

    input:
    path sequencesFile

    output:
    path ('seq_*')    

    // simple awk command

    script:
    """
    awk '/^>/{f="seq_"++d} {print > f}' < ${sequencesFile}
    """
}


/*
 * Process 2 for reversing the sequences
 */

process reverseSequence {
    tag { "${seq}" }                

    input:
    path seq

    output:
    path "all.rev"

    script:
    """
    cat ${seq} | awk '{if (\$1~">") {print \$0} else system("echo " \$0 " |rev")}' > all.rev
    """
}

workflow flow1 {
    take: sequences
    main:
    splitted_seq        = splitSequences(sequences)
    rev_single_seq      = reverseSequence(splitted_seq)
}

workflow flow2 {
    take: sequences
    main:
    splitted_seq        = splitSequences(sequences).flatten()
    rev_single_seq      = reverseSequence(splitted_seq)

}

workflow {
   flow1(sequences_file)
   flow2(sequences_file)
}

The first workflow will just run like the previous script, while the second will “flatten” the output of the first process and will launch the second process on each single sequence.

The reverseSequence process of the second workflow will run in parallel if you have enough processors, or if you are running the script in a cluster environment with a scheduler supported by Nextflow.

nextflow run test1.nf -bg

C02WX1XFHV2Q:nextflow lcozzuto$ N E X T F L O W  ~  version 20.07.1
Launching `test1.nf` [insane_plateau] - revision: d33befe154
[bd/f4e9a6] Submitted process > flow1:splitSequences
[37/d790ab] Submitted process > flow2:splitSequences
[33/a6fc72] Submitted process > flow1:reverseSequence ([seq_1, seq_2, seq_3])
[87/54bfe8] Submitted process > flow2:reverseSequence (seq_2)
[45/86dd83] Submitted process > flow2:reverseSequence (seq_1)
[93/c7b1c6] Submitted process > flow2:reverseSequence (seq_3)