Nextflow Features
In lecture, we will have covered the basics of how nextflow works conceptually. In this guide, we’ll cover a few extra features of nextflow that are useful for workflow development and execution.
Basic Nextflow Directory for BF528
The basic nextflow directory will look something like this:
project/
├── bin/
│ └── python_script.py
├── envs/
│ └── tool_env.yml
├── modules/
│ └── module1/
│ └── main.nf
├── main.nf
├── data/
│ └── test_data.csv
├── nextflow.config
└── samplesheet.csv
bin/
The bin directory is where we will place any accessory scripts that we will use in our nextflow pipeline. Nextflow automatically includes this directory in the PATH environment variable, so you can refer to these scripts using their names without having to provide the full path or provide the files directly. You will need to make the script executable using the following command:
chmod +x <name-of-script>
You will then be able to call a script by using the following syntax in any nextflow process:
shell:
"""
<name-of-script> <arguments>
"""
or more descriptively:
shell:
"""
parse_gtf.py -i <arg> -o <arg>
"""
Ensure that the script has the correct shebang line at the top of the file. For example:
#!/usr/bin/env python
We will also be using an argument parser to parse any arguments passed to the script.
In python, you can use the argparse
module to do this or in R, you can use the argparser
package.
envs/
The envs directory is where we will place any conda environments that we will use in our nextflow pipeline. Per our class guidelines, we will make conda environments using YML files according to the following conventions in the computational environments guide found here.
modules/
Eventually we will develop our nextflow processes in the modules directory. This practice will allow us to keep our modules separate and reuse them across different worfklows by simply copying them into a new project directory.
When using modules this way, in your nextflow workflow main.nf
you will need to include
the following line:
include "./modules/<module_name>/main.nf"
workflow {
<module_name>()
}
As you can see, this is conceptually similar to the way we call a function or import a module in python.
Nextflow Working Directory
By default, nextflow assumes the working directory is the directory where the main.nf file is located. This means that you can refer to files in the working directory using relative paths.
For example, in our main.nf, when we are using INCLUDE
to import a module,
we can refer to files in the working directory using relative paths.
include "./modules/<module_name>/main.nf"
This assumes the working directory structure we set up above.
In each module, we can also use relative paths when referring to other files. For example, if we are using a conda environment, we can refer to the environment YML file using the following path:
process INDEX {
label 'process_high'
conda 'envs/star_env.yml'
...
}
Note that the path is a relative path as nextflow resolves against the workflow
launching directory (where the main.nf file is located). If we wanted to be more
careful, we could encode the path to the environment YML file using the $projectDir
variable in the nextflow.config file.
$projectDir
As a good practice, we will use the $projectDir
variable to refer to the
root directory of the project. This is a variable that is automatically
assigned by nextflow and points to the directory where the main.nf file is
located.
For example, if your project directory looks like this:
project/
├── main.nf
├── data/
│ └── test_data.csv
└── nextflow.config
You can refer to the test data file using the following path:
test_data = "$projectDir/data/test_data.csv"
If our full path was /home/user/project/data/test_data.csv
, then the variable
$projectDir
would resolve to /home/user/project
and the variable
$projectDir/data/test_data.csv
would resolve to /home/user/project/data/test_data.csv
.
This allows us to avoid hardcoding the full path to the file and makes it easier to
move the project to a different location without having to update the path in the config file.
Nextflow Config
params
We will be using the config file to store static values, files and variables that will be used in the workflow. This is important because it allows us to specify these values once and use them throughout the workflow without having to change each instance if we update one.
For example, you can see a sample config file below:
params {
test_data = "$projectDir/data/test_data.csv"
ref_genome = "$projectDir/data/ref_genome.fa"
}
This config file above stores the paths to the test data and reference genome
as variables that can be used throughout the workflow. We can reference these
variables elsewhere using the params.variable_name
syntax.
For example, we could use the test_data value in a process like this:
params.test_data
Profiles
If you look in the nextflow.config file at the top-level of the directory, you will see a number of profiles defined. These profiles correspond to preset options that will be automatically applied when you run the workflow.
For example, if you run the workflow using the following command:
nextflow run main.nf -profile test
The test
profile will be applied and the workflow will be run with the
preset options defined in that profile.
For a more realistic test case, you will often see the following command:
nextflow run main.nf -profile conda,cluster
This will instruct nextflow to use the pre-defined options in the conda
and cluster
profiles. These profiles specify a number of important options for nextlow to use
conda and submit jobs appropriately to the SCC.
You can see the options defined in each profile by looking at the nextflow.config
file.
Nextflow Labels
Nextflow labels are a way to assign labels to processes in the workflow. These labels can be used to pre-assign any default values to specific processes. We will primarily use labels to assign processes a different set of computational resources to request when submitting to the SCC.
This requires the label to be defined in two places:
- In the
nextflow.config
file - In the process definition in the module
main.nf
file
In the nextflow.config, these labels will look like below:
withLabel: process_high {
cpus = 16
}
If we had a sample process with this label, it would look like below:
process INDEX {
label 'process_high'
conda 'envs/star_env.yml'
...
}
Whenever we submit a job to the SCC, nextflow will automatically assign the values defined in the label to the process. For this process INDEX, it will appropriately request 16 CPUs in the qsub command. These labels will allow us to dynamically request different amounts of resources for different processes without having to manually specify every time. This has the benefit of ensuring that we are not requesting more resources than we need for simple tasks and requesting enough to make complex tasks run faster or at all. Remember that the SCC is a shared resource and we want to be respectful of the other users. Also, keep in mind that requesting nodes with more resources will likely increase the amount of time our jobs spend in the queue before they begin running as there are less nodes available with these resources and they are in high demand.
$task.cpus
Remember that we have several layers to our resource requests.
- We must request the appropriate amount of resources in nextflow.
- We accomplish this with the
label
found in both the process and the nextflow.config file. - Each process should have a label and the nextflow.config file should define the resources for that label.
- We accomplish this with the
- We must ensure that the tool is instructed to use the appropriate amount of resources.
- We accomplish this with the
$task.cpus
variable in the shell script. - Check each tool’s documentation for how to specify how many resources the process should use.
- We accomplish this with the
We will most often by using multiple cores or cpus to accelerate our computational processes.
Remember that you need to match the amount you request via qsub with the amount you instruct the program to utilize in the shell script.
You may use the $task.cpus
variable in the shell script to access the values
assigned in the nextflow.config file. Note in the following example that several
directives are missing for clarity.
#!/usr/bin/env nextflow
process ALIGN {
label 'process_high'
script:
"""
STAR --runThreadN $task.cpus <other directives>
"""
}
At runtime, nextflow will replace the $task.cpus
variable with the value assigned
in the nextflow.config file, specifically the value in the process_high label under
the “cpus” directive.
You can specify memory in a similar way by setting the memory directive under a label in the nextflow.config file.
String interpolation
Nextflow allows for string interpolation using the ${}
syntax. This allows us to
include variables in strings dynamically. For example:
#!/usr/bin/env nextflow
process ALIGN {
label 'process_high'
conda 'envs/star_env.yml'
publishDir params.outdir, pattern: "*.Log.final.out"
input:
tuple val(meta), path(reads)
path(index)
output:
tuple val(meta), path("${meta}.Aligned.out.bam"), emit: bam
tuple val(meta), path("${meta}.Log.final.out"), emit: log
script:
"""
STAR --runThreadN $task.cpus --genomeDir $index --readFilesIn $reads --readFilesCommand zcat --outFileNamePrefix $meta. --outSAMtype BAM Unsorted
"""
}
You can see that we are using the ${meta}
variable to specify the sample name in the output
file name. The file generated will have the value of meta substituted in the file name
and end with .Aligned.out.bam
. This value is the first element of the tuple passed
in the input channel and is typically the name or identifier of the sample. This
is a common pattern in nextflow and allows us to dynamically generate file names based
on the name passed in the input channel tuple.
String functions (.baseName)
Groovy also has a few built-in functions that we can use to generate file names. For example, we can use the basename
function to extract the base name of a file path. This is useful when we want to remove the file extension from a file path.
#!/usr/bin/env nextflow
process BOWTIE2_BUILD {
label 'process_high'
container 'ghcr.io/bf528/bowtie2:latest'
input:
path(genome)
output:
path('bowtie2_index'), emit: index
val genome.baseName, emit: name
shell:
"""
mkdir bowtie2_index
bowtie2-build --threads $task.cpus $genome bowtie2_index/${genome.baseName}
"""
}
In this example, we are using the baseName
function to extract the base name of the genome file path. Bowtie2 requires us
to provide the base name of the index files as an argument without the extensions. This is a common pattern in nextflow and
is another method that allows us to dynamically generate file names based on the name passed in the input channel tuple.
(*) and (**) patterns
Nextflow also supports the bash *
glob pattern to match any number of files in a directory. For example, we could use the following
output:
path('*.fastq.gz')
The above line would instruct nextflow to match any file ending in .fastq.gz
in the current directory and emit them in the
output channel.
You can also use the **
to recurse through directories. For example, we could use the following:
#!/usr/bin/env nextflow
process NCBI_DATASETS_CLI {
label 'process_single'
conda "envs/ncbidatasets_env.yml"
input:
tuple val(name), val(GCF)
output:
tuple val(name), path('dataset/**/*.fna')
shell:
"""
datasets download genome accession $GCF --include genome
unzip ncbi_dataset.zip -d dataset/
"""
The above line would instruct nextflow to match any file ending in .fna
in the dataset
directory and any subdirectories
and emit them in the output channel. This is useful in cases where processes create multiple output files in different directories or
deeply nested directories. The above example demonstrates this by downloading a genome from the NCBI datasets CLI and emitting the
FASTA file in the dataset
directory. By default, NCBI datasets CLI will download a ncbi_dataset.zip
file with the requested files
and we unzip it to the dataset
directory. The files provided are in a nested directory structure, so we use the **
glob pattern to match
any file ending in .fna
in the dataset
directory and any subdirectories while avoiding hardcoding the exact path to the file.
Nextflow work/ directory
Nextflow automatically creates a work directory where it stores the output of each process. Nextflow creates a hash of the process name and the input files to create a unique directory for each process. When a task is run, nextflow will stage any input files into these directories and each process will run in its own directory. These directories are located in the work/ directory and if we ran a single task the directory structure would look something like this:
work/
├── 03
│ └── c746ec9f000f7b3f9bbccebd1dca3d/
│ └── .command.begin
│ └── .command.err
│ └── .command.log
│ └── .command.out
│ └── .command.run
│ └── .command.sh
│ └── .command.trace
│ └── .exitcode
│ └── staged files
│ └── output files
This work directory will generate a number of different .files that contain information about the execution of each process. These files include:
- .command.begin: A script that is run before the process starts
- .command.err: The standard error output of the process
- .command.log: The log output of the process
- .command.out: The standard output of the process
- .command.run: The script that is run to execute the process
- .command.sh: The shell script that is run to execute the process
- .command.trace: A trace of the process execution
- .exitcode: The exit code of the process
Any files that are staged into the process directory will be copied into the work directory and all output files will be generated in this directory as well. One of the advantages of this strategy is that each process is completely self-contained and cannot be affected by other processes. This also means that we can refer to any files generated in a process with relative paths. For instance, we can create directories in each process and refer to them with relative paths. For example, we can create a directory and then run a command that generates a file in that directory.
process INDEX {
label 'process_high'
conda 'envs/star_env.yml'
input:
path genome
path gtf
output:
path "star", emit: index
script:
"""
mkdir star
STAR --runThreadN $task.cpus --runMode genomeGenerate --genomeDir star --genomeFastaFiles $genome --sjdbGTFfile $gtf --genomeSAindexNbases 11
"""
}
In this example, we create a directory called star
and then run a command that generates a file in that directory. Specifically,
we create the star
directory and then the --genomeDir
option points to that directory. We can then refer to the file with a relative path.
Nextflow Log
The nextflow log command shows information about executed pipelines. This is helpful for showing you the various exit statuses of the processes in your pipeline. A sample command may look like the following:
nextflow log <run_name> -f hash,name,exit,status
This will show you the hash of the process, the name of the process, the exit status, and the status of the process. Which will look something like this:
hash name exit status
03/c746ec INDEX 0 COMPLETED
Although the work/ directory strategy has a number of advantages, it can be a bit of a pain to navigate and manage. This nextflow log command is an easy way to see where each process is located and what its exit status was. If you need to manually inspect the output of a process, you can use the hash value to navigate to the directory where you can view all of the running information for that process, input files, and output files.
Nextflow PublishDir
Nextflow also has a publishDir option that allows you to specify a directory where you want to publish the output of a process. This is helpful for gathering final output files from a process and storing them in a single location. You may also wish to use publishDir to share any QC or log output files from each process.
process INDEX {
label 'process_high'
conda 'envs/star_env.yml'
publishDir params.outdir, pattern: "*.Log.final.out"
...
}
This will publish the output of the align process to the directory specified in the params.outdir
parameter. The pattern
option allows you to specify a pattern to match the output files that you want to publish. In this case, we are only
publishing a file that matches the pattern and ends with .Log.final.out
.
Typically, we will make the publishDir location, here params.outdir, a directory in this same repository called results/.
Nextflow report
Nextflow can create a html report that summarizes the execution of a pipeline. This is helpful for documentation purposes and for examining some runtime metrics for each process.
This can be achieved by adding the following flag to the command:
nextflow run <main.nf> -with-report [file name]
One of the more helpful features of this report will be the Resource Usage section where you can see the amount of memory and CPU Usage for each process. This can be helpful for determining how many resources to request for a given process.