Nextflow Basics

Setting up to run Nextflow on the SCC

We will be creating a single conda environment that contains only the nextflow package. We will activate this environment every time we want to run Nextflow on the SCC.

Remember that when you first begin working on the SCC, you will need to load the module for miniconda and either create the nextflow conda environment or activate it if it already exists.

Just like for other conda environments, we will encode the environment specification in a YAML file. This will allow us to create the environment on any system that has conda installed. It also improves the reproducibility of your workflow by making all of the environment specifications explicit.

This environment should contain only the most up-to-date version of Nextflow. Other tools and packages will be specified in their own YAML files and the installation of each will be handled by Nextflow.

This series of steps will look something like below:

Create a YML file with the most recent version of nextflow installed:

name: nextflow_env
channels:
- conda-forge
- bioconda

dependencies:
- nextflow=25.04.6

Then you will need to create the environment using the following command:

conda env create -f nextflow_env.yml

And then activate it:

conda activate nextflow_env

If you’ve done these previous steps, you would simply do the following:

module load miniconda
conda activate nextflow_env

Running Nextflow on the SCC

There are several ways to run Nextflow on the SCC using our different profiles.

The most common combinations will be:

-profile local,conda

nextflow run main.nf -profile local,conda

This set of profile options will run nextflow on the interactive node where your VSCode terminal is running. We will be using this particular running method when we are troubleshooting the pipeline and when working with any subsets of the data. Remember that although the VSCode terminal is running on a compute node, we will still only be able to run relatively small tasks as our VSCode session only requested a small amount of resources.

-profile cluster,conda

nextflow run main.nf -profile cluster,conda

This set of profile options will run nextflow on the SCC compute nodes and also instruct Nextflow to build or activate any conda environments specified in the pipeline modules per task. Importantly, this will allow the pipeline to submit multiple tasks to the SCC compute nodes and run them in parallel. The number of jobs dispatched is controlled by the executor settings in the Nextflow config file. You can see the $sge directive and the queueSize flag is by default set to 8. This means that by default, Nextflow will submit up to 8 jobs to the SCC compute nodes at the same time. You can adjust this number as needed but remember that the SCC is a shared resource and that depending on your pipeline, there’s only so many jobs that can run in parallel.

-profile cluster,singularity

nextflow run main.nf -profile cluster,singularity

This set of profile options will run nextflow on the SCC compute nodes and also instruct Nextflow to pull the singularity container specified in the pipeline modules per task. After project 1, this is the profile you will use going forward. It will both submit jobs properly to the cluster with appropriate resources requests and pull the singularity container specified in the pipeline modules per task.

-stub-run

nextflow run main.nf <options> -stub-run

You may add the -stub-run flag to any of the above profile options to run the pipeline in stub mode. This will allow you to see what the pipeline will do without actually running it. This is ideal for troubleshooting the actual channel logic of your pipeline without waiting for the jobs to actually run on the SCC. You can also use this stub-run even when you have not yet constructed the appropriate command for each task in your pipeline.

The stub modules in your processes have been pre-configured to mimic what the output of each task should produce to enable you to develop the logic in your main.nf.