Lab 04 — Modules and Multi-Sample Pipelines
Key concepts and tools
include { PROCESS } from './modules/...'modules/directory structure, one process per module filetuple val(meta), path(file)input/output patternsamplesheet.csv,splitCsv(header: true),mapncbi-datasets-cligenome download- Prokka genome annotation, GFF output format
samtools faidx— FASTA indexing and coordinate extractionjoinchannel operator — merging channels by key-stub-run— testing workflow logic without running real tools-with-report— HTML execution reportresume = trueinnextflow.config- Process
labeldirectives and resource allocation - Jupyter notebooks, conda environments as kernels, circos plots
This lab introduces the module system and multi-sample patterns that will be standard for the rest of the course. The pipeline downloads several bacterial genomes from NCBI, annotates them with Prokka, indexes each genome with samtools faidx, and extracts a specific gene’s sequence by its GFF coordinates. The pipeline is scaffolded but incomplete — you fill in the inputs, outputs, and commands for each process, then connect them in main.nf.
Process specifications
Each module is outlined below. Use -stub-run to confirm the workflow logic before running real tools.
ncbi_datasets_cli — Input: tuple val(name), val(GCF). Output: tuple val(name), path("*.fna").
prokka — Input: tuple val(name), path(fna). Output: tuple val(name), path("*.gff"), emit: gff. Use --outdir and --prefix flags set to the sample name.
extract_region — Input: tuple val(name), path(gff). Output: tuple val(name), path("region_of_interest.txt"). Uses the provided script in bin/.
samtools_faidx — Input: tuple val(name), path(fna). Output: tuple val(name), path(fna), path("*.fai").
samtools_faidx_subset — Input: tuple val(name), path(fna), path(fai), path(region). Output: tuple val(name), path("*_region.subset.fna"). You will need a channel operator to merge the samtools_faidx and extract_region outputs by sample name before passing them here.
Running
# Test workflow logic (no real tools run)
nextflow run main.nf -profile local,conda -stub-run
# Run locally once logic is confirmed
nextflow run main.nf -profile local,conda
# Submit to cluster with report
nextflow run main.nf -profile cluster,conda -with-report
New features introduced
-stub-run— executes thestubblock of each process instead ofscript, creating empty placeholder files instantly. Use this to validate channel wiring before committing to a full run.-with-report— generates an HTML summary of resource usage per task after the run.resume = true— caches completed tasks so a failed run can restart from where it stopped.- Labels — each process has a
labelthat maps to CPU/memory settings innextflow.config, replacing per-processqsubflags.
Jupyter notebook
Create a conda environment from the provided notebook_env.yml, select it as the notebook kernel, and use BioPython to visualize the gene annotations for all downloaded genomes as a circos plot.