After running Nextflow pipelines in production across public health labs, cloud environments, and biotech — I've accumulated a set of patterns that consistently make pipelines easier to maintain and extend. This is not a Nextflow intro; if you're here you already know what a process is.
Module composition over monolithic pipelines
DSL2's import system is its biggest quality-of-life upgrade. Rather than writing one enormous pipeline file, each tool lives in its own module:
// modules/local/bwa_mem.nf
process BWA_MEM {
tag "$meta.id"
container 'quay.io/biocontainers/bwa:0.7.17--h5bf99c6_8'
input:
tuple val(meta), path(reads)
path index
output:
tuple val(meta), path('*.bam'), emit: bam
script:
"""
bwa mem -t ${task.cpus} ${index}/genome.fa ${reads} \
| samtools sort -o ${meta.id}.bam
"""
}
Then in your workflow:
include { BWA_MEM } from './modules/local/bwa_mem'
The tag directive is underrated — it makes the Nextflow execution log readable when you're processing 200 samples in parallel.
Subworkflows for reusable pipeline segments
If BWA_MEM → samtools flagstat → samtools index appears in three of your pipelines, it belongs in a subworkflow:
// subworkflows/local/align_and_qc.nf
workflow ALIGN_AND_QC {
take:
reads // channel: [ meta, [ fastq_1, fastq_2 ] ]
index // path
main:
BWA_MEM(reads, index)
SAMTOOLS_FLAGSTAT(BWA_MEM.out.bam)
SAMTOOLS_INDEX(BWA_MEM.out.bam)
emit:
bam = BWA_MEM.out.bam
stats = SAMTOOLS_FLAGSTAT.out.stats
}
Config layering for multi-environment runs
One of the most common sources of pain is hardcoded paths and resource specifications. A layered config approach fixes this:
nextflow.config ← defaults, container registries, process labels
conf/base.config ← resource labels (low/medium/high)
conf/hpc.config ← HPC executor + queue names
conf/cloud.config ← AWS Batch / Google Life Sciences settings
conf/test.config ← small test datasets for CI
Then run with -profile cloud or -profile hpc,test. The includeConfig directive in nextflow.config handles the rest.
The samplesheet pattern
Avoid passing individual FASTQ paths as CLI arguments. Use a CSV samplesheet validated at pipeline entry:
# bin/check_samplesheet.py — runs at pipeline start
import csv, sys, pathlib
def validate(path):
with open(path) as f:
reader = csv.DictReader(f)
for row in reader:
for col in ('sample', 'fastq_1', 'fastq_2'):
if not row.get(col):
sys.exit(f"Missing column: {col}")
for fq in (row['fastq_1'], row['fastq_2']):
if not pathlib.Path(fq).exists():
sys.exit(f"File not found: {fq}")
This catches problems before the pipeline starts rather than three hours in when sample 47 of 200 fails.
Closing thoughts
DSL2 encourages you to think in composable units. The best pipelines I've worked on read like a high-level description of the analysis — with the complexity hidden behind well-named subworkflows. The worst ones are a single 400-line main.nf that no-one dares touch.
Invest time in the module layer. It pays dividends every time you build the next pipeline.