ALL POSTS
Nextflow NGS Infrastructure Python

Nextflow DSL2 Patterns I Actually Use

Practical patterns for scalable NGS pipelines in Nextflow DSL2 — module composition, subworkflows, and config layering for multi-environment deployments.

· 9 min read

After running Nextflow pipelines in production across public health labs, cloud environments, and biotech — I've accumulated a set of patterns that consistently make pipelines easier to maintain and extend. This is not a Nextflow intro; if you're here you already know what a process is.

Module composition over monolithic pipelines

DSL2's import system is its biggest quality-of-life upgrade. Rather than writing one enormous pipeline file, each tool lives in its own module:

// modules/local/bwa_mem.nf
process BWA_MEM {
  tag "$meta.id"
  container 'quay.io/biocontainers/bwa:0.7.17--h5bf99c6_8'

  input:
  tuple val(meta), path(reads)
  path index

  output:
  tuple val(meta), path('*.bam'), emit: bam

  script:
  """
  bwa mem -t ${task.cpus} ${index}/genome.fa ${reads} \
    | samtools sort -o ${meta.id}.bam
  """
}

Then in your workflow:

include { BWA_MEM } from './modules/local/bwa_mem'

The tag directive is underrated — it makes the Nextflow execution log readable when you're processing 200 samples in parallel.

Subworkflows for reusable pipeline segments

If BWA_MEM → samtools flagstat → samtools index appears in three of your pipelines, it belongs in a subworkflow:

// subworkflows/local/align_and_qc.nf
workflow ALIGN_AND_QC {
  take:
  reads   // channel: [ meta, [ fastq_1, fastq_2 ] ]
  index   // path

  main:
  BWA_MEM(reads, index)
  SAMTOOLS_FLAGSTAT(BWA_MEM.out.bam)
  SAMTOOLS_INDEX(BWA_MEM.out.bam)

  emit:
  bam   = BWA_MEM.out.bam
  stats = SAMTOOLS_FLAGSTAT.out.stats
}

Config layering for multi-environment runs

One of the most common sources of pain is hardcoded paths and resource specifications. A layered config approach fixes this:

nextflow.config          ← defaults, container registries, process labels
conf/base.config         ← resource labels (low/medium/high)
conf/hpc.config          ← HPC executor + queue names
conf/cloud.config        ← AWS Batch / Google Life Sciences settings
conf/test.config         ← small test datasets for CI

Then run with -profile cloud or -profile hpc,test. The includeConfig directive in nextflow.config handles the rest.

The samplesheet pattern

Avoid passing individual FASTQ paths as CLI arguments. Use a CSV samplesheet validated at pipeline entry:

# bin/check_samplesheet.py  — runs at pipeline start
import csv, sys, pathlib

def validate(path):
    with open(path) as f:
        reader = csv.DictReader(f)
        for row in reader:
            for col in ('sample', 'fastq_1', 'fastq_2'):
                if not row.get(col):
                    sys.exit(f"Missing column: {col}")
            for fq in (row['fastq_1'], row['fastq_2']):
                if not pathlib.Path(fq).exists():
                    sys.exit(f"File not found: {fq}")

This catches problems before the pipeline starts rather than three hours in when sample 47 of 200 fails.

Closing thoughts

DSL2 encourages you to think in composable units. The best pipelines I've worked on read like a high-level description of the analysis — with the complexity hidden behind well-named subworkflows. The worst ones are a single 400-line main.nf that no-one dares touch.

Invest time in the module layer. It pays dividends every time you build the next pipeline.