ALL POSTS
Surveillance NGS Infrastructure Python

Genomic Surveillance at Scale: Lessons from the COVID-19 Response

What building global pathogen surveillance infrastructure during SARS-CoV-2 taught me about reproducible pipelines, data federation, and scientific communication under pressure.

· 10 min read

In 2020, the Centre for Genomic Pathogen Surveillance found itself at the operational centre of a global sequencing effort. Labs in India, Nigeria, Colombia, the Philippines, and across Europe were generating SARS-CoV-2 whole-genome sequences, and the challenge shifted rapidly from "can we sequence it?" to "can we make sense of it, fast, across every lab that's contributing?"

That experience shaped how I think about bioinformatics infrastructure more than anything else in my career.

Reproducibility is not optional

When you're operating across a dozen labs in different countries, different IT setups, different versions of Ubuntu, different institutional firewalls — the pipeline has to work identically everywhere. That means containers.

Nextflow + Docker (or Singularity, where Docker wasn't available on HPC clusters) became the default. Every process pinned to a container with a specific digest, not just a tag. A tag can change silently; a digest is immutable.

The practical consequence: a result from Lagos and a result from Bogotá, run six months apart, are directly comparable because the software stack was identical.

The data federation problem

Sequencing data is heavy and politically sensitive. Governments and health authorities are understandably reluctant to send raw patient-linked genomic data offshore. The model that worked was federated: sequence locally, run a standardised consensus pipeline locally, share the consensus FASTA and curated metadata.

Tools like Pathogenwatch and Microreact were built specifically to accept this kind of federated input and generate phylogenetic trees and maps without requiring raw reads. The design philosophy — lightweight, browser-based, shareable — was exactly right for an emergency context.

Latency kills response

In an outbreak, a variant that's 10% of sequences today might be 60% next week. The surveillance pipeline had to be fast enough that results were informative before the epidemiological picture changed.

That forced some uncomfortable trade-offs:

  • Assembly vs. mapping: mapping to the reference (Wuhan-Hu-1) was faster and good enough for variant calling; full de novo assembly added latency we couldn't afford
  • QC thresholds: strict QC that rejected marginal-quality samples was worse than permissive QC with uncertainty flags. A genome with 10% ambiguous bases is still informative about lineage if the informative positions are covered
  • Automation over flexibility: human review of every sample is fine for 50 samples. At 50,000, you need to trust your automated QC and reserve human review for flagged outliers

Communication as infrastructure

One thing I underestimated before COVID: the visualisation layer is as important as the pipeline. Decision-makers — public health officials, ministers, WHO advisors — can't read a VCF. They need maps, timelines, variant frequency curves.

Investing engineering time in the Microreact and Pathogenwatch interfaces wasn't a nice-to-have. It was what made the genomic data actionable for the people who needed to act on it.

What carries over to gene therapy

The parallels with INDUCE-seq data at scale are closer than they first appear:

  • Federated data (different labs, different cell types, different editing conditions) needs standardised pipelines
  • Results need to be interpretable by people who didn't build the pipeline
  • Speed matters: a safety report that takes three weeks to produce is less useful than one that takes three days

The infrastructure mindset transfers even when the biology is completely different.