This repository contains a Nextflow DSL2 pipeline for tumor-normal somatic variant calling and consensus reporting.
The pipeline runs three callers in parallel:
- Mutect2 (scatter by chromosome, then gather)
- MuSE2
- Strelka2
Then it performs a consensus workflow:
- Normalize/split/standardize/label per caller
- Build K-of-N consensus
- Merge event-level variants
- Generate QC metrics and plots
- Render a final Quarto HTML report
Main entrypoint:
main.nf
Configuration:
conf/nextflow.config
Top-level flow:
- Parse pair manifest (
pair_id,tumor_id,tumor_bam,normal_id,normal_bam,gender) MUTATION_CALLINGsubworkflowCONSENSUS_CALLINGsubworkflow
Subworkflows:
subworkflows/mutation_calling.nfsubworkflows/consensus_calling.nf
Process definitions:
workflows/mutation_processes.nfworkflows/consensus_processes.nf
- Nextflow with DSL2 support
- LSF cluster (default executor in config)
- Conda or Mamba (conda is enabled in config)
- Reference bundle expected by
bin/lib/resolve_refs.sh
Environment files:
envs/mutect2.ymlenvs/muse2.ymlenvs/strelka2.ymlenvs/plotting.yml
- Clone the repository and enter it.
git clone <repo-url>
cd pipelines- Install Nextflow (choose one method appropriate for your environment).
# Example: with Conda
conda create -n nextflow -c conda-forge -c bioconda nextflow
conda activate nextflow- Create the pipeline Conda environments in a dedicated env root directory.
# Choose an env root directory (example)
mkdir -p /path/to/conda_envs
conda env create --prefix /path/to/conda_envs/wgs --file envs/mutect2.yml
conda env create --prefix /path/to/conda_envs/muse2 --file envs/muse2.yml
conda env create --prefix /path/to/conda_envs/strelka2 --file envs/strelka2.yml
conda env create --prefix /path/to/conda_envs/wgs-plotting --file envs/plotting.yml- Add the Conda environment directory to Nextflow so the pipeline can find these envs.
Option A (recommended): set it in conf/nextflow.config.
params {
condaEnvDir = "/path/to/conda_envs"
}Option B: pass it at runtime.
nextflow run main.nf -c conf/nextflow.config --condaEnvDir /path/to/conda_envs ...Notes:
params.condaEnvDiris optional, but setting it explicitly avoids ambiguity.- If not set, the pipeline probes:
${params.pipeline}/conda_envs,$HOME/.conda/envs, and$HOME/conda-envs.
Provide --pairs_tsv as TSV (or CSV). Required columns:
pair_idtumor_idtumor_bamnormal_idnormal_bamgender
Notes:
tumor_bai/normal_baicolumns are no longer required.- BAI paths are inferred from BAM paths (
<bam>.baior<bam-without-.bam>.bai). - Legacy
--samplesheetis still accepted as an alias to--pairs_tsv.
Example TSV:
pair_id tumor_id tumor_bam normal_id normal_bam gender
MK545-A_pair MK545-A /path/MK545-A.bam MK545-Control /path/MK545-Control.bam malenextflow run main.nf \
-profile local \
-c conf/nextflow.config \
--pairs_tsv tests/samplesheet.tsv \
--outdir tests/nf_outnextflow run main.nf \
-profile lsf \
-c conf/nextflow.config \
--pairs_tsv tests/samplesheet.tsv \
--outdir tests/nf_outnextflow run main.nf -c conf/nextflow.config -resume ...Notes:
- Resume requires the same launch/work context and unchanged task signatures.
- Existing trace/report/timeline/dag files are configured to overwrite in
conf/nextflow.config.
From conf/nextflow.config:
params.pairs_tsv: pair manifest pathparams.samplesheet: legacy alias forpairs_tsvparams.outdir: output directory rootparams.ref: reference key (defaulthg38)params.refdir: reference base directoryparams.seq: sequencing type (defaultWGS)params.pipeline: pipeline root directory (defaultprojectDir)params.binDir: tools scripts path (${params.pipeline}/modules/somatic)params.resolveRefs: reference resolver script (${params.pipeline}/bin/lib/resolve_refs.sh)params.libDir: shared library scripts (${params.pipeline}/bin/lib)params.envDir: env YAML directory (${params.pipeline}/envs)params.threads: default thread countparams.chroms: scatter chromosomes (defaultchr1-22, chrX)params.k_range: consensus K rangeparams.k_pick: selected K for final consensusparams.condaEnvDir: conda envs location
Published outputs are organized by sample under --outdir:
<outdir>/<sample>/mutect2/scatter/<outdir>/<sample>/mutect2/<outdir>/<sample>/muse2/<outdir>/<sample>/strelka2/<outdir>/<sample>/consensus/<outdir>/<sample>/final_report/
The final HTML report is rendered and published under:
<outdir>/<sample>/final_report/somatic_report.html
Pipeline run metadata is written to:
<outdir>/pipeline_info/timeline.html<outdir>/pipeline_info/report.html<outdir>/pipeline_info/trace.tsv<outdir>/pipeline_info/dag.svg
- Default executor:
lsf - Queue:
cgel executor.perJobMemLimit = trueis enabled (LSF memory handled per job)- Caller processes (
MUTECT2_SCATTER,MUTECT2_GATHER,MUSE2_CALL,STRELKA2_CALL) useerrorStrategy = 'ignore'so one caller failure does not immediately terminate the other caller branches
- Missing output errors usually indicate a mismatch between
output:pattern and actual script filenames. - If resume appears to rerun everything, ensure:
- Same working directory context
- Same config/parameters where possible
- Existing
work/and.nextflow/cache accessible
- Do not delete
work/if you need resume/debug support.
main.nf: entry workflowconf/nextflow.config: parameters, executors, defaultsworkflows/: process definitionssubworkflows/: orchestration of process groupsmodules/somatic/: caller and utility shell scriptsbin/lib/: shared helper scripts (including reference resolver and conda helpers)envs/: conda environment specstests/: test data, run scripts, and local run artifacts