feat(qc): recombination: E: Multi-Ancestor by ivan-aksamentov · Pull Request #1740 · nextstrain/nextclade

ivan-aksamentov · 2026-01-13T13:11:33Z

Recombination Detection: Strategy E - Multi-Ancestor

Scientific Motivation

Recombination creates chimeric genomes where different genomic regions inherit from different parental lineages. When a sequence is analyzed against multiple potential ancestors (clades), a non-recombinant will show consistent affinity to a single ancestor across all genome regions. A recombinant, however, will show distinct patterns: some regions match ancestor A better (fewer private mutations), while other regions match ancestor B better. This "ancestor switching" pattern across the genome is a hallmark of recombination that the Multi-Ancestor strategy exploits.

Mechanism

The algorithm works as follows:

Genome segmentation: Divide the genome into N equal segments (configurable via numSegments)
Per-segment ancestor matching: For each segment, count private mutations relative to each ancestor defined in the tree's ref_nodes.search configuration
Best ancestor selection: For each segment, identify which ancestor has the fewest private mutations in that region
Switch counting: Count transitions between adjacent segments where the best ancestor changes
Scoring: If switches >= minSwitches, score = (switches - minSwitches + 1) * weight

Configuration

Requires configuration in both pathogen.json (QC rule) and tree.json (ancestor definitions):

pathogen.json - QC rule configuration:

{
  "qc": {
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "multiAncestor": {
        "enabled": true,
        "weight": 50.0,
        "numSegments": 10,
        "minSwitches": 2
      }
    }
  }
}

tree.json - Ancestor definitions via ref_nodes.search:

{
  "meta": {
    "extensions": {
      "nextclade": {
        "ref_nodes": {
          "search": [
            {
              "name": "CladeA",
              "displayName": "Clade A founder",
              "criteria": [{ "node": [{ "name": ["NODE_XXX"] }] }]
            },
            {
              "name": "CladeB",
              "displayName": "Clade B founder",
              "criteria": [{ "node": [{ "name": ["NODE_YYY"] }] }]
            }
          ]
        }
      }
    }
  }
}

Advantages

Leverages existing ref_nodes infrastructure - no new data structures needed
Directly detects the biological signature of recombination (mixed ancestry)
Works with any number of potential ancestors
Configurable sensitivity via segment count and switch threshold
Provides interpretable output (which ancestors match which regions)

Limitations

Requires tree.json with ref_nodes.search ancestors defined
Effectiveness depends on quality of ancestor node selection
Segment-based approach may miss breakpoints that don't align with segment boundaries
Performance scales with number of ancestors (more comparisons needed)
Requires sufficient sequence diversity between ancestors to distinguish them

Comparison to Other Strategies

Unlike mutation-counting strategies (A, B, C, D) that look at patterns within private mutations, Strategy E directly tests the recombination hypothesis by comparing the query against multiple reference ancestors. This is more biologically grounded but requires more dataset configuration (defining ancestors in tree.json). Strategy F (label switching) is conceptually similar but uses mutation labels rather than full ancestral comparisons.

Implementation Summary

Files modified:

packages/nextclade/src/qc/qc_config.rs - Added QcRecombConfigMultiAncestor config struct
packages/nextclade/src/qc/qc_rule_recombinants.rs - Added strategy_multi_ancestor() implementation
packages/nextclade/src/qc/qc_run.rs - Pass relative_nuc_mutations to recombinants rule
packages/nextclade/src/run/nextclade_run_one.rs - Thread through relative mutations
packages/nextclade-web/src/helpers/formatQCRecombinants.ts - Web UI formatting for multi-ancestor results

Files created:

packages/nextclade/src/qc/qc_recomb_utils.rs - Shared utilities (clustering, segmentation, CV calculation)

Test dataset:

data/recomb/enpen/enterovirus/ev-d68/ - EV-D68 dataset with multi-ancestor configuration

Unit tests:

Tests for disabled state, empty inputs, edge cases
Tests for single ancestor (no switches)
Tests for multiple ancestors with clear switching patterns
Tests for score calculation and minSwitches threshold

Future Work

Automatic ancestor discovery from tree structure
Variable segment sizing based on recombination hotspots
Confidence scoring based on mutation count per segment
Visualization of per-segment ancestry assignments
Integration with breakpoint detection for precise boundary identification

## Recombination Detection: Strategy E - Multi-Ancestor ### Scientific Motivation Recombination creates chimeric genomes where different genomic regions inherit from different parental lineages. When a sequence is analyzed against multiple potential ancestors (clades), a non-recombinant will show consistent affinity to a single ancestor across all genome regions. A recombinant, however, will show distinct patterns: some regions match ancestor A better (fewer private mutations), while other regions match ancestor B better. This "ancestor switching" pattern across the genome is a hallmark of recombination that the Multi-Ancestor strategy exploits. ### Mechanism The algorithm works as follows: 1. **Genome segmentation**: Divide the genome into N equal segments (configurable via `numSegments`) 2. **Per-segment ancestor matching**: For each segment, count private mutations relative to each ancestor defined in the tree's `ref_nodes.search` configuration 3. **Best ancestor selection**: For each segment, identify which ancestor has the fewest private mutations in that region 4. **Switch counting**: Count transitions between adjacent segments where the best ancestor changes 5. **Scoring**: If switches >= `minSwitches`, score = (switches - minSwitches + 1) * weight ### Configuration Requires configuration in both `pathogen.json` (QC rule) and `tree.json` (ancestor definitions): **pathogen.json** - QC rule configuration: ```json { "qc": { "recombinants": { "enabled": true, "scoreWeight": 100.0, "multiAncestor": { "enabled": true, "weight": 50.0, "numSegments": 10, "minSwitches": 2 } } } } ``` **tree.json** - Ancestor definitions via ref_nodes.search: ```json { "meta": { "extensions": { "nextclade": { "ref_nodes": { "search": [ { "name": "CladeA", "displayName": "Clade A founder", "criteria": [{ "node": [{ "name": ["NODE_XXX"] }] }] }, { "name": "CladeB", "displayName": "Clade B founder", "criteria": [{ "node": [{ "name": ["NODE_YYY"] }] }] } ] } } } } } ``` ### Advantages - Leverages existing ref_nodes infrastructure - no new data structures needed - Directly detects the biological signature of recombination (mixed ancestry) - Works with any number of potential ancestors - Configurable sensitivity via segment count and switch threshold - Provides interpretable output (which ancestors match which regions) ### Limitations - Requires tree.json with ref_nodes.search ancestors defined - Effectiveness depends on quality of ancestor node selection - Segment-based approach may miss breakpoints that don't align with segment boundaries - Performance scales with number of ancestors (more comparisons needed) - Requires sufficient sequence diversity between ancestors to distinguish them ### Comparison to Other Strategies Unlike mutation-counting strategies (A, B, C, D) that look at patterns within private mutations, Strategy E directly tests the recombination hypothesis by comparing the query against multiple reference ancestors. This is more biologically grounded but requires more dataset configuration (defining ancestors in tree.json). Strategy F (label switching) is conceptually similar but uses mutation labels rather than full ancestral comparisons. ### Implementation Summary Files modified: - `packages/nextclade/src/qc/qc_config.rs` - Added `QcRecombConfigMultiAncestor` config struct - `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Added `strategy_multi_ancestor()` implementation - `packages/nextclade/src/qc/qc_run.rs` - Pass `relative_nuc_mutations` to recombinants rule - `packages/nextclade/src/run/nextclade_run_one.rs` - Thread through relative mutations - `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - Web UI formatting for multi-ancestor results Files created: - `packages/nextclade/src/qc/qc_recomb_utils.rs` - Shared utilities (clustering, segmentation, CV calculation) Test dataset: - `data/recomb/enpen/enterovirus/ev-d68/` - EV-D68 dataset with multi-ancestor configuration Unit tests: - Tests for disabled state, empty inputs, edge cases - Tests for single ancestor (no switches) - Tests for multiple ancestors with clear switching patterns - Tests for score calculation and minSwitches threshold ### Future Work - Automatic ancestor discovery from tree structure - Variable segment sizing based on recombination hotspots - Confidence scoring based on mutation count per segment - Visualization of per-segment ancestry assignments - Integration with breakpoint detection for precise boundary identification Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2026-01-13T14:20:13Z

Preview: https://nextstrain--nextclade--pr-1740.previews.neherlab.click

(ci)

ivan-aksamentov · 2026-01-13T14:36:41Z

Test with strategy-specific dataset:

Preview with EV-D68 test dataset

ivan-aksamentov requested a review from Copilot January 13, 2026 13:13

Copilot started reviewing on behalf of ivan-aksamentov January 13, 2026 13:13 View session

ivan-aksamentov mentioned this pull request Jan 13, 2026

QC label for recombinant sequences #1699

Open

This comment was marked as resolved.

Sign in to view

fix: remove unnecessary deprecation attribute

3e15cd4

Co-Authored-By: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qc): recombination: E: Multi-Ancestor#1740

feat(qc): recombination: E: Multi-Ancestor#1740
ivan-aksamentov wants to merge 2 commits intomasterfrom
feat/qc-recomb-strategy-e

ivan-aksamentov commented Jan 13, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

ivan-aksamentov commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivan-aksamentov commented Jan 13, 2026

Recombination Detection: Strategy E - Multi-Ancestor

Scientific Motivation

Mechanism

Configuration

Advantages

Limitations

Comparison to Other Strategies

Implementation Summary

Future Work

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

ivan-aksamentov commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants