Skip to content

feat(qc): recombination: E: Multi-Ancestor#1740

Open
ivan-aksamentov wants to merge 2 commits intomasterfrom
feat/qc-recomb-strategy-e
Open

feat(qc): recombination: E: Multi-Ancestor#1740
ivan-aksamentov wants to merge 2 commits intomasterfrom
feat/qc-recomb-strategy-e

Conversation

@ivan-aksamentov
Copy link
Member

Recombination Detection: Strategy E - Multi-Ancestor

Scientific Motivation

Recombination creates chimeric genomes where different genomic regions inherit from different parental lineages. When a sequence is analyzed against multiple potential ancestors (clades), a non-recombinant will show consistent affinity to a single ancestor across all genome regions. A recombinant, however, will show distinct patterns: some regions match ancestor A better (fewer private mutations), while other regions match ancestor B better. This "ancestor switching" pattern across the genome is a hallmark of recombination that the Multi-Ancestor strategy exploits.

Mechanism

The algorithm works as follows:

  1. Genome segmentation: Divide the genome into N equal segments (configurable via numSegments)
  2. Per-segment ancestor matching: For each segment, count private mutations relative to each ancestor defined in the tree's ref_nodes.search configuration
  3. Best ancestor selection: For each segment, identify which ancestor has the fewest private mutations in that region
  4. Switch counting: Count transitions between adjacent segments where the best ancestor changes
  5. Scoring: If switches >= minSwitches, score = (switches - minSwitches + 1) * weight

Configuration

Requires configuration in both pathogen.json (QC rule) and tree.json (ancestor definitions):

pathogen.json - QC rule configuration:

{
  "qc": {
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "multiAncestor": {
        "enabled": true,
        "weight": 50.0,
        "numSegments": 10,
        "minSwitches": 2
      }
    }
  }
}

tree.json - Ancestor definitions via ref_nodes.search:

{
  "meta": {
    "extensions": {
      "nextclade": {
        "ref_nodes": {
          "search": [
            {
              "name": "CladeA",
              "displayName": "Clade A founder",
              "criteria": [{ "node": [{ "name": ["NODE_XXX"] }] }]
            },
            {
              "name": "CladeB",
              "displayName": "Clade B founder",
              "criteria": [{ "node": [{ "name": ["NODE_YYY"] }] }]
            }
          ]
        }
      }
    }
  }
}

Advantages

  • Leverages existing ref_nodes infrastructure - no new data structures needed
  • Directly detects the biological signature of recombination (mixed ancestry)
  • Works with any number of potential ancestors
  • Configurable sensitivity via segment count and switch threshold
  • Provides interpretable output (which ancestors match which regions)

Limitations

  • Requires tree.json with ref_nodes.search ancestors defined
  • Effectiveness depends on quality of ancestor node selection
  • Segment-based approach may miss breakpoints that don't align with segment boundaries
  • Performance scales with number of ancestors (more comparisons needed)
  • Requires sufficient sequence diversity between ancestors to distinguish them

Comparison to Other Strategies

Unlike mutation-counting strategies (A, B, C, D) that look at patterns within private mutations, Strategy E directly tests the recombination hypothesis by comparing the query against multiple reference ancestors. This is more biologically grounded but requires more dataset configuration (defining ancestors in tree.json). Strategy F (label switching) is conceptually similar but uses mutation labels rather than full ancestral comparisons.

Implementation Summary

Files modified:

  • packages/nextclade/src/qc/qc_config.rs - Added QcRecombConfigMultiAncestor config struct
  • packages/nextclade/src/qc/qc_rule_recombinants.rs - Added strategy_multi_ancestor() implementation
  • packages/nextclade/src/qc/qc_run.rs - Pass relative_nuc_mutations to recombinants rule
  • packages/nextclade/src/run/nextclade_run_one.rs - Thread through relative mutations
  • packages/nextclade-web/src/helpers/formatQCRecombinants.ts - Web UI formatting for multi-ancestor results

Files created:

  • packages/nextclade/src/qc/qc_recomb_utils.rs - Shared utilities (clustering, segmentation, CV calculation)

Test dataset:

  • data/recomb/enpen/enterovirus/ev-d68/ - EV-D68 dataset with multi-ancestor configuration

Unit tests:

  • Tests for disabled state, empty inputs, edge cases
  • Tests for single ancestor (no switches)
  • Tests for multiple ancestors with clear switching patterns
  • Tests for score calculation and minSwitches threshold

Future Work

  • Automatic ancestor discovery from tree structure
  • Variable segment sizing based on recombination hotspots
  • Confidence scoring based on mutation count per segment
  • Visualization of per-segment ancestry assignments
  • Integration with breakpoint detection for precise boundary identification

## Recombination Detection: Strategy E - Multi-Ancestor

### Scientific Motivation

Recombination creates chimeric genomes where different genomic regions inherit from different parental lineages. When a sequence is analyzed against multiple potential ancestors (clades), a non-recombinant will show consistent affinity to a single ancestor across all genome regions. A recombinant, however, will show distinct patterns: some regions match ancestor A better (fewer private mutations), while other regions match ancestor B better. This "ancestor switching" pattern across the genome is a hallmark of recombination that the Multi-Ancestor strategy exploits.

### Mechanism

The algorithm works as follows:

1. **Genome segmentation**: Divide the genome into N equal segments (configurable via `numSegments`)
2. **Per-segment ancestor matching**: For each segment, count private mutations relative to each ancestor defined in the tree's `ref_nodes.search` configuration
3. **Best ancestor selection**: For each segment, identify which ancestor has the fewest private mutations in that region
4. **Switch counting**: Count transitions between adjacent segments where the best ancestor changes
5. **Scoring**: If switches >= `minSwitches`, score = (switches - minSwitches + 1) * weight

### Configuration

Requires configuration in both `pathogen.json` (QC rule) and `tree.json` (ancestor definitions):

**pathogen.json** - QC rule configuration:
```json
{
  "qc": {
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "multiAncestor": {
        "enabled": true,
        "weight": 50.0,
        "numSegments": 10,
        "minSwitches": 2
      }
    }
  }
}
```

**tree.json** - Ancestor definitions via ref_nodes.search:
```json
{
  "meta": {
    "extensions": {
      "nextclade": {
        "ref_nodes": {
          "search": [
            {
              "name": "CladeA",
              "displayName": "Clade A founder",
              "criteria": [{ "node": [{ "name": ["NODE_XXX"] }] }]
            },
            {
              "name": "CladeB",
              "displayName": "Clade B founder",
              "criteria": [{ "node": [{ "name": ["NODE_YYY"] }] }]
            }
          ]
        }
      }
    }
  }
}
```

### Advantages

- Leverages existing ref_nodes infrastructure - no new data structures needed
- Directly detects the biological signature of recombination (mixed ancestry)
- Works with any number of potential ancestors
- Configurable sensitivity via segment count and switch threshold
- Provides interpretable output (which ancestors match which regions)

### Limitations

- Requires tree.json with ref_nodes.search ancestors defined
- Effectiveness depends on quality of ancestor node selection
- Segment-based approach may miss breakpoints that don't align with segment boundaries
- Performance scales with number of ancestors (more comparisons needed)
- Requires sufficient sequence diversity between ancestors to distinguish them

### Comparison to Other Strategies

Unlike mutation-counting strategies (A, B, C, D) that look at patterns within private mutations, Strategy E directly tests the recombination hypothesis by comparing the query against multiple reference ancestors. This is more biologically grounded but requires more dataset configuration (defining ancestors in tree.json). Strategy F (label switching) is conceptually similar but uses mutation labels rather than full ancestral comparisons.

### Implementation Summary

Files modified:
- `packages/nextclade/src/qc/qc_config.rs` - Added `QcRecombConfigMultiAncestor` config struct
- `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Added `strategy_multi_ancestor()` implementation
- `packages/nextclade/src/qc/qc_run.rs` - Pass `relative_nuc_mutations` to recombinants rule
- `packages/nextclade/src/run/nextclade_run_one.rs` - Thread through relative mutations
- `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - Web UI formatting for multi-ancestor results

Files created:
- `packages/nextclade/src/qc/qc_recomb_utils.rs` - Shared utilities (clustering, segmentation, CV calculation)

Test dataset:
- `data/recomb/enpen/enterovirus/ev-d68/` - EV-D68 dataset with multi-ancestor configuration

Unit tests:
- Tests for disabled state, empty inputs, edge cases
- Tests for single ancestor (no switches)
- Tests for multiple ancestors with clear switching patterns
- Tests for score calculation and minSwitches threshold

### Future Work

- Automatic ancestor discovery from tree structure
- Variable segment sizing based on recombination hotspots
- Confidence scoring based on mutation count per segment
- Visualization of per-segment ancestry assignments
- Integration with breakpoint detection for precise boundary identification

Co-Authored-By: Claude <noreply@anthropic.com>

This comment was marked as resolved.

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

@ivan-aksamentov
Copy link
Member Author

Test with strategy-specific dataset:

Preview with EV-D68 test dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants