feat(qc): recombination: E: Multi-Ancestor#1740
Open
ivan-aksamentov wants to merge 2 commits intomasterfrom
Open
feat(qc): recombination: E: Multi-Ancestor#1740ivan-aksamentov wants to merge 2 commits intomasterfrom
ivan-aksamentov wants to merge 2 commits intomasterfrom
Conversation
## Recombination Detection: Strategy E - Multi-Ancestor
### Scientific Motivation
Recombination creates chimeric genomes where different genomic regions inherit from different parental lineages. When a sequence is analyzed against multiple potential ancestors (clades), a non-recombinant will show consistent affinity to a single ancestor across all genome regions. A recombinant, however, will show distinct patterns: some regions match ancestor A better (fewer private mutations), while other regions match ancestor B better. This "ancestor switching" pattern across the genome is a hallmark of recombination that the Multi-Ancestor strategy exploits.
### Mechanism
The algorithm works as follows:
1. **Genome segmentation**: Divide the genome into N equal segments (configurable via `numSegments`)
2. **Per-segment ancestor matching**: For each segment, count private mutations relative to each ancestor defined in the tree's `ref_nodes.search` configuration
3. **Best ancestor selection**: For each segment, identify which ancestor has the fewest private mutations in that region
4. **Switch counting**: Count transitions between adjacent segments where the best ancestor changes
5. **Scoring**: If switches >= `minSwitches`, score = (switches - minSwitches + 1) * weight
### Configuration
Requires configuration in both `pathogen.json` (QC rule) and `tree.json` (ancestor definitions):
**pathogen.json** - QC rule configuration:
```json
{
"qc": {
"recombinants": {
"enabled": true,
"scoreWeight": 100.0,
"multiAncestor": {
"enabled": true,
"weight": 50.0,
"numSegments": 10,
"minSwitches": 2
}
}
}
}
```
**tree.json** - Ancestor definitions via ref_nodes.search:
```json
{
"meta": {
"extensions": {
"nextclade": {
"ref_nodes": {
"search": [
{
"name": "CladeA",
"displayName": "Clade A founder",
"criteria": [{ "node": [{ "name": ["NODE_XXX"] }] }]
},
{
"name": "CladeB",
"displayName": "Clade B founder",
"criteria": [{ "node": [{ "name": ["NODE_YYY"] }] }]
}
]
}
}
}
}
}
```
### Advantages
- Leverages existing ref_nodes infrastructure - no new data structures needed
- Directly detects the biological signature of recombination (mixed ancestry)
- Works with any number of potential ancestors
- Configurable sensitivity via segment count and switch threshold
- Provides interpretable output (which ancestors match which regions)
### Limitations
- Requires tree.json with ref_nodes.search ancestors defined
- Effectiveness depends on quality of ancestor node selection
- Segment-based approach may miss breakpoints that don't align with segment boundaries
- Performance scales with number of ancestors (more comparisons needed)
- Requires sufficient sequence diversity between ancestors to distinguish them
### Comparison to Other Strategies
Unlike mutation-counting strategies (A, B, C, D) that look at patterns within private mutations, Strategy E directly tests the recombination hypothesis by comparing the query against multiple reference ancestors. This is more biologically grounded but requires more dataset configuration (defining ancestors in tree.json). Strategy F (label switching) is conceptually similar but uses mutation labels rather than full ancestral comparisons.
### Implementation Summary
Files modified:
- `packages/nextclade/src/qc/qc_config.rs` - Added `QcRecombConfigMultiAncestor` config struct
- `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Added `strategy_multi_ancestor()` implementation
- `packages/nextclade/src/qc/qc_run.rs` - Pass `relative_nuc_mutations` to recombinants rule
- `packages/nextclade/src/run/nextclade_run_one.rs` - Thread through relative mutations
- `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - Web UI formatting for multi-ancestor results
Files created:
- `packages/nextclade/src/qc/qc_recomb_utils.rs` - Shared utilities (clustering, segmentation, CV calculation)
Test dataset:
- `data/recomb/enpen/enterovirus/ev-d68/` - EV-D68 dataset with multi-ancestor configuration
Unit tests:
- Tests for disabled state, empty inputs, edge cases
- Tests for single ancestor (no switches)
- Tests for multiple ancestors with clear switching patterns
- Tests for score calculation and minSwitches threshold
### Future Work
- Automatic ancestor discovery from tree structure
- Variable segment sizing based on recombination hotspots
- Confidence scoring based on mutation count per segment
- Visualization of per-segment ancestry assignments
- Integration with breakpoint detection for precise boundary identification
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Member
Author
|
Test with strategy-specific dataset: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Recombination Detection: Strategy E - Multi-Ancestor
Scientific Motivation
Recombination creates chimeric genomes where different genomic regions inherit from different parental lineages. When a sequence is analyzed against multiple potential ancestors (clades), a non-recombinant will show consistent affinity to a single ancestor across all genome regions. A recombinant, however, will show distinct patterns: some regions match ancestor A better (fewer private mutations), while other regions match ancestor B better. This "ancestor switching" pattern across the genome is a hallmark of recombination that the Multi-Ancestor strategy exploits.
Mechanism
The algorithm works as follows:
numSegments)ref_nodes.searchconfigurationminSwitches, score = (switches - minSwitches + 1) * weightConfiguration
Requires configuration in both
pathogen.json(QC rule) andtree.json(ancestor definitions):pathogen.json - QC rule configuration:
{ "qc": { "recombinants": { "enabled": true, "scoreWeight": 100.0, "multiAncestor": { "enabled": true, "weight": 50.0, "numSegments": 10, "minSwitches": 2 } } } }tree.json - Ancestor definitions via ref_nodes.search:
{ "meta": { "extensions": { "nextclade": { "ref_nodes": { "search": [ { "name": "CladeA", "displayName": "Clade A founder", "criteria": [{ "node": [{ "name": ["NODE_XXX"] }] }] }, { "name": "CladeB", "displayName": "Clade B founder", "criteria": [{ "node": [{ "name": ["NODE_YYY"] }] }] } ] } } } } }Advantages
Limitations
Comparison to Other Strategies
Unlike mutation-counting strategies (A, B, C, D) that look at patterns within private mutations, Strategy E directly tests the recombination hypothesis by comparing the query against multiple reference ancestors. This is more biologically grounded but requires more dataset configuration (defining ancestors in tree.json). Strategy F (label switching) is conceptually similar but uses mutation labels rather than full ancestral comparisons.
Implementation Summary
Files modified:
packages/nextclade/src/qc/qc_config.rs- AddedQcRecombConfigMultiAncestorconfig structpackages/nextclade/src/qc/qc_rule_recombinants.rs- Addedstrategy_multi_ancestor()implementationpackages/nextclade/src/qc/qc_run.rs- Passrelative_nuc_mutationsto recombinants rulepackages/nextclade/src/run/nextclade_run_one.rs- Thread through relative mutationspackages/nextclade-web/src/helpers/formatQCRecombinants.ts- Web UI formatting for multi-ancestor resultsFiles created:
packages/nextclade/src/qc/qc_recomb_utils.rs- Shared utilities (clustering, segmentation, CV calculation)Test dataset:
data/recomb/enpen/enterovirus/ev-d68/- EV-D68 dataset with multi-ancestor configurationUnit tests:
Future Work