feat(qc): recombination: F: Label Switching#1741
Open
ivan-aksamentov wants to merge 4 commits intomasterfrom
Open
feat(qc): recombination: F: Label Switching#1741ivan-aksamentov wants to merge 4 commits intomasterfrom
ivan-aksamentov wants to merge 4 commits intomasterfrom
Conversation
## Recombination Detection: Strategy F - Label Switching
### Scientific Motivation
Recombination occurs when a virus incorporates genetic material from two or
more parental lineages. Each lineage accumulates characteristic mutations
over time - these "signature mutations" serve as molecular markers that
distinguish lineages from one another.
When a recombination event occurs, different genomic regions inherit
mutations from different parental lineages. This creates a distinctive
pattern: the sequence carries mutations characteristic of lineage A in one
region, and mutations characteristic of lineage B in another region.
The label switching strategy exploits this by leveraging the mutation label
map (nucMutLabelMap) - a curated mapping of nucleotide positions to lineage
labels. When private mutations are detected, they inherit labels from this
map. In a non-recombinant sequence, most labeled mutations should belong to
a single lineage (or closely related lineages). In a recombinant, mutations
from different lineages cluster in different genomic regions, creating
detectable "label switches" as you traverse the genome.
### Mechanism
The algorithm proceeds as follows:
1. **Label grouping**: Collect all labeled private substitutions from
`PrivateNucMutations.labeled_substitutions`. Group them by their primary
label (first label in the labels array), storing genomic positions for
each label.
2. **Minimum labels check**: If fewer than `minLabels` distinct labels are
present, return zero score (insufficient signal for recombination).
3. **Centroid calculation**: For each label, compute the centroid (mean
position) of all mutations carrying that label. This represents the
"center of mass" of each lineage's contribution.
4. **Switch counting**: Sort labels by their centroid position. The number
of switches equals `numLabels - 1`, representing transitions between
lineage-dominated regions as you traverse the genome from 5' to 3'.
5. **Scoring**: `score = numSwitches * weight`
### Configuration
Required in `pathogen.json`:
```json
{
"mutLabels": {
"nucMutLabelMap": {
"A123T": ["Alpha"],
"G456C": ["Beta"],
...
}
},
"qc": {
"recombinants": {
"enabled": true,
"scoreWeight": 100.0,
"labelSwitching": {
"enabled": true,
"weight": 50.0,
"minLabels": 2
}
}
}
}
```
Parameters:
- `enabled`: Activate label switching detection
- `weight`: Score contribution per label switch (default: 50.0)
- `minLabels`: Minimum distinct labels required to trigger detection
(default: 2)
### Advantages
- Leverages existing lineage annotation infrastructure (mutLabels)
- Biologically interpretable - directly identifies which lineages
contributed to the recombinant
- Does not require spatial parameters or segment definitions
- Robust to mutation density variations across the genome
- Works with any pathogen that has curated lineage-defining mutations
### Limitations
- Requires a well-curated `nucMutLabelMap` with lineage-specific mutations
- Effectiveness depends on quality and completeness of label annotations
- Cannot detect recombination between unlabeled or identically-labeled
lineages
- Uses only the first label when mutations have multiple labels
- Centroid-based ordering may miss complex recombination patterns with
interleaved regions
### Comparison to Other Strategies
Unlike Strategy A (weighted threshold) which only counts mutations, label
switching considers the identity and spatial distribution of labeled
mutations. Unlike Strategy B (spatial uniformity) which measures general
non-uniformity, this strategy specifically identifies which lineages
contribute to different regions.
Choose label switching when:
- Your pathogen has well-characterized lineage-defining mutations
- You want to identify the parental lineages, not just detect recombination
- The labeled mutation set has good genome-wide coverage
Choose other strategies when:
- No mutation label map is available (A, B, C, D)
- Recombination involves unlabeled variants (A, B, C, D)
- Multiple ancestral references are available (E)
### Implementation Summary
Files modified:
- `packages/nextclade/src/qc/qc_config.rs` - Added QcRecombConfigLabelSwitching config struct
- `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Implemented strategy_label_switching function
- `packages/nextclade/src/qc/qc_recomb_utils.rs` - Added shared utilities module
- `packages/nextclade/src/qc/qc_run.rs` - Integrated recombinants rule
- `packages/nextclade/src/qc/mod.rs` - Registered new modules
- `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - Added UI formatting
- `packages/nextclade-web/src/components/Results/ListOfQcIsuues.tsx` - Display integration
- `packages/nextclade-schemas/*.schema.{json,yaml}` - Updated JSON schemas
Test dataset:
- `data/recomb/enpen/enterovirus/ev-d68/` - EV-D68 dataset with label
switching configuration enabled for testing
Unit tests added for:
- Disabled config returns None
- Empty labeled mutations returns None
- Single label below minLabels returns zero score
- Two labels returns one switch
- Three labels returns two switches
- Multiple labels per mutation uses first label only
### Future Work
- Support weighted label switches based on centroid separation distance
- Consider secondary labels for mutations with multiple lineage assignments
- Add visualization of label distribution across genome
- Integrate with tree-based lineage assignment for validation
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Member
Author
|
Test with strategy-specific dataset: |
ivan-aksamentov
added a commit
that referenced
this pull request
Jan 20, 2026
Closes #1699 Combines four recombination detection strategies: - B: Spatial uniformity (PR #1737) - C: Cluster gaps (PR #1738) - D: Reversion clustering (PR #1739) - F: Label switching (PR #1741) Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/` Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example CLI test: ``` nextclade run \ --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \ --output-all output/ \ data/recomb/enpen/enterovirus/ev-d68/sequences.fasta ``` Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.
ivan-aksamentov
added a commit
that referenced
this pull request
Jan 20, 2026
Closes #1699 Combines four recombination detection strategies: - B: Spatial uniformity (PR #1737) - C: Cluster gaps (PR #1738) - D: Reversion clustering (PR #1739) - F: Label switching (PR #1741) Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/` Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example CLI test: ``` nextclade run \ --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \ --output-all output/ \ data/recomb/enpen/enterovirus/ev-d68/sequences.fasta ``` Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.
ivan-aksamentov
added a commit
that referenced
this pull request
Jan 20, 2026
Closes #1699 Combines four recombination detection strategies: - B: Spatial uniformity (PR #1737) - C: Cluster gaps (PR #1738) - D: Reversion clustering (PR #1739) - F: Label switching (PR #1741) Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/` Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example CLI test: ``` nextclade run \ --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \ --output-all output/ \ data/recomb/enpen/enterovirus/ev-d68/sequences.fasta ``` Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Recombination Detection: Strategy F - Label Switching
Scientific Motivation
Recombination occurs when a virus incorporates genetic material from two or more parental lineages. Each lineage accumulates characteristic mutations over time - these "signature mutations" serve as molecular markers that distinguish lineages from one another.
When a recombination event occurs, different genomic regions inherit mutations from different parental lineages. This creates a distinctive pattern: the sequence carries mutations characteristic of lineage A in one region, and mutations characteristic of lineage B in another region.
The label switching strategy exploits this by leveraging the mutation label map (nucMutLabelMap) - a curated mapping of nucleotide positions to lineage labels. When private mutations are detected, they inherit labels from this map. In a non-recombinant sequence, most labeled mutations should belong to a single lineage (or closely related lineages). In a recombinant, mutations from different lineages cluster in different genomic regions, creating detectable "label switches" as you traverse the genome.
Mechanism
The algorithm proceeds as follows:
Label grouping: Collect all labeled private substitutions from
PrivateNucMutations.labeled_substitutions. Group them by their primary label (first label in the labels array), storing genomic positions for each label.Minimum labels check: If fewer than
minLabelsdistinct labels are present, return zero score (insufficient signal for recombination).Centroid calculation: For each label, compute the centroid (mean position) of all mutations carrying that label. This represents the "center of mass" of each lineage's contribution.
Switch counting: Sort labels by their centroid position. The number of switches equals
numLabels - 1, representing transitions between lineage-dominated regions as you traverse the genome from 5' to 3'.Scoring:
score = numSwitches * weightConfiguration
Required in
pathogen.json:{ "mutLabels": { "nucMutLabelMap": { "A123T": ["Alpha"], "G456C": ["Beta"], ... } }, "qc": { "recombinants": { "enabled": true, "scoreWeight": 100.0, "labelSwitching": { "enabled": true, "weight": 50.0, "minLabels": 2 } } } }Parameters:
enabled: Activate label switching detectionweight: Score contribution per label switch (default: 50.0)minLabels: Minimum distinct labels required to trigger detection (default: 2)Advantages
Limitations
nucMutLabelMapwith lineage-specific mutationsComparison to Other Strategies
Unlike Strategy A (weighted threshold) which only counts mutations, label switching considers the identity and spatial distribution of labeled mutations. Unlike Strategy B (spatial uniformity) which measures general non-uniformity, this strategy specifically identifies which lineages contribute to different regions.
Choose label switching when:
Choose other strategies when:
Implementation Summary
Files modified:
packages/nextclade/src/qc/qc_config.rs- Added QcRecombConfigLabelSwitching config structpackages/nextclade/src/qc/qc_rule_recombinants.rs- Implemented strategy_label_switching functionpackages/nextclade/src/qc/qc_recomb_utils.rs- Added shared utilities modulepackages/nextclade/src/qc/qc_run.rs- Integrated recombinants rulepackages/nextclade/src/qc/mod.rs- Registered new modulespackages/nextclade-web/src/helpers/formatQCRecombinants.ts- Added UI formattingpackages/nextclade-web/src/components/Results/ListOfQcIsuues.tsx- Display integrationpackages/nextclade-schemas/*.schema.{json,yaml}- Updated JSON schemasTest dataset:
data/recomb/enpen/enterovirus/ev-d68/- EV-D68 dataset with label switching configuration enabled for testingUnit tests added for:
Future Work