Skip to content

feat(qc): recombination: F: Label Switching#1741

Open
ivan-aksamentov wants to merge 4 commits intomasterfrom
feat/qc-recomb-strategy-f
Open

feat(qc): recombination: F: Label Switching#1741
ivan-aksamentov wants to merge 4 commits intomasterfrom
feat/qc-recomb-strategy-f

Conversation

@ivan-aksamentov
Copy link
Member

Recombination Detection: Strategy F - Label Switching

Scientific Motivation

Recombination occurs when a virus incorporates genetic material from two or more parental lineages. Each lineage accumulates characteristic mutations over time - these "signature mutations" serve as molecular markers that distinguish lineages from one another.

When a recombination event occurs, different genomic regions inherit mutations from different parental lineages. This creates a distinctive pattern: the sequence carries mutations characteristic of lineage A in one region, and mutations characteristic of lineage B in another region.

The label switching strategy exploits this by leveraging the mutation label map (nucMutLabelMap) - a curated mapping of nucleotide positions to lineage labels. When private mutations are detected, they inherit labels from this map. In a non-recombinant sequence, most labeled mutations should belong to a single lineage (or closely related lineages). In a recombinant, mutations from different lineages cluster in different genomic regions, creating detectable "label switches" as you traverse the genome.

Mechanism

The algorithm proceeds as follows:

  1. Label grouping: Collect all labeled private substitutions from PrivateNucMutations.labeled_substitutions. Group them by their primary label (first label in the labels array), storing genomic positions for each label.

  2. Minimum labels check: If fewer than minLabels distinct labels are present, return zero score (insufficient signal for recombination).

  3. Centroid calculation: For each label, compute the centroid (mean position) of all mutations carrying that label. This represents the "center of mass" of each lineage's contribution.

  4. Switch counting: Sort labels by their centroid position. The number of switches equals numLabels - 1, representing transitions between lineage-dominated regions as you traverse the genome from 5' to 3'.

  5. Scoring: score = numSwitches * weight

Configuration

Required in pathogen.json:

{
  "mutLabels": {
    "nucMutLabelMap": {
      "A123T": ["Alpha"],
      "G456C": ["Beta"],
      ...
    }
  },
  "qc": {
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "labelSwitching": {
        "enabled": true,
        "weight": 50.0,
        "minLabels": 2
      }
    }
  }
}

Parameters:

  • enabled: Activate label switching detection
  • weight: Score contribution per label switch (default: 50.0)
  • minLabels: Minimum distinct labels required to trigger detection (default: 2)

Advantages

  • Leverages existing lineage annotation infrastructure (mutLabels)
  • Biologically interpretable - directly identifies which lineages contributed to the recombinant
  • Does not require spatial parameters or segment definitions
  • Robust to mutation density variations across the genome
  • Works with any pathogen that has curated lineage-defining mutations

Limitations

  • Requires a well-curated nucMutLabelMap with lineage-specific mutations
  • Effectiveness depends on quality and completeness of label annotations
  • Cannot detect recombination between unlabeled or identically-labeled lineages
  • Uses only the first label when mutations have multiple labels
  • Centroid-based ordering may miss complex recombination patterns with interleaved regions

Comparison to Other Strategies

Unlike Strategy A (weighted threshold) which only counts mutations, label switching considers the identity and spatial distribution of labeled mutations. Unlike Strategy B (spatial uniformity) which measures general non-uniformity, this strategy specifically identifies which lineages contribute to different regions.

Choose label switching when:

  • Your pathogen has well-characterized lineage-defining mutations
  • You want to identify the parental lineages, not just detect recombination
  • The labeled mutation set has good genome-wide coverage

Choose other strategies when:

  • No mutation label map is available (A, B, C, D)
  • Recombination involves unlabeled variants (A, B, C, D)
  • Multiple ancestral references are available (E)

Implementation Summary

Files modified:

  • packages/nextclade/src/qc/qc_config.rs - Added QcRecombConfigLabelSwitching config struct
  • packages/nextclade/src/qc/qc_rule_recombinants.rs - Implemented strategy_label_switching function
  • packages/nextclade/src/qc/qc_recomb_utils.rs - Added shared utilities module
  • packages/nextclade/src/qc/qc_run.rs - Integrated recombinants rule
  • packages/nextclade/src/qc/mod.rs - Registered new modules
  • packages/nextclade-web/src/helpers/formatQCRecombinants.ts - Added UI formatting
  • packages/nextclade-web/src/components/Results/ListOfQcIsuues.tsx - Display integration
  • packages/nextclade-schemas/*.schema.{json,yaml} - Updated JSON schemas

Test dataset:

  • data/recomb/enpen/enterovirus/ev-d68/ - EV-D68 dataset with label switching configuration enabled for testing

Unit tests added for:

  • Disabled config returns None
  • Empty labeled mutations returns None
  • Single label below minLabels returns zero score
  • Two labels returns one switch
  • Three labels returns two switches
  • Multiple labels per mutation uses first label only

Future Work

  • Support weighted label switches based on centroid separation distance
  • Consider secondary labels for mutations with multiple lineage assignments
  • Add visualization of label distribution across genome
  • Integrate with tree-based lineage assignment for validation

## Recombination Detection: Strategy F - Label Switching

### Scientific Motivation

Recombination occurs when a virus incorporates genetic material from two or
more parental lineages. Each lineage accumulates characteristic mutations
over time - these "signature mutations" serve as molecular markers that
distinguish lineages from one another.

When a recombination event occurs, different genomic regions inherit
mutations from different parental lineages. This creates a distinctive
pattern: the sequence carries mutations characteristic of lineage A in one
region, and mutations characteristic of lineage B in another region.

The label switching strategy exploits this by leveraging the mutation label
map (nucMutLabelMap) - a curated mapping of nucleotide positions to lineage
labels. When private mutations are detected, they inherit labels from this
map. In a non-recombinant sequence, most labeled mutations should belong to
a single lineage (or closely related lineages). In a recombinant, mutations
from different lineages cluster in different genomic regions, creating
detectable "label switches" as you traverse the genome.

### Mechanism

The algorithm proceeds as follows:

1. **Label grouping**: Collect all labeled private substitutions from
   `PrivateNucMutations.labeled_substitutions`. Group them by their primary
   label (first label in the labels array), storing genomic positions for
   each label.

2. **Minimum labels check**: If fewer than `minLabels` distinct labels are
   present, return zero score (insufficient signal for recombination).

3. **Centroid calculation**: For each label, compute the centroid (mean
   position) of all mutations carrying that label. This represents the
   "center of mass" of each lineage's contribution.

4. **Switch counting**: Sort labels by their centroid position. The number
   of switches equals `numLabels - 1`, representing transitions between
   lineage-dominated regions as you traverse the genome from 5' to 3'.

5. **Scoring**: `score = numSwitches * weight`

### Configuration

Required in `pathogen.json`:

```json
{
  "mutLabels": {
    "nucMutLabelMap": {
      "A123T": ["Alpha"],
      "G456C": ["Beta"],
      ...
    }
  },
  "qc": {
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "labelSwitching": {
        "enabled": true,
        "weight": 50.0,
        "minLabels": 2
      }
    }
  }
}
```

Parameters:
- `enabled`: Activate label switching detection
- `weight`: Score contribution per label switch (default: 50.0)
- `minLabels`: Minimum distinct labels required to trigger detection
  (default: 2)

### Advantages

- Leverages existing lineage annotation infrastructure (mutLabels)
- Biologically interpretable - directly identifies which lineages
  contributed to the recombinant
- Does not require spatial parameters or segment definitions
- Robust to mutation density variations across the genome
- Works with any pathogen that has curated lineage-defining mutations

### Limitations

- Requires a well-curated `nucMutLabelMap` with lineage-specific mutations
- Effectiveness depends on quality and completeness of label annotations
- Cannot detect recombination between unlabeled or identically-labeled
  lineages
- Uses only the first label when mutations have multiple labels
- Centroid-based ordering may miss complex recombination patterns with
  interleaved regions

### Comparison to Other Strategies

Unlike Strategy A (weighted threshold) which only counts mutations, label
switching considers the identity and spatial distribution of labeled
mutations. Unlike Strategy B (spatial uniformity) which measures general
non-uniformity, this strategy specifically identifies which lineages
contribute to different regions.

Choose label switching when:
- Your pathogen has well-characterized lineage-defining mutations
- You want to identify the parental lineages, not just detect recombination
- The labeled mutation set has good genome-wide coverage

Choose other strategies when:
- No mutation label map is available (A, B, C, D)
- Recombination involves unlabeled variants (A, B, C, D)
- Multiple ancestral references are available (E)

### Implementation Summary

Files modified:
- `packages/nextclade/src/qc/qc_config.rs` - Added QcRecombConfigLabelSwitching config struct
- `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Implemented strategy_label_switching function
- `packages/nextclade/src/qc/qc_recomb_utils.rs` - Added shared utilities module
- `packages/nextclade/src/qc/qc_run.rs` - Integrated recombinants rule
- `packages/nextclade/src/qc/mod.rs` - Registered new modules
- `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - Added UI formatting
- `packages/nextclade-web/src/components/Results/ListOfQcIsuues.tsx` - Display integration
- `packages/nextclade-schemas/*.schema.{json,yaml}` - Updated JSON schemas

Test dataset:
- `data/recomb/enpen/enterovirus/ev-d68/` - EV-D68 dataset with label
  switching configuration enabled for testing

Unit tests added for:
- Disabled config returns None
- Empty labeled mutations returns None
- Single label below minLabels returns zero score
- Two labels returns one switch
- Three labels returns two switches
- Multiple labels per mutation uses first label only

### Future Work

- Support weighted label switches based on centroid separation distance
- Consider secondary labels for mutations with multiple lineage assignments
- Add visualization of label distribution across genome
- Integrate with tree-based lineage assignment for validation

Co-Authored-By: Claude <noreply@anthropic.com>

This comment was marked as resolved.

ivan-aksamentov and others added 3 commits January 13, 2026 14:48
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

@ivan-aksamentov
Copy link
Member Author

Test with strategy-specific dataset:

Preview with EV-D68 test dataset

ivan-aksamentov added a commit that referenced this pull request Jan 20, 2026
Closes #1699

Combines four recombination detection strategies:
- B: Spatial uniformity (PR #1737)
- C: Cluster gaps (PR #1738)
- D: Reversion clustering (PR #1739)
- F: Label switching (PR #1741)

Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/`

Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click

Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example

CLI test:
```
nextclade run \
  --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \
  --output-all output/ \
  data/recomb/enpen/enterovirus/ev-d68/sequences.fasta
```

Note: The current weighted score aggregation (simple sum of strategy
scores) is a temporary solution. The scoring mechanism needs further
discussion to determine optimal combination approach.
ivan-aksamentov added a commit that referenced this pull request Jan 20, 2026
Closes #1699

Combines four recombination detection strategies:
- B: Spatial uniformity (PR #1737)
- C: Cluster gaps (PR #1738)
- D: Reversion clustering (PR #1739)
- F: Label switching (PR #1741)

Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/`

Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click

Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example

CLI test:
```
nextclade run \
  --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \
  --output-all output/ \
  data/recomb/enpen/enterovirus/ev-d68/sequences.fasta
```

Note: The current weighted score aggregation (simple sum of strategy
scores) is a temporary solution. The scoring mechanism needs further
discussion to determine optimal combination approach.
ivan-aksamentov added a commit that referenced this pull request Jan 20, 2026
Closes #1699

Combines four recombination detection strategies:
- B: Spatial uniformity (PR #1737)
- C: Cluster gaps (PR #1738)
- D: Reversion clustering (PR #1739)
- F: Label switching (PR #1741)

Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/`

Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click

Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example

CLI test:
```
nextclade run \
  --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \
  --output-all output/ \
  data/recomb/enpen/enterovirus/ev-d68/sequences.fasta
```

Note: The current weighted score aggregation (simple sum of strategy
scores) is a temporary solution. The scoring mechanism needs further
discussion to determine optimal combination approach.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants