Skip to content

[magpietts] added multiple validation dataloaders and log metrics per val data#15348

Merged
XuesongYang merged 18 commits intoNVIDIA-NeMo:mainfrom
XuesongYang:xueyang/pr-multi-val-dataloaders-main
Mar 9, 2026
Merged

[magpietts] added multiple validation dataloaders and log metrics per val data#15348
XuesongYang merged 18 commits intoNVIDIA-NeMo:mainfrom
XuesongYang:xueyang/pr-multi-val-dataloaders-main

Conversation

@XuesongYang
Copy link
Collaborator

@XuesongYang XuesongYang commented Feb 2, 2026

Summary

  • Cherry-picked #15189 (multi-validation dataloaders) to main.
  • Refactored media artifact logging.
  • Added MoE expert usage monitoring (per-expert scalars + layer-wise heatmaps).
  • Unified validation_step_outputs to always use list[list] structure, eliminating conditional branching for single vs. multi-dataloader paths.
  • Unified config key from dataset (singular) to datasets across non-lhotse YAML configs.
  • Fixed PO subclasses (OfflinePO, OnlinePO) to work with the new list[list] validation outputs.

Details

Multi-Validation Dataloaders

Added support for running validation on multiple datasets simultaneously. Each dataset gets its own dataloader with per-dataset metric logging (Loss:<dataset_name>/<metric>), plus an averaged Loss:val_avg/<metric>.

Training command:

python examples/tts/magpietts.py \
    ...
    model.train_ds.input_cfg="/data/manifests/train_input_cfg.yaml" \
    model.validation_ds.datasets="/data/val_datasets.yaml" \
    ...

Validation datasets YAML (generalizes to multiple languages/splits):

- name: "LibriTTS_dev_clean"
  input_cfg: "/data/manifests/val_input_cfg_en.yaml"
- name: "LibriTTS_test_clean"
  input_cfg: "/data/manifests/val_input_cfg_en_testClean.yaml"

Shared settings (e.g., batch_duration, volume_norm, min_duration) live at the validation_ds level; per-dataset entries inherit and can override them.

Config Changes

  • train_ds: Fields previously nested under train_ds.dataset are now directly under train_ds for lhotse configs; under train_ds.datasets for non-lhotse configs.
  • validation_ds: Removed the dataset nesting level. Now requires a datasets key, a list for lhotse (one dataloader per entry) or a dict for non-lhotse (single dataloader, multiplicity via dataset_meta).

MoE Expert Usage Monitoring

  • Per-expert usage scalars under MoE:train/ and MoE:<dataset>/ panels.
  • Layer-wise expert usage heatmaps (deviation from ideal usage 1/num_experts) at validation intervals.
  • Aggregate moe_expert_usage_variance for training as a load-balance health indicator.
  • MoE auxiliary loss logging skipped when coefficient is 0 to avoid constant-zero noise.

Example WandB Plots

Screenshot 2026-03-04 at 11 14 08 AM Screenshot 2026-03-04 at 11 14 16 AM Screenshot 2026-02-02 at 4 26 33 PM Screenshot 2026-02-02 at 4 27 07 PM Screenshot 2026-02-02 at 4 27 30 PM Screenshot 2026-02-02 at 4 27 52 PM

Full WandB run: https://wandb.ai/aiapps/debug_magpieTTS_EN_2509_multiValSet/runs/5za0abz7

@XuesongYang XuesongYang force-pushed the xueyang/pr-multi-val-dataloaders-main branch from 861e8b3 to fae5fcb Compare February 2, 2026 22:08
@XuesongYang XuesongYang marked this pull request as ready for review February 3, 2026 00:31
@XuesongYang XuesongYang force-pushed the xueyang/pr-multi-val-dataloaders-main branch from fae5fcb to c9cc855 Compare February 3, 2026 01:35
Copilot AI review requested due to automatic review settings February 3, 2026 01:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for validating MagpieTTS on multiple datasets (multiple validation dataloaders) while improving how media artifacts (audio + attention visualizations) are prepared and logged to W&B/TensorBoard, and updates the example Lhotse config to the new dataset configuration structure.

Changes:

  • Refactors validation media logging by separating data preparation (numpy arrays) from logger-specific emission (W&B/TB objects).
  • Adds multi-dataloader validation support, including per-dataloader metric aggregation and an averaged validation loss for checkpointing.
  • Updates the MagpieTTS Lhotse example config to remove the dataset: nesting and introduce a validation_ds.datasets list format.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
nemo/collections/tts/models/magpietts.py Implements multi-validation-dataloader handling, refactors media logging, and adjusts Lhotse dataloader config expectations.
examples/tts/conf/magpietts/magpietts_lhotse.yaml Updates example configuration to match the new train/validation dataset config structure and multi-val datasets list format.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

blisc
blisc previously approved these changes Mar 4, 2026
XuesongYang and others added 18 commits March 6, 2026 22:45
…VIDIA-NeMo#15189)

1. added multiple validation dataloaders and log metrics per val data.
* Apply suggestion from @XuesongYang
* Apply suggestion from @Copilot
* Apply suggestion from @Copilot
* Apply suggestion from @Copilot

2. refactor wandb and tb logging. Move image and audio objects initialization to on_validation_epoch_end.
3. bugfix: adpat new changes on codec inference and process_batch
4. update docstring for on_validation_epoch_end
5. backward compatibility for non-lhotse configs when no 'datasets' key exists in val ds config
6. refactor arguments of func _log_media_to_wandb_and_tb with dataclass.
7. added unit tests on legacy and new yaml configs.
8. refactor as suggested using a dict instead of 5 lists to store losses.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
…override

Override validation_step_outputs in MagpieTTSModel to always return
list-of-lists, removing all isinstance/conditional branching in
validation_step and on_validation_epoch_end. Drop legacy config
support (datasets key is now required).

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
.cpu() and .item() moved tensors off GPU, causing NCCL all_reduce
to fail with 'No backend type associated with device type cpu'
when self.log(sync_dist=True) tried to sync across ranks. Use
.detach() instead to break the computation graph while keeping
tensors on the original device.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
…wise heatmaps

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
- Fix aliased heatmap: set DPI=150, interpolation='nearest'.
- Fix Y-axis orientation: origin='lower' so layer 0 is at bottom.
- Update labels: "Experts"/"Layers", add ideal usage value to title.
- Fix WandB step mismatch: pass step=global_step to experiment.log()
  so the slider aligns with training step instead of wandb's internal
  counter.
- Skip logging moe_load_balancing_loss and moe_router_z_loss when their
  coefficient is 0 (e.g. Sinkhorn routing) to avoid constant-zero noise.
- Remove moe_expert_usage_max/min from both train and val logging
  (redundant with per-expert scalar lines). Keep only variance for
  training as a single-number load balance health indicator.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
… misleading validation balance log

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
…extensibility

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Non-lhotse configs used  (singular) while lhotse configs used
(list), causing schema confusion. Rename the key to  everywhere — for
non-lhotse it is a dict (multiplicity handled via dataset_meta), for lhotse it
remains a list of separate dataloader configs. Validation names are now derived
from dataset_meta keys (e.g., en+es) instead of hardcoded val_set_0.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
…docs

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
MagpieTTSModelOfflinePO and MagpieTTSModelOnlinePO were appending to the
outer list instead of the per-dataloader inner list, causing a TypeError
during on_validation_epoch_end. Accept dataloader_idx in validation_step
and iterate all dataloaders in on_validation_epoch_end.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
@XuesongYang
Copy link
Collaborator Author

@blisc addressed merging conflicts and fixed bugs of building longform doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants