[magpietts] added multiple validation dataloaders and log metrics per val data by XuesongYang · Pull Request #15348 · NVIDIA-NeMo/NeMo

XuesongYang · 2026-02-02T21:16:50Z

Summary

Cherry-picked #15189 (multi-validation dataloaders) to main.
Refactored media artifact logging.
Added MoE expert usage monitoring (per-expert scalars + layer-wise heatmaps).
Unified validation_step_outputs to always use list[list] structure, eliminating conditional branching for single vs. multi-dataloader paths.
Unified config key from dataset (singular) to datasets across non-lhotse YAML configs.
Fixed PO subclasses (OfflinePO, OnlinePO) to work with the new list[list] validation outputs.

Details

Multi-Validation Dataloaders

Added support for running validation on multiple datasets simultaneously. Each dataset gets its own dataloader with per-dataset metric logging (Loss:<dataset_name>/<metric>), plus an averaged Loss:val_avg/<metric>.

Training command:

python examples/tts/magpietts.py \
    ...
    model.train_ds.input_cfg="/data/manifests/train_input_cfg.yaml" \
    model.validation_ds.datasets="/data/val_datasets.yaml" \
    ...

Validation datasets YAML (generalizes to multiple languages/splits):

- name: "LibriTTS_dev_clean"
  input_cfg: "/data/manifests/val_input_cfg_en.yaml"
- name: "LibriTTS_test_clean"
  input_cfg: "/data/manifests/val_input_cfg_en_testClean.yaml"

Shared settings (e.g., batch_duration, volume_norm, min_duration) live at the validation_ds level; per-dataset entries inherit and can override them.

Config Changes

train_ds: Fields previously nested under train_ds.dataset are now directly under train_ds for lhotse configs; under train_ds.datasets for non-lhotse configs.
validation_ds: Removed the dataset nesting level. Now requires a datasets key, a list for lhotse (one dataloader per entry) or a dict for non-lhotse (single dataloader, multiplicity via dataset_meta).

MoE Expert Usage Monitoring

Per-expert usage scalars under MoE:train/ and MoE:<dataset>/ panels.
Layer-wise expert usage heatmaps (deviation from ideal usage 1/num_experts) at validation intervals.
Aggregate moe_expert_usage_variance for training as a load-balance health indicator.
MoE auxiliary loss logging skipped when coefficient is 0 to avoid constant-zero noise.

Example WandB Plots

Full WandB run: https://wandb.ai/aiapps/debug_magpieTTS_EN_2509_multiValSet/runs/5za0abz7

Copilot

Pull request overview

Adds support for validating MagpieTTS on multiple datasets (multiple validation dataloaders) while improving how media artifacts (audio + attention visualizations) are prepared and logged to W&B/TensorBoard, and updates the example Lhotse config to the new dataset configuration structure.

Changes:

Refactors validation media logging by separating data preparation (numpy arrays) from logger-specific emission (W&B/TB objects).
Adds multi-dataloader validation support, including per-dataloader metric aggregation and an averaged validation loss for checkpointing.
Updates the MagpieTTS Lhotse example config to remove the dataset: nesting and introduce a validation_ds.datasets list format.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
nemo/collections/tts/models/magpietts.py	Implements multi-validation-dataloader handling, refactors media logging, and adjusts Lhotse dataloader config expectations.
examples/tts/conf/magpietts/magpietts_lhotse.yaml	Updates example configuration to match the new train/validation dataset config structure and multi-val datasets list format.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nemo/collections/tts/models/magpietts.py

@XuesongYang

…VIDIA-NeMo#15189) 1. added multiple validation dataloaders and log metrics per val data. * Apply suggestion from @XuesongYang * Apply suggestion from @Copilot * Apply suggestion from @Copilot * Apply suggestion from @Copilot 2. refactor wandb and tb logging. Move image and audio objects initialization to on_validation_epoch_end. 3. bugfix: adpat new changes on codec inference and process_batch 4. update docstring for on_validation_epoch_end 5. backward compatibility for non-lhotse configs when no 'datasets' key exists in val ds config 6. refactor arguments of func _log_media_to_wandb_and_tb with dataclass. 7. added unit tests on legacy and new yaml configs. 8. refactor as suggested using a dict instead of 5 lists to store losses. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>