[magpietts] added multiple validation dataloaders and log metrics per val data#15348
Merged
XuesongYang merged 18 commits intoNVIDIA-NeMo:mainfrom Mar 9, 2026
Merged
Conversation
861e8b3 to
fae5fcb
Compare
fae5fcb to
c9cc855
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Adds support for validating MagpieTTS on multiple datasets (multiple validation dataloaders) while improving how media artifacts (audio + attention visualizations) are prepared and logged to W&B/TensorBoard, and updates the example Lhotse config to the new dataset configuration structure.
Changes:
- Refactors validation media logging by separating data preparation (numpy arrays) from logger-specific emission (W&B/TB objects).
- Adds multi-dataloader validation support, including per-dataloader metric aggregation and an averaged validation loss for checkpointing.
- Updates the MagpieTTS Lhotse example config to remove the
dataset:nesting and introduce avalidation_ds.datasetslist format.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| nemo/collections/tts/models/magpietts.py | Implements multi-validation-dataloader handling, refactors media logging, and adjusts Lhotse dataloader config expectations. |
| examples/tts/conf/magpietts/magpietts_lhotse.yaml | Updates example configuration to match the new train/validation dataset config structure and multi-val datasets list format. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
blisc
requested changes
Feb 3, 2026
15d8bcd to
8692fbc
Compare
blisc
previously approved these changes
Mar 4, 2026
…VIDIA-NeMo#15189) 1. added multiple validation dataloaders and log metrics per val data. * Apply suggestion from @XuesongYang * Apply suggestion from @Copilot * Apply suggestion from @Copilot * Apply suggestion from @Copilot 2. refactor wandb and tb logging. Move image and audio objects initialization to on_validation_epoch_end. 3. bugfix: adpat new changes on codec inference and process_batch 4. update docstring for on_validation_epoch_end 5. backward compatibility for non-lhotse configs when no 'datasets' key exists in val ds config 6. refactor arguments of func _log_media_to_wandb_and_tb with dataclass. 7. added unit tests on legacy and new yaml configs. 8. refactor as suggested using a dict instead of 5 lists to store losses. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
…override Override validation_step_outputs in MagpieTTSModel to always return list-of-lists, removing all isinstance/conditional branching in validation_step and on_validation_epoch_end. Drop legacy config support (datasets key is now required). Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
.cpu() and .item() moved tensors off GPU, causing NCCL all_reduce to fail with 'No backend type associated with device type cpu' when self.log(sync_dist=True) tried to sync across ranks. Use .detach() instead to break the computation graph while keeping tensors on the original device. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
…wise heatmaps Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
- Fix aliased heatmap: set DPI=150, interpolation='nearest'. - Fix Y-axis orientation: origin='lower' so layer 0 is at bottom. - Update labels: "Experts"/"Layers", add ideal usage value to title. - Fix WandB step mismatch: pass step=global_step to experiment.log() so the slider aligns with training step instead of wandb's internal counter. - Skip logging moe_load_balancing_loss and moe_router_z_loss when their coefficient is 0 (e.g. Sinkhorn routing) to avoid constant-zero noise. - Remove moe_expert_usage_max/min from both train and val logging (redundant with per-expert scalar lines). Keep only variance for training as a single-number load balance health indicator. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
… misleading validation balance log Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
…extensibility Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Non-lhotse configs used (singular) while lhotse configs used (list), causing schema confusion. Rename the key to everywhere — for non-lhotse it is a dict (multiplicity handled via dataset_meta), for lhotse it remains a list of separate dataloader configs. Validation names are now derived from dataset_meta keys (e.g., en+es) instead of hardcoded val_set_0. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
…docs Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
MagpieTTSModelOfflinePO and MagpieTTSModelOnlinePO were appending to the outer list instead of the per-dataloader inner list, causing a TypeError during on_validation_epoch_end. Accept dataloader_idx in validation_step and iterate all dataloaders in on_validation_epoch_end. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Collaborator
Author
|
@blisc addressed merging conflicts and fixed bugs of building longform doc. |
blisc
approved these changes
Mar 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
main.validation_step_outputsto always uselist[list]structure, eliminating conditional branching for single vs. multi-dataloader paths.dataset(singular) todatasetsacross non-lhotse YAML configs.OfflinePO,OnlinePO) to work with the newlist[list]validation outputs.Details
Multi-Validation Dataloaders
Added support for running validation on multiple datasets simultaneously. Each dataset gets its own dataloader with per-dataset metric logging (
Loss:<dataset_name>/<metric>), plus an averagedLoss:val_avg/<metric>.Training command:
python examples/tts/magpietts.py \ ... model.train_ds.input_cfg="/data/manifests/train_input_cfg.yaml" \ model.validation_ds.datasets="/data/val_datasets.yaml" \ ...Validation datasets YAML (generalizes to multiple languages/splits):
Shared settings (e.g.,
batch_duration,volume_norm,min_duration) live at thevalidation_dslevel; per-dataset entries inherit and can override them.Config Changes
train_ds: Fields previously nested undertrain_ds.datasetare now directly undertrain_dsfor lhotse configs; undertrain_ds.datasetsfor non-lhotse configs.validation_ds: Removed thedatasetnesting level. Now requires adatasetskey, a list for lhotse (one dataloader per entry) or a dict for non-lhotse (single dataloader, multiplicity viadataset_meta).MoE Expert Usage Monitoring
MoE:train/andMoE:<dataset>/panels.1/num_experts) at validation intervals.moe_expert_usage_variancefor training as a load-balance health indicator.Example WandB Plots
Full WandB run: https://wandb.ai/aiapps/debug_magpieTTS_EN_2509_multiValSet/runs/5za0abz7