Skip to content

End-of-Utterance metric#15462

Open
rfejgin wants to merge 23 commits intoNVIDIA-NeMo:mainfrom
rfejgin:magpietts_eou_quality
Open

End-of-Utterance metric#15462
rfejgin wants to merge 23 commits intoNVIDIA-NeMo:mainfrom
rfejgin:magpietts_eou_quality

Conversation

@rfejgin
Copy link
Collaborator

@rfejgin rfejgin commented Mar 4, 2026

This PR adds a metric that measures end-of-utterance performance.

Each utterance is classified as one of:

  • "good": natural ending
  • "cutoff": early cutoff
  • "silence": a long silence at the end
  • "noise": noise after the last word

New metrics reported by the evaluation script are eou_cutoff_rate, eou_silence_rate, eou_noise_rate and eou_error_rate, where eou_error_rate accumulates all non-good cases (cutoff OR silence OR noise).

End-of-Utterance (EoU) Classifier Algorithm

This classifier detects whether a speech sample ends naturally or has an artifact (cutoff, trailing silence, or trailing noise). It works in two stages:

1. CTC Forced Alignment

We run the generated audio through a pretrained Wav2Vec2 CTC model (facebook/wav2vec2-base-960h) and use NeMo Forced Aligner's viterbi_decoding to force-align the audio frames to the target transcript. This produces per-character token segments with timestamps and confidence scores. The end of the last aligned token gives us and estimate of the speech boundary — the point where intelligible speech ends and any trailing audio begins.

2. Trailing-Region Analysis & Classification

Using the speech boundary, we split the audio into a speech region and a trailing region (with a small padding of the speech region of 100-150ms to account for Wav2Vec2's end-of-segment inaccuracy). We then extract features from both regions:

  • Trailing duration — how much audio remains after aligned speech ends
  • Trail RMS ratio — RMS energy of the tail region relative to the full utterance
  • Last token confidence — alignment confidence of the final speech token (falls back to average of last two tokens if near zero, as a zero-confidence was found to be an unreliable estimate of actual confidence).
  • Last token gap — blank-frame gap between the last two aligned tokens — tends to happen when there is junk audio at the end.

These features feed a simple decision tree:

Condition Label
Short tail (< 0.1s) + low confidence + no large gap cutoff
Long noisy tail (> 0.15s, RMS ratio > 0.4) OR large gap + low confidence noise
Long tail (> 1.4s) without high energy silence
Everything else good

3. Additional Notes

  • Utterances whose ending includes both noise and silence can be classified as either class, but tend to be classified as silence, especially if the trailing silence is long.
  • Utterances with looping at the end are not considered a separate category, but I tested a few and their ending was classified as "noise", as desired.

Accuracy

This metric is still somewhat experimental and we don't have enough labeled data to measure its accuracy. But it was tested on both ground truth test sets and noisy generated speech sets and worked quite well. It was found to be,

  • Good at detecting utterances with cutoff and silence tails, in the sense that when it detects them, it's usually correct (low false positives). It was able to find cutoff issues in a ground-truth test set that we were previously unaware included such issues.
  • Occasionally under or over-detect noise tails. That said, on a test set of 122 utterances nearly all of which likely contain a noise tail, it detected 97% utterances as noisy. For LibriTTS ground truth data, it only flagged 4/2275 as noisy (and one of those was indeed noisy). For studio-quality data (Riva), it classified all 100 utterances tested as good.

Limitations

Support English only, for now. The metric is set to nan for other languages.

Speed

It's pretty fast – the metric processed 2300 utterances in about 36 seconds on my machine. The speed is aided by batching.

Testing

Included is a unit test that verifies some utterances whose ending types is known are classified correctly.

Adding an experimental module that examines the end of speech utterances and classifies
them as either good (natural ending), cutoff (abrupt ending), silence (long tail of
silence), or noise (tail with high energy).

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@github-actions github-actions bot added the TTS label Mar 4, 2026
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
This gets rid of the torchaudio dependency.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin rfejgin force-pushed the magpietts_eou_quality branch from 2dab2ed to 32265a5 Compare March 4, 2026 21:29
rfejgin added 2 commits March 5, 2026 13:04
Split Wav2Vec2 forward pass: run CNN feature extractor per-sample
(avoiding GroupNorm padding artifacts) and batch the transformer
encoder, LM head, and Viterbi decoding for throughput.

Key changes:
- Extract _build_alignment_info() and _classify_from_alignment() helpers
  to share logic between single and batch code paths
- Add _forced_align_batch() with unbatched CNN + batched transformer
- Add _forced_align_batch_naive() for comparison (fully batched including
  CNN; produces small alignment drift due to GroupNorm on padded zeros)
- Add classify_batch(items, log_timing) public API
- Use bool dtype for attention_mask (long dtype causes bitwise NOT bug
  in HuggingFace Wav2Vec2Encoder, zeroing out all hidden states)
- Add per-stage timing instrumentation behind log_timing flag
- Add test_batch_matches_unbatched confirming bit-exact parity
- Add test_batch_naive_matches_unbatched documenting GroupNorm drift
- Add test_batch_with_timing smoke test

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Made-with: Cursor
... and delete an uneeded notebook.
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
rfejgin added 2 commits March 5, 2026 18:59
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin rfejgin marked this pull request as ready for review March 6, 2026 03:00
@rfejgin rfejgin requested a review from blisc March 6, 2026 03:00
@rfejgin rfejgin requested a review from rlangman March 6, 2026 03:01
rfejgin added 2 commits March 5, 2026 19:25
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin rfejgin changed the title End-of-utterance metric End-of-Utterance metric Mar 6, 2026

from nemo.collections.asr.parts.utils.aligner_utils import viterbi_decoding

SR = 16000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you only use this magic number once, add it to the function definition itself rather than make a global variable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +459 to +481
if __name__ == "__main__":
import argparse

parser = argparse.ArgumentParser(description="Classify end-of-utterance audio quality")
parser.add_argument("audio", help="Path to audio file")
parser.add_argument("text", help="Target text")
args = parser.parse_args()

classifier = EoUClassifier()
result = classifier.classify(args.audio, args.text)
print(f"eou_type: {result.eou_type}")
print(f"speech_end: {result.speech_end:.3f}s")
print(f"audio_duration: {result.audio_duration:.3f}s")
print(f"trailing_duration: {result.trailing_duration:.3f}s")
print(f"trail_rms_ratio: {result.trail_rms_ratio:.4f}")
print(f"last_token_dur: {result.last_token_duration:.3f}s")
print(f"last_token_conf: {result.last_token_confidence:.3f}")
print(f"last_token_gap: {result.last_token_gap:.3f}s")
print(f"last_2_ph_avg_conf: {result.last_two_phoneme_avg_confidence:.3f}")
print(f"last_token: {result.last_token!r}")
print(f"\nToken segments ({len(result.token_segments)}):")
for seg in result.token_segments:
print(f" {seg.token!r:<6} {seg.start:.3f}-{seg.end:.3f}s dur={seg.duration:.3f}s conf={seg.confidence:.3f}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want this API in this file? I would recommend removal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense - removed

# EoU classification rates
eou_types = [m.get('eou_type') for m in filewise_metrics]
if eou_types[0] is not None:
from collections import Counter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move import statements to the top of the file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved

Comment on lines +153 to +155
Returns:
Dict with information about when the speech ended, what the last token was,
what its confidence was, and detailed per-segment information.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dictionary needs to be flattened out or moved to a dataclass. It is very dense

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I've now wrapped this as a dataclass called AlignmentFeatures. It is also reused by in the EoUClassification where it's now one of the fields, which reduces duplication.

* Reorganize data classes
* Rename some classes for clarity
* Get rid of global constant only used once
* Slightly increase one of the thresholds

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants