Open
Conversation
Adding an experimental module that examines the end of speech utterances and classifies them as either good (natural ending), cutoff (abrupt ending), silence (long tail of silence), or noise (tail with high energy). Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
This gets rid of the torchaudio dependency. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
2dab2ed to
32265a5
Compare
Split Wav2Vec2 forward pass: run CNN feature extractor per-sample (avoiding GroupNorm padding artifacts) and batch the transformer encoder, LM head, and Viterbi decoding for throughput. Key changes: - Extract _build_alignment_info() and _classify_from_alignment() helpers to share logic between single and batch code paths - Add _forced_align_batch() with unbatched CNN + batched transformer - Add _forced_align_batch_naive() for comparison (fully batched including CNN; produces small alignment drift due to GroupNorm on padded zeros) - Add classify_batch(items, log_timing) public API - Use bool dtype for attention_mask (long dtype causes bitwise NOT bug in HuggingFace Wav2Vec2Encoder, zeroing out all hidden states) - Add per-stage timing instrumentation behind log_timing flag - Add test_batch_matches_unbatched confirming bit-exact parity - Add test_batch_naive_matches_unbatched documenting GroupNorm drift - Add test_batch_with_timing smoke test Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com> Made-with: Cursor
... and delete an uneeded notebook. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
…magpietts_eou_quality
blisc
requested changes
Mar 9, 2026
|
|
||
| from nemo.collections.asr.parts.utils.aligner_utils import viterbi_decoding | ||
|
|
||
| SR = 16000 |
Collaborator
There was a problem hiding this comment.
If you only use this magic number once, add it to the function definition itself rather than make a global variable
Comment on lines
+459
to
+481
| if __name__ == "__main__": | ||
| import argparse | ||
|
|
||
| parser = argparse.ArgumentParser(description="Classify end-of-utterance audio quality") | ||
| parser.add_argument("audio", help="Path to audio file") | ||
| parser.add_argument("text", help="Target text") | ||
| args = parser.parse_args() | ||
|
|
||
| classifier = EoUClassifier() | ||
| result = classifier.classify(args.audio, args.text) | ||
| print(f"eou_type: {result.eou_type}") | ||
| print(f"speech_end: {result.speech_end:.3f}s") | ||
| print(f"audio_duration: {result.audio_duration:.3f}s") | ||
| print(f"trailing_duration: {result.trailing_duration:.3f}s") | ||
| print(f"trail_rms_ratio: {result.trail_rms_ratio:.4f}") | ||
| print(f"last_token_dur: {result.last_token_duration:.3f}s") | ||
| print(f"last_token_conf: {result.last_token_confidence:.3f}") | ||
| print(f"last_token_gap: {result.last_token_gap:.3f}s") | ||
| print(f"last_2_ph_avg_conf: {result.last_two_phoneme_avg_confidence:.3f}") | ||
| print(f"last_token: {result.last_token!r}") | ||
| print(f"\nToken segments ({len(result.token_segments)}):") | ||
| for seg in result.token_segments: | ||
| print(f" {seg.token!r:<6} {seg.start:.3f}-{seg.end:.3f}s dur={seg.duration:.3f}s conf={seg.confidence:.3f}") |
Collaborator
There was a problem hiding this comment.
Do you want this API in this file? I would recommend removal
Collaborator
Author
There was a problem hiding this comment.
Yeah, makes sense - removed
| # EoU classification rates | ||
| eou_types = [m.get('eou_type') for m in filewise_metrics] | ||
| if eou_types[0] is not None: | ||
| from collections import Counter |
Collaborator
There was a problem hiding this comment.
Move import statements to the top of the file
Comment on lines
+153
to
+155
| Returns: | ||
| Dict with information about when the speech ended, what the last token was, | ||
| what its confidence was, and detailed per-segment information. |
Collaborator
There was a problem hiding this comment.
This dictionary needs to be flattened out or moved to a dataclass. It is very dense
Collaborator
Author
There was a problem hiding this comment.
Sure - I've now wrapped this as a dataclass called AlignmentFeatures. It is also reused by in the EoUClassification where it's now one of the fields, which reduces duplication.
* Reorganize data classes * Rename some classes for clarity * Get rid of global constant only used once * Slightly increase one of the thresholds Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a metric that measures end-of-utterance performance.
Each utterance is classified as one of:
New metrics reported by the evaluation script are
eou_cutoff_rate,eou_silence_rate,eou_noise_rateandeou_error_rate, whereeou_error_rateaccumulates all non-good cases (cutoff OR silence OR noise).End-of-Utterance (EoU) Classifier Algorithm
This classifier detects whether a speech sample ends naturally or has an artifact (cutoff, trailing silence, or trailing noise). It works in two stages:
1. CTC Forced Alignment
We run the generated audio through a pretrained Wav2Vec2 CTC model (
facebook/wav2vec2-base-960h) and use NeMo Forced Aligner'sviterbi_decodingto force-align the audio frames to the target transcript. This produces per-character token segments with timestamps and confidence scores. The end of the last aligned token gives us and estimate of the speech boundary — the point where intelligible speech ends and any trailing audio begins.2. Trailing-Region Analysis & Classification
Using the speech boundary, we split the audio into a speech region and a trailing region (with a small padding of the speech region of 100-150ms to account for Wav2Vec2's end-of-segment inaccuracy). We then extract features from both regions:
These features feed a simple decision tree:
3. Additional Notes
Accuracy
This metric is still somewhat experimental and we don't have enough labeled data to measure its accuracy. But it was tested on both ground truth test sets and noisy generated speech sets and worked quite well. It was found to be,
Limitations
Support English only, for now. The metric is set to
nanfor other languages.Speed
It's pretty fast – the metric processed 2300 utterances in about 36 seconds on my machine. The speed is aided by batching.
Testing
Included is a unit test that verifies some utterances whose ending types is known are classified correctly.