-
Notifications
You must be signed in to change notification settings - Fork 375
Bilingual Japanese / English zipformer recipe (multi_ja_en) and MLS English recipe (mls_english) #2015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Bilingual Japanese / English zipformer recipe (multi_ja_en) and MLS English recipe (mls_english) #2015
Conversation
…ed symlink to librispeech
…just datamodule to load from manifest files
This reverts commit ba603e0.
…ultiDatasetAsrDataModule, not tested yet
…re.sh from commit 547f5c5
…tructure, added script to update cutset paths. WIP
…the multilingual training recipe directory structure
…o make dev and test splits have matching sizes to reazonspeech
Musan mls clean final
WalkthroughAdds a complete MLS English ASR recipe and integrates it into the multi_ja_en pipeline. Introduces dataset download, feature/materialization from Hugging Face, transcripts/BPE tooling, a new MLS English data module, and Zipformer training/decoding (non-streaming/streaming). Updates multi_ja_en to use MLS English + ReazonSpeech, overhauls manifests/lang prep, and replaces legacy LibriSpeech/char-token flows. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant P as prepare.sh
participant HF as HuggingFace Hub
participant L as Lhotse
participant M as MUSAN
participant SPM as SentencePiece
U->>P: Run prepare.sh (stages 0..6)
P->>HF: snapshot_download mls_eng
P->>L: CutSet.from_huggingface_dataset (train/dev/test)
P->>L: save_audios + compute_and_store_features
P->>M: lhotse prepare & fbank
P->>L: generate transcript.txt
P->>SPM: train BPE (bbpe.model)
P->>P: update_cutset_paths
P-->>U: Manifests, features, lang ready
sequenceDiagram
autonumber
participant T as train.py
participant DM as MLSEnglish/MultiDataset DataModule
participant D as DataLoader
participant Z as Zipformer Model
participant CK as Checkpointer
T->>DM: build train/valid datasets (OTF feats, MUSAN, aug)
DM-->>D: DynamicBucketingSampler loaders
loop per epoch
D->>Z: batches (features, tokens)
Z-->>T: losses (transducer/CTC)
T->>CK: save checkpoints / best
end
sequenceDiagram
autonumber
participant Dec as decode.py
participant DM as DataModule
participant DL as DataLoader
participant Z as Zipformer Model
participant G as Decoding Graph/LM
participant Eval as WER
Dec->>DM: test_cuts -> DataLoader
Dec->>Z: load/avg checkpoints
Dec->>G: build (trivial/L/G/context)
DL->>Z: batch features
Z-->>Dec: hypotheses (greedy/beam/fast)
Dec->>Eval: compute WER, save results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks (2 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. ✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 47
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (15)
egs/tedlium3/ASR/conformer_ctc2/transformer.py (1)
606-616
: Document kwargs behavior and ensure TorchScript compatibility
- In
egs/tedlium3/ASR/conformer_ctc2/transformer.py
’sTransformerDecoderLayer.forward
, append to the docstring:warmup: controls selective bypass of layers; if < 1.0, we will bypass the layer more frequently (default=1.0).
kwargs:
accepted for API compatibility; currently ignored.
- In
egs/tedlium3/ASR/conformer_ctc2/export.py
(around line 277), insert beforemodel = torch.jit.script(model)
:to avoid scripting errors.# Prevent **kwargs in TransformerDecoderLayer.forward from breaking JIT model.__class__.forward = torch.jit.ignore(model.__class__.forward)egs/mls_english/ASR/zipformer/optim.py (2)
103-116
: Fix NameError: ‘key’ is undefined when batching parameters.
stacked_params_dict[key] = p_stacked
useskey
that isn’t defined in this scope.Apply:
for batch, batch_names in zip(batches, batches_names): p = batch[0] @@ - p_stacked.grad = grad - stacked_params_dict[key] = p_stacked + p_stacked.grad = grad + # derive the batch key from a representative param + key = (str(p.dtype), *p.shape) + stacked_params_dict[key] = p_stacked tuples.append((p_stacked, state, batch_names))
551-553
: Fix undefined variable: use first_p.device instead of p.device.
p
is not defined in this scope; should referencefirst_p
.Apply:
- first_state["model_norms"] = torch.zeros( - clipping_update_period, device=p.device - ) + first_state["model_norms"] = torch.zeros( + clipping_update_period, device=first_p.device + )egs/mls_english/ASR/zipformer/my_profile.py (2)
158-163
: Fix typo in get_model_profile kwarg and avoid shadowing config “params”Typo: module_hoop_mapping -> module_hook_mapping. Also rename the returned “params” to avoid shadowing the config dict.
- flops, params = get_model_profile( + flops, param_count = get_model_profile( model=model, args=(feature, feature_lens), - module_hoop_mapping=MODULE_HOOK_MAPPING, + module_hook_mapping=MODULE_HOOK_MAPPING, ) - logging.info(f"For the encoder part, params: {params}, flops: {flops}") + logging.info(f"For the encoder part, params: {param_count}, flops: {flops}")
1-171
: Remove outdated relative-path license headers in MLS English zipformer files
Remove the line# See ../../../../LICENSE for clarification regarding multiple authors
from all Python scripts in egs/mls_english/ASR/zipformer (
my_profile.py
,train.py
,streaming_decode.py
,decode.py
,do_not_use_it_directly.py
).egs/mls_english/ASR/zipformer/zipformer.py (1)
239-251
: Fix dtype bug when building feature masks (logical_and on float tensors).
torch.logical_and
expects bool tensors; here it’s applied to float masks, which will error. Build boolean masks, then cast once for multiplication.Apply this diff:
- # mask1 shape: (1, batch_size, 1) - mask1 = ( - torch.rand(1, batch_size, 1, device=x.device) > feature_mask_dropout_prob - ).to(x.dtype) - - # mask2 has additional sequences masked, about twice the number. - mask2 = torch.logical_and( - mask1, - ( - torch.rand(1, batch_size, 1, device=x.device) - > feature_mask_dropout_prob - ).to(x.dtype), - ) + # mask1 shape: (1, batch_size, 1) + mask1_bool = torch.rand(1, batch_size, 1, device=x.device) > feature_mask_dropout_prob + # mask2 has additional sequences masked, about twice the number. + mask2_bool = torch.logical_and( + mask1_bool, + torch.rand(1, batch_size, 1, device=x.device) > feature_mask_dropout_prob, + ) - # dim: (1, batch_size, 2) - mask = torch.cat((mask1, mask2), dim=-1) + # dim: (1, batch_size, 2) as dtype of x for downstream multiplication + mask = torch.cat((mask1_bool.to(x.dtype), mask2_bool.to(x.dtype)), dim=-1)egs/multi_ja_en/ASR/local/prepare_lang_bbpe.py (1)
24-34
: Fix mismatch between CLI/docs and actual BBPE model path; wire up --vocab-size and add fallback search.Docs and help mention bbpe_<vocab_size>/bbpe.model, but the code only checks lang_dir/bbpe.model and ignores --vocab-size. This will raise FileNotFoundError in the new layout.
Apply:
@@ - parser.add_argument( - "--lang-dir", - type=str, - help="""Input and output directory. - It should contain the words.txt file and the - bbpe model in a subdirectory (e.g., bbpe_2000/bbpe.model). - """, - ) + parser.add_argument( + "--lang-dir", + type=str, + help="""Input and output directory. + It should contain words.txt and a BBPE model, either: + - {lang_dir}/bbpe_{vocab_size}/bbpe.model, or + - {lang_dir}/bbpe.model (deprecated fallback). + """, + ) @@ - parser.add_argument( - "--vocab-size", - type=int, - default=2000, # Add a default value for vocab_size for consistency - help="Vocabulary size used for BPE training (determines the bbpe model directory).", - ) + parser.add_argument( + "--vocab-size", + type=int, + default=2000, + help="Vocabulary size used for BBPE (drives bbpe_{size}/bbpe.model lookup).", + ) @@ - model_file = lang_dir / "bbpe.model" - - if not model_file.is_file(): - raise FileNotFoundError(f"BPE model not found at: {model_file}") + # Prefer new layout; fall back to legacy flat file. + candidate_new = lang_dir / f"bbpe_{args.vocab_size}" / "bbpe.model" + candidate_legacy = lang_dir / "bbpe.model" + model_file = candidate_new if candidate_new.is_file() else candidate_legacy + if not model_file.is_file(): + raise FileNotFoundError( + "BBPE model not found. Searched:\n" + f" - {candidate_new}\n" + f" - {candidate_legacy}\n" + "Please run local/train_bbpe_model.py or adjust --vocab-size/lang-dir." + ) @@ - lexicon, token_sym_table = generate_lexicon(str(model_file), words, args.oov) + lexicon, token_sym_table = generate_lexicon(str(model_file), words, args.oov)Also applies to: 170-179, 188-194, 215-219, 230-231
egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)
97-104
: Require essential CLI arguments.These paths are dereferenced unconditionally; make them required to avoid None/TypeErrors.
- parser.add_argument("-m", "--manifest-dir", type=Path) - parser.add_argument("-a", "--audio-dir", type=Path) - parser.add_argument("-d", "--dl-dir", type=Path) + parser.add_argument("-m", "--manifest-dir", type=Path, required=True, help="Directory for feature manifests and done marker") + parser.add_argument("-a", "--audio-dir", type=Path, required=True, help="Directory to materialize audio per split") + parser.add_argument("-d", "--dl-dir", type=Path, required=True, help="HF dataset name or local path (e.g., parler-tts/mls_eng)")egs/multi_ja_en/ASR/local/train_bbpe_model.py (2)
83-87
: Stream input conversion; avoid loading whole file into memory and set encoding
readlines()
loads the entire transcript and encoding is unspecified.-def _convert_to_bchar(in_path: str, out_path: str): - with open(out_path, "w") as f: - for line in open(in_path, "r").readlines(): - f.write(byte_encode(tokenize_by_ja_char(line)) + "\n") +def _convert_to_bchar(in_path: str, out_path: str): + with open(in_path, "r", encoding="utf-8") as src, open( + out_path, "w", encoding="utf-8" + ) as dst: + for line in src: + dst.write(byte_encode(tokenize_by_ja_char(line)) + "\n")
114-117
: Use delete=False for NamedTemporaryFile to avoid Windows sharing issuesOpening the same path for write while
NamedTemporaryFile
is open can fail on Windows.- temp = tempfile.NamedTemporaryFile() - train_text = temp.name + temp = tempfile.NamedTemporaryFile(delete=False) + train_text = temp.name + temp.close()egs/mls_english/ASR/zipformer/asr_datamodule.py (1)
1-366
: De-duplicate: make this a thin re-export to the local/utils moduleKeep a single implementation and re-export here to avoid drift.
@@ -<remove entire file contents> +from ..local.utils.asr_datamodule import MLSEnglishHFAsrDataModule + +__all__ = ["MLSEnglishHFAsrDataModule"]egs/mls_english/ASR/zipformer/do_not_use_it_directly.py (1)
1-1262
: Replace lingeringReazonSpeechAsrDataModule
imports withMLSEnglishHFAsrDataModule
- In all scripts under
egs/mls_english/ASR/zipformer/
(e.g.do_not_use_it_directly.py
,streaming_decode.py
,decode.py
,train.py
), removeand instead importfrom asr_datamodule import ReazonSpeechAsrDataModulefrom ASR.local.utils.asr_datamodule import MLSEnglishHFAsrDataModule- Update all instantiations of
ReazonSpeechAsrDataModule
to useMLSEnglishHFAsrDataModule(args)
, ensuring consistency with the soleasr_datamodule.py
underASR/local/utils
.egs/multi_ja_en/ASR/zipformer/decode.py (3)
67-72
: Import error: asr_datamodule is not in this package path. Add sys.path shim.-import torch.nn as nn -from asr_datamodule import MultiDatasetAsrDataModule +import torch.nn as nn +import sys as _sys +from pathlib import Path as _P +_sys.path.append(str((_P(__file__).resolve().parents[1] / "local" / "utils"))) +from asr_datamodule import MultiDatasetAsrDataModule
379-392
: Bug: tokenization for oracle n-best uses a list where a string is expected.supervisions["text"] is a list. Apply normalization per-utterance.
- num_paths=params.num_paths, - ref_texts=sp.encode(byte_encode(tokenize_by_ja_char(supervisions["text"]))), + num_paths=params.num_paths, + ref_texts=sp.encode( + [byte_encode(tokenize_by_ja_char(t)) for t in supervisions["text"]] + ), nbest_scale=params.nbest_scale,
556-564
: WER summary filename uses an out-of-scope 'key'.This will reference the last loop value or fail if empty. Remove 'key' from the filename.
- errs_info = ( - params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt" - ) + errs_info = params.res_dir / f"wer-summary-{test_set_name}-{params.suffix}.txt"
🧹 Nitpick comments (74)
egs/mls_english/ASR/zipformer/scaling.py (1)
193-206
: Optional: fix odd extra_repr formatting.
extra_repr()
printsself.schedule.pairs[1:-1]
, dropping the end points. Use the full list for clarity.- return ( - f"batch_count={self.batch_count}, schedule={str(self.schedule.pairs[1:-1])}" - ) + return f"batch_count={self.batch_count}, schedule={self.schedule.pairs}"egs/mls_english/ASR/zipformer/export.py (2)
25-29
: Adjust usage/docs to reference MLS English pathsThe embedded usage/help still references “librispeech” paths. Since this file lives under mls_english, update examples to avoid confusion.
Example diff:
-Note: This is a example for librispeech dataset, if you are using different +Note: This is an example for the MLS English dataset; if you are using a different dataset, you should change the argument values according to your dataset. @@ - cd /path/to/egs/librispeech/ASR + cd /path/to/egs/mls_english/ASR @@ - cd /path/to/egs/librispeech/ASR + cd /path/to/egs/mls_english/ASRAlso applies to: 99-107, 115-129, 129-140
399-405
: Enable CUDA automatically (or add a flag)Export can benefit from GPU when available. Consider auto-selecting CUDA to avoid accidental CPU-only runs.
- device = torch.device("cpu") - # if torch.cuda.is_available(): - # device = torch.device("cuda", 0) + device = torch.device("cuda", 0) if torch.cuda.is_available() else torch.device("cpu")egs/mls_english/ASR/zipformer/optim.py (2)
463-463
: Dead variable.
batch = True
is unused; remove or document if intended for future use.- batch = True
480-503
: Guard in-place grad scaling.
p.grad.mul_(clipping_scale)
mutates grad in-place; if the same grad tensor is reused elsewhere, this could have side effects. Consider using out-of-place scaling.- grad = ( - p.grad if clipping_scale == 1.0 else p.grad.mul_(clipping_scale) - ) + grad = p.grad if clipping_scale == 1.0 else p.grad * clipping_scaleegs/mls_english/ASR/zipformer/my_profile.py (1)
111-111
: Fix misleading shape comment after permuteThe permute converts (T, N, C) -> (N, T, C), not the other way around.
- encoder_out = encoder_out.permute(1, 0, 2) # (N, T, C) -> (T, N, C) + encoder_out = encoder_out.permute(1, 0, 2) # (T, N, C) -> (N, T, C)egs/mls_english/ASR/zipformer/subsampling.py (2)
205-227
: Docstring parameters are inaccurate/outdatedSecond “layer1_channels” actually documents layer2; “bottleneck” isn’t a ctor arg; missing docs for layer3_channels and dropout.
- layer1_channels: - Number of channels in layer2 - bottleneck: - bottleneck dimension for 1d squeeze-excite + layer2_channels: + Number of channels in layer2 + layer3_channels: + Number of channels in layer3 (input to ConvNeXt) + dropout: + Dropout probability or ScheduledFloat applied after output norm.
306-309
: Stale comment: no explicit scaling of x occurs hereComment references scaling x by 0.1, but code relies on ScaleGrad modules instead.
- # scaling x by 0.1 allows us to use a larger grad-scale in fp16 "amp" (automatic mixed precision) - # training, since the weights in the first convolution are otherwise the limiting factor for getting infinite - # gradients. + # Gradient stability for mixed precision is handled above via ScaleGrad; + # no explicit scaling of x is performed here.egs/mls_english/ASR/zipformer/decoder.py (1)
79-88
: Guard grouped Conv1d with a divisibility checkgroups=decoder_dim//4 requires decoder_dim % 4 == 0; assert early for clearer errors.
assert context_size >= 1, context_size + assert decoder_dim % 4 == 0, "decoder_dim must be divisible by 4 when using groups=decoder_dim//4" self.context_size = context_size self.vocab_size = vocab_size
egs/mls_english/ASR/local/display_manifest_statistics.py (1)
34-46
: Rename helper to reflect return type and simplify usage.
get_parser()
returns parsed args, not a parser. Rename for clarity.Apply this diff:
-def get_parser(): +def get_args(): parser = argparse.ArgumentParser( description=ARGPARSE_DESCRIPTION, formatter_class=argparse.ArgumentDefaultsHelpFormatter, ) @@ - return parser.parse_args() + return parser.parse_args() @@ -def main(): - args = get_parser() +def main(): + args = get_args()egs/reazonspeech/ASR/local/utils/asr_datamodule.py (2)
189-194
: Docstring: includecuts_musan
in Args for completeness.Apply this diff:
Args: cuts_train: CutSet for training. sampler_state_dict: The state dict for the training sampler. + cuts_musan: + Optional CutSet of MUSAN noise to mix in via CutMix when enabled.
362-366
: Fix return type annotation: this returns a single CutSet, not a list.Apply this diff:
- def test_cuts(self) -> List[CutSet]: + def test_cuts(self) -> CutSet: @@ - return load_manifest_lazy( + return load_manifest_lazy( self.args.manifest_dir / "reazonspeech_cuts_test.jsonl.gz" )icefall/utils.py (1)
1394-1399
: Makepad_left
keyword-only to avoid accidental positional misuseThis preserves backward compatibility for existing positional calls that pass only
lengths
or(lengths, max_len)
.-def make_pad_mask( - lengths: torch.Tensor, - max_len: int = 0, - pad_left: bool = False, -) -> torch.Tensor: +def make_pad_mask( + lengths: torch.Tensor, + max_len: int = 0, + *, + pad_left: bool = False, +) -> torch.Tensor:egs/mls_english/ASR/README.md (2)
10-12
: Replace bare URLs with Markdown linksSilences MD034 and improves readability.
-- Dataset: https://huggingface.co/datasets/parler-tts/mls_eng -- Original MLS dataset link: https://www.openslr.org/94 +- Dataset: [Hugging Face – parler-tts/mls_eng](https://huggingface.co/datasets/parler-tts/mls_eng) +- Original MLS dataset: [OpenSLR 94](https://www.openslr.org/94)
16-19
: Remove HTML/break hacks; tighten wordingAvoid backslash/HTML line breaks in Markdown; keep content concise.
-This recipe currently only supports on-the-fly feature bank computation, since `lhotse` manifests and feature banks are not pre-calculated in this recipe. This should mean that the dataset can be streamed from Hugging Face, but we have not tested this yet. We may add a version that supports pre-calculating features to better match existing recipes.\ -<br> - -[./RESULTS.md](./RESULTS.md) contains the latest results. This MLS English recipe was primarily developed for use in the ```multi_ja_en``` Japanese-English bilingual pipeline, which is based on MLS English and ReazonSpeech. +This recipe currently supports only on-the-fly feature computation (no pre-calculated `lhotse` manifests/feature banks yet). Streaming from Hugging Face should work, but is untested. We may add a pre-calculated features option later. + +See [RESULTS.md](./RESULTS.md) for the latest results. This MLS English recipe was primarily developed for the `multi_ja_en` Japanese–English bilingual pipeline (MLS English + ReazonSpeech).egs/multi_ja_en/ASR/RESULTS.md (3)
33-39
: Nit: capitalize ONNX consistentlyMinor wording polish in the heading; no code impact.
56-57
: Use canonical dataset name “Common Voice”Spelling consistency.
-| CommonVoice | 39.87 | +| Common Voice | 39.87 |
71-71
: Replace bare URL with Markdown linkImproves readability and satisfies MD034.
-Pre-trained model can be found here: https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/main +Pre-trained model: [reazon-research/reazonspeech-k2-v2-ja-en](https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/main)egs/reazonspeech/ASR/zipformer/train.py (2)
68-69
: Prefer lazy manifest loading for MUSAN to reduce memory footprint
load_manifest_lazy
avoids eagerly materializing the CutSet. MUSAN isn’t huge, but lazy loading is safer and consistent with other loaders in this repo.-from lhotse import load_manifest +from lhotse import load_manifest_lazy
1223-1235
: Switch MUSAN to lazy loading; keep graceful fallbackReduces peak memory while keeping behavior identical when the manifest is absent.
- if args.enable_musan: - musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz" - if musan_path.exists(): - cuts_musan = load_manifest(musan_path) - logging.info(f"Loaded MUSAN manifest from {musan_path}") - else: - logging.warning( - f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation" - ) - cuts_musan = None + if args.enable_musan: + musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz" + if musan_path.exists(): + cuts_musan = load_manifest_lazy(musan_path) + logging.info(f"Loaded MUSAN manifest (lazy) from {musan_path}") + else: + logging.warning( + f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation" + ) + cuts_musan = None else: cuts_musan = Noneegs/multi_ja_en/ASR/README.md (4)
3-3
: Fix stray punctuationRemoves the dangling comma/period.
-A bilingual Japanese-English ASR model developed by the developers of ReazonSpeech that utilizes ReazonSpeech and the English subset of Multilingual LibriSpeech (MLS English), . +A bilingual Japanese–English ASR model developed by the ReazonSpeech team that utilizes ReazonSpeech and the English subset of Multilingual LibriSpeech (MLS English).
7-7
: Use Markdown link for MLS EnglishConsistency with other references.
-**Multilingual LibriSpeech (MLS)** is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages. This icefall training recipe was created for the restructured version of the English split of the dataset available on Hugging Face from `parler-tts` [here](https://huggingface.co/datasets/parler-tts/mls_eng). +**Multilingual LibriSpeech (MLS)** is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages—English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages. This recipe uses the restructured English split on Hugging Face: [`parler-tts/mls_eng`](https://huggingface.co/datasets/parler-tts/mls_eng).
15-20
: Table fixes: header typo and linksCorrect “Datset” → “Dataset” and replace bare URLs with links.
-|Datset| Number of hours| URL| +|Dataset| Number of hours| URL| @@ -|MLS English|44,500|https://huggingface.co/datasets/parler-tts/mls_eng| -|ReazonSpeech (all)|35,000|https://huggingface.co/datasets/reazon-research/reazonspeech| +|MLS English|44,500|[Hugging Face – parler-tts/mls_eng](https://huggingface.co/datasets/parler-tts/mls_eng)| +|ReazonSpeech (all)|35,000|[Hugging Face – reazon-research/reazonspeech](https://huggingface.co/datasets/reazon-research/reazonspeech)|
23-35
: Tighten usage steps; clarify balance noteMinor wording/structure improvements.
-This recipe relies on the `mls_english` recipe and the `reazonspeech` recipe. +This recipe relies on the `mls_english` and `reazonspeech` recipes. @@ -This recipe does not enforce data balance: please ensure that the `mls_english` and `reazonspeech` datasets prepared above are balanced to your liking (you may use the utility script `create_subsets_greedy.py` in the `mls_english` recipe to create a custom-sized MLS English sub-dataset). +This recipe does not enforce dataset balance. If desired, use `mls_english/local/utils/create_subsets_greedy.py` to create an MLS English subset sized to your needs. @@ -0. Run `../../mls_english/ASR/prepare.sh` and `../../reazonspeech/ASR/prepare.sh` +0. Run `../../mls_english/ASR/prepare.sh` and `../../reazonspeech/ASR/prepare.sh` 1. Run `./prepare.sh` -2. Run `update_cutset_paths.py` (we will soon add this to `./prepare.sh`) +2. Run `local/utils/update_cutset_paths.py` (to be integrated into `./prepare.sh`) 3. Run `zipformer/train.py` (see example arguments inside the file)egs/multi_ja_en/ASR/local/prepare_lang_bbpe.py (1)
163-166
: OOV decomposition sanity-check.You emit OOV as ["▁", sp.id_to_piece(sp.unk_id())]. Ensure "▁" exists in the model pieces and downstream handles a two-piece OOV token; otherwise prefer just [sp.id_to_piece(sp.unk_id())].
- lexicon.append((oov, ["▁", sp.id_to_piece(sp.unk_id())])) + unk_piece = sp.id_to_piece(sp.unk_id()) + lexicon.append((oov, [unk_piece]))egs/mls_english/ASR/local/utils/generate_transcript.py (2)
22-23
: Remove unused import.-from typing import Optional
60-64
: Guard against None/newlines and normalize whitespace when writing transcripts.Prevents blank/garbled lines.
- with open(output_file, "w") as f: - for cut in tqdm(cuts, desc="Processing cuts"): - for sup in cut.supervisions: - f.write(f"{sup.text}\n") + with open(output_file, "w") as f: + for cut in tqdm(cuts, desc="Processing cuts"): + for sup in cut.supervisions: + if not sup.text: + continue + txt = " ".join(str(sup.text).split()) + f.write(txt + "\n")egs/mls_english/ASR/local/utils/create_subsets_greedy.py (4)
111-127
: Use enumerate for progress; avoid manual counter.Cleaner and aligns with static analysis hint.
- num_samples_processed = 0 - for original_idx in all_original_indices: + for num_samples_processed, original_idx in enumerate(all_original_indices, 1): @@ - num_samples_processed += 1 - if num_samples_processed % 10000 == 0: # Print progress periodically + if num_samples_processed % 10000 == 0: # Print progress periodically print(f" Processed {num_samples_processed} samples for '{split_name}'. Current duration: {current_duration_ms / (3600*1000):.2f} hours")
136-141
: Avoid mutating Features in-place; cast with a copied mapping.Prevents side effects on the original Features object.
- if "audio" in sampled_split.features and not isinstance(sampled_split.features["audio"], Audio): - sampling_rate = sampled_split.features["audio"].sampling_rate if isinstance(sampled_split.features["audio"], Audio) else 16000 - new_features = sampled_split.features - new_features["audio"] = Audio(sampling_rate=sampling_rate) - sampled_split = sampled_split.cast(new_features) + if "audio" in sampled_split.features and not isinstance(sampled_split.features["audio"], Audio): + sampling_rate = 16000 + new_features = sampled_split.features.copy() + new_features["audio"] = Audio(sampling_rate=sampling_rate) + sampled_split = sampled_split.cast(new_features)
170-175
: Iterate dict directly (no .keys()); skip ‘validation’ cleanly.- for split_name in full_dataset.keys(): + for split_name in full_dataset: if split_name not in ['train', 'dev', 'test', 'validation']: # Ensure 'validation' is not re-copied if not renamed
261-270
: Remove unused f-strings in usage prints.Silences F541 and reduces confusion.
- print(f"\nTo use your new subset dataset, you can load it like this:") - print(f"from datasets import load_dataset") - print(f"import os, glob") - print(f"data_files = {{}}") - print(f"for split_name in ['train', 'dev', 'test']: # Or iterate through actual splits created") - print(f" split_path = os.path.join('{output_subset_data_path}', f'{{split_name}}*.parquet')") - print(f" files = glob.glob(split_path)") - print(f" if files: data_files[split_name] = files") - print(f"subset = load_dataset('parquet', data_files=data_files)") - print(f"print(subset)") + print("\nTo use your new subset dataset, you can load it like this:") + print("from datasets import load_dataset") + print("import os, glob") + print("data_files = {}") + print("for split_name in ['train', 'dev', 'test']: # Or iterate through actual splits created") + print(f" split_path = os.path.join('{output_subset_data_path}', f'{{split_name}}*.parquet')") + print(" files = glob.glob(split_path)") + print(" if files: data_files[split_name] = files") + print("subset = load_dataset('parquet', data_files=data_files)") + print("print(subset)")egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)
117-126
: Minor: argparse defaults and path handling.You gate recomputation on args.manifest_dir done file (good). Also ensure dl_dir is cast to str for load_dataset (already done) and audio_dir/manifest_dir exist.
logging.basicConfig(format=formatter, level=logging.INFO) - if (args.manifest_dir / ".mls-eng-fbank.done").exists(): + args.manifest_dir.mkdir(parents=True, exist_ok=True) + args.audio_dir.mkdir(parents=True, exist_ok=True) + if (args.manifest_dir / ".mls-eng-fbank.done").exists(): @@ - mls_eng_hf_dataset_path = args.dl_dir # "/root/datasets/parler-tts--mls_eng" + mls_eng_hf_dataset_path = args.dl_dir # e.g., "parler-tts/mls_eng" or local path cut_sets = make_cutset_blueprints(mls_eng_hf_dataset_path)egs/multi_ja_en/ASR/local/train_bbpe_model.py (2)
27-31
: Remove unused import
re
is unused here.-import re
102-104
: Atomicity and artifacts: skip/rename behavior
- Skipping solely on
is_file()
may miss existing symlinks. Prefer.exists()
.- SentencePiece produces
.model
and.vocab
; you move only.model
. Confirm.vocab
isn’t needed downstream or move it too.- if output_model.is_file(): + if output_model.exists(): print(f"{output_model} exists - skipping") return @@ - shutil.move(str(temp_model_file), str(output_model)) + shutil.move(str(temp_model_file), str(output_model)) + temp_vocab_file = Path(model_prefix + ".vocab") + if temp_vocab_file.exists(): + shutil.move(str(temp_vocab_file), str(output_model.with_suffix(".vocab")))Also applies to: 133-134
egs/mls_english/ASR/local/utils/asr_datamodule.py (2)
26-36
: Remove unused import
load_manifest
is not used.-from lhotse import CutSet, Fbank, FbankConfig, load_manifest, load_manifest_lazy +from lhotse import CutSet, Fbank, FbankConfig, load_manifest_lazy
346-365
: Avoid lru_cache on instance methods; fix return type of test_cuts
@lru_cache()
on methods can retainself
and leak memory (Ruff B019). Limit cache or switch tocached_property
.test_cuts
returns a singleCutSet
, notList[CutSet]
.- @lru_cache() - def train_cuts(self) -> CutSet: + @lru_cache(maxsize=1) + def train_cuts(self) -> CutSet: @@ - @lru_cache() - def valid_cuts(self) -> CutSet: + @lru_cache(maxsize=1) + def valid_cuts(self) -> CutSet: @@ - @lru_cache() - def test_cuts(self) -> List[CutSet]: + @lru_cache(maxsize=1) + def test_cuts(self) -> CutSet:egs/mls_english/ASR/zipformer/do_not_use_it_directly.py (6)
1052-1061
: Docstring vs thresholds mismatchComment says keep 1–20s, code uses 0.3–30s. Align the comment to code (or vice versa).
- # Keep only utterances with duration between 1 second and 20 seconds + # Keep only utterances with duration between 0.3 and 30.0 seconds
886-899
: GradScaler growth heuristicManually doubling grad scale inside loop is unusual; if intentional, add a brief comment referencing rationale.
1242-1262
: Unreachable code after explicit RuntimeError in main()
main()
raises; the rest ofmain()
is unreachable. If this file is strictly a library, consider removing the code below theraise
to avoid confusion.
1001-1006
: Model averaging dtype
model_avg = copy.deepcopy(model).to(torch.float64)
ups casts all params to float64. Confirm this matches usage in checkpointing and averaging utilities (most recipes keep float32). If unintended, useto(dtype=torch.float32)
.
818-845
: Zeroing grads orderConsider
optimizer.zero_grad(set_to_none=True)
before backward to save memory, as in other recipes; current order is still valid.
1-44
: Usage header references a different recipeUsage points to
pruned_transducer_stateless7_streaming/train.py
. If this is intentional boilerplate, ignore; otherwise update to MLS English path to prevent confusion.egs/mls_english/ASR/RESULTS.md (2)
7-12
: Tighten the table formatting and naming consistency.
- Use a caption or footnote to specify decoding method variants (e.g., “modified_beam_search”) for clarity.
- Keep naming consistent: “MLS English” vs “MLS-English”.
-| Type | Greedy | Beam search | +| Type | Greedy | Beam search (modified_beam_search) |
14-24
: Improve reproducibility details.Consider pinning exact commits/versions for icefall, k2, lhotse, and the BPE model checksum; optionally include the exact MLS-English subset/version.
If available, append a small “Environment” section with commit hashes and package versions.
egs/reazonspeech/ASR/prepare.sh (3)
57-60
: Fix subset comment mismatch and quote paths.Comment says "small-v1" but code downloads "medium"; also quote $dl_dir to avoid word-splitting.
- if [ ! -d $dl_dir/ReazonSpeech/downloads ]; then - # Download small-v1 by default. - lhotse download reazonspeech --subset medium $dl_dir + if [ ! -d "$dl_dir/ReazonSpeech/downloads" ]; then + # Download "medium" subset by default. + lhotse download reazonspeech --subset medium "$dl_dir"
61-68
: Quote musan path and dl_dir to avoid globbing/word-splitting.Safer with spaces and special chars.
- if [ ! -d $dl_dir/musan ]; then - lhotse download musan $dl_dir + if [ ! -d "$dl_dir/musan" ]; then + lhotse download musan "$dl_dir"
104-111
: Run Musan FBANK script via python for portability.Avoids relying on execute bits in git.
- ./local/compute_fbank_musan.py + python local/compute_fbank_musan.pyegs/reazonspeech/ASR/local/compute_fbank_musan.py (1)
64-66
: Guard against os.cpu_count() returning None.- num_jobs = min(15, os.cpu_count()) + cpu_cnt = os.cpu_count() or 1 + num_jobs = min(15, cpu_cnt)egs/multi_ja_en/ASR/zipformer/multi_dataset.py (2)
3-3
: Remove unused import.lru_cache is not used after refactor.
-from functools import lru_cache
27-47
: Optional: cache CutSet mux results to avoid repeated manifest scans.If these methods are called multiple times, consider caching (e.g., cached_property) to avoid repeated lazy-load counts.
egs/mls_english/ASR/prepare.sh (3)
66-71
: Typo and quoting fix in Musan guidance.- # If you ha`ve predownloaded it to /path/to/musan, + # If you have predownloaded it to /path/to/musan, @@ - if [ ! -d $dl_dir/musan ] ; then + if [ ! -d "$dl_dir/musan" ] ; then @@ - lhotse download musan $dl_dir + lhotse download musan "$dl_dir"
103-107
: Run Musan FBANK script via python for portability.- ./local/compute_fbank_musan.py + python local/compute_fbank_musan.py
79-89
: Stage 1 sequencing is fine; consider chaining validations with '&&' for clearer failure propagation.Not required (set -e already covers it), but improves readability.
- python local/compute_fbank_mls_english.py \ + python local/compute_fbank_mls_english.py \ --manifest-dir data/manifests \ --audio-dir data/audio \ - --dl-dir $dl_dir/mls_english - # --dl-dir /root/datasets/parler-tts--mls_eng - python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_train.jsonl.gz - python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_dev.jsonl.gz - python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_test.jsonl.gz - touch data/manifests/.mls_english-validated.done + --dl-dir "$dl_dir/mls_english" && \ + python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_train.jsonl.gz && \ + python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_dev.jsonl.gz && \ + python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_test.jsonl.gz && \ + touch data/manifests/.mls_english-validated.doneegs/multi_ja_en/ASR/local/utils/asr_datamodule.py (4)
25-25
: Remove unused import (ruff F401).
torch
is not used.- import torch
21-21
: Remove unused import.
lru_cache
is not used in this file.-from functools import lru_cache
205-213
: Prefer lazy loading for MUSAN to reduce memory footprint.Use
load_manifest_lazy
instead ofload_manifest
.- cuts_musan = load_manifest( - self.args.manifest_dir / "musan/musan_cuts.jsonl.gz" - ) + cuts_musan = load_manifest_lazy( + self.args.manifest_dir / "musan/musan_cuts.jsonl.gz" + )
281-291
: Sampler buffers are hard-coded; consider exposing as CLI or deriving from dataset size.Current
buffer_size
andshuffle_buffer_size
may be suboptimal across datasets.group.add_argument( "--num-buckets", type=int, default=30, help="The number of buckets for the DynamicBucketingSampler" "(you might want to increase it for larger datasets).", ) + group.add_argument( + "--sampler-buffer-size", + type=int, + default=0, + help="Override DynamicBucketingSampler buffer_size; 0 uses num_buckets*2000.", + ) + group.add_argument( + "--sampler-shuffle-buffer-size", + type=int, + default=0, + help="Override DynamicBucketingSampler shuffle_buffer_size; 0 uses num_buckets*5000.", + )train_sampler = DynamicBucketingSampler( cuts_train, max_duration=self.args.max_duration, shuffle=self.args.shuffle, num_buckets=self.args.num_buckets, - buffer_size=self.args.num_buckets * 2000, - shuffle_buffer_size=self.args.num_buckets * 5000, + buffer_size=( + self.args.sampler_buffer_size + if self.args.sampler_buffer_size > 0 + else self.args.num_buckets * 2000 + ), + shuffle_buffer_size=( + self.args.sampler_shuffle_buffer_size + if self.args.sampler_shuffle_buffer_size > 0 + else self.args.num_buckets * 5000 + ), drop_last=self.args.drop_last, )egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py (1)
143-151
: Safer overwrite of symlink target.Use
unlink(missing_ok=True)
and ensure parent dir exists.- os.unlink(symlink_path) + symlink_path.parent.mkdir(parents=True, exist_ok=True) + try: + os.unlink(symlink_path) + except FileNotFoundError: + pass updated_cuts.to_file(symlink_path)egs/multi_ja_en/ASR/prepare.sh (1)
181-185
: Ensure target dir exists and quote path when symlinking validator.Also quote expansion (SC2046).
- ln -svfr $(realpath ../../multi_zh_en/ASR/local/validate_bpe_lexicon.py) local/ + mkdir -p local + ln -svfr "$(realpath ../../multi_zh_en/ASR/local/validate_bpe_lexicon.py)" local/egs/mls_english/ASR/zipformer/decode.py (2)
138-145
: Remove unused import (ruff F401).
make_pad_mask
is unused.from icefall.utils import ( AttributeDict, - make_pad_mask, setup_logger, store_transcripts, str2bool, write_error_stats, )
1030-1036
: Use context manager to read context file (SIM115).Also simplify list construction.
- if os.path.exists(params.context_file): - contexts = [] - for line in open(params.context_file).readlines(): - contexts.append(line.strip()) + if os.path.exists(params.context_file): + with open(params.context_file, "r", encoding="utf-8") as f: + contexts = [line.strip() for line in f] context_graph = ContextGraph(params.context_score) context_graph.build(sp.encode(contexts))egs/mls_english/ASR/zipformer/tokenizer.py (2)
171-190
: Avoid accessing private members of SymbolTable.Prefer public API to build maps.
- self._id2sym = token_table._id2sym - self._sym2id = token_table._sym2id + self._id2sym = {v: k for k, v in token_table.symbol_to_id.items()} + self._sym2id = dict(token_table.symbol_to_id)
232-253
: Keep tests out of production module entrypoint.Consider moving
test_CharTokenizer()
into a proper test file.-if __name__ == "__main__": - test_CharTokenizer() +if __name__ == "__main__": + # For ad-hoc debugging only + test_CharTokenizer()egs/mls_english/ASR/zipformer/streaming_decode.py (5)
25-31
: Remove unused imports.
os
,pdb
, andsubprocess
are unused; drop them to keep the module clean.import argparse import logging import math -import os -import pdb -import subprocess as sp from pathlib import Path from typing import Dict, List, Optional, Tuple
19-23
: Fix CLI usage string.Missing space after script path.
-./zipformer/streaming_decode.py--epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192 +./zipformer/streaming_decode.py --epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192
198-203
: Avoid function call in default argument (B008).Set the default to None and initialize inside the function.
-def get_init_states( - model: nn.Module, - batch_size: int = 1, - device: torch.device = torch.device("cpu"), -) -> List[torch.Tensor]: +def get_init_states( + model: nn.Module, + batch_size: int = 1, + device: Optional[torch.device] = None, +) -> List[torch.Tensor]: @@ - states = model.encoder.get_init_states(batch_size, device) + if device is None: + device = torch.device("cpu") + states = model.encoder.get_init_states(batch_size, device)
186-191
: Reduce default parallel streams to avoid excessive memory.2000 concurrent streams can blow up memory; 64 is a safer default for most setups.
- parser.add_argument( + parser.add_argument( "--num-decode-streams", type=int, - default=2000, + default=64, help="The number of streams that can be decoded parallel.", )
756-763
: Tokenizer: prefer explicit lang_dir usage or remove unused--bpe-model
.You already add Tokenizer args; consider removing
--bpe-model
from this script, or use it consistently. Also verifyparams.lang
/params.lang_type
exist from Tokenizer.add_arguments (they do).egs/multi_ja_en/ASR/zipformer/train.py (2)
1204-1211
: Minor: yoda condition.Prefer
len(tokens) > T
for readability (ruff SIM300).- if T < len(tokens): + if len(tokens) > T:
1222-1232
: Pre-BPE Japanese char tokenization may conflict with BPE training.You apply
tokenize_by_ja_char
then rely on SentencePiece for targets. This could distort English texts and duplicate segmentation work. Either:
- apply this only for Japanese lines; or
- drop it and normalize text elsewhere before SP encoding.
Example conditional application (pseudo-diff; adjust placement as needed):
- def tokenize_and_encode_text(c: Cut): + def tokenize_and_encode_text(c: Cut): # Text normalize for each sample text = c.supervisions[0].text - text = byte_encode(tokenize_by_ja_char(text)) + # Apply JA char tokenization only if text contains Japanese scripts + if any("\u3040" <= ch <= "\u30ff" or "\u4e00" <= ch <= "\u9fff" for ch in text): + text = byte_encode(tokenize_by_ja_char(text)) c.supervisions[0].text = text return cegs/mls_english/ASR/zipformer/train.py (4)
1181-1217
: Silence overly verbose duration logs.Commented-out warnings suggest prior noise; keep logs at
debug
to avoid log spam during filtering.- if c.duration < 1.0 or c.duration > 30.0: - # logging.warning( - # f"Exclude cut with ID {c.id} from training. Duration: {c.duration}" - # ) + if c.duration < 1.0 or c.duration > 30.0: + logging.debug( + f"Exclude cut with ID {c.id} from training. Duration: {c.duration}" + ) return False
862-869
: Rename unused loop variable in validation.
batch_idx
isn’t used; use_
to reflect intent (ruff B007).- for batch_idx, batch in enumerate(valid_dl): + for _, batch in enumerate(valid_dl):
702-705
: Simplify nested if.Combine nested
if
checks when settingstart_epoch
from checkpoints.- if params.start_batch > 0: - if "cur_epoch" in saved_params: - params["start_epoch"] = saved_params["cur_epoch"] + if params.start_batch > 0 and "cur_epoch" in saved_params: + params["start_epoch"] = saved_params["cur_epoch"]
323-336
: Arg redundancy:--lang-dir
vs Tokenizer.add_arguments.You add a custom
--lang-dir
and alsoTokenizer.add_arguments
(which adds--lang
,--lang-type
). Consider using only the Tokenizer interface or harmonize both to avoid confusion.Also applies to: 1382-1385
model_file = Path(model_prefix + ".model") | ||
if not model_file.is_file(): | ||
spm.SentencePieceTrainer.train( | ||
input=train_text, | ||
vocab_size=vocab_size, | ||
model_type=model_type, | ||
model_prefix=model_prefix, | ||
input_sentence_size=input_sentence_size, | ||
character_coverage=args.character_coverage, | ||
user_defined_symbols=user_defined_symbols, | ||
byte_fallback=args.byte_fallback, | ||
unk_id=unk_id, | ||
bos_id=-1, | ||
eos_id=-1, | ||
) | ||
else: | ||
print(f"{model_file} exists - skipping") | ||
return | ||
|
||
shutil.copyfile(model_file, f"{lang_dir}/bpe.model") | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Always sync bpe_.model to bpe.model, even when training is skipped.
Currently, if the versioned model exists, you return early without copying, leaving bpe.model stale/missing.
- if not model_file.is_file():
+ if not model_file.is_file():
spm.SentencePieceTrainer.train(
input=train_text,
@@
)
- else:
- print(f"{model_file} exists - skipping")
- return
-
- shutil.copyfile(model_file, f"{lang_dir}/bpe.model")
+ else:
+ print(f"{model_file} exists - skipping training")
+ # Ensure canonical symlink/copy is updated
+ shutil.copyfile(model_file, f"{lang_dir}/bpe.model")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
model_file = Path(model_prefix + ".model") | |
if not model_file.is_file(): | |
spm.SentencePieceTrainer.train( | |
input=train_text, | |
vocab_size=vocab_size, | |
model_type=model_type, | |
model_prefix=model_prefix, | |
input_sentence_size=input_sentence_size, | |
character_coverage=args.character_coverage, | |
user_defined_symbols=user_defined_symbols, | |
byte_fallback=args.byte_fallback, | |
unk_id=unk_id, | |
bos_id=-1, | |
eos_id=-1, | |
) | |
else: | |
print(f"{model_file} exists - skipping") | |
return | |
shutil.copyfile(model_file, f"{lang_dir}/bpe.model") | |
model_file = Path(model_prefix + ".model") | |
if not model_file.is_file(): | |
spm.SentencePieceTrainer.train( | |
input=train_text, | |
vocab_size=vocab_size, | |
model_type=model_type, | |
model_prefix=model_prefix, | |
input_sentence_size=input_sentence_size, | |
character_coverage=args.character_coverage, | |
user_defined_symbols=user_defined_symbols, | |
byte_fallback=args.byte_fallback, | |
unk_id=unk_id, | |
bos_id=-1, | |
eos_id=-1, | |
) | |
else: | |
print(f"{model_file} exists - skipping training") | |
# Ensure canonical symlink/copy is updated | |
shutil.copyfile(model_file, f"{lang_dir}/bpe.model") |
🤖 Prompt for AI Agents
In egs/mls_english/ASR/local/train_bpe_model.py around lines 91 to 112, the
function returns early when the versioned model file exists and therefore never
syncs model_prefix+".model" to "{lang_dir}/bpe.model"; change the control flow
so the shutil.copyfile(model_file, f"{lang_dir}/bpe.model") is executed
regardless of whether training ran or was skipped (e.g., remove the early return
and only skip training but still perform the copy, or move the copy call out of
the if/else block after the training check), ensuring model_file points to the
correct Path before copying and handling/reporting any copy errors.
transforms = [] | ||
if cuts_musan is not None: | ||
logging.info("Enable MUSAN") | ||
transforms.append( | ||
CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True) | ||
) | ||
else: | ||
logging.info("Disable MUSAN") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Honor the CLI flag when enabling MUSAN and warn if requested but unavailable.
Currently MUSAN is applied whenever cuts_musan
is passed, ignoring --enable-musan
. Gate on both for predictable behavior, and warn if enabled but missing data.
Apply this diff:
- if cuts_musan is not None:
- logging.info("Enable MUSAN")
- transforms.append(
- CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True)
- )
- else:
- logging.info("Disable MUSAN")
+ if self.args.enable_musan:
+ if cuts_musan is not None:
+ logging.info("Enable MUSAN")
+ transforms.append(
+ CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True)
+ )
+ else:
+ logging.warning("MUSAN requested but no cuts provided; disabling MUSAN.")
+ else:
+ logging.info("Disable MUSAN")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
transforms = [] | |
if cuts_musan is not None: | |
logging.info("Enable MUSAN") | |
transforms.append( | |
CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True) | |
) | |
else: | |
logging.info("Disable MUSAN") | |
transforms = [] | |
if self.args.enable_musan: | |
if cuts_musan is not None: | |
logging.info("Enable MUSAN") | |
transforms.append( | |
CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True) | |
) | |
else: | |
logging.warning("MUSAN requested but no cuts provided; disabling MUSAN.") | |
else: | |
logging.info("Disable MUSAN") |
🤖 Prompt for AI Agents
In egs/reazonspeech/ASR/local/utils/asr_datamodule.py around lines 196 to 204,
the code currently enables MUSAN whenever cuts_musan is provided regardless of
the --enable-musan CLI flag; change the logic to only append the CutMix
transform when BOTH cuts_musan is not None and enable_musan is truthy, log.info
when MUSAN is enabled, log.info when MUSAN is explicitly disabled via the flag,
and emit a logging.warning if enable_musan is True but cuts_musan is None to
warn the user that MUSAN was requested but data is unavailable.
@@ -0,0 +1 @@ | |||
Subproject commit 157dc62a6439f03de1fcccb4914fc0d30b6f21b8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Submodule pointer update: verify gitlink, .gitmodules entry, CI/docs, and license.
- Ensure this SHA is reachable and intentional (ideally tagged for traceability).
- Confirm .gitmodules contains the correct URL/path for this submodule.
- Make sure CI and READMEs instruct users to init/update submodules; otherwise Musan-dependent stages will fail.
- Verify the submodule repo’s license is compatible and referenced in the top-level NOTICE/README.
Run:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
path="egs/reazonspeech/ASR/musan-k2-v2-reazonspeech-medium"
sha_expected="157dc62a6439f03de1fcccb4914fc0d30b6f21b8"
echo "== gitlink at HEAD =="
git ls-tree HEAD "$path" || true
echo "== verify SHA =="
if git ls-tree HEAD "$path" | awk '{print $3}' | grep -q "$sha_expected"; then
echo "OK: gitlink matches expected SHA"
else
echo "Mismatch: gitlink SHA does not match $sha_expected"; exit 1
fi
echo "== .gitmodules entries =="
if [ -f .gitmodules ]; then
git config -f .gitmodules --get-regexp 'submodule\..*' || true
echo "Filtered for path:"
rg -nP "(?s)\\[submodule\\s+\".*\"\\].*?\\n\\s*path\\s*=\\s*$path\\b" .gitmodules -n -C2 || true
else
echo "No .gitmodules found"; exit 1
fi
echo "== READMEs mention submodule init? =="
rg -n -S 'git submodule (update|init)' --glob 'egs/**/ASR/**/README*' || true
echo "== prepare scripts mention submodule? =="
rg -n -S 'git submodule (update|init)' --glob 'egs/**/ASR/**/prepare.sh' || true
echo "== license file in submodule (if initialized) =="
if [ -d "$path/.git" ] || [ -f "$path/.git" ]; then
fd -HI -a -t f -d 2 '(LICENSE|COPYING|NOTICE)(\..*)?$' "$path" || true
else
echo "Submodule not initialized; skip license check"
fi
Length of output: 774
Add missing .gitmodules entry and update submodule docs
- The
.gitmodules
file is currently absent; add a section for
egs/reazonspeech/ASR/musan-k2-v2-reazonspeech-medium
pointing to the correct repo URL at commit 157dc62a6439f03de1fcccb4914fc0d30b6f21b8 - Update CI configs and any
egs/reazonspeech/ASR/README*
orprepare.sh
to includegit submodule init --recursive
andgit submodule update --init --recursive
- After initializing the submodule, verify its LICENSE (or NOTICE) is compatible and reference it in the top-level NOTICE/README
🤖 Prompt for AI Agents
In egs/reazonspeech/ASR/musan-k2-v2-reazonspeech-medium around lines 1-1, add a
.gitmodules entry mapping this path to the repository URL at commit
157dc62a6439f03de1fcccb4914fc0d30b6f21b8; then update CI configs and any
egs/reazonspeech/ASR/README* and prepare.sh to run git submodule init
--recursive and git submodule update --init --recursive (or equivalent pipeline
steps) so the submodule is cloned and populated during CI and local setup;
finally, after initializing the submodule locally in your branch, inspect its
LICENSE/NOTICE for compatibility and add a reference to that license in the
top-level NOTICE/README as required.
if args.enable_musan: | ||
musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz" | ||
if musan_path.exists(): | ||
cuts_musan = load_manifest(musan_path) | ||
logging.info(f"Loaded MUSAN manifest from {musan_path}") | ||
else: | ||
logging.warning( | ||
f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation" | ||
) | ||
cuts_musan = None | ||
else: | ||
cuts_musan = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify CLI flag is plumbed: --enable_musan
Ensure ReazonSpeechAsrDataModule.add_arguments
(or parser) defines --enable_musan
with a sensible default and that prepare.sh
documents generating musan_cuts.jsonl.gz
.
🏁 Script executed:
#!/bin/bash
# Verify the flag and where it's consumed.
rg -nC2 -g 'egs/reazonspeech/**' -P '\benable_musan\b'
# Verify MUSAN manifest generation is documented or scripted.
rg -nC2 -g 'egs/reazonspeech/**' -P 'musan_cuts\.jsonl\.gz|MUSAN'
Length of output: 3111
#!/bin/bash
# Locate prepare.sh in the egs/reazonspeech directory
find egs/reazonspeech -type f -name prepare.sh -print
#!/bin/bash
# List top-level files in egs/reazonspeech to find any prepare or doc scripts
find egs/reazonspeech -maxdepth 2 -type f
# Search for any README or documentation files under egs/reazonspeech
find egs/reazonspeech -type f -iname "*readme*" -o -iname "*.md"
🏁 Script executed:
#!/bin/bash
# List all shell scripts in the egs/reazonspeech directory
find egs/reazonspeech -type f -name "*.sh"
Length of output: 76
🏁 Script executed:
#!/bin/bash
sed -n '1,200p' egs/reazonspeech/ASR/prepare.sh
Length of output: 3949
🏁 Script executed:
#!/bin/bash
# Display beginning of compute_fbank_musan.py to confirm manifest creation
sed -n '1,200p' egs/reazonspeech/ASR/local/compute_fbank_musan.py
Length of output: 4444
Document musan_cuts.jsonl.gz generation in prepare.sh
--enable_musan
is already defined inReazonSpeechAsrDataModule
(default False).- In
egs/reazonspeech/ASR/prepare.sh
, Stage 4 invokescompute_fbank_musan.py
but doesn’t explicitly mention the output. Add a comment like
# This generates data/manifests/musan_cuts.jsonl.gz
before running./local/compute_fbank_musan.py
.
🤖 Prompt for AI Agents
In egs/reazonspeech/ASR/zipformer/train.py around lines 1223 to 1235, the code
expects a musan_cuts.jsonl.gz manifest when --enable_musan is used but the
project's prepare script doesn't document that file's generation; update
egs/reazonspeech/ASR/prepare.sh (Stage 4) to add an inline comment immediately
before invoking ./local/compute_fbank_musan.py stating that this step generates
data/manifests/musan_cuts.jsonl.gz (e.g. "# This generates
data/manifests/musan_cuts.jsonl.gz"), so users running preparation know where
the musan manifest comes from and the train.py check will succeed when
--enable_musan is set.
max_len = max(max_len, lengths.max()) | ||
n = lengths.size(0) | ||
seq_range = torch.arange(0, max_len, device=lengths.device) | ||
expaned_lengths = seq_range.unsqueeze(0).expand(n, max_len) | ||
expanded_lengths = seq_range.unsqueeze(0).expand(n, max_len) | ||
|
||
if pad_left: | ||
mask = expanded_lengths < (max_len - lengths).unsqueeze(1) | ||
else: | ||
mask = expanded_lengths >= lengths.unsqueeze(-1) | ||
|
||
return expaned_lengths >= lengths.unsqueeze(-1) | ||
return mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix potential int/Tensor mixing in max_len
computation
max(max_len, lengths.max())
can raise a TypeError (int vs Tensor). Cast to Python ints to make it torch/version-agnostic. Also keep return path unchanged.
- max_len = max(max_len, lengths.max())
+ max_len = max(int(max_len), int(lengths.max().item()))
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
max_len = max(max_len, lengths.max()) | |
n = lengths.size(0) | |
seq_range = torch.arange(0, max_len, device=lengths.device) | |
expaned_lengths = seq_range.unsqueeze(0).expand(n, max_len) | |
expanded_lengths = seq_range.unsqueeze(0).expand(n, max_len) | |
if pad_left: | |
mask = expanded_lengths < (max_len - lengths).unsqueeze(1) | |
else: | |
mask = expanded_lengths >= lengths.unsqueeze(-1) | |
return expaned_lengths >= lengths.unsqueeze(-1) | |
return mask | |
# ensure both sides are Python ints to avoid int/Tensor comparison issues | |
max_len = max(int(max_len), int(lengths.max().item())) | |
n = lengths.size(0) | |
seq_range = torch.arange(0, max_len, device=lengths.device) | |
expanded_lengths = seq_range.unsqueeze(0).expand(n, max_len) | |
if pad_left: | |
mask = expanded_lengths < (max_len - lengths).unsqueeze(1) | |
else: | |
mask = expanded_lengths >= lengths.unsqueeze(-1) | |
return mask |
🤖 Prompt for AI Agents
In icefall/utils.py around lines 1421 to 1431, the computation max_len =
max(max_len, lengths.max()) mixes a Python int and a torch Tensor which can
raise a TypeError; fix it by converting the tensor result to a Python int (e.g.,
lengths.max().item() or int(lengths.max().item())) before calling max so max_len
stays a plain int, then proceed with seq_range/expanded_lengths and return mask
unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review continued from previous batch...
from asr_datamodule import ReazonSpeechAsrDataModule | ||
from decode_stream import DecodeStream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use MLS English datamodule instead of ReazonSpeech here.
This script is under MLS English recipe but wires ReazonSpeech datamodule. Replace with MLSEnglishHFAsrDataModule throughout.
-from asr_datamodule import ReazonSpeechAsrDataModule
+from asr_datamodule import MLSEnglishHFAsrDataModule
@@
def main():
parser = get_parser()
- ReazonSpeechAsrDataModule.add_arguments(parser)
+ MLSEnglishHFAsrDataModule.add_arguments(parser)
Tokenizer.add_arguments(parser)
@@
- reazonspeech_corpus = ReazonSpeechAsrDataModule(args)
-
- valid_cuts = reazonspeech_corpus.valid_cuts()
- test_cuts = reazonspeech_corpus.test_cuts()
+ mls_english_corpus = MLSEnglishHFAsrDataModule(args)
+
+ valid_cuts = mls_english_corpus.valid_cuts()
+ test_cuts = mls_english_corpus.test_cuts()
Also applies to: 712-717, 856-865
🤖 Prompt for AI Agents
In egs/mls_english/ASR/zipformer/streaming_decode.py around lines 37-38 (and
also update occurrences at 712-717 and 856-865), the file imports and uses
ReazonSpeechAsrDataModule but should use the MLS English datamodule; replace the
import from "from asr_datamodule import ReazonSpeechAsrDataModule" with the
correct MLS English import "from ml_eng_hf_datamodule import
MLSEnglishHFAsrDataModule" (or the actual module path used in this repo), then
update all instantiations, type hints, and references from
ReazonSpeechAsrDataModule to MLSEnglishHFAsrDataModule across the noted line
ranges so the script consistently constructs and uses the
MLSEnglishHFAsrDataModule instance. Ensure any configuration keys or constructor
arguments match the MLSEnglishHFAsrDataModule API and adjust variable names if
needed.
finished_streams = [] | ||
for i in range(len(decode_streams)): | ||
decode_streams[i].states = states[i] | ||
decode_streams[i].done_frames += encoder_out_lens[i] | ||
# if decode_streams[i].done: | ||
# finished_streams.append(i) | ||
finished_streams.append(i) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All streams are marked finished; plus int/ Tensor add bug.
Only finalize streams that are truly done. Also convert encoder_out_lens[i]
to Python int when accumulating done_frames
.
finished_streams = []
for i in range(len(decode_streams)):
decode_streams[i].states = states[i]
- decode_streams[i].done_frames += encoder_out_lens[i]
- # if decode_streams[i].done:
- # finished_streams.append(i)
- finished_streams.append(i)
+ decode_streams[i].done_frames += int(encoder_out_lens[i].item())
+ if decode_streams[i].done:
+ finished_streams.append(i)
🤖 Prompt for AI Agents
In egs/mls_english/ASR/zipformer/streaming_decode.py around lines 515 to 522,
fix two issues: only mark streams as finished when they are actually done, and
ensure encoder_out_lens[i] is converted to a Python int before adding to
done_frames. Replace the unconditional finished_streams.append(i) with a
conditional that appends i only if decode_streams[i].done is truthy, and change
the done_frames increment to use an integer conversion like done_frames +=
int(encoder_out_lens[i]) (or encoder_out_lens[i].item() if it's a torch tensor).
test_set_wers = sorted(test_set_wers.items(), key=lambda x: x[1]) | ||
errs_info = ( | ||
params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt" | ||
) | ||
with open(errs_info, "w") as f: | ||
print("settings\tWER", file=f) | ||
for key, val in test_set_wers: | ||
print("{}\t{}".format(key, val), file=f) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary WER filename uses an out-of-scope/ambiguous key.
Don’t include key
(last-loop variable) in the summary filename.
- errs_info = (
- params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt"
- )
+ errs_info = params.res_dir / f"wer-summary-{test_set_name}-{params.suffix}.txt"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
test_set_wers = sorted(test_set_wers.items(), key=lambda x: x[1]) | |
errs_info = ( | |
params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt" | |
) | |
with open(errs_info, "w") as f: | |
print("settings\tWER", file=f) | |
for key, val in test_set_wers: | |
print("{}\t{}".format(key, val), file=f) | |
test_set_wers = sorted(test_set_wers.items(), key=lambda x: x[1]) | |
- errs_info = ( | |
- params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt" | |
errs_info = params.res_dir / f"wer-summary-{test_set_name}-{params.suffix}.txt" | |
with open(errs_info, "w") as f: | |
print("settings\tWER", file=f) | |
for key, val in test_set_wers: | |
print("{}\t{}".format(key, val), file=f) |
🤖 Prompt for AI Agents
In egs/mls_english/ASR/zipformer/streaming_decode.py around lines 695 to 703,
the summary WER filename is constructed using the loop variable `key` which is
shadowed by the later for-loop and therefore ambiguous/out-of-scope; change the
sorting and loop to use distinct variable names (e.g., rename sorted items to
`sorted_wers` or use `k, v` in the loop) and use the correct intended identifier
for the filename (for example `method_key`, `summary_key`, or the outer scoped
name such as `test_set_name` or `params.suffix`) so the filename no longer
depends on the loop variable; update the open and print loop to iterate over the
renamed variables accordingly.
mls_english_corpus = MLSEnglishHFAsrDataModule(args) | ||
train_cuts = mls_english_corpus.train_cuts() | ||
# mls_english_corpus.load_dataset(args.dataset_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Train-time filtering is defined but not used.
Apply remove_short_and_long_utt
to train_cuts
.
mls_english_corpus = MLSEnglishHFAsrDataModule(args)
train_cuts = mls_english_corpus.train_cuts()
+ train_cuts = train_cuts.filter(remove_short_and_long_utt)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
mls_english_corpus = MLSEnglishHFAsrDataModule(args) | |
train_cuts = mls_english_corpus.train_cuts() | |
# mls_english_corpus.load_dataset(args.dataset_path) | |
mls_english_corpus = MLSEnglishHFAsrDataModule(args) | |
train_cuts = mls_english_corpus.train_cuts() | |
train_cuts = train_cuts.filter(remove_short_and_long_utt) | |
# mls_english_corpus.load_dataset(args.dataset_path) |
🤖 Prompt for AI Agents
egs/mls_english/ASR/zipformer/train.py around lines 1218-1220: train-time
filtering function remove_short_and_long_utt is defined but never applied to the
training cuts; call the filter on train_cuts and reassign (e.g., train_cuts =
remove_short_and_long_utt(train_cuts) or, if the function accepts duration
bounds, pass the configured values from args like train_cuts =
remove_short_and_long_utt(train_cuts, args.min_duration, args.max_duration)) so
the training set respects the min/max utterance length.
if args.enable_musan: | ||
musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz" | ||
if musan_path.exists(): | ||
cuts_musan = load_manifest(musan_path) | ||
logging.info(f"Loaded MUSAN manifest from {musan_path}") | ||
else: | ||
logging.warning(f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation") | ||
cuts_musan = None | ||
else: | ||
cuts_musan = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MUSAN manifest is loaded but never passed to dataloader.
Wire cuts_musan
into train_dataloaders
so augmentation actually happens.
- train_dl = mls_english_corpus.train_dataloaders(
- train_cuts, sampler_state_dict=sampler_state_dict
- )
+ train_dl = mls_english_corpus.train_dataloaders(
+ train_cuts, sampler_state_dict=sampler_state_dict, cuts_musan=cuts_musan
+ )
Also applies to: 1240-1245
🧰 Tools
🪛 Ruff (0.12.2)
1238-1238: Local variable cuts_musan
is assigned to but never used
Remove assignment to unused variable cuts_musan
(F841)
🤖 Prompt for AI Agents
In egs/mls_english/ASR/zipformer/train.py around lines 1229-1239 (and similarly
1240-1245), cuts_musan is loaded but never forwarded into the dataloader
creation; pass cuts_musan into the train_dataloaders call (e.g., as a keyword
argument cuts_musan=cuts_musan) where train/validation dataloaders are
constructed, and if train_dataloaders (or its upstream helpers) don't accept
this parameter, update their signature to accept and forward cuts_musan into the
augmentation/data pipeline so the MUSAN augmentation is actually applied.
def remove_short_and_long_utt(c: Cut): | ||
# Keep only utterances with duration between 1 second and 20 seconds | ||
# | ||
# Caution: There is a reason to select 20.0 here. Please see | ||
# ../local/display_manifest_statistics.py | ||
|
||
# Keep only utterances greater than 1 second | ||
# | ||
# You should use ../local/display_manifest_statistics.py to get | ||
# an utterance duration distribution for your dataset to select | ||
# the threshold | ||
# if c.duration < 1.0 or c.duration > 30.0: | ||
# logging.warning( | ||
# f"Exclude cut with ID {c.id} from training. Duration: {c.duration}" | ||
# ) | ||
# return False | ||
# the threshold as this is dependent on which datasets you choose | ||
if c.duration < 1.0: | ||
logging.warning( | ||
f"Exclude cut with ID {c.id} from training. Duration: {c.duration}" | ||
) | ||
return False | ||
|
||
# In pruned RNN-T, we require that T >= S |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Reintroduce an upper duration cap to prevent OOM/outliers.
Earlier recipes capped at 30s; current filter only enforces ≥1s.
- # Keep only utterances greater than 1 second
+ # Keep only utterances within a safe duration range
@@
- if c.duration < 1.0:
+ if c.duration < 1.0 or c.duration > 30.0:
logging.warning(
f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
)
return False
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def remove_short_and_long_utt(c: Cut): | |
# Keep only utterances with duration between 1 second and 20 seconds | |
# | |
# Caution: There is a reason to select 20.0 here. Please see | |
# ../local/display_manifest_statistics.py | |
# Keep only utterances greater than 1 second | |
# | |
# You should use ../local/display_manifest_statistics.py to get | |
# an utterance duration distribution for your dataset to select | |
# the threshold | |
# if c.duration < 1.0 or c.duration > 30.0: | |
# logging.warning( | |
# f"Exclude cut with ID {c.id} from training. Duration: {c.duration}" | |
# ) | |
# return False | |
# the threshold as this is dependent on which datasets you choose | |
if c.duration < 1.0: | |
logging.warning( | |
f"Exclude cut with ID {c.id} from training. Duration: {c.duration}" | |
) | |
return False | |
# In pruned RNN-T, we require that T >= S | |
def remove_short_and_long_utt(c: Cut): | |
# Keep only utterances within a safe duration range | |
# | |
# You should use ../local/display_manifest_statistics.py to get | |
# an utterance duration distribution for your dataset to select | |
# the threshold as this is dependent on which datasets you choose | |
if c.duration < 1.0 or c.duration > 30.0: | |
logging.warning( | |
f"Exclude cut with ID {c.id} from training. Duration: {c.duration}" | |
) | |
return False | |
# In pruned RNN-T, we require that T >= S |
🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/zipformer/train.py around lines 1187 to 1200, the cut
filter enforces a minimum duration but lacks an upper duration cap which can
allow outlier long utterances to cause OOM; add a configurable maximum duration
check (e.g., MAX_UTT_DURATION = 30.0) and return False for cuts with c.duration
> MAX_UTT_DURATION while logging a warning with the cut id and duration; place
the constant near other config values or at module top and reference it in the
if-statement so the threshold is easy to adjust based on dataset statistics.
Not sure if adding a language tag prompt to the decoder/predictor can alleviate the issue of imbalanced multilingual data; it would be somewhat similar to the prompting approach Whisper uses for multi-language support. |
@pingfengluo I see, this could work well to increase the accuracy of prediction for one-language clips! But at the same time, I worry it could make it harder for the model to accurately code-switch when predicting bilingual audio without further explicit fine-tuning on labeled code-switching data. If you think the benefit would be large enough though, perhaps we can modify the code to add it. |
@JinZr Can you have a look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)
119-119
: Guard against os.cpu_count() returning None.- num_jobs = min(16, os.cpu_count()) + num_jobs = min(16, os.cpu_count() or 1)
♻️ Duplicate comments (3)
egs/mls_english/ASR/local/utils/download_mls_english.py (2)
8-13
: Parameterize HF repo and pin a revision for reproducibility (and align with future mirrors).Hard-coding the repo makes runs brittle. Expose repo_id/revision/token with sensible defaults.
@@ -import argparse -import os -import sys +import argparse +import os +import sys +from typing import Optional @@ -def download_dataset(dl_dir): +def download_dataset( + dl_dir: str, + repo_id: str = "parler-tts/mls_eng", + revision: Optional[str] = None, + token: Optional[str] = None, +) -> None: @@ - repo_id = "parler-tts/mls_eng" local_dataset_dir = os.path.join(dl_dir, "mls_english") @@ - snapshot_download( + snapshot_download( repo_id=repo_id, repo_type="dataset", local_dir=local_dataset_dir, local_dir_use_symlinks=False, + revision=revision, + token=token, )
31-33
: Do not exit from library code; let exceptions propagate and handle in CLI.Let callers handle failures; wrap the CLI call and exit non-zero there. Add flags to pass repo/revision; read token from HF_TOKEN.
@@ - except Exception as e: - print(f"Error downloading dataset '{repo_id}': {e}", file=sys.stderr) - sys.exit(1) + except Exception as e: + print(f"Error downloading dataset '{repo_id}': {e}", file=sys.stderr) + raise @@ - parser = argparse.ArgumentParser( + parser = argparse.ArgumentParser( description="Download MLS English dataset from Hugging Face." ) @@ parser.add_argument( "--dl-dir", type=str, required=True, help="The base directory where the 'mls_english' dataset will be downloaded.", ) + parser.add_argument("--repo-id", type=str, default="parler-tts/mls_eng", + help="HF dataset repo-id or local dataset path.") + parser.add_argument("--revision", type=str, default=None, + help="HF revision/commit/tag for reproducible snapshots.") args = parser.parse_args() - download_dataset(args.dl_dir) + try: + download_dataset( + args.dl_dir, + repo_id=args.repo_id, + revision=args.revision, + token=os.getenv("HF_TOKEN"), + ) + except Exception: + sys.exit(1)Also applies to: 36-48
egs/mls_english/ASR/local/utils/create_subsets_greedy.py (1)
107-117
: Harden duration parsing (handles None/NaN/strings/negatives).Current code
float(example[col])
will throw or propagate NaN. Use a safe parser. Also addmath
import.+import math @@ def get_duration_from_column(example): """Helper to safely get duration from the specified column, in milliseconds.""" - if duration_column_name in example: - return float(example[duration_column_name]) * 1000 - else: - print( - f"Warning: Duration column '{duration_column_name}' not found in example. Returning 0.", - file=sys.stderr, - ) - return 0 + try: + if duration_column_name in example: + val = example[duration_column_name] + if isinstance(val, (int, float)) and math.isfinite(val) and val > 0: + return float(val) * 1000 + if isinstance(val, str): + val = val.strip() + if val: + v = float(val) + if math.isfinite(v) and v > 0: + return v * 1000 + except (TypeError, ValueError): + pass + return 0
🧹 Nitpick comments (10)
egs/mls_english/ASR/local/compute_fbank_mls_english.py (2)
63-65
: Use logging instead of print for consistency.- print(f"{mls_eng_hf_dataset_path=}") + logging.info("mls_eng_hf_dataset_path=%s", mls_eng_hf_dataset_path)
57-59
: Shorten exception message per Ruff TRY003 (optional).- raise ImportError( - "To process the MLS English HF corpus, please install optional dependency: pip install datasets" - ) + raise ImportError("Missing optional dependency 'datasets'. Run: pip install datasets")egs/mls_english/ASR/local/utils/create_subsets_greedy.py (8)
51-52
: Generalize split filename regex to support any shard counts.Your pattern only matches exactly 5-digit shard counters. Many datasets use variable digits. Relax it to
\d+
for robustness.- split_pattern = re.compile(r"^(train|dev|test|validation)-\d{5}-of-\d{5}\.parquet$") + split_pattern = re.compile(r"^(train|dev|test|validation)-\d+-of-\d+\.parquet$")
76-84
: Narrow the exception type when loading datasets.Catching bare Exception hides actionable errors and triggers lints. Limit to expected types.
- try: - full_dataset = load_dataset("parquet", data_files=data_files) - except Exception as e: + try: + full_dataset = load_dataset("parquet", data_files=data_files) + except (OSError, ValueError, RuntimeError) as e: print( f"Error loading dataset from '{full_data_dir}' with load_dataset: {e}", file=sys.stderr, ) sys.exit(1)
217-234
: Compute “other splits” duration without row iteration.
sum(get_duration_from_column(ex) for ex in other_split)
materializes rows and may decode audio. Sum the prefetched column instead.- other_split = full_dataset[split_name] - subset_dataset[split_name] = other_split - other_duration_ms = sum(get_duration_from_column(ex) for ex in other_split) + other_split = full_dataset[split_name] + subset_dataset[split_name] = other_split + if duration_column_name in other_split.column_names: + vals = other_split[duration_column_name] + other_duration_ms = sum( + (float(v) * 1000) + for v in vals + if v is not None + and isinstance(v, (int, float, str)) + and str(v).strip() != "" + and math.isfinite(float(v)) + and float(v) > 0 + ) + else: + other_duration_ms = 0
244-256
: Narrow exception on save or explicitly ignore the lint.Same BLE001 as above. Either narrow to expected IO/Arrow errors or add a
# noqa: BLE001
if broad catch is intentional.- except Exception as e: + except (OSError, RuntimeError, ValueError) as e: print( f"Error saving subset dataset to '{output_subset_dir}': {e}", file=sys.stderr, ) sys.exit(1)
22-28
: Preserve fractional hours in output directory name.Casting to int drops information (e.g., 0.5h → 0h). Use a compact formatter.
- output_subset_dir = os.path.join( - output_base_dir, - f"mls_english_subset_train{int(target_train_hours)}h_dev{int(target_dev_hours)}h_test{int(target_test_hours)}h", - ) + def _fmt_hours(h): # avoids trailing .0 while preserving fractions + return f"{h:g}" + output_subset_dir = os.path.join( + output_base_dir, + f"mls_english_subset_train{_fmt_hours(target_train_hours)}h_dev{_fmt_hours(target_dev_hours)}h_test{_fmt_hours(target_test_hours)}h", + )
328-341
: Remove f-prefix from literal strings (Ruff F541).These have no placeholders; drop the
f
to satisfy Ruff.-print(f"\nTo use your new subset dataset, you can load it like this:") -print(f"from datasets import load_dataset") -print(f"import os, glob") -print(f"data_files = {}") +print("\nTo use your new subset dataset, you can load it like this:") +print("from datasets import load_dataset") +print("import os, glob") +print("data_files = {}") @@ - print(f" files = glob.glob(split_path)") - print(f" if files: data_files[split_name] = files") - print(f"subset = load_dataset('parquet', data_files=data_files)") - print(f"print(subset)") + print(" files = glob.glob(split_path)") + print(" if files: data_files[split_name] = files") + print("subset = load_dataset('parquet', data_files=data_files)") + print("print(subset)")
20-29
: Validate non-negative target hours early.Guard against accidental negative inputs; current logic would silently produce empty subsets.
random.seed(random_seed) + for _name, _hours in { + "train": target_train_hours, + "dev": target_dev_hours, + "test": target_test_hours, + }.items(): + if _hours < 0: + print(f"Error: target hours for '{_name}' must be >= 0 (got {_hours}).", file=sys.stderr) + sys.exit(1)Also applies to: 281-297
112-116
: Reduce noisy per-example warnings.If a duration column is missing, this will spam logs per row. Consider counting and emitting a single summary warning.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
egs/mls_english/ASR/local/compute_fbank_mls_english.py
(4 hunks)egs/mls_english/ASR/local/utils/create_subsets_greedy.py
(1 hunks)egs/mls_english/ASR/local/utils/download_mls_english.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)
icefall/utils.py (1)
is_module_available
(1975-1984)
🪛 Ruff (0.12.2)
egs/mls_english/ASR/local/utils/create_subsets_greedy.py
78-78: Do not catch blind exception: Exception
(BLE001)
251-251: Do not catch blind exception: Exception
(BLE001)
328-328: f-string without any placeholders
Remove extraneous f
prefix
(F541)
329-329: f-string without any placeholders
Remove extraneous f
prefix
(F541)
330-330: f-string without any placeholders
Remove extraneous f
prefix
(F541)
331-331: f-string without any placeholders
Remove extraneous f
prefix
(F541)
333-333: f-string without any placeholders
Remove extraneous f
prefix
(F541)
338-338: f-string without any placeholders
Remove extraneous f
prefix
(F541)
339-339: f-string without any placeholders
Remove extraneous f
prefix
(F541)
340-340: f-string without any placeholders
Remove extraneous f
prefix
(F541)
341-341: f-string without any placeholders
Remove extraneous f
prefix
(F541)
egs/mls_english/ASR/local/compute_fbank_mls_english.py
57-59: Avoid specifying long messages outside the exception class
(TRY003)
egs/mls_english/ASR/local/utils/download_mls_english.py
31-31: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (48)
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
parser.add_argument("-a", "--audio-dir", type=Path) | ||
parser.add_argument("-d", "--dl-dir", type=Path) | ||
return parser.parse_args() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix --dl-dir semantics and resolve the actual dataset path; current code will miss the downloaded ‘mls_english’ subdir.
The downloader writes to <dl_dir>/mls_english, but main() passes args.dl_dir directly to load_dataset(), causing failures. Also mark required args.
@@
- parser.add_argument("-m", "--manifest-dir", type=Path)
- parser.add_argument("-a", "--audio-dir", type=Path)
- parser.add_argument("-d", "--dl-dir", type=Path)
+ parser.add_argument("-m", "--manifest-dir", type=Path, required=True,
+ help="Directory for manifests and features.")
+ parser.add_argument("-a", "--audio-dir", type=Path, required=True,
+ help="Directory to materialize audio shards.")
+ parser.add_argument("-d", "--dl-dir", type=Path, required=True,
+ help="Base download dir (expects '<dl-dir>/mls_english') or a local HF dataset dir or an HF repo-id.")
@@
- mls_eng_hf_dataset_path = args.dl_dir # "/root/datasets/parler-tts--mls_eng"
- cut_sets = make_cutset_blueprints(mls_eng_hf_dataset_path)
+ # Accept: (1) <dl-dir>/mls_english from our downloader,
+ # (2) a direct local HF dataset dir, or
+ # (3) a repo-id string (if user passes one).
+ if args.dl_dir.is_dir():
+ candidate = args.dl_dir / "mls_english"
+ mls_eng_hf_dataset_path = candidate if candidate.exists() else args.dl_dir
+ else:
+ mls_eng_hf_dataset_path = args.dl_dir # repo-id string allowed
+ cut_sets = make_cutset_blueprints(str(mls_eng_hf_dataset_path))
Also applies to: 132-134
🤖 Prompt for AI Agents
In egs/mls_english/ASR/local/compute_fbank_mls_english.py around lines 110-112
(and similarly lines 132-134), the CLI currently defines --dl-dir as an optional
Path and passes args.dl_dir directly to load_dataset(), but the downloader
writes into a subdirectory named "mls_english", so the code misses the actual
dataset path and can fail; update the argparse definitions to mark required
arguments (audio-dir and dl-dir) and, before calling load_dataset(), resolve the
real dataset root by appending the "mls_english" subdirectory (e.g.,
dataset_root = args.dl_dir / "mls_english" if args.dl_dir is not None) and use
that resolved path when invoking load_dataset(). Ensure you also handle the case
where the resolved path does not exist with a clear error.
# --- NEW: Generalized sampling function --- | ||
def sample_split_by_hours(split_name, original_split, target_hours): | ||
""" | ||
Samples a dataset split to reach a target number of hours. | ||
Returns the sampled Dataset object and its actual duration in milliseconds. | ||
""" | ||
target_duration_ms = target_hours * 3600 * 1000 | ||
current_duration_ms = 0 | ||
indices_to_include = [] | ||
|
||
if original_split is None or len(original_split) == 0: | ||
print( | ||
f" Warning: Original '{split_name}' split is empty or not found. Cannot sample.", | ||
file=sys.stderr, | ||
) | ||
return None, 0 | ||
|
||
print( | ||
f"\n Processing '{split_name}' split to reach approximately {target_hours} hours..." | ||
) | ||
print( | ||
f" Total samples in original '{split_name}' split: {len(original_split)}" | ||
) | ||
|
||
all_original_indices = list(range(len(original_split))) | ||
random.shuffle(all_original_indices) # Shuffle indices for random sampling | ||
|
||
num_samples_processed = 0 | ||
for original_idx in all_original_indices: | ||
if current_duration_ms >= target_duration_ms and target_hours > 0: | ||
print( | ||
f" Target {split_name} hours reached ({target_hours}h). Stopping processing." | ||
) | ||
break | ||
|
||
example = original_split[original_idx] | ||
duration_ms = get_duration_from_column(example) | ||
|
||
if duration_ms > 0: | ||
indices_to_include.append(original_idx) | ||
current_duration_ms += duration_ms | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid row-wise access (decodes audio) and prefetch durations; huge perf win.
original_split[i]
materializes full rows and can decode Audio, which is prohibitively expensive. Prefetch the duration column and disable audio decode during sampling.
def sample_split_by_hours(split_name, original_split, target_hours):
@@
- target_duration_ms = target_hours * 3600 * 1000
+ target_duration_ms = target_hours * 3600 * 1000
@@
- all_original_indices = list(range(len(original_split)))
+ # Disable audio decoding to avoid loading waveforms while sampling
+ if "audio" in original_split.features and isinstance(original_split.features["audio"], Audio):
+ sr = getattr(original_split.features["audio"], "sampling_rate", 16000)
+ original_split = original_split.cast_column("audio", Audio(sampling_rate=sr, decode=False))
+
+ # Prefetch durations once to avoid per-row materialization
+ duration_values = None
+ if duration_column_name in original_split.column_names:
+ try:
+ duration_values = original_split[duration_column_name]
+ except (RuntimeError, ValueError, OSError):
+ duration_values = None
+
+ all_original_indices = list(range(len(original_split)))
random.shuffle(all_original_indices) # Shuffle indices for random sampling
@@
- example = original_split[original_idx]
- duration_ms = get_duration_from_column(example)
+ if duration_values is not None:
+ # Fast path: parse from prefetched column
+ try:
+ val = duration_values[original_idx]
+ duration_ms = 0
+ if val is not None:
+ v = float(val)
+ if math.isfinite(v) and v > 0:
+ duration_ms = v * 1000
+ except (TypeError, ValueError):
+ duration_ms = 0
+ else:
+ # Fallback: access a single row (audio decoding disabled above)
+ duration_ms = get_duration_from_column(original_split[original_idx])
@@
- if "audio" in sampled_split.features and not isinstance(
- sampled_split.features["audio"], Audio
- ):
- sampling_rate = (
- sampled_split.features["audio"].sampling_rate
- if isinstance(sampled_split.features["audio"], Audio)
- else 16000
- )
- new_features = sampled_split.features
- new_features["audio"] = Audio(sampling_rate=sampling_rate)
- sampled_split = sampled_split.cast(new_features)
+ if "audio" in sampled_split.features and not isinstance(sampled_split.features["audio"], Audio):
+ sr = getattr(sampled_split.features["audio"], "sampling_rate", 16000)
+ new_features = sampled_split.features.copy()
+ new_features["audio"] = Audio(sampling_rate=sr, decode=False)
+ sampled_split = sampled_split.cast(new_features)
Also applies to: 175-187
🤖 Prompt for AI Agents
egs/mls_english/ASR/local/utils/create_subsets_greedy.py lines 118-159 (also
apply same change to 175-187): the loop uses original_split[i] which
materializes full rows and decodes audio per-row; instead prefetch the duration
column into memory and iterate that array while ensuring audio decoding is
disabled for the split during sampling. Concretely: obtain a durations
list/array from the dataset (e.g., fetch the single duration column via the
dataset API so it does not decode audio), shuffle indices over that length, use
durations[idx] to accumulate current_duration_ms and only access full examples
later when building the final subset, and make sure the dataset is opened/used
with audio decoding turned off or limited to the columns you need while
sampling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
egs/multi_ja_en/ASR/RESULTS.md (4)
33-38
: Align ONNX export averaging with reported decode settings.
Export uses --use-averaged-model 0 and --avg 1, while decode reports --avg 15. Mismatch can yield different WER/CER. Consider exporting the averaged model (or add a note explaining the difference).Apply:
./zipformer/export-onnx.py \ --tokens ./data/lang/bbpe_2000/tokens.txt \ - --use-averaged-model 0 \ - --epoch 21 \ - --avg 1 \ + --use-averaged-model 1 \ + --epoch 21 \ + --avg 15 \ --exp-dir ./zipformer/exp
30-31
: Nit: capitalize ONNX.
Minor documentation polish.-To export the model with onnx: +To export the model with ONNX:
51-58
: Make English WER table reproducible (versions/splits/normalization).
Briefly note dataset versions/splits (e.g., CV release, TED corpus variant, MLS split) and text normalization/tokenization used.
60-67
: Add dataset details for Japanese CER table.
Specify corpus versions/splits and scoring pipeline (e.g., full-/half-width normalization, punctuation handling) for reproducibility.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
egs/multi_ja_en/ASR/RESULTS.md
(3 hunks)
🧰 Additional context used
🪛 LanguageTool
egs/multi_ja_en/ASR/RESULTS.md
[grammar] ~53-~53: There might be a mistake here.
Context: ...Corpus | WER (%) | |-----------------------------|---------...
(QB_NEW_EN)
[grammar] ~54-~54: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...
(QB_NEW_EN)
[grammar] ~55-~55: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...
(QB_NEW_EN)
[grammar] ~56-~56: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test-clean) | 8.64 |...
(QB_NEW_EN)
[grammar] ~62-~62: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...
(QB_NEW_EN)
[grammar] ~63-~63: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...
(QB_NEW_EN)
[grammar] ~64-~64: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...
(QB_NEW_EN)
[grammar] ~65-~65: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (130)
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.4.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.4.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.4.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.4.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.4.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.4.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.4.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.4.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.4.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.4.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.0 v20250630
🔇 Additional comments (2)
egs/multi_ja_en/ASR/RESULTS.md (2)
12-12
: Epoch change to 21 is consistent with decode/export — LGTM.
No issues spotted with this value across commands.
23-23
: Decode epoch updated to 21 — LGTM.
Matches training/export epoch.
egs/multi_ja_en/ASR/RESULTS.md
Outdated
--use-fp16 1 \ | ||
--exp-dir zipformer/exp \ | ||
--max-duration 600 | ||
--manifest-dir data/manifests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
Verify manifest path expectations for training.
Switching to --manifest-dir data/manifests is fine; ensure the prep steps populate this directory and that the README points to the exact script/output layout.
Run to sanity-check presence:
🏁 Script executed:
#!/bin/bash
ls -lah data/manifests || exit 1
fd -e jsonl -e jsonl.gz data/manifests -x sh -c 'echo "found: $1" _ {}' \; || true
Length of output: 95
Ensure data/manifests
exists and is populated before using --manifest-dir
The directory data/manifests
doesn’t exist, so training will fail. Update your data-prep script (and the README) to output the generated JSONL/JSONL.GZ files into data/manifests
.
🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around line 16, the README references
--manifest-dir data/manifests but that directory doesn't exist; update the
data-prep script to create data/manifests and emit the generated JSONL/JSONL.GZ
manifest files there, and update the README to document this output path and
prerequisite (or change the flag to the actual manifest path your prep currently
writes to).
for sure! i’ll look into it tonight
Best Regards
Jin
…On Mon, 1 Sep 2025 at 21:44 Fangjun Kuang ***@***.***> wrote:
*csukuangfj* left a comment (k2-fsa/icefall#2015)
<#2015 (comment)>
@JinZr <https://github.com/JinZr> Can you have a look at this PR?
—
Reply to this email directly, view it on GitHub
<#2015 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOON42FGBNWS3SMOKXGF7O33QREVHAVCNFSM6AAAAACFAOJHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENBSGQ2DEMBUGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
egs/multi_ja_en/ASR/RESULTS.md (1)
12-16
: Clarify manifests prep and make max-duration explicit (if non-default).
- Ensure prepare scripts create/populate data/manifests before training; this was flagged earlier and still applies.
- If train.py relies on a default max-duration, consider documenting the default or adding the flag for clarity.
Apply this doc tweak right below the training block to make the prerequisite unmistakable:
--exp-dir zipformer/exp \ --manifest-dir data/manifests
+Note: Run the data preparation steps first so that
data/manifests
exists and contains the JSONL/JSONL.GZ manifests used for training.
+If you rely on the default for--max-duration
, consider stating it here (or pass it explicitly) for reproducibility.</blockquote></details> <details> <summary>egs/mls_english/ASR/local/utils/asr_datamodule.py (2)</summary><blockquote> `77-82`: **Resolved: --max-duration now uses float.** Thanks for addressing the earlier inconsistency; `type=float` matches the default and expected semantics. --- `319-321`: **Use configured --num-workers for validation DataLoader (and set persistent_workers accordingly).** Hardcoding `num_workers=2` ignores the CLI knob and can underutilize hardware. Apply: ```diff valid_dl = DataLoader( validate, sampler=valid_sampler, batch_size=None, - num_workers=2, - persistent_workers=False, + num_workers=self.args.num_workers, + persistent_workers=self.args.num_workers > 0, )
Optional: make the same change for train/test DataLoaders for consistency (see Lines 278-284 and 342-343).
🧹 Nitpick comments (11)
egs/multi_ja_en/ASR/RESULTS.md (4)
33-38
: Confirm tokens path provenance and prep step.The tokens path moved to ./data/lang/bbpe_2000/tokens.txt. Add a short note pointing readers to the tokenizer prep that generates it (bbpe_2000), or they’ll hit a missing file.
Suggested doc nudge:
./zipformer/export-onnx.py \ --tokens ./data/lang/bbpe_2000/tokens.txt \ --use-averaged-model 0 \ --epoch 21 \ --avg 1 \ --exp-dir ./zipformer/exp + +# Ensure you have created ./data/lang/bbpe_2000 via the language prep step before exporting.
51-58
: Annotate dataset versions and evaluation settings for WERs.Specify dataset versions/splits (e.g., CV version, TED corpus variant), text normalization, and decoding method (greedy vs. MBS, LM? none). This aids reproducibility.
Example header tweak:
-| Corpus | WER (%) | +| Corpus (version/split, norm, method) | WER (%) |
60-67
: Do the same for Japanese CERs (versions/norm/method).List dataset versions (e.g., JSUT release), normalization rules, and decode method for CER.
69-69
: Pin the pre-trained model revision for stability.Link is good; consider adding a specific HF revision/commit/tag in the text so results remain reproducible if the repo updates.
egs/mls_english/ASR/local/utils/asr_datamodule.py (7)
346-351
: Avoid lru_cache on bound instance methods (B019); cache by path instead.
@lru_cache
on methods retains references toself
and can leak when multiple instances are created.Replace the decorators and route through a path-keyed cache:
- @lru_cache() def train_cuts(self) -> CutSet: logging.info("About to get train cuts") - return load_manifest_lazy( - self.args.manifest_dir / "mls_eng_cuts_train.jsonl.gz" - ) + return _cached_load_cuts(str(self.args.manifest_dir / "mls_eng_cuts_train.jsonl.gz")) - @lru_cache() def valid_cuts(self) -> CutSet: logging.info("About to get dev cuts") - return load_manifest_lazy( - self.args.manifest_dir / "mls_eng_cuts_dev.jsonl.gz" - ) + return _cached_load_cuts(str(self.args.manifest_dir / "mls_eng_cuts_dev.jsonl.gz")) - @lru_cache() def test_cuts(self) -> CutSet: logging.info("About to get test cuts") - return load_manifest_lazy( - self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz" - ) + return _cached_load_cuts(str(self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"))Add the helper near the imports:
# top-level helper (place after imports) @lru_cache(maxsize=16) def _cached_load_cuts(path: str) -> CutSet: return load_manifest_lazy(Path(path))Also applies to: 353-358, 360-365
196-205
: Honor --concatenate-cuts during training too (not only in validation).The CLI flag suggests train-time concatenation support; applying it here reduces padding and aligns with the docstring.
Apply:
transforms = [] + if self.args.concatenate_cuts: + transforms = [ + CutConcatenate( + duration_factor=self.args.duration_factor, gap=self.args.gap + ) + ] + transformsIf order matters, keep
CutConcatenate
beforeCutMix
so noise is applied after concatenation.
231-236
: Be explicit about PrecomputedFeatures when not using on-the-fly features.K2SpeechRecognitionDataset defaults may vary; explicit is clearer and future-proof.
Apply:
train = K2SpeechRecognitionDataset( cut_transforms=transforms, + input_strategy=PrecomputedFeatures(), input_transforms=input_transforms, return_cuts=self.args.return_cuts, )
else: validate = K2SpeechRecognitionDataset( cut_transforms=transforms, + input_strategy=PrecomputedFeatures(), return_cuts=self.args.return_cuts, )
Also applies to: 305-309
247-247
: Remove unused noqa in commented code (RUF100).The
# noqa
is unnecessary on a commented-out line.Apply:
- # transforms = [PerturbSpeed(factors=[0.9, 1.1], p=2/3)] + transforms # noqa + # transforms = [PerturbSpeed(factors=[0.9, 1.1], p=2/3)] + transforms
291-295
: Use iterable unpacking over list concatenation (RUF005).Slightly cleaner and avoids an intermediate list.
Apply:
- transforms = [ - CutConcatenate( - duration_factor=self.args.duration_factor, gap=self.args.gap - ) - ] + transforms + transforms = [ + CutConcatenate( + duration_factor=self.args.duration_factor, gap=self.args.gap + ), + *transforms, + ]
26-26
: Drop unused importload_manifest
.Apply:
-from lhotse import CutSet, Fbank, FbankConfig, load_manifest, load_manifest_lazy +from lhotse import CutSet, Fbank, FbankConfig, load_manifest_lazy
278-284
: Optional: enable persistent_workers when workers > 0 for throughput.This can speed up data loading for long runs; keep it False when workers == 0.
Apply:
train_dl = DataLoader( train, sampler=train_sampler, batch_size=None, num_workers=self.args.num_workers, - persistent_workers=False, + persistent_workers=self.args.num_workers > 0, )test_dl = DataLoader( test, batch_size=None, sampler=sampler, - num_workers=self.args.num_workers, + num_workers=self.args.num_workers, + persistent_workers=self.args.num_workers > 0, )Also applies to: 338-343
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
egs/mls_english/ASR/local/utils/asr_datamodule.py
(1 hunks)egs/multi_ja_en/ASR/RESULTS.md
(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
egs/mls_english/ASR/local/utils/asr_datamodule.py (1)
icefall/utils.py (1)
str2bool
(142-158)
🪛 Ruff (0.12.2)
egs/mls_english/ASR/local/utils/asr_datamodule.py
247-247: Unused blanket noqa
directive
Remove unused noqa
directive
(RUF100)
291-295: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
346-346: Use of functools.lru_cache
or functools.cache
on methods can lead to memory leaks
(B019)
353-353: Use of functools.lru_cache
or functools.cache
on methods can lead to memory leaks
(B019)
360-360: Use of functools.lru_cache
or functools.cache
on methods can lead to memory leaks
(B019)
🪛 LanguageTool
egs/multi_ja_en/ASR/RESULTS.md
[grammar] ~53-~53: There might be a mistake here.
Context: ...Corpus | WER (%) | |-----------------------------|---------...
(QB_NEW_EN)
[grammar] ~54-~54: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...
(QB_NEW_EN)
[grammar] ~55-~55: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...
(QB_NEW_EN)
[grammar] ~56-~56: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test-clean) | 8.64 |...
(QB_NEW_EN)
[grammar] ~62-~62: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...
(QB_NEW_EN)
[grammar] ~63-~63: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...
(QB_NEW_EN)
[grammar] ~64-~64: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...
(QB_NEW_EN)
[grammar] ~65-~65: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (197)
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: style_check (ubuntu-latest, 3.10)
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: style_check (ubuntu-latest, 3.10)
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: style_check (ubuntu-latest, 3.10)
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: style_check (ubuntu-latest, 3.10)
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: style_check (ubuntu-latest, 3.10)
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: style_check (ubuntu-latest, 3.10)
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: style_check (ubuntu-latest, 3.10)
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: style_check (ubuntu-latest, 3.10)
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: style_check (ubuntu-latest, 3.10)
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: style_check (ubuntu-latest, 3.10)
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.5.0 v20250630
- GitHub Check: py3.10 torch2.2.0 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.0.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
🔇 Additional comments (1)
egs/multi_ja_en/ASR/RESULTS.md (1)
23-28
: LGTM: decode epoch/avg pair is consistent.Epoch 21 with avg 15 matches the training epoch change. No further action.
@lru_cache() | ||
def test_cuts(self) -> List[CutSet]: | ||
logging.info("About to get test cuts") | ||
return load_manifest_lazy( | ||
self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotation mismatch: test_cuts returns a CutSet, not List[CutSet].
This can cause confusion and downstream typing issues; test_dataloaders
also expects a CutSet
.
Apply:
- def test_cuts(self) -> List[CutSet]:
+ def test_cuts(self) -> CutSet:
logging.info("About to get test cuts")
return load_manifest_lazy(
self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"
)
And remove the now-unused List
import:
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, Optional
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@lru_cache() | |
def test_cuts(self) -> List[CutSet]: | |
logging.info("About to get test cuts") | |
return load_manifest_lazy( | |
self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz" | |
) | |
++ b/egs/mls_english/ASR/local/utils/asr_datamodule.py | |
@@ –1,1 +1,1 @@ | |
from typing import Any, Dict, Optional | |
@@ –360,2 +360,2 @@ | |
def test_cuts(self) -> CutSet: | |
logging.info("About to get test cuts") | |
return load_manifest_lazy( | |
self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz" | |
) |
🧰 Tools
🪛 Ruff (0.12.2)
360-360: Use of functools.lru_cache
or functools.cache
on methods can lead to memory leaks
(B019)
🤖 Prompt for AI Agents
In egs/mls_english/ASR/local/utils/asr_datamodule.py around lines 360 to 365,
the test_cuts function is annotated to return List[CutSet] but actually returns
a CutSet; update the return type annotation to CutSet and remove the now-unused
List import from the file; ensure any related type hints (e.g., in
test_dataloaders) remain consistent with CutSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
egs/multi_ja_en/ASR/RESULTS.md (1)
13-20
: Boolean CLI arg: confirm expected type for --enable-musan and ensure manifests exist.
- Many icefall scripts expect 0/1 or store_true; passing literal "True" may be parsed as a string. Please confirm and align examples (e.g.,
--enable-musan 1
or just--enable-musan
).- Also verify that the README/prep steps create and populate
data/manifests
before running these commands.Run:
#!/bin/bash # Check how enable_musan is defined rg -nP --type=py -C2 '\benable[-_]?musan\b' egs/multi_ja_en/ASR/zipformer/train.py egs/multi_ja_en/ASR/zipformer/decode.py || true # Sanity-check manifests after prep ls -lah data/manifests || exit 1 fd -e jsonl -e jsonl.gz data/manifests -x sh -c 'echo "found: $1"' _ {}
🧹 Nitpick comments (5)
egs/multi_ja_en/ASR/RESULTS.md (5)
24-31
: Decode config is underspecified (beam/num-active/etc.).
modified_beam_search
usually needs beam/num_paths-like knobs for reproducibility. Add the key params you used (beam size, max contexts, n-best).
36-42
: Capitalize and clarify: “export to ONNX”.Minor wording fix and keep flags on one line per arg for copy/paste.
Apply:
-To export the model with onnx: +To export the model to ONNX:
56-71
: Clarify scoring setup for English WER and Japanese CER.Specify casing (lowercased or original), punctuation normalization, tokenizer for Japanese (e.g., character-based vs. MeCab), and scoring tool (e.g., sclite). Without this, numbers aren’t comparable.
116-116
: Document how chunk size affects WER/RTF.Add a brief note on the trade-off across {16,32,64,128} (latency vs. accuracy) and indicate the value used to report streaming WER.
5-8
: Add repro details: seed, commit hashes, and tool versions.Include torch/k2/lhotse/icefall commit IDs, random seed(s), and GPU type to make these results reproducible.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
egs/multi_ja_en/ASR/RESULTS.md
(1 hunks)
🧰 Additional context used
🪛 LanguageTool
egs/multi_ja_en/ASR/RESULTS.md
[grammar] ~44-~44: There might be a mistake here.
Context: ...xp-dir ./zipformer/exp ``` WER and CER on test set listed below (calculated with ...
(QB_NEW_EN)
[grammar] ~46-~46: There might be a mistake here.
Context: ...eech + MLS English (combined test set) | |----------------------|----------------...
(QB_NEW_EN)
[grammar] ~47-~47: There might be a mistake here.
Context: ...---------------------------------------| | Zipformer WER (%) | ...
(QB_NEW_EN)
[grammar] ~48-~48: There might be a mistake here.
Context: ... test | | greedy_search | ...
(QB_NEW_EN)
[grammar] ~49-~49: There might be a mistake here.
Context: ... 6.33 | | modified_beam_search | ...
(QB_NEW_EN)
[grammar] ~56-~56: There might be a mistake here.
Context: ... Corpus | WER (%) | |-----------------------------|---------...
(QB_NEW_EN)
[grammar] ~57-~57: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...
(QB_NEW_EN)
[grammar] ~58-~58: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...
(QB_NEW_EN)
[grammar] ~59-~59: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test set) | 8.64 ...
(QB_NEW_EN)
[grammar] ~65-~65: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...
(QB_NEW_EN)
[grammar] ~66-~66: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...
(QB_NEW_EN)
[grammar] ~67-~67: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...
(QB_NEW_EN)
[grammar] ~68-~68: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...
(QB_NEW_EN)
🔇 Additional comments (3)
egs/multi_ja_en/ASR/RESULTS.md (3)
72-75
: HF link fix looks good.
Anchor and href are consistent now.
83-91
: Streaming training flags likely missing.The streaming section repeats non-streaming args. If you used streaming-specific options (e.g., causal/chunked training flags), include them; otherwise readers can’t reproduce “Streaming” results.
107-114
: Confirm ONNX export flags match the streaming architecture.
--decode-chunk-len 32
is set; if additional streaming-only flags are required at export (e.g., causal conv settings), document them.
WER and CER on test set listed below (calculated with `./zipformer/decode.py`): | ||
|
||
| Datasets | ReazonSpeech + MLS English (combined test set) | | ||
|----------------------|------------------------------------------------| | ||
| Zipformer WER (%) | test | | ||
| greedy_search | 6.33 | | ||
| modified_beam_search | 6.32 | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix table layout for combined-test metrics.
Header/rows are misaligned (“Zipformer WER (%) | test” reads oddly). Consider:
-| Datasets | ReazonSpeech + MLS English (combined test set) |
-|----------------------|------------------------------------------------|
-| Zipformer WER (%) | test |
-| greedy_search | 6.33 |
-| modified_beam_search | 6.32 |
+| Method | WER (%) | Notes |
+|----------------------|--------:|-----------------------------------------|
+| greedy_search | 6.33 | Combined ReazonSpeech + MLS English |
+| modified_beam_search | 6.32 | Combined ReazonSpeech + MLS English |
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
WER and CER on test set listed below (calculated with `./zipformer/decode.py`): | |
| Datasets | ReazonSpeech + MLS English (combined test set) | | |
|----------------------|------------------------------------------------| | |
| Zipformer WER (%) | test | | |
| greedy_search | 6.33 | | |
| modified_beam_search | 6.32 | | |
WER and CER on test set listed below (calculated with `./zipformer/decode.py`): | |
| Method | WER (%) | Notes | | |
|----------------------|--------:|-----------------------------------------| | |
| greedy_search | 6.33 | Combined ReazonSpeech + MLS English | | |
| modified_beam_search | 6.32 | Combined ReazonSpeech + MLS English | |
🧰 Tools
🪛 LanguageTool
[grammar] ~44-~44: There might be a mistake here.
Context: ...xp-dir ./zipformer/exp ``` WER and CER on test set listed below (calculated with ...
(QB_NEW_EN)
[grammar] ~46-~46: There might be a mistake here.
Context: ...eech + MLS English (combined test set) | |----------------------|----------------...
(QB_NEW_EN)
[grammar] ~47-~47: There might be a mistake here.
Context: ...---------------------------------------| | Zipformer WER (%) | ...
(QB_NEW_EN)
[grammar] ~48-~48: There might be a mistake here.
Context: ... test | | greedy_search | ...
(QB_NEW_EN)
[grammar] ~49-~49: There might be a mistake here.
Context: ... 6.33 | | modified_beam_search | ...
(QB_NEW_EN)
🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around lines 44 to 51, the markdown table
layout for the combined test metrics is misaligned (the header cell
"ReazonSpeech + MLS English (combined test set)" and the "Zipformer WER (%) |
test" row read oddly); fix by restructuring the table to have clear column
headers (e.g., Dataset | Metric | Value or Dataset | Zipformer WER (%) with a
single header row), align the separator row to match the number of columns, and
move "test" into a proper column cell so each row maps correctly to the headers.
egs/multi_ja_en/ASR/RESULTS.md
Outdated
./zipformer/decode.py \ | ||
--epoch 10 \ | ||
--avg 1 \ | ||
--exp-dir ./zipformer/exp \ | ||
--decoding-method modified_beam_search \ | ||
--manifest-dir data/manifests | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use streaming decode script/flags for the streaming model.
The streaming section uses decode.py
without chunk args. Replace with the streaming decoder and include chunk length.
Apply:
-./zipformer/decode.py \
- --epoch 10 \
- --avg 1 \
- --exp-dir ./zipformer/exp \
- --decoding-method modified_beam_search \
- --manifest-dir data/manifests
+./zipformer/streaming_decode.py \
+ --epoch 10 \
+ --avg 1 \
+ --exp-dir ./zipformer/exp \
+ --decoding-method modified_beam_search \
+ --decode-chunk-len 32 \
+ --manifest-dir data/manifests
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
./zipformer/decode.py \ | |
--epoch 10 \ | |
--avg 1 \ | |
--exp-dir ./zipformer/exp \ | |
--decoding-method modified_beam_search \ | |
--manifest-dir data/manifests | |
``` | |
./zipformer/streaming_decode.py \ | |
--epoch 10 \ | |
--avg 1 \ | |
--exp-dir ./zipformer/exp \ | |
--decoding-method modified_beam_search \ | |
--decode-chunk-len 32 \ | |
--manifest-dir data/manifests |
🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around lines 96 to 102, the streaming decode
example uses the non-streaming decode.py and lacks chunking arguments; replace
it with the streaming decoder invocation (e.g., streaming_decode.py or the
decode.py streaming mode) and add required streaming flags such as
--decoding-method streaming_beam_search (or the project-specific streaming
method) plus chunk-related args like --chunk-length (and if applicable
--left-context/--right-context or --chunk-size) and any streaming-specific
model/exp flags; update the command line to call the streaming decoder script
and include a concrete --chunk-length value used by the streaming model.
*Please let us know which script to use to evaluate the streaming model!* | ||
|
||
|
||
We also include WER% for common English ASR datasets: | ||
|
||
*Please let us know which script to use to evaluate the streaming model!* | ||
|
||
|
||
And CER% for common Japanese datasets: | ||
|
||
*Please let us know which script to use to evaluate the streaming model!* | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace placeholders with concrete evaluation commands.
The repeated “Please let us know…” lines block reproducibility. Provide the exact script(s), command lines, and scoring steps for streaming WER/CER.
Apply:
-*Please let us know which script to use to evaluate the streaming model!*
+Evaluate streaming with:
+```shell
+./zipformer/streaming_decode.py \
+ --epoch 10 --avg 1 --exp-dir ./zipformer/exp \
+ --decoding-method modified_beam_search \
+ --decode-chunk-len 32 \
+ --manifest-dir data/manifests
+```
+Score with the same normalization used for non-streaming; list exact commands/tools.
🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around lines 120-131, replace the repeated
placeholder lines with concrete evaluation instructions: add the exact streaming
decode command (pointing to zipformer/streaming_decode.py with required flags
such as epoch, avg, exp-dir, decoding-method, decode-chunk-len, manifest-dir),
then add the precise scoring steps and commands (normalization pipeline matching
non-streaming evaluation, followed by invocation of the scoring tool such as
sclite or the repo's wer/cer scoring script) and an example invocation showing
how to produce hypothesis and reference files and run the scorer; ensure you
list any required normalization options (lowercasing, punctuation removal,
kana/kanji normalization for Japanese) and the exact files produced/consumed so
readers can reproduce the reported WER/CER.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (4)
egs/multi_ja_en/ASR/RESULTS.md (4)
19-21
: Ensuredata/manifests
is created and populated in prep.Docs should state that prepare scripts emit manifests into
data/manifests
before training/decoding uses it.Run to sanity‑check:
#!/bin/bash set -e ls -lah data/manifests fd -e jsonl -e jsonl.gz data/manifests -x sh -c 'echo "found: $1"' _ {}
45-52
: Fix misaligned WER table.Replace with a simple Method|WER table:
-| Datasets | ReazonSpeech + MLS English (combined test set) | -|----------------------|------------------------------------------------| -| Zipformer WER (%) | test | -| greedy_search | 6.33 | -| modified_beam_search | 6.32 | +| Method | WER (%) | Notes | +|----------------------|--------:|-------------------------------------| +| greedy_search | 6.33 | ReazonSpeech+MLS English, combined | +| modified_beam_search | 6.32 | ReazonSpeech+MLS English, combined |
97-99
: Replace TODO with a working streaming decode command.-```shell -TODO -``` +```shell +./zipformer/streaming_decode.py \ + --epoch 10 \ + --avg 1 \ + --exp-dir ./zipformer/exp-15k15k-streaming \ + --decoding-method modified_beam_search \ + --decode-chunk-len 32 \ + --manifest-dir data/manifests +```
149-160
: Replace placeholders with concrete streaming eval steps.-Word Error Rates (WERs) listed below: - -*Please let us know which script to use to evaluate the streaming model!* - - -We also include WER% for common English ASR datasets: - -*Please let us know which script to use to evaluate the streaming model!* - - -And CER% for common Japanese datasets: - -*Please let us know which script to use to evaluate the streaming model!* +Word Error Rates (WERs) for streaming: + +```shell +# Decode (streaming) +./zipformer/streaming_decode.py \ + --epoch 10 --avg 1 \ + --exp-dir ./zipformer/exp-15k15k-streaming \ + --decoding-method modified_beam_search \ + --decode-chunk-len 32 \ + --manifest-dir data/manifests + +# Score (same normalization as non-streaming) +# Example: produces refs/hyps then runs the repo scorer. +python ./zipformer/scoring/prepare_refs_hyps.py \ + --manifest-dir data/manifests \ + --output-dir ./zipformer/exp-15k15k-streaming/scoring +python ./zipformer/scoring/score_wer_cer.py \ + --scoring-dir ./zipformer/exp-15k15k-streaming/scoring +``` + +We also include WER% for common English ASR datasets and CER% for Japanese with the same normalization.
🧹 Nitpick comments (5)
egs/multi_ja_en/ASR/RESULTS.md (5)
34-36
: Capitalize ONNX.-To export the model with onnx: +To export the model with ONNX:
57-62
: Specify dataset versions/splits and normalization.State exact corpora (e.g., “Common Voice 17 test”, “TED-LIUM 3 test” or “TED Talks subset”), and text normalization used, to make WER reproducible.
66-71
: Clarify JP sets and CER normalization.Please note the dataset releases/splits (e.g., JSUT v1.1) and normalization (e.g., Kanji→Kana mapping, punctuation handling).
101-103
: Use canonical name “sherpa-onnx”.-To export the model with sherpa onnx: +To export the model with sherpa-onnx:
145-145
: Match flag name in the text.-You may also use decode chunk sizes `16`, `32`, `64`, `128`. +You may also use --decode-chunk-len values 16, 32, 64, 128.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
egs/multi_ja_en/ASR/RESULTS.md
(1 hunks)
🧰 Additional context used
🪛 LanguageTool
egs/multi_ja_en/ASR/RESULTS.md
[grammar] ~45-~45: There might be a mistake here.
Context: ...xp-dir ./zipformer/exp ``` WER and CER on test set listed below (calculated with ...
(QB_NEW_EN)
[grammar] ~47-~47: There might be a mistake here.
Context: ...eech + MLS English (combined test set) | |----------------------|----------------...
(QB_NEW_EN)
[grammar] ~48-~48: There might be a mistake here.
Context: ...---------------------------------------| | Zipformer WER (%) | ...
(QB_NEW_EN)
[grammar] ~49-~49: There might be a mistake here.
Context: ... test | | greedy_search | ...
(QB_NEW_EN)
[grammar] ~50-~50: There might be a mistake here.
Context: ... 6.33 | | modified_beam_search | ...
(QB_NEW_EN)
[grammar] ~57-~57: There might be a mistake here.
Context: ... Corpus | WER (%) | |-----------------------------|---------...
(QB_NEW_EN)
[grammar] ~58-~58: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...
(QB_NEW_EN)
[grammar] ~59-~59: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...
(QB_NEW_EN)
[grammar] ~60-~60: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test set) | 8.64 ...
(QB_NEW_EN)
[grammar] ~66-~66: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...
(QB_NEW_EN)
[grammar] ~67-~67: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...
(QB_NEW_EN)
[grammar] ~68-~68: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...
(QB_NEW_EN)
[grammar] ~69-~69: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (32)
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.6.0 v20250630
- GitHub Check: py3.10 torch2.7.1 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
🔇 Additional comments (1)
egs/multi_ja_en/ASR/RESULTS.md (1)
27-31
: Decode args look consistent with the new pipeline.
--world-size 8 \ | ||
--causal 1 \ | ||
--num-epochs 10 \ | ||
--start-epoch 1 \ | ||
--use-fp16 1 \ | ||
--exp-dir zipformer/exp \ | ||
--max-duration 600 | ||
--manifest-dir data/manifests \ | ||
--enable-musan True | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Non‑streaming section sets causal=1 — flip to 0 or clarify.
Using --causal 1
here will train a streaming/causal model under a “Non‑streaming” heading.
Apply:
./zipformer/train.py \
- --world-size 8 \
- --causal 1 \
+ --world-size 8 \
+ --causal 0 \
--num-epochs 10 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--manifest-dir data/manifests \
--enable-musan True
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
--world-size 8 \ | |
--causal 1 \ | |
--num-epochs 10 \ | |
--start-epoch 1 \ | |
--use-fp16 1 \ | |
--exp-dir zipformer/exp \ | |
--max-duration 600 | |
--manifest-dir data/manifests \ | |
--enable-musan True | |
``` | |
./zipformer/train.py \ | |
--world-size 8 \ | |
--causal 0 \ | |
--num-epochs 10 \ | |
--start-epoch 1 \ | |
--use-fp16 1 \ | |
--exp-dir zipformer/exp \ | |
--manifest-dir data/manifests \ | |
--enable-musan True |
🤖 Prompt for AI Agents
egs/multi_ja_en/ASR/RESULTS.md around lines 13 to 21: the Non‑streaming section
currently sets --causal 1 which configures a streaming/causal model; change
--causal 1 to --causal 0 to match the Non‑streaming heading (or alternatively
add a clarifying note above the block that this run intentionally trains a
causal/streaming model), and ensure the comment/heading and flag are consistent.
./zipformer/export-onnx.py \ | ||
--tokens ./data/lang/bbpe_2000/tokens.txt \ | ||
--use-averaged-model 0 \ | ||
--epoch 10 \ | ||
--avg 1 \ | ||
--exp-dir ./zipformer/exp | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Token path: add a quick existence check in docs or prep.
Ensure data/lang/bbpe_2000/tokens.txt
is produced by your BBPE training step before export.
Run:
🏁 Script executed:
#!/bin/bash
test -f data/lang/bbpe_2000/tokens.txt && echo "tokens.txt OK" || (echo "tokens.txt missing" && exit 1)
Length of output: 96
Add existence check for tokens.txt in export instructions
In egs/multi_ja_en/ASR/RESULTS.md (lines 37–43), prepend a test to verify data/lang/bbpe_2000/tokens.txt
exists before invoking export-onnx.py
, for example:
test -f data/lang/bbpe_2000/tokens.txt || { echo "Error: tokens.txt not found. Run BBPE training first."; exit 1; }
This ensures users generate the tokens file beforehand and avoids silent failures.
🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around lines 37 to 43, the export example
calls export-onnx.py without checking that data/lang/bbpe_2000/tokens.txt
exists; add a pre-check that verifies that file exists and, if not, prints a
clear error like "Error: tokens.txt not found. Run BBPE training first." and
exits with a non-zero status before running export-onnx.py so users don't hit
silent failures.
--world-size 8 \ | ||
--causal 1 \ | ||
--num-epochs 10 \ | ||
--start-epoch 1 \ | ||
--use-fp16 1 \ | ||
--exp-dir zipformer/exp \ | ||
--manifest-dir data/manifests \ | ||
--enable-musan True | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Streaming exp-dir inconsistent with later export path.
Train uses ./zipformer/exp
, but export uses ./zipformer/exp-15k15k-streaming
. Align them.
./zipformer/train.py \
--world-size 8 \
--causal 1 \
--num-epochs 10 \
--start-epoch 1 \
--use-fp16 1 \
- --exp-dir zipformer/exp \
+ --exp-dir zipformer/exp-15k15k-streaming \
--manifest-dir data/manifests \
--enable-musan True
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
--world-size 8 \ | |
--causal 1 \ | |
--num-epochs 10 \ | |
--start-epoch 1 \ | |
--use-fp16 1 \ | |
--exp-dir zipformer/exp \ | |
--manifest-dir data/manifests \ | |
--enable-musan True | |
``` | |
./zipformer/train.py \ | |
--world-size 8 \ | |
--causal 1 \ | |
--num-epochs 10 \ | |
--start-epoch 1 \ | |
--use-fp16 1 \ | |
--exp-dir zipformer/exp-15k15k-streaming \ | |
--manifest-dir data/manifests \ | |
--enable-musan True |
🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around lines 85 to 93, the training command
uses --exp-dir zipformer/exp while the export later references
./zipformer/exp-15k15k-streaming; make them consistent by changing one to match
the other (preferred: set --exp-dir to zipformer/exp-15k15k-streaming to match
the streaming export) so both training and export use the identical experiment
directory path.
--chunk-size 16 \ | ||
--left-context-frames 128 \ | ||
--tokens data/lang/bbpe_2000/tokens.txt \ | ||
--epoch 10 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since your dataset is large, can you also try --iter
instead of --epoch
?
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
egs/mls_english/ASR/zipformer/streaming_decode.py (5)
37-37
: Use MLSEnglishHFAsrDataModule instead of ReazonSpeechAsrDataModule.This script imports
ReazonSpeechAsrDataModule
but should import the MLS English data module to align with the MLS English recipe context.This is identical to the past review comment flagging the same issue.
515-522
: Fix stream completion logic and tensor conversion.Two critical issues: all streams are unconditionally marked as finished, and tensor values need conversion to Python integers.
This matches the past review comment identifying the same issues.
695-697
: Fix variable name collision in WER summary filename.The variable
key
is shadowed by the loop variable, creating ambiguous filename generation.This matches the past review comment about the same variable scoping issue.
712-717
: Use MLSEnglishHFAsrDataModule instead of ReazonSpeechAsrDataModule.The main function uses
ReazonSpeechAsrDataModule
but should use the MLS English data module for consistency with the MLS English recipe.This matches the past review comment flagging the same data module mismatch.
856-865
: Use MLSEnglishHFAsrDataModule instead of ReazonSpeechAsrDataModule.The dataset loading uses
ReazonSpeechAsrDataModule
but should use the appropriate MLS English data module.This matches the past review comment identifying the same data module issue.
🧹 Nitpick comments (2)
egs/mls_english/ASR/zipformer/streaming_decode.py (2)
29-29
: Remove unusedpdb
import.The
pdb
import is not used in the code and should be removed.Apply this diff:
-import pdb
415-418
: Consider using iterable unpacking for state concatenation.The state list concatenation can be simplified with iterable unpacking for better readability.
Apply this diff:
- new_states = new_encoder_states + [ - new_cached_embed_left_pad, - new_processed_lens, - ] + new_states = [*new_encoder_states, new_cached_embed_left_pad, new_processed_lens]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
egs/mls_english/ASR/zipformer/streaming_decode.py
(1 hunks)egs/reazonspeech/ASR/local/compute_fbank_musan.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- egs/reazonspeech/ASR/local/compute_fbank_musan.py
🧰 Additional context used
🧬 Code graph analysis (1)
egs/mls_english/ASR/zipformer/streaming_decode.py (7)
egs/mls_english/ASR/zipformer/decode_stream.py (5)
DecodeStream
(28-148)get_feature_frames
(123-137)id
(105-106)set_features
(108-121)decoding_result
(139-148)egs/mls_english/ASR/zipformer/streaming_beam_search.py (3)
fast_beam_search_one_best
(208-295)greedy_search
(30-96)modified_beam_search
(99-205)egs/mls_english/ASR/zipformer/tokenizer.py (4)
Tokenizer
(9-169)add_arguments
(13-25)piece_to_id
(191-195)get_piece_size
(200-201)egs/mls_english/ASR/zipformer/train.py (3)
add_model_arguments
(125-258)get_model
(612-640)get_params
(482-545)icefall/checkpoint.py (2)
average_checkpoints_with_averaged_model
(389-454)find_checkpoints
(254-314)icefall/utils.py (4)
AttributeDict
(214-236)make_pad_mask
(1394-1431)store_transcripts
(558-580)str2bool
(142-158)egs/mls_english/ASR/local/utils/asr_datamodule.py (3)
add_arguments
(62-180)valid_cuts
(354-358)test_cuts
(361-365)
🪛 Ruff (0.12.2)
egs/mls_english/ASR/zipformer/streaming_decode.py
29-29: Import for pdb
found
(T100)
201-201: Do not perform function call torch.device
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
415-418: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
511-511: Avoid specifying long messages outside the exception class
(TRY003)
663-663: Avoid specifying long messages outside the exception class
(TRY003)
774-777: Avoid specifying long messages outside the exception class
(TRY003)
779-782: Avoid specifying long messages outside the exception class
(TRY003)
803-806: Avoid specifying long messages outside the exception class
(TRY003)
808-811: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (180)
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch1.13.1 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.0.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.1 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.0 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
- GitHub Check: py3.10 torch2.1.0 v20250630
- GitHub Check: py3.10 torch2.3.1 v20250630
- GitHub Check: py3.10 torch2.2.2 v20250630
- GitHub Check: py3.10 torch1.13.0 v20250630
- GitHub Check: py3.10 torch2.4.1 v20250630
- GitHub Check: py3.10 torch2.2.1 v20250630
- GitHub Check: py3.10 torch2.1.2 v20250630
🔇 Additional comments (2)
egs/mls_english/ASR/zipformer/streaming_decode.py (2)
388-396
: Correct processed mask construction.The mask construction is correct here and properly computes padding positions without the
flip(1)
issue mentioned in past comments.
32-32
: Add missing import for Optional type hint.The
Optional
type hint is used in the suggested fix but not imported.Apply this diff:
from pathlib import Path -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, TupleNote: This appears to already be imported correctly, so no change is needed.
Likely an incorrect or invalid review comment.
Usage: | ||
./zipformer/streaming_decode.py--epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192 | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix malformed command line example in docstring.
The usage example has a syntax error with missing space after streaming_decode.py
.
Apply this diff:
-./zipformer/streaming_decode.py--epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192
+./zipformer/streaming_decode.py --epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Usage: | |
./zipformer/streaming_decode.py--epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192 | |
""" | |
Usage: | |
./zipformer/streaming_decode.py --epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192 | |
""" |
🤖 Prompt for AI Agents
egs/mls_english/ASR/zipformer/streaming_decode.py around lines 20 to 23: the
docstring usage example is malformed because there's no space between
"streaming_decode.py" and the command-line flags; fix by inserting a space after
"streaming_decode.py" so the example reads "streaming_decode.py --epoch 28 ..."
and ensure the rest of flags remain unchanged.
def get_init_states( | ||
model: nn.Module, | ||
batch_size: int = 1, | ||
device: torch.device = torch.device("cpu"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid mutable default argument with function call.
Using torch.device("cpu")
as a default argument is problematic as it creates the device object at function definition time rather than call time.
Apply this diff:
-def get_init_states(
- model: nn.Module,
- batch_size: int = 1,
- device: torch.device = torch.device("cpu"),
-) -> List[torch.Tensor]:
+def get_init_states(
+ model: nn.Module,
+ batch_size: int = 1,
+ device: Optional[torch.device] = None,
+) -> List[torch.Tensor]:
And update the function body:
+ if device is None:
+ device = torch.device("cpu")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
device: torch.device = torch.device("cpu"), | |
def get_init_states( | |
model: nn.Module, | |
batch_size: int = 1, | |
device: Optional[torch.device] = None, | |
) -> List[torch.Tensor]: | |
if device is None: | |
device = torch.device("cpu") |
🧰 Tools
🪛 Ruff (0.12.2)
201-201: Do not perform function call torch.device
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
🤖 Prompt for AI Agents
In egs/mls_english/ASR/zipformer/streaming_decode.py around line 201, the
function currently uses torch.device("cpu") as a default argument which is
created at import time; change the parameter default to None (typing it as
Optional[torch.device]), add "from typing import Optional" if not present, and
inside the function set device = torch.device("cpu") when device is None so the
device object is created at call time; update any type hints and callers if
necessary.
|
||
|
||
The pre-trained model is available here : [reazon-research/mls-english | ||
](https://huggingface.co/reazon-research/mls-english) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This URL leads to a 404 page. Can you make it public?
This PR adds a new recipe,
mls_english
, and overhauls an existing recipe,multi_ja_en
.In this prior PR, we added the
multi_ja_en
recipe for bilingual Japanese / English models. However, we observed that models performed significantly worse at English speech compared to Japanese speech. After determining the cause to be the large imbalance between the number of hours of English data compared to Japanese data, we decided to find a larger English dataset and use it instead of LibriSpeech.Ultimately, we decided to use the English portion of Multilingual LibriSpeech (MLS English). However, this dataset did not yet have a recipe in
icefall
. So, we created a new recipe for MLS English, and then updated themulti_ja_en
to rely on bothreazonspeech
recipe andmls_english
recipe.This PR includes both full recipes, so I will close my prior PR which implements the
mls_english
recipe alone.Please see
multi_ja_en/ASR/README.md
andmls_english/ASR/README.md
for more details about each recipe.Please let me and @baileyeet know if there are any comments or concerns.
Summary by CodeRabbit
New Features
Documentation
Refactor
Chores