Bilingual Japanese / English zipformer recipe (multi_ja_en) and MLS English recipe (mls_english) #2015

kinanmartin · 2025-08-28T07:12:56Z

This PR adds a new recipe, mls_english, and overhauls an existing recipe, multi_ja_en.

In this prior PR, we added the multi_ja_en recipe for bilingual Japanese / English models. However, we observed that models performed significantly worse at English speech compared to Japanese speech. After determining the cause to be the large imbalance between the number of hours of English data compared to Japanese data, we decided to find a larger English dataset and use it instead of LibriSpeech.

Ultimately, we decided to use the English portion of Multilingual LibriSpeech (MLS English). However, this dataset did not yet have a recipe in icefall. So, we created a new recipe for MLS English, and then updated the multi_ja_en to rely on both reazonspeech recipe and mls_english recipe.

This PR includes both full recipes, so I will close my prior PR which implements the mls_english recipe alone.

Please see multi_ja_en/ASR/README.md and mls_english/ASR/README.md for more details about each recipe.

Please let me and @baileyeet know if there are any comments or concerns.

Summary by CodeRabbit

New Features
- Added MLS English ASR recipe with end-to-end prepare, train, and decode (streaming and non-streaming).
- Introduced multi-dataset data module, Hugging Face dataset support, on-the-fly features, and MUSAN augmentation.
- Added utilities: dataset downloader, subset creation by hours, transcript generation, BPE training, and manifest path updater.
- Provided tokenizer support (BPE/char).
Documentation
- New MLS English README and RESULTS with WERs and commands.
- Updated multi_ja_en README/RESULTS to use MLS English + ReazonSpeech, with revised commands and metrics.
Refactor
- Reorganized data/manifests layout and language assets; unified SentencePiece pipeline.
Chores
- Improved device selection for Whisper Fbank; added shared Zipformer links.

…pare.sh

…ed symlink to librispeech

…just datamodule to load from manifest files

…for MLS English

This reverts commit ba603e0.

…ultiDatasetAsrDataModule, not tested yet

…re.sh from commit 547f5c5

…tructure, added script to update cutset paths. WIP

…the multilingual training recipe directory structure

…o make dev and test splits have matching sizes to reazonspeech

Musan mls clean final

coderabbitai · 2025-08-28T07:13:06Z

Walkthrough

Adds a complete MLS English ASR recipe and integrates it into the multi_ja_en pipeline. Introduces dataset download, feature/materialization from Hugging Face, transcripts/BPE tooling, a new MLS English data module, and Zipformer training/decoding (non-streaming/streaming). Updates multi_ja_en to use MLS English + ReazonSpeech, overhauls manifests/lang prep, and replaces legacy LibriSpeech/char-token flows.

Changes

Cohort / File(s)	Summary
Docs (MLS English, Multi JA-EN) `egs/mls_english/ASR/README.md`, `egs/mls_english/ASR/RESULTS.md`, `egs/multi_ja_en/ASR/README.md`, `egs/multi_ja_en/ASR/RESULTS.md`	Adds MLS English recipe docs and results; updates multi_ja_en docs/results to MLS English + ReazonSpeech, new commands, metrics, and links.
MLS English data prep pipeline `egs/mls_english/ASR/prepare.sh`, `egs/mls_english/ASR/local/compute_fbank_mls_english.py`, `egs/mls_english/ASR/local/display_manifest_statistics.py`, `egs/mls_english/ASR/local/compute_fbank_musan.py`, `egs/mls_english/ASR/shared`	Stage-based prep script; switches to Hugging Face CutSet creation; updates cut filenames and adds test split; adds Musan fbank symlink; shares common assets.
MLS English utilities `egs/mls_english/ASR/local/utils/download_mls_english.py`, `egs/mls_english/ASR/local/utils/generate_transcript.py`, `egs/mls_english/ASR/local/train_bpe_model.py`, `egs/mls_english/ASR/local/utils/asr_datamodule.py`, `egs/mls_english/ASR/local/utils/create_subsets_greedy.py`	New HF dataset downloader; transcript generator from cuts; BPE trainer; MLS English HF data module with on-the-fly/precomputed features, MUSAN, SpecAugment; tool to create hour-limited Parquet subsets.
MLS English Zipformer decoding/training `egs/mls_english/ASR/zipformer/decode.py`, `egs/mls_english/ASR/zipformer/streaming_decode.py`	Full-featured non-streaming and streaming Zipformer decoders with multiple decoding methods, LM integration, graphs, WER reporting.
MLS English Zipformer pointers `egs/mls_english/ASR/zipformer/asr_datamodule.py`, `.../beam_search.py`, `.../ctc_decode.py`, `.../decode_stream.py`, `.../decoder.py`, `.../encoder_interface.py`, `.../export-onnx.py`, `.../export.py`, `.../generate_averaged_model.py`, `.../joiner.py`, `.../model.py`, `.../my_profile.py`, `.../onnx_pretrained.py`, `.../optim.py`, `.../pretrained.py`, `.../scaling.py`, `.../scaling_converter.py`, `.../streaming_beam_search.py`, `.../subsampling.py`, `.../test_scaling.py`, `.../test_subsampling.py`, `.../zipformer.py`	Adds modules referencing existing Librispeech implementations to reuse core Zipformer components.
MLS English Zipformer training (transducer) `egs/mls_english/ASR/zipformer/do_not_use_it_directly.py`	Comprehensive RNN-T training script with DDP, mixed precision, checkpointing, diagnostics, and loss scheduling.
Multi JA-EN data/lang prep changes `egs/multi_ja_en/ASR/prepare.sh`, `egs/multi_ja_en/ASR/local/prepare_lang_bbpe.py`, `egs/multi_ja_en/ASR/local/train_bbpe_model.py`, `egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py`, `egs/multi_ja_en/ASR/local/prepare_lang_char.py` (removed), `egs/multi_ja_en/ASR/local/utils/asr_datamodule.py`, `egs/multi_ja_en/ASR/local/validate_bpe_lexicon.py`	Reworks manifests layout and BBPE pipeline; adds path updater; enhances BBPE trainer options/output handling; adds MUSAN integration in datamodule; removes char-lang prep; adjusts validations and defaults.
Multi JA-EN Zipformer code updates `egs/multi_ja_en/ASR/zipformer/decode.py`, `.../do_not_use_it_directly.py`, `.../streaming_decode.py`, `.../train.py`, `.../multi_dataset.py`	Switches to MultiDatasetAsrDataModule; unifies SentencePiece-based text encoding; merges MLS English + ReazonSpeech mux; simplifies test handling; updates signatures and defaults.
ReazonSpeech utility tweak `egs/reazonspeech/ASR/local/compute_fbank_musan.py`	WhisperFbank now auto-selects device and warns if falling back to CPU.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant P as prepare.sh
  participant HF as HuggingFace Hub
  participant L as Lhotse
  participant M as MUSAN
  participant SPM as SentencePiece

  U->>P: Run prepare.sh (stages 0..6)
  P->>HF: snapshot_download mls_eng
  P->>L: CutSet.from_huggingface_dataset (train/dev/test)
  P->>L: save_audios + compute_and_store_features
  P->>M: lhotse prepare & fbank
  P->>L: generate transcript.txt
  P->>SPM: train BPE (bbpe.model)
  P->>P: update_cutset_paths
  P-->>U: Manifests, features, lang ready

sequenceDiagram
  autonumber
  participant T as train.py
  participant DM as MLSEnglish/MultiDataset DataModule
  participant D as DataLoader
  participant Z as Zipformer Model
  participant CK as Checkpointer

  T->>DM: build train/valid datasets (OTF feats, MUSAN, aug)
  DM-->>D: DynamicBucketingSampler loaders
  loop per epoch
    D->>Z: batches (features, tokens)
    Z-->>T: losses (transducer/CTC)
    T->>CK: save checkpoints / best
  end

sequenceDiagram
  autonumber
  participant Dec as decode.py
  participant DM as DataModule
  participant DL as DataLoader
  participant Z as Zipformer Model
  participant G as Decoding Graph/LM
  participant Eval as WER

  Dec->>DM: test_cuts -> DataLoader
  Dec->>Z: load/avg checkpoints
  Dec->>G: build (trivial/L/G/context)
  DL->>Z: batch features
  Z-->>Dec: hypotheses (greedy/beam/fast)
  Dec->>Eval: compute WER, save results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

Musan implementation for ReazonSpeech #1988 — Similar Musan integration patterns (CutMix in datamodule; compute_fbank_musan updates), likely shares implementation approach.

Suggested reviewers

csukuangfj

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title concisely and accurately summarizes the PR’s primary changes: adding an MLS English recipe and updating the bilingual Japanese/English Zipformer recipe (multi_ja_en), matching the provided PR objectives and file summaries. It is specific, focused on the main intent, and omits extraneous detail.

Poem

A rabbit hopped through manifests’ glen,
Wove MLS with ReazonSpeech again.
Buckets brim, beams align,
Zipformer sings in tidy time.
BPE seeds in moonlit byte,
WERs fall softly—what a sight.
Hop, commit—tests green tonight! 🐇✨

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 47

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (15)

egs/tedlium3/ASR/conformer_ctc2/transformer.py (1)

606-616: Document kwargs behavior and ensure TorchScript compatibility
In egs/tedlium3/ASR/conformer_ctc2/transformer.py’s TransformerDecoderLayer.forward, append to the docstring:
           warmup:
             controls selective bypass of layers; if < 1.0, we will bypass the layer more frequently (default=1.0).
     kwargs:
       accepted for API compatibility; currently ignored.
In egs/tedlium3/ASR/conformer_ctc2/export.py (around line 277), insert before model = torch.jit.script(model):
# Prevent **kwargs in TransformerDecoderLayer.forward from breaking JIT
model.__class__.forward = torch.jit.ignore(model.__class__.forward)
to avoid scripting errors.

egs/mls_english/ASR/zipformer/optim.py (2)

103-116: Fix NameError: ‘key’ is undefined when batching parameters.

stacked_params_dict[key] = p_stacked uses key that isn’t defined in this scope.

Apply:

         for batch, batch_names in zip(batches, batches_names):
             p = batch[0]
@@
-            p_stacked.grad = grad
-            stacked_params_dict[key] = p_stacked
+            p_stacked.grad = grad
+            # derive the batch key from a representative param
+            key = (str(p.dtype), *p.shape)
+            stacked_params_dict[key] = p_stacked
             tuples.append((p_stacked, state, batch_names))

551-553: Fix undefined variable: use first_p.device instead of p.device.

p is not defined in this scope; should reference first_p.

Apply:

-            first_state["model_norms"] = torch.zeros(
-                clipping_update_period, device=p.device
-            )
+            first_state["model_norms"] = torch.zeros(
+                clipping_update_period, device=first_p.device
+            )

egs/mls_english/ASR/zipformer/my_profile.py (2)

158-163: Fix typo in get_model_profile kwarg and avoid shadowing config “params”

Typo: module_hoop_mapping -> module_hook_mapping. Also rename the returned “params” to avoid shadowing the config dict.
-    flops, params = get_model_profile(
+    flops, param_count = get_model_profile(
         model=model,
         args=(feature, feature_lens),
-        module_hoop_mapping=MODULE_HOOK_MAPPING,
+        module_hook_mapping=MODULE_HOOK_MAPPING,
     )
-    logging.info(f"For the encoder part, params: {params}, flops: {flops}")
+    logging.info(f"For the encoder part, params: {param_count}, flops: {flops}")
1-171: Remove outdated relative-path license headers in MLS English zipformer files
Remove the line
# See ../../../../LICENSE for clarification regarding multiple authors  
from all Python scripts in egs/mls_english/ASR/zipformer (my_profile.py, train.py, streaming_decode.py, decode.py, do_not_use_it_directly.py).

egs/mls_english/ASR/zipformer/zipformer.py (1)

239-251: Fix dtype bug when building feature masks (logical_and on float tensors).

torch.logical_and expects bool tensors; here it’s applied to float masks, which will error. Build boolean masks, then cast once for multiplication.

Apply this diff:

-        # mask1 shape: (1, batch_size, 1)
-        mask1 = (
-            torch.rand(1, batch_size, 1, device=x.device) > feature_mask_dropout_prob
-        ).to(x.dtype)
-
-        # mask2 has additional sequences masked, about twice the number.
-        mask2 = torch.logical_and(
-            mask1,
-            (
-                torch.rand(1, batch_size, 1, device=x.device)
-                > feature_mask_dropout_prob
-            ).to(x.dtype),
-        )
+        # mask1 shape: (1, batch_size, 1)
+        mask1_bool = torch.rand(1, batch_size, 1, device=x.device) > feature_mask_dropout_prob
+        # mask2 has additional sequences masked, about twice the number.
+        mask2_bool = torch.logical_and(
+            mask1_bool,
+            torch.rand(1, batch_size, 1, device=x.device) > feature_mask_dropout_prob,
+        )
 
-        # dim: (1, batch_size, 2)
-        mask = torch.cat((mask1, mask2), dim=-1)
+        # dim: (1, batch_size, 2) as dtype of x for downstream multiplication
+        mask = torch.cat((mask1_bool.to(x.dtype), mask2_bool.to(x.dtype)), dim=-1)

egs/multi_ja_en/ASR/local/prepare_lang_bbpe.py (1)

24-34: Fix mismatch between CLI/docs and actual BBPE model path; wire up --vocab-size and add fallback search.

Docs and help mention bbpe_<vocab_size>/bbpe.model, but the code only checks lang_dir/bbpe.model and ignores --vocab-size. This will raise FileNotFoundError in the new layout.

Apply:

@@
-    parser.add_argument(
-        "--lang-dir",
-        type=str,
-        help="""Input and output directory.
-        It should contain the words.txt file and the
-        bbpe model in a subdirectory (e.g., bbpe_2000/bbpe.model).
-        """,
-    )
+    parser.add_argument(
+        "--lang-dir",
+        type=str,
+        help="""Input and output directory.
+        It should contain words.txt and a BBPE model, either:
+          - {lang_dir}/bbpe_{vocab_size}/bbpe.model, or
+          - {lang_dir}/bbpe.model (deprecated fallback).
+        """,
+    )
@@
-    parser.add_argument(
-        "--vocab-size",
-        type=int,
-        default=2000,  # Add a default value for vocab_size for consistency
-        help="Vocabulary size used for BPE training (determines the bbpe model directory).",
-    )
+    parser.add_argument(
+        "--vocab-size",
+        type=int,
+        default=2000,
+        help="Vocabulary size used for BBPE (drives bbpe_{size}/bbpe.model lookup).",
+    )
@@
-    model_file = lang_dir / "bbpe.model"
-
-    if not model_file.is_file():
-        raise FileNotFoundError(f"BPE model not found at: {model_file}")
+    # Prefer new layout; fall back to legacy flat file.
+    candidate_new = lang_dir / f"bbpe_{args.vocab_size}" / "bbpe.model"
+    candidate_legacy = lang_dir / "bbpe.model"
+    model_file = candidate_new if candidate_new.is_file() else candidate_legacy
+    if not model_file.is_file():
+        raise FileNotFoundError(
+            "BBPE model not found. Searched:\n"
+            f"  - {candidate_new}\n"
+            f"  - {candidate_legacy}\n"
+            "Please run local/train_bbpe_model.py or adjust --vocab-size/lang-dir."
+        )
@@
-    lexicon, token_sym_table = generate_lexicon(str(model_file), words, args.oov)
+    lexicon, token_sym_table = generate_lexicon(str(model_file), words, args.oov)

Also applies to: 170-179, 188-194, 215-219, 230-231

egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)

97-104: Require essential CLI arguments.

These paths are dereferenced unconditionally; make them required to avoid None/TypeErrors.

-    parser.add_argument("-m", "--manifest-dir", type=Path)
-    parser.add_argument("-a", "--audio-dir", type=Path)
-    parser.add_argument("-d", "--dl-dir", type=Path)
+    parser.add_argument("-m", "--manifest-dir", type=Path, required=True, help="Directory for feature manifests and done marker")
+    parser.add_argument("-a", "--audio-dir", type=Path, required=True, help="Directory to materialize audio per split")
+    parser.add_argument("-d", "--dl-dir", type=Path, required=True, help="HF dataset name or local path (e.g., parler-tts/mls_eng)")

egs/multi_ja_en/ASR/local/train_bbpe_model.py (2)

83-87: Stream input conversion; avoid loading whole file into memory and set encoding

readlines() loads the entire transcript and encoding is unspecified.

-def _convert_to_bchar(in_path: str, out_path: str):
-    with open(out_path, "w") as f:
-        for line in open(in_path, "r").readlines():
-            f.write(byte_encode(tokenize_by_ja_char(line)) + "\n")
+def _convert_to_bchar(in_path: str, out_path: str):
+    with open(in_path, "r", encoding="utf-8") as src, open(
+        out_path, "w", encoding="utf-8"
+    ) as dst:
+        for line in src:
+            dst.write(byte_encode(tokenize_by_ja_char(line)) + "\n")

114-117: Use delete=False for NamedTemporaryFile to avoid Windows sharing issues

Opening the same path for write while NamedTemporaryFile is open can fail on Windows.

-    temp = tempfile.NamedTemporaryFile()
-    train_text = temp.name
+    temp = tempfile.NamedTemporaryFile(delete=False)
+    train_text = temp.name
+    temp.close()

egs/mls_english/ASR/zipformer/asr_datamodule.py (1)

1-366: De-duplicate: make this a thin re-export to the local/utils module

Keep a single implementation and re-export here to avoid drift.
@@
-<remove entire file contents>
+from ..local.utils.asr_datamodule import MLSEnglishHFAsrDataModule
+
+__all__ = ["MLSEnglishHFAsrDataModule"]

egs/mls_english/ASR/zipformer/do_not_use_it_directly.py (1)

1-1262: Replace lingering ReazonSpeechAsrDataModule imports with MLSEnglishHFAsrDataModule
In all scripts under egs/mls_english/ASR/zipformer/ (e.g. do_not_use_it_directly.py, streaming_decode.py, decode.py, train.py), remove
from asr_datamodule import ReazonSpeechAsrDataModule
and instead import
from ASR.local.utils.asr_datamodule import MLSEnglishHFAsrDataModule
Update all instantiations of ReazonSpeechAsrDataModule to use MLSEnglishHFAsrDataModule(args), ensuring consistency with the sole asr_datamodule.py under ASR/local/utils.

egs/multi_ja_en/ASR/zipformer/decode.py (3)

67-72: Import error: asr_datamodule is not in this package path. Add sys.path shim.

-import torch.nn as nn
-from asr_datamodule import MultiDatasetAsrDataModule
+import torch.nn as nn
+import sys as _sys
+from pathlib import Path as _P
+_sys.path.append(str((_P(__file__).resolve().parents[1] / "local" / "utils")))
+from asr_datamodule import MultiDatasetAsrDataModule

379-392: Bug: tokenization for oracle n-best uses a list where a string is expected.

supervisions["text"] is a list. Apply normalization per-utterance.

-            num_paths=params.num_paths,
-            ref_texts=sp.encode(byte_encode(tokenize_by_ja_char(supervisions["text"]))),
+            num_paths=params.num_paths,
+            ref_texts=sp.encode(
+                [byte_encode(tokenize_by_ja_char(t)) for t in supervisions["text"]]
+            ),
             nbest_scale=params.nbest_scale,

556-564: WER summary filename uses an out-of-scope 'key'.

This will reference the last loop value or fail if empty. Remove 'key' from the filename.

-    errs_info = (
-        params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt"
-    )
+    errs_info = params.res_dir / f"wer-summary-{test_set_name}-{params.suffix}.txt"

🧹 Nitpick comments (74)

egs/mls_english/ASR/zipformer/scaling.py (1)

193-206: Optional: fix odd extra_repr formatting.

extra_repr() prints self.schedule.pairs[1:-1], dropping the end points. Use the full list for clarity.
-        return (
-            f"batch_count={self.batch_count}, schedule={str(self.schedule.pairs[1:-1])}"
-        )
+        return f"batch_count={self.batch_count}, schedule={self.schedule.pairs}"

egs/mls_english/ASR/zipformer/export.py (2)

25-29: Adjust usage/docs to reference MLS English paths

The embedded usage/help still references “librispeech” paths. Since this file lives under mls_english, update examples to avoid confusion.

Example diff:
-Note: This is a example for librispeech dataset, if you are using different
+Note: This is an example for the MLS English dataset; if you are using a different
 dataset, you should change the argument values according to your dataset.
@@
-    cd /path/to/egs/librispeech/ASR
+    cd /path/to/egs/mls_english/ASR
@@
-    cd /path/to/egs/librispeech/ASR
+    cd /path/to/egs/mls_english/ASR
Also applies to: 99-107, 115-129, 129-140

399-405: Enable CUDA automatically (or add a flag)

Export can benefit from GPU when available. Consider auto-selecting CUDA to avoid accidental CPU-only runs.
-    device = torch.device("cpu")
-    # if torch.cuda.is_available():
-    #     device = torch.device("cuda", 0)
+    device = torch.device("cuda", 0) if torch.cuda.is_available() else torch.device("cpu")

egs/mls_english/ASR/zipformer/optim.py (2)

463-463: Dead variable.

batch = True is unused; remove or document if intended for future use.
-        batch = True
480-503: Guard in-place grad scaling.

p.grad.mul_(clipping_scale) mutates grad in-place; if the same grad tensor is reused elsewhere, this could have side effects. Consider using out-of-place scaling.
-                    grad = (
-                        p.grad if clipping_scale == 1.0 else p.grad.mul_(clipping_scale)
-                    )
+                    grad = p.grad if clipping_scale == 1.0 else p.grad * clipping_scale

egs/mls_english/ASR/zipformer/my_profile.py (1)

111-111: Fix misleading shape comment after permute

The permute converts (T, N, C) -> (N, T, C), not the other way around.
-        encoder_out = encoder_out.permute(1, 0, 2)  # (N, T, C) -> (T, N, C)
+        encoder_out = encoder_out.permute(1, 0, 2)  # (T, N, C) -> (N, T, C)

egs/mls_english/ASR/zipformer/subsampling.py (2)

205-227: Docstring parameters are inaccurate/outdated

Second “layer1_channels” actually documents layer2; “bottleneck” isn’t a ctor arg; missing docs for layer3_channels and dropout.

-          layer1_channels:
-            Number of channels in layer2
-          bottleneck:
-            bottleneck dimension for 1d squeeze-excite
+          layer2_channels:
+            Number of channels in layer2
+          layer3_channels:
+            Number of channels in layer3 (input to ConvNeXt)
+          dropout:
+            Dropout probability or ScheduledFloat applied after output norm.

306-309: Stale comment: no explicit scaling of x occurs here

Comment references scaling x by 0.1, but code relies on ScaleGrad modules instead.

-        # scaling x by 0.1 allows us to use a larger grad-scale in fp16 "amp" (automatic mixed precision)
-        # training, since the weights in the first convolution are otherwise the limiting factor for getting infinite
-        # gradients.
+        # Gradient stability for mixed precision is handled above via ScaleGrad;
+        # no explicit scaling of x is performed here.

egs/mls_english/ASR/zipformer/decoder.py (1)

79-88: Guard grouped Conv1d with a divisibility check

groups=decoder_dim//4 requires decoder_dim % 4 == 0; assert early for clearer errors.
         assert context_size >= 1, context_size
+        assert decoder_dim % 4 == 0, "decoder_dim must be divisible by 4 when using groups=decoder_dim//4"
         self.context_size = context_size
         self.vocab_size = vocab_size

egs/mls_english/ASR/local/display_manifest_statistics.py (1)

34-46: Rename helper to reflect return type and simplify usage.

get_parser() returns parsed args, not a parser. Rename for clarity.

Apply this diff:

-def get_parser():
+def get_args():
     parser = argparse.ArgumentParser(
         description=ARGPARSE_DESCRIPTION,
         formatter_class=argparse.ArgumentDefaultsHelpFormatter,
     )
@@
-    return parser.parse_args()
+    return parser.parse_args()
@@
-def main():
-    args = get_parser()
+def main():
+    args = get_args()

egs/reazonspeech/ASR/local/utils/asr_datamodule.py (2)

189-194: Docstring: include cuts_musan in Args for completeness.

Apply this diff:

         Args:
           cuts_train:
             CutSet for training.
           sampler_state_dict:
             The state dict for the training sampler.
+          cuts_musan:
+            Optional CutSet of MUSAN noise to mix in via CutMix when enabled.

362-366: Fix return type annotation: this returns a single CutSet, not a list.

Apply this diff:

-    def test_cuts(self) -> List[CutSet]:
+    def test_cuts(self) -> CutSet:
@@
-        return load_manifest_lazy(
+        return load_manifest_lazy(
             self.args.manifest_dir / "reazonspeech_cuts_test.jsonl.gz"
         )

icefall/utils.py (1)

1394-1399: Make pad_left keyword-only to avoid accidental positional misuse

This preserves backward compatibility for existing positional calls that pass only lengths or (lengths, max_len).
-def make_pad_mask(
-    lengths: torch.Tensor,
-    max_len: int = 0,
-    pad_left: bool = False,
-) -> torch.Tensor:
+def make_pad_mask(
+    lengths: torch.Tensor,
+    max_len: int = 0,
+    *,
+    pad_left: bool = False,
+) -> torch.Tensor:

egs/mls_english/ASR/README.md (2)

10-12: Replace bare URLs with Markdown links

Silences MD034 and improves readability.

-- Dataset: https://huggingface.co/datasets/parler-tts/mls_eng
-- Original MLS dataset link: https://www.openslr.org/94
+- Dataset: [Hugging Face – parler-tts/mls_eng](https://huggingface.co/datasets/parler-tts/mls_eng)
+- Original MLS dataset: [OpenSLR 94](https://www.openslr.org/94)

16-19: Remove HTML/break hacks; tighten wording

Avoid backslash/HTML line breaks in Markdown; keep content concise.

-This recipe currently only supports on-the-fly feature bank computation, since `lhotse` manifests and feature banks are not pre-calculated in this recipe. This should mean that the dataset can be streamed from Hugging Face, but we have not tested this yet. We may add a version that supports pre-calculating features to better match existing recipes.\
-<br>
-
-[./RESULTS.md](./RESULTS.md) contains the latest results. This MLS English recipe was primarily developed for use in the ```multi_ja_en``` Japanese-English bilingual pipeline, which is based on MLS English and ReazonSpeech.
+This recipe currently supports only on-the-fly feature computation (no pre-calculated `lhotse` manifests/feature banks yet). Streaming from Hugging Face should work, but is untested. We may add a pre-calculated features option later.
+
+See [RESULTS.md](./RESULTS.md) for the latest results. This MLS English recipe was primarily developed for the `multi_ja_en` Japanese–English bilingual pipeline (MLS English + ReazonSpeech).

egs/multi_ja_en/ASR/RESULTS.md (3)

33-39: Nit: capitalize ONNX consistently

Minor wording polish in the heading; no code impact.

56-57: Use canonical dataset name “Common Voice”

Spelling consistency.
-| CommonVoice                 | 39.87   |
+| Common Voice               | 39.87   |
71-71: Replace bare URL with Markdown link

Improves readability and satisfies MD034.
-Pre-trained model can be found here: https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/main
+Pre-trained model: [reazon-research/reazonspeech-k2-v2-ja-en](https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/main)

egs/reazonspeech/ASR/zipformer/train.py (2)

68-69: Prefer lazy manifest loading for MUSAN to reduce memory footprint

load_manifest_lazy avoids eagerly materializing the CutSet. MUSAN isn’t huge, but lazy loading is safer and consistent with other loaders in this repo.

-from lhotse import load_manifest
+from lhotse import load_manifest_lazy

1223-1235: Switch MUSAN to lazy loading; keep graceful fallback

Reduces peak memory while keeping behavior identical when the manifest is absent.

-    if args.enable_musan:
-        musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz"
-        if musan_path.exists():
-            cuts_musan = load_manifest(musan_path)
-            logging.info(f"Loaded MUSAN manifest from {musan_path}")
-        else:
-            logging.warning(
-                f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation"
-            )
-            cuts_musan = None
+    if args.enable_musan:
+        musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz"
+        if musan_path.exists():
+            cuts_musan = load_manifest_lazy(musan_path)
+            logging.info(f"Loaded MUSAN manifest (lazy) from {musan_path}")
+        else:
+            logging.warning(
+                f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation"
+            )
+            cuts_musan = None
     else:
         cuts_musan = None

egs/multi_ja_en/ASR/README.md (4)

3-3: Fix stray punctuation

Removes the dangling comma/period.

-A bilingual Japanese-English ASR model developed by the developers of ReazonSpeech that utilizes ReazonSpeech and the English subset of Multilingual LibriSpeech (MLS English), .
+A bilingual Japanese–English ASR model developed by the ReazonSpeech team that utilizes ReazonSpeech and the English subset of Multilingual LibriSpeech (MLS English).

7-7: Use Markdown link for MLS English

Consistency with other references.

-**Multilingual LibriSpeech (MLS)** is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages. This icefall training recipe was created for the restructured version of the English split of the dataset available on Hugging Face from `parler-tts` [here](https://huggingface.co/datasets/parler-tts/mls_eng).
+**Multilingual LibriSpeech (MLS)** is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages—English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages. This recipe uses the restructured English split on Hugging Face: [`parler-tts/mls_eng`](https://huggingface.co/datasets/parler-tts/mls_eng).

15-20: Table fixes: header typo and links

Correct “Datset” → “Dataset” and replace bare URLs with links.

-|Datset| Number of hours| URL|
+|Dataset| Number of hours| URL|
@@
-|MLS English|44,500|https://huggingface.co/datasets/parler-tts/mls_eng|
-|ReazonSpeech (all)|35,000|https://huggingface.co/datasets/reazon-research/reazonspeech|
+|MLS English|44,500|[Hugging Face – parler-tts/mls_eng](https://huggingface.co/datasets/parler-tts/mls_eng)|
+|ReazonSpeech (all)|35,000|[Hugging Face – reazon-research/reazonspeech](https://huggingface.co/datasets/reazon-research/reazonspeech)|

23-35: Tighten usage steps; clarify balance note

Minor wording/structure improvements.

-This recipe relies on the `mls_english` recipe and the `reazonspeech` recipe. 
+This recipe relies on the `mls_english` and `reazonspeech` recipes.
@@
-This recipe does not enforce data balance: please ensure that the `mls_english` and `reazonspeech` datasets prepared above are balanced to your liking (you may use the utility script `create_subsets_greedy.py` in the `mls_english` recipe to create a custom-sized MLS English sub-dataset).
+This recipe does not enforce dataset balance. If desired, use `mls_english/local/utils/create_subsets_greedy.py` to create an MLS English subset sized to your needs.
@@
-0. Run `../../mls_english/ASR/prepare.sh` and `../../reazonspeech/ASR/prepare.sh`
+0. Run `../../mls_english/ASR/prepare.sh` and `../../reazonspeech/ASR/prepare.sh`
 1. Run `./prepare.sh`
-2. Run `update_cutset_paths.py` (we will soon add this to `./prepare.sh`)
+2. Run `local/utils/update_cutset_paths.py` (to be integrated into `./prepare.sh`)
 3. Run `zipformer/train.py` (see example arguments inside the file)

egs/multi_ja_en/ASR/local/prepare_lang_bbpe.py (1)

163-166: OOV decomposition sanity-check.

You emit OOV as ["▁", sp.id_to_piece(sp.unk_id())]. Ensure "▁" exists in the model pieces and downstream handles a two-piece OOV token; otherwise prefer just [sp.id_to_piece(sp.unk_id())].
-    lexicon.append((oov, ["▁", sp.id_to_piece(sp.unk_id())]))
+    unk_piece = sp.id_to_piece(sp.unk_id())
+    lexicon.append((oov, [unk_piece]))

egs/mls_english/ASR/local/utils/generate_transcript.py (2)

22-23: Remove unused import.

-from typing import Optional

60-64: Guard against None/newlines and normalize whitespace when writing transcripts.

Prevents blank/garbled lines.

-    with open(output_file, "w") as f:
-        for cut in tqdm(cuts, desc="Processing cuts"):
-            for sup in cut.supervisions:
-                f.write(f"{sup.text}\n")
+    with open(output_file, "w") as f:
+        for cut in tqdm(cuts, desc="Processing cuts"):
+            for sup in cut.supervisions:
+                if not sup.text:
+                    continue
+                txt = " ".join(str(sup.text).split())
+                f.write(txt + "\n")

egs/mls_english/ASR/local/utils/create_subsets_greedy.py (4)

111-127: Use enumerate for progress; avoid manual counter.

Cleaner and aligns with static analysis hint.

-        num_samples_processed = 0
-        for original_idx in all_original_indices:
+        for num_samples_processed, original_idx in enumerate(all_original_indices, 1):
@@
-            num_samples_processed += 1
-            if num_samples_processed % 10000 == 0: # Print progress periodically
+            if num_samples_processed % 10000 == 0:  # Print progress periodically
                 print(f"  Processed {num_samples_processed} samples for '{split_name}'. Current duration: {current_duration_ms / (3600*1000):.2f} hours")

136-141: Avoid mutating Features in-place; cast with a copied mapping.

Prevents side effects on the original Features object.

-        if "audio" in sampled_split.features and not isinstance(sampled_split.features["audio"], Audio):
-            sampling_rate = sampled_split.features["audio"].sampling_rate if isinstance(sampled_split.features["audio"], Audio) else 16000
-            new_features = sampled_split.features
-            new_features["audio"] = Audio(sampling_rate=sampling_rate)
-            sampled_split = sampled_split.cast(new_features)
+        if "audio" in sampled_split.features and not isinstance(sampled_split.features["audio"], Audio):
+            sampling_rate = 16000
+            new_features = sampled_split.features.copy()
+            new_features["audio"] = Audio(sampling_rate=sampling_rate)
+            sampled_split = sampled_split.cast(new_features)

170-175: Iterate dict directly (no .keys()); skip ‘validation’ cleanly.

-    for split_name in full_dataset.keys():
+    for split_name in full_dataset:
         if split_name not in ['train', 'dev', 'test', 'validation']: # Ensure 'validation' is not re-copied if not renamed

261-270: Remove unused f-strings in usage prints.

Silences F541 and reduces confusion.

-    print(f"\nTo use your new subset dataset, you can load it like this:")
-    print(f"from datasets import load_dataset")
-    print(f"import os, glob")
-    print(f"data_files = {{}}")
-    print(f"for split_name in ['train', 'dev', 'test']: # Or iterate through actual splits created")
-    print(f"    split_path = os.path.join('{output_subset_data_path}', f'{{split_name}}*.parquet')")
-    print(f"    files = glob.glob(split_path)")
-    print(f"    if files: data_files[split_name] = files")
-    print(f"subset = load_dataset('parquet', data_files=data_files)")
-    print(f"print(subset)")
+    print("\nTo use your new subset dataset, you can load it like this:")
+    print("from datasets import load_dataset")
+    print("import os, glob")
+    print("data_files = {}")
+    print("for split_name in ['train', 'dev', 'test']: # Or iterate through actual splits created")
+    print(f"    split_path = os.path.join('{output_subset_data_path}', f'{{split_name}}*.parquet')")
+    print("    files = glob.glob(split_path)")
+    print("    if files: data_files[split_name] = files")
+    print("subset = load_dataset('parquet', data_files=data_files)")
+    print("print(subset)")

egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)

117-126: Minor: argparse defaults and path handling.

You gate recomputation on args.manifest_dir done file (good). Also ensure dl_dir is cast to str for load_dataset (already done) and audio_dir/manifest_dir exist.

     logging.basicConfig(format=formatter, level=logging.INFO)
 
-    if (args.manifest_dir / ".mls-eng-fbank.done").exists():
+    args.manifest_dir.mkdir(parents=True, exist_ok=True)
+    args.audio_dir.mkdir(parents=True, exist_ok=True)
+    if (args.manifest_dir / ".mls-eng-fbank.done").exists():
@@
-        mls_eng_hf_dataset_path = args.dl_dir # "/root/datasets/parler-tts--mls_eng"
+        mls_eng_hf_dataset_path = args.dl_dir  # e.g., "parler-tts/mls_eng" or local path
         cut_sets = make_cutset_blueprints(mls_eng_hf_dataset_path)

egs/multi_ja_en/ASR/local/train_bbpe_model.py (2)

27-31: Remove unused import

re is unused here.
-import re
102-104: Atomicity and artifacts: skip/rename behavior

Skipping solely on is_file() may miss existing symlinks. Prefer .exists().

SentencePiece produces .model and .vocab; you move only .model. Confirm .vocab isn’t needed downstream or move it too.
-    if output_model.is_file():
+    if output_model.exists():
         print(f"{output_model} exists - skipping")
         return
@@
-    shutil.move(str(temp_model_file), str(output_model))
+    shutil.move(str(temp_model_file), str(output_model))
+    temp_vocab_file = Path(model_prefix + ".vocab")
+    if temp_vocab_file.exists():
+        shutil.move(str(temp_vocab_file), str(output_model.with_suffix(".vocab")))
Also applies to: 133-134

egs/mls_english/ASR/local/utils/asr_datamodule.py (2)

26-36: Remove unused import

load_manifest is not used.

-from lhotse import CutSet, Fbank, FbankConfig, load_manifest, load_manifest_lazy
+from lhotse import CutSet, Fbank, FbankConfig, load_manifest_lazy

346-365: Avoid lru_cache on instance methods; fix return type of test_cuts

@lru_cache() on methods can retain self and leak memory (Ruff B019). Limit cache or switch to cached_property.
test_cuts returns a single CutSet, not List[CutSet].

-    @lru_cache()
-    def train_cuts(self) -> CutSet:
+    @lru_cache(maxsize=1)
+    def train_cuts(self) -> CutSet:
@@
-    @lru_cache()
-    def valid_cuts(self) -> CutSet:
+    @lru_cache(maxsize=1)
+    def valid_cuts(self) -> CutSet:
@@
-    @lru_cache()
-    def test_cuts(self) -> List[CutSet]:
+    @lru_cache(maxsize=1)
+    def test_cuts(self) -> CutSet:

egs/mls_english/ASR/zipformer/do_not_use_it_directly.py (6)

1052-1061: Docstring vs thresholds mismatch

Comment says keep 1–20s, code uses 0.3–30s. Align the comment to code (or vice versa).
-        # Keep only utterances with duration between 1 second and 20 seconds
+        # Keep only utterances with duration between 0.3 and 30.0 seconds
886-899: GradScaler growth heuristic

Manually doubling grad scale inside loop is unusual; if intentional, add a brief comment referencing rationale.

1242-1262: Unreachable code after explicit RuntimeError in main()

main() raises; the rest of main() is unreachable. If this file is strictly a library, consider removing the code below the raise to avoid confusion.

1001-1006: Model averaging dtype

model_avg = copy.deepcopy(model).to(torch.float64) ups casts all params to float64. Confirm this matches usage in checkpointing and averaging utilities (most recipes keep float32). If unintended, use to(dtype=torch.float32).

818-845: Zeroing grads order

Consider optimizer.zero_grad(set_to_none=True) before backward to save memory, as in other recipes; current order is still valid.

1-44: Usage header references a different recipe

Usage points to pruned_transducer_stateless7_streaming/train.py. If this is intentional boilerplate, ignore; otherwise update to MLS English path to prevent confusion.

egs/mls_english/ASR/RESULTS.md (2)

7-12: Tighten the table formatting and naming consistency.

Use a caption or footnote to specify decoding method variants (e.g., “modified_beam_search”) for clarity.

Keep naming consistent: “MLS English” vs “MLS-English”.
-| Type          | Greedy | Beam search |
+| Type          | Greedy | Beam search (modified_beam_search) |
14-24: Improve reproducibility details.

Consider pinning exact commits/versions for icefall, k2, lhotse, and the BPE model checksum; optionally include the exact MLS-English subset/version.

If available, append a small “Environment” section with commit hashes and package versions.

egs/reazonspeech/ASR/prepare.sh (3)

57-60: Fix subset comment mismatch and quote paths.

Comment says "small-v1" but code downloads "medium"; also quote $dl_dir to avoid word-splitting.
-  if [ ! -d $dl_dir/ReazonSpeech/downloads ]; then
-    # Download small-v1 by default.
-    lhotse download reazonspeech --subset medium $dl_dir
+  if [ ! -d "$dl_dir/ReazonSpeech/downloads" ]; then
+    # Download "medium" subset by default.
+    lhotse download reazonspeech --subset medium "$dl_dir"
61-68: Quote musan path and dl_dir to avoid globbing/word-splitting.

Safer with spaces and special chars.
-  if [ ! -d $dl_dir/musan ]; then
-    lhotse download musan $dl_dir
+  if [ ! -d "$dl_dir/musan" ]; then
+    lhotse download musan "$dl_dir"
104-111: Run Musan FBANK script via python for portability.

Avoids relying on execute bits in git.
-    ./local/compute_fbank_musan.py
+    python local/compute_fbank_musan.py

egs/reazonspeech/ASR/local/compute_fbank_musan.py (1)

64-66: Guard against os.cpu_count() returning None.

-    num_jobs = min(15, os.cpu_count())
+    cpu_cnt = os.cpu_count() or 1
+    num_jobs = min(15, cpu_cnt)

egs/multi_ja_en/ASR/zipformer/multi_dataset.py (2)

3-3: Remove unused import.

lru_cache is not used after refactor.
-from functools import lru_cache
27-47: Optional: cache CutSet mux results to avoid repeated manifest scans.

If these methods are called multiple times, consider caching (e.g., cached_property) to avoid repeated lazy-load counts.

egs/mls_english/ASR/prepare.sh (3)

66-71: Typo and quoting fix in Musan guidance.

-  # If you ha`ve predownloaded it to /path/to/musan,
+  # If you have predownloaded it to /path/to/musan,
@@
-  if [ ! -d $dl_dir/musan ] ; then
+  if [ ! -d "$dl_dir/musan" ] ; then
@@
-	  lhotse download musan $dl_dir 
+	  lhotse download musan "$dl_dir"

103-107: Run Musan FBANK script via python for portability.

-		./local/compute_fbank_musan.py
+		python local/compute_fbank_musan.py

79-89: Stage 1 sequencing is fine; consider chaining validations with '&&' for clearer failure propagation.

Not required (set -e already covers it), but improves readability.

-		python local/compute_fbank_mls_english.py \
+		python local/compute_fbank_mls_english.py \
 			--manifest-dir data/manifests \
 			--audio-dir data/audio \
-			--dl-dir $dl_dir/mls_english
-					# --dl-dir /root/datasets/parler-tts--mls_eng
-					python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_train.jsonl.gz
-					python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_dev.jsonl.gz
-					python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_test.jsonl.gz
-					touch data/manifests/.mls_english-validated.done
+			--dl-dir "$dl_dir/mls_english" && \
+		python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_train.jsonl.gz && \
+		python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_dev.jsonl.gz && \
+		python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_test.jsonl.gz && \
+		touch data/manifests/.mls_english-validated.done

egs/multi_ja_en/ASR/local/utils/asr_datamodule.py (4)

25-25: Remove unused import (ruff F401).

torch is not used.

- import torch

21-21: Remove unused import.

lru_cache is not used in this file.

-from functools import lru_cache

205-213: Prefer lazy loading for MUSAN to reduce memory footprint.

Use load_manifest_lazy instead of load_manifest.

-            cuts_musan = load_manifest(
-                self.args.manifest_dir / "musan/musan_cuts.jsonl.gz"
-            )
+            cuts_musan = load_manifest_lazy(
+                self.args.manifest_dir / "musan/musan_cuts.jsonl.gz"
+            )

281-291: Sampler buffers are hard-coded; consider exposing as CLI or deriving from dataset size.

Current buffer_size and shuffle_buffer_size may be suboptimal across datasets.

         group.add_argument(
             "--num-buckets",
             type=int,
             default=30,
             help="The number of buckets for the DynamicBucketingSampler"
             "(you might want to increase it for larger datasets).",
         )
+        group.add_argument(
+            "--sampler-buffer-size",
+            type=int,
+            default=0,
+            help="Override DynamicBucketingSampler buffer_size; 0 uses num_buckets*2000.",
+        )
+        group.add_argument(
+            "--sampler-shuffle-buffer-size",
+            type=int,
+            default=0,
+            help="Override DynamicBucketingSampler shuffle_buffer_size; 0 uses num_buckets*5000.",
+        )

             train_sampler = DynamicBucketingSampler(
                 cuts_train,
                 max_duration=self.args.max_duration,
                 shuffle=self.args.shuffle,
                 num_buckets=self.args.num_buckets,
-                buffer_size=self.args.num_buckets * 2000,
-                shuffle_buffer_size=self.args.num_buckets * 5000,
+                buffer_size=(
+                    self.args.sampler_buffer_size
+                    if self.args.sampler_buffer_size > 0
+                    else self.args.num_buckets * 2000
+                ),
+                shuffle_buffer_size=(
+                    self.args.sampler_shuffle_buffer_size
+                    if self.args.sampler_shuffle_buffer_size > 0
+                    else self.args.num_buckets * 5000
+                ),
                 drop_last=self.args.drop_last,
             )

egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py (1)

143-151: Safer overwrite of symlink target.

Use unlink(missing_ok=True) and ensure parent dir exists.

-                os.unlink(symlink_path)
+                symlink_path.parent.mkdir(parents=True, exist_ok=True)
+                try:
+                    os.unlink(symlink_path)
+                except FileNotFoundError:
+                    pass
                 updated_cuts.to_file(symlink_path)

egs/multi_ja_en/ASR/prepare.sh (1)

181-185: Ensure target dir exists and quote path when symlinking validator.

Also quote expansion (SC2046).

-      ln -svfr $(realpath ../../multi_zh_en/ASR/local/validate_bpe_lexicon.py) local/
+      mkdir -p local
+      ln -svfr "$(realpath ../../multi_zh_en/ASR/local/validate_bpe_lexicon.py)" local/

egs/mls_english/ASR/zipformer/decode.py (2)

138-145: Remove unused import (ruff F401).

make_pad_mask is unused.

 from icefall.utils import (
     AttributeDict,
-    make_pad_mask,
     setup_logger,
     store_transcripts,
     str2bool,
     write_error_stats,
 )

1030-1036: Use context manager to read context file (SIM115).

Also simplify list construction.

-        if os.path.exists(params.context_file):
-            contexts = []
-            for line in open(params.context_file).readlines():
-                contexts.append(line.strip())
+        if os.path.exists(params.context_file):
+            with open(params.context_file, "r", encoding="utf-8") as f:
+                contexts = [line.strip() for line in f]
             context_graph = ContextGraph(params.context_score)
             context_graph.build(sp.encode(contexts))

egs/mls_english/ASR/zipformer/tokenizer.py (2)

171-190: Avoid accessing private members of SymbolTable.

Prefer public API to build maps.

-        self._id2sym = token_table._id2sym
-        self._sym2id = token_table._sym2id
+        self._id2sym = {v: k for k, v in token_table.symbol_to_id.items()}
+        self._sym2id = dict(token_table.symbol_to_id)

232-253: Keep tests out of production module entrypoint.

Consider moving test_CharTokenizer() into a proper test file.

-if __name__ == "__main__":
-    test_CharTokenizer()
+if __name__ == "__main__":
+    # For ad-hoc debugging only
+    test_CharTokenizer()

egs/mls_english/ASR/zipformer/streaming_decode.py (5)

25-31: Remove unused imports.

os, pdb, and subprocess are unused; drop them to keep the module clean.

 import argparse
 import logging
 import math
-import os
-import pdb
-import subprocess as sp
 from pathlib import Path
 from typing import Dict, List, Optional, Tuple

19-23: Fix CLI usage string.

Missing space after script path.

-./zipformer/streaming_decode.py--epoch 28   --avg 15   --causal 1   --chunk-size 32   --left-context-frames 256   --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192
+./zipformer/streaming_decode.py --epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192

198-203: Avoid function call in default argument (B008).

Set the default to None and initialize inside the function.

-def get_init_states(
-    model: nn.Module,
-    batch_size: int = 1,
-    device: torch.device = torch.device("cpu"),
-) -> List[torch.Tensor]:
+def get_init_states(
+    model: nn.Module,
+    batch_size: int = 1,
+    device: Optional[torch.device] = None,
+) -> List[torch.Tensor]:
@@
-    states = model.encoder.get_init_states(batch_size, device)
+    if device is None:
+        device = torch.device("cpu")
+    states = model.encoder.get_init_states(batch_size, device)

186-191: Reduce default parallel streams to avoid excessive memory.

2000 concurrent streams can blow up memory; 64 is a safer default for most setups.

-    parser.add_argument(
+    parser.add_argument(
         "--num-decode-streams",
         type=int,
-        default=2000,
+        default=64,
         help="The number of streams that can be decoded parallel.",
     )

756-763: Tokenizer: prefer explicit lang_dir usage or remove unused --bpe-model.

You already add Tokenizer args; consider removing --bpe-model from this script, or use it consistently. Also verify params.lang/params.lang_type exist from Tokenizer.add_arguments (they do).

egs/multi_ja_en/ASR/zipformer/train.py (2)

1204-1211: Minor: yoda condition.

Prefer len(tokens) > T for readability (ruff SIM300).
-        if T < len(tokens):
+        if len(tokens) > T:
1222-1232: Pre-BPE Japanese char tokenization may conflict with BPE training.

You apply tokenize_by_ja_char then rely on SentencePiece for targets. This could distort English texts and duplicate segmentation work. Either:

apply this only for Japanese lines; or

drop it and normalize text elsewhere before SP encoding.

Example conditional application (pseudo-diff; adjust placement as needed):
-    def tokenize_and_encode_text(c: Cut):
+    def tokenize_and_encode_text(c: Cut):
         # Text normalize for each sample
         text = c.supervisions[0].text
-        text = byte_encode(tokenize_by_ja_char(text))
+        # Apply JA char tokenization only if text contains Japanese scripts
+        if any("\u3040" <= ch <= "\u30ff" or "\u4e00" <= ch <= "\u9fff" for ch in text):
+            text = byte_encode(tokenize_by_ja_char(text))
         c.supervisions[0].text = text
         return c

egs/mls_english/ASR/zipformer/train.py (4)

1181-1217: Silence overly verbose duration logs.

Commented-out warnings suggest prior noise; keep logs at debug to avoid log spam during filtering.
-        if c.duration < 1.0 or c.duration > 30.0:
-            # logging.warning(
-            #     f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
-            # )
+        if c.duration < 1.0 or c.duration > 30.0:
+            logging.debug(
+                f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
+            )
             return False
862-869: Rename unused loop variable in validation.

batch_idx isn’t used; use _ to reflect intent (ruff B007).
-    for batch_idx, batch in enumerate(valid_dl):
+    for _, batch in enumerate(valid_dl):
702-705: Simplify nested if.

Combine nested if checks when setting start_epoch from checkpoints.
-    if params.start_batch > 0:
-        if "cur_epoch" in saved_params:
-            params["start_epoch"] = saved_params["cur_epoch"]
+    if params.start_batch > 0 and "cur_epoch" in saved_params:
+        params["start_epoch"] = saved_params["cur_epoch"]
323-336: Arg redundancy: --lang-dir vs Tokenizer.add_arguments.

You add a custom --lang-dir and also Tokenizer.add_arguments (which adds --lang, --lang-type). Consider using only the Tokenizer interface or harmonize both to avoid confusion.

Also applies to: 1382-1385

egs/mls_english/ASR/local/train_bpe_model.py

coderabbitai · 2025-08-28T07:32:22Z

egs/mls_english/ASR/local/train_bpe_model.py

+    model_file = Path(model_prefix + ".model")
+    if not model_file.is_file():
+        spm.SentencePieceTrainer.train(
+            input=train_text,
+            vocab_size=vocab_size,
+            model_type=model_type,
+            model_prefix=model_prefix,
+            input_sentence_size=input_sentence_size,
+            character_coverage=args.character_coverage,
+            user_defined_symbols=user_defined_symbols,
+            byte_fallback=args.byte_fallback,
+            unk_id=unk_id,
+            bos_id=-1,
+            eos_id=-1,
+        )
+    else:
+        print(f"{model_file} exists - skipping")
+        return
+
+    shutil.copyfile(model_file, f"{lang_dir}/bpe.model")
+
+


🛠️ Refactor suggestion

Always sync bpe_.model to bpe.model, even when training is skipped.

Currently, if the versioned model exists, you return early without copying, leaving bpe.model stale/missing.

- if not model_file.is_file(): + if not model_file.is_file(): spm.SentencePieceTrainer.train( input=train_text, @@ ) - else: - print(f"{model_file} exists - skipping") - return - - shutil.copyfile(model_file, f"{lang_dir}/bpe.model") + else: + print(f"{model_file} exists - skipping training") + # Ensure canonical symlink/copy is updated + shutil.copyfile(model_file, f"{lang_dir}/bpe.model")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

model_file = Path(model_prefix + ".model")

if not model_file.is_file():

spm.SentencePieceTrainer.train(

input=train_text,

vocab_size=vocab_size,

model_type=model_type,

model_prefix=model_prefix,

input_sentence_size=input_sentence_size,

character_coverage=args.character_coverage,

user_defined_symbols=user_defined_symbols,

byte_fallback=args.byte_fallback,

unk_id=unk_id,

bos_id=-1,

eos_id=-1,

)

else:

print(f"{model_file} exists - skipping")

return

shutil.copyfile(model_file, f"{lang_dir}/bpe.model")

model_file = Path(model_prefix + ".model")

if not model_file.is_file():

spm.SentencePieceTrainer.train(

input=train_text,

vocab_size=vocab_size,

model_type=model_type,

model_prefix=model_prefix,

input_sentence_size=input_sentence_size,

character_coverage=args.character_coverage,

user_defined_symbols=user_defined_symbols,

byte_fallback=args.byte_fallback,

unk_id=unk_id,

bos_id=-1,

eos_id=-1,

)

else:

print(f"{model_file} exists - skipping training")

# Ensure canonical symlink/copy is updated

shutil.copyfile(model_file, f"{lang_dir}/bpe.model")

🤖 Prompt for AI Agents

In egs/mls_english/ASR/local/train_bpe_model.py around lines 91 to 112, the function returns early when the versioned model file exists and therefore never syncs model_prefix+".model" to "{lang_dir}/bpe.model"; change the control flow so the shutil.copyfile(model_file, f"{lang_dir}/bpe.model") is executed regardless of whether training ran or was skipped (e.g., remove the early return and only skip training but still perform the copy, or move the copy call out of the if/else block after the training check), ensuring model_file points to the correct Path before copying and handling/reporting any copy errors.

egs/mls_english/ASR/local/utils/asr_datamodule.py

egs/mls_english/ASR/local/utils/create_subsets_greedy.py

egs/reazonspeech/ASR/local/compute_fbank_musan.py

coderabbitai · 2025-08-28T07:32:28Z

egs/reazonspeech/ASR/local/utils/asr_datamodule.py

        transforms = []
+        if cuts_musan is not None:
+            logging.info("Enable MUSAN")
+            transforms.append(
+                CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True)
+            )
+        else:
+            logging.info("Disable MUSAN")
+


🛠️ Refactor suggestion

Honor the CLI flag when enabling MUSAN and warn if requested but unavailable.

Currently MUSAN is applied whenever cuts_musan is passed, ignoring --enable-musan. Gate on both for predictable behavior, and warn if enabled but missing data.

Apply this diff:

- if cuts_musan is not None: - logging.info("Enable MUSAN") - transforms.append( - CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True) - ) - else: - logging.info("Disable MUSAN") + if self.args.enable_musan: + if cuts_musan is not None: + logging.info("Enable MUSAN") + transforms.append( + CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True) + ) + else: + logging.warning("MUSAN requested but no cuts provided; disabling MUSAN.") + else: + logging.info("Disable MUSAN")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

transforms = []

if cuts_musan is not None:

logging.info("Enable MUSAN")

transforms.append(

CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True)

)

else:

logging.info("Disable MUSAN")

transforms = []

if self.args.enable_musan:

if cuts_musan is not None:

logging.info("Enable MUSAN")

transforms.append(

CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True)

)

else:

logging.warning("MUSAN requested but no cuts provided; disabling MUSAN.")

else:

logging.info("Disable MUSAN")

🤖 Prompt for AI Agents

In egs/reazonspeech/ASR/local/utils/asr_datamodule.py around lines 196 to 204, the code currently enables MUSAN whenever cuts_musan is provided regardless of the --enable-musan CLI flag; change the logic to only append the CutMix transform when BOTH cuts_musan is not None and enable_musan is truthy, log.info when MUSAN is enabled, log.info when MUSAN is explicitly disabled via the flag, and emit a logging.warning if enable_musan is True but cuts_musan is None to warn the user that MUSAN was requested but data is unavailable.

coderabbitai · 2025-08-28T07:32:28Z

egs/reazonspeech/ASR/musan-k2-v2-reazonspeech-medium

@@ -0,0 +1 @@
+Subproject commit 157dc62a6439f03de1fcccb4914fc0d30b6f21b8


💡 Verification agent

🧩 Analysis chain

Submodule pointer update: verify gitlink, .gitmodules entry, CI/docs, and license.

Ensure this SHA is reachable and intentional (ideally tagged for traceability).

Confirm .gitmodules contains the correct URL/path for this submodule.

Make sure CI and READMEs instruct users to init/update submodules; otherwise Musan-dependent stages will fail.

Verify the submodule repo’s license is compatible and referenced in the top-level NOTICE/README.

Run:

🏁 Script executed:

#!/bin/bash set -euo pipefail path="egs/reazonspeech/ASR/musan-k2-v2-reazonspeech-medium" sha_expected="157dc62a6439f03de1fcccb4914fc0d30b6f21b8" echo "== gitlink at HEAD ==" git ls-tree HEAD "$path" || true echo "== verify SHA ==" if git ls-tree HEAD "$path" | awk '{print $3}' | grep -q "$sha_expected"; then echo "OK: gitlink matches expected SHA" else echo "Mismatch: gitlink SHA does not match $sha_expected"; exit 1 fi echo "== .gitmodules entries ==" if [ -f .gitmodules ]; then git config -f .gitmodules --get-regexp 'submodule\..*' || true echo "Filtered for path:" rg -nP "(?s)\\[submodule\\s+\".*\"\\].*?\\n\\s*path\\s*=\\s*$path\\b" .gitmodules -n -C2 || true else echo "No .gitmodules found"; exit 1 fi echo "== READMEs mention submodule init? ==" rg -n -S 'git submodule (update|init)' --glob 'egs/**/ASR/**/README*' || true echo "== prepare scripts mention submodule? ==" rg -n -S 'git submodule (update|init)' --glob 'egs/**/ASR/**/prepare.sh' || true echo "== license file in submodule (if initialized) ==" if [ -d "$path/.git" ] || [ -f "$path/.git" ]; then fd -HI -a -t f -d 2 '(LICENSE|COPYING|NOTICE)(\..*)?$' "$path" || true else echo "Submodule not initialized; skip license check" fi

Length of output: 774

Add missing .gitmodules entry and update submodule docs

The .gitmodules file is currently absent; add a section for
egs/reazonspeech/ASR/musan-k2-v2-reazonspeech-medium pointing to the correct repo URL at commit 157dc62a6439f03de1fcccb4914fc0d30b6f21b8

Update CI configs and any egs/reazonspeech/ASR/README* or prepare.sh to include git submodule init --recursive and git submodule update --init --recursive

After initializing the submodule, verify its LICENSE (or NOTICE) is compatible and reference it in the top-level NOTICE/README

🤖 Prompt for AI Agents

In egs/reazonspeech/ASR/musan-k2-v2-reazonspeech-medium around lines 1-1, add a .gitmodules entry mapping this path to the repository URL at commit 157dc62a6439f03de1fcccb4914fc0d30b6f21b8; then update CI configs and any egs/reazonspeech/ASR/README* and prepare.sh to run git submodule init --recursive and git submodule update --init --recursive (or equivalent pipeline steps) so the submodule is cloned and populated during CI and local setup; finally, after initializing the submodule locally in your branch, inspect its LICENSE/NOTICE for compatibility and add a reference to that license in the top-level NOTICE/README as required.

coderabbitai · 2025-08-28T07:32:28Z

egs/reazonspeech/ASR/zipformer/train.py

+    if args.enable_musan:
+        musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz"
+        if musan_path.exists():
+            cuts_musan = load_manifest(musan_path)
+            logging.info(f"Loaded MUSAN manifest from {musan_path}")
+        else:
+            logging.warning(
+                f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation"
+            )
+            cuts_musan = None
+    else:
+        cuts_musan = None
+


💡 Verification agent

🧩 Analysis chain

Verify CLI flag is plumbed: --enable_musan

Ensure ReazonSpeechAsrDataModule.add_arguments (or parser) defines --enable_musan with a sensible default and that prepare.sh documents generating musan_cuts.jsonl.gz.

🏁 Script executed:

#!/bin/bash # Verify the flag and where it's consumed. rg -nC2 -g 'egs/reazonspeech/**' -P '\benable_musan\b' # Verify MUSAN manifest generation is documented or scripted. rg -nC2 -g 'egs/reazonspeech/**' -P 'musan_cuts\.jsonl\.gz|MUSAN'

Length of output: 3111

#!/bin/bash # Locate prepare.sh in the egs/reazonspeech directory find egs/reazonspeech -type f -name prepare.sh -print

#!/bin/bash # List top-level files in egs/reazonspeech to find any prepare or doc scripts find egs/reazonspeech -maxdepth 2 -type f # Search for any README or documentation files under egs/reazonspeech find egs/reazonspeech -type f -iname "*readme*" -o -iname "*.md"

🏁 Script executed:

#!/bin/bash # List all shell scripts in the egs/reazonspeech directory find egs/reazonspeech -type f -name "*.sh"

Length of output: 76

🏁 Script executed:

#!/bin/bash sed -n '1,200p' egs/reazonspeech/ASR/prepare.sh

Length of output: 3949

🏁 Script executed:

#!/bin/bash # Display beginning of compute_fbank_musan.py to confirm manifest creation sed -n '1,200p' egs/reazonspeech/ASR/local/compute_fbank_musan.py

Length of output: 4444

Document musan_cuts.jsonl.gz generation in prepare.sh

--enable_musan is already defined in ReazonSpeechAsrDataModule (default False).

In egs/reazonspeech/ASR/prepare.sh, Stage 4 invokes compute_fbank_musan.py but doesn’t explicitly mention the output. Add a comment like
# This generates data/manifests/musan_cuts.jsonl.gz
before running ./local/compute_fbank_musan.py.

🤖 Prompt for AI Agents

In egs/reazonspeech/ASR/zipformer/train.py around lines 1223 to 1235, the code expects a musan_cuts.jsonl.gz manifest when --enable_musan is used but the project's prepare script doesn't document that file's generation; update egs/reazonspeech/ASR/prepare.sh (Stage 4) to add an inline comment immediately before invoking ./local/compute_fbank_musan.py stating that this step generates data/manifests/musan_cuts.jsonl.gz (e.g. "# This generates data/manifests/musan_cuts.jsonl.gz"), so users running preparation know where the musan manifest comes from and the train.py check will succeed when --enable_musan is set.

coderabbitai · 2025-08-28T07:32:28Z

icefall/utils.py

    max_len = max(max_len, lengths.max())
    n = lengths.size(0)
    seq_range = torch.arange(0, max_len, device=lengths.device)
-    expaned_lengths = seq_range.unsqueeze(0).expand(n, max_len)
+    expanded_lengths = seq_range.unsqueeze(0).expand(n, max_len)
+
+    if pad_left:
+        mask = expanded_lengths < (max_len - lengths).unsqueeze(1)
+    else:
+        mask = expanded_lengths >= lengths.unsqueeze(-1)

-    return expaned_lengths >= lengths.unsqueeze(-1)
+    return mask


⚠️ Potential issue

Fix potential int/Tensor mixing in max_len computation

max(max_len, lengths.max()) can raise a TypeError (int vs Tensor). Cast to Python ints to make it torch/version-agnostic. Also keep return path unchanged.

- max_len = max(max_len, lengths.max()) + max_len = max(int(max_len), int(lengths.max().item()))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

max_len = max(max_len, lengths.max())

n = lengths.size(0)

seq_range = torch.arange(0, max_len, device=lengths.device)

expaned_lengths = seq_range.unsqueeze(0).expand(n, max_len)

expanded_lengths = seq_range.unsqueeze(0).expand(n, max_len)

if pad_left:

mask = expanded_lengths < (max_len - lengths).unsqueeze(1)

else:

mask = expanded_lengths >= lengths.unsqueeze(-1)

return expaned_lengths >= lengths.unsqueeze(-1)

return mask

# ensure both sides are Python ints to avoid int/Tensor comparison issues

max_len = max(int(max_len), int(lengths.max().item()))

n = lengths.size(0)

seq_range = torch.arange(0, max_len, device=lengths.device)

expanded_lengths = seq_range.unsqueeze(0).expand(n, max_len)

if pad_left:

mask = expanded_lengths < (max_len - lengths).unsqueeze(1)

else:

mask = expanded_lengths >= lengths.unsqueeze(-1)

return mask

🤖 Prompt for AI Agents

In icefall/utils.py around lines 1421 to 1431, the computation max_len = max(max_len, lengths.max()) mixes a Python int and a torch Tensor which can raise a TypeError; fix it by converting the tensor result to a Python int (e.g., lengths.max().item() or int(lengths.max().item())) before calling max so max_len stays a plain int, then proceed with seq_range/expanded_lengths and return mask unchanged.

coderabbitai

Review continued from previous batch...

coderabbitai · 2025-08-28T07:33:36Z

egs/mls_english/ASR/zipformer/streaming_decode.py

+from asr_datamodule import ReazonSpeechAsrDataModule
+from decode_stream import DecodeStream


⚠️ Potential issue

Use MLS English datamodule instead of ReazonSpeech here.

This script is under MLS English recipe but wires ReazonSpeech datamodule. Replace with MLSEnglishHFAsrDataModule throughout.

-from asr_datamodule import ReazonSpeechAsrDataModule +from asr_datamodule import MLSEnglishHFAsrDataModule @@ def main(): parser = get_parser() - ReazonSpeechAsrDataModule.add_arguments(parser) + MLSEnglishHFAsrDataModule.add_arguments(parser) Tokenizer.add_arguments(parser) @@ - reazonspeech_corpus = ReazonSpeechAsrDataModule(args) - - valid_cuts = reazonspeech_corpus.valid_cuts() - test_cuts = reazonspeech_corpus.test_cuts() + mls_english_corpus = MLSEnglishHFAsrDataModule(args) + + valid_cuts = mls_english_corpus.valid_cuts() + test_cuts = mls_english_corpus.test_cuts()

Also applies to: 712-717, 856-865

🤖 Prompt for AI Agents

In egs/mls_english/ASR/zipformer/streaming_decode.py around lines 37-38 (and also update occurrences at 712-717 and 856-865), the file imports and uses ReazonSpeechAsrDataModule but should use the MLS English datamodule; replace the import from "from asr_datamodule import ReazonSpeechAsrDataModule" with the correct MLS English import "from ml_eng_hf_datamodule import MLSEnglishHFAsrDataModule" (or the actual module path used in this repo), then update all instantiations, type hints, and references from ReazonSpeechAsrDataModule to MLSEnglishHFAsrDataModule across the noted line ranges so the script consistently constructs and uses the MLSEnglishHFAsrDataModule instance. Ensure any configuration keys or constructor arguments match the MLSEnglishHFAsrDataModule API and adjust variable names if needed.

egs/mls_english/ASR/zipformer/streaming_decode.py

coderabbitai · 2025-08-28T07:33:36Z

egs/mls_english/ASR/zipformer/streaming_decode.py

+    finished_streams = []
+    for i in range(len(decode_streams)):
+        decode_streams[i].states = states[i]
+        decode_streams[i].done_frames += encoder_out_lens[i]
+        # if decode_streams[i].done:
+        # finished_streams.append(i)
+        finished_streams.append(i)
+


⚠️ Potential issue

All streams are marked finished; plus int/ Tensor add bug.

Only finalize streams that are truly done. Also convert encoder_out_lens[i] to Python int when accumulating done_frames.

finished_streams = [] for i in range(len(decode_streams)): decode_streams[i].states = states[i] - decode_streams[i].done_frames += encoder_out_lens[i] - # if decode_streams[i].done: - # finished_streams.append(i) - finished_streams.append(i) + decode_streams[i].done_frames += int(encoder_out_lens[i].item()) + if decode_streams[i].done: + finished_streams.append(i)

🤖 Prompt for AI Agents

In egs/mls_english/ASR/zipformer/streaming_decode.py around lines 515 to 522, fix two issues: only mark streams as finished when they are actually done, and ensure encoder_out_lens[i] is converted to a Python int before adding to done_frames. Replace the unconditional finished_streams.append(i) with a conditional that appends i only if decode_streams[i].done is truthy, and change the done_frames increment to use an integer conversion like done_frames += int(encoder_out_lens[i]) (or encoder_out_lens[i].item() if it's a torch tensor).

coderabbitai · 2025-08-28T07:33:36Z

egs/mls_english/ASR/zipformer/streaming_decode.py

+    test_set_wers = sorted(test_set_wers.items(), key=lambda x: x[1])
+    errs_info = (
+        params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt"
+    )
+    with open(errs_info, "w") as f:
+        print("settings\tWER", file=f)
+        for key, val in test_set_wers:
+            print("{}\t{}".format(key, val), file=f)
+


⚠️ Potential issue

Summary WER filename uses an out-of-scope/ambiguous key.

Don’t include key (last-loop variable) in the summary filename.

- errs_info = ( - params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt" - ) + errs_info = params.res_dir / f"wer-summary-{test_set_name}-{params.suffix}.txt"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

test_set_wers = sorted(test_set_wers.items(), key=lambda x: x[1])

errs_info = (

params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt"

)

with open(errs_info, "w") as f:

print("settings\tWER", file=f)

for key, val in test_set_wers:

print("{}\t{}".format(key, val), file=f)

test_set_wers = sorted(test_set_wers.items(), key=lambda x: x[1])

- errs_info = (

- params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt"

errs_info = params.res_dir / f"wer-summary-{test_set_name}-{params.suffix}.txt"

with open(errs_info, "w") as f:

print("settings\tWER", file=f)

for key, val in test_set_wers:

print("{}\t{}".format(key, val), file=f)

🤖 Prompt for AI Agents

In egs/mls_english/ASR/zipformer/streaming_decode.py around lines 695 to 703, the summary WER filename is constructed using the loop variable `key` which is shadowed by the later for-loop and therefore ambiguous/out-of-scope; change the sorting and loop to use distinct variable names (e.g., rename sorted items to `sorted_wers` or use `k, v` in the loop) and use the correct intended identifier for the filename (for example `method_key`, `summary_key`, or the outer scoped name such as `test_set_name` or `params.suffix`) so the filename no longer depends on the loop variable; update the open and print loop to iterate over the renamed variables accordingly.

coderabbitai · 2025-08-28T07:33:36Z

egs/mls_english/ASR/zipformer/train.py

+    mls_english_corpus = MLSEnglishHFAsrDataModule(args)
+    train_cuts = mls_english_corpus.train_cuts()
+    # mls_english_corpus.load_dataset(args.dataset_path)


🛠️ Refactor suggestion

Train-time filtering is defined but not used.

Apply remove_short_and_long_utt to train_cuts.

mls_english_corpus = MLSEnglishHFAsrDataModule(args) train_cuts = mls_english_corpus.train_cuts() + train_cuts = train_cuts.filter(remove_short_and_long_utt)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mls_english_corpus = MLSEnglishHFAsrDataModule(args)

train_cuts = mls_english_corpus.train_cuts()

# mls_english_corpus.load_dataset(args.dataset_path)

mls_english_corpus = MLSEnglishHFAsrDataModule(args)

train_cuts = mls_english_corpus.train_cuts()

train_cuts = train_cuts.filter(remove_short_and_long_utt)

# mls_english_corpus.load_dataset(args.dataset_path)

🤖 Prompt for AI Agents

egs/mls_english/ASR/zipformer/train.py around lines 1218-1220: train-time filtering function remove_short_and_long_utt is defined but never applied to the training cuts; call the filter on train_cuts and reassign (e.g., train_cuts = remove_short_and_long_utt(train_cuts) or, if the function accepts duration bounds, pass the configured values from args like train_cuts = remove_short_and_long_utt(train_cuts, args.min_duration, args.max_duration)) so the training set respects the min/max utterance length.

coderabbitai · 2025-08-28T07:33:37Z

egs/mls_english/ASR/zipformer/train.py

+    if args.enable_musan:
+        musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz"
+        if musan_path.exists():
+            cuts_musan = load_manifest(musan_path)
+            logging.info(f"Loaded MUSAN manifest from {musan_path}")
+        else:
+            logging.warning(f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation")
+            cuts_musan = None
+    else:
+        cuts_musan = None
+


⚠️ Potential issue

MUSAN manifest is loaded but never passed to dataloader.

Wire cuts_musan into train_dataloaders so augmentation actually happens.

- train_dl = mls_english_corpus.train_dataloaders( - train_cuts, sampler_state_dict=sampler_state_dict - ) + train_dl = mls_english_corpus.train_dataloaders( + train_cuts, sampler_state_dict=sampler_state_dict, cuts_musan=cuts_musan + )

Also applies to: 1240-1245

🧰 Tools

🪛 Ruff (0.12.2)

1238-1238: Local variable cuts_musan is assigned to but never used

Remove assignment to unused variable cuts_musan

(F841)

🤖 Prompt for AI Agents

In egs/mls_english/ASR/zipformer/train.py around lines 1229-1239 (and similarly 1240-1245), cuts_musan is loaded but never forwarded into the dataloader creation; pass cuts_musan into the train_dataloaders call (e.g., as a keyword argument cuts_musan=cuts_musan) where train/validation dataloaders are constructed, and if train_dataloaders (or its upstream helpers) don't accept this parameter, update their signature to accept and forward cuts_musan into the augmentation/data pipeline so the MUSAN augmentation is actually applied.

coderabbitai · 2025-08-28T07:33:37Z

egs/multi_ja_en/ASR/zipformer/train.py

    def remove_short_and_long_utt(c: Cut):
-        # Keep only utterances with duration between 1 second and 20 seconds
-        #
-        # Caution: There is a reason to select 20.0 here. Please see
-        # ../local/display_manifest_statistics.py
+
+        # Keep only utterances greater than 1 second
        #
        # You should use ../local/display_manifest_statistics.py to get
        # an utterance duration distribution for your dataset to select
-        # the threshold
-        # if c.duration < 1.0 or c.duration > 30.0:
-        #     logging.warning(
-        #         f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
-        #     )
-        #     return False
+        # the threshold as this is dependent on which datasets you choose
+        if c.duration < 1.0:
+            logging.warning(
+                f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
+            )
+            return False

        # In pruned RNN-T, we require that T >= S


🛠️ Refactor suggestion

Reintroduce an upper duration cap to prevent OOM/outliers.

Earlier recipes capped at 30s; current filter only enforces ≥1s.

- # Keep only utterances greater than 1 second + # Keep only utterances within a safe duration range @@ - if c.duration < 1.0: + if c.duration < 1.0 or c.duration > 30.0: logging.warning( f"Exclude cut with ID {c.id} from training. Duration: {c.duration}" ) return False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def remove_short_and_long_utt(c: Cut):

# Keep only utterances with duration between 1 second and 20 seconds

#

# Caution: There is a reason to select 20.0 here. Please see

# ../local/display_manifest_statistics.py

# Keep only utterances greater than 1 second

#

# You should use ../local/display_manifest_statistics.py to get

# an utterance duration distribution for your dataset to select

# the threshold

# if c.duration < 1.0 or c.duration > 30.0:

# logging.warning(

# f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"

# )

# return False

# the threshold as this is dependent on which datasets you choose

if c.duration < 1.0:

logging.warning(

f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"

)

return False

# In pruned RNN-T, we require that T >= S

def remove_short_and_long_utt(c: Cut):

# Keep only utterances within a safe duration range

#

# You should use ../local/display_manifest_statistics.py to get

# an utterance duration distribution for your dataset to select

# the threshold as this is dependent on which datasets you choose

if c.duration < 1.0 or c.duration > 30.0:

logging.warning(

f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"

)

return False

# In pruned RNN-T, we require that T >= S

🤖 Prompt for AI Agents

In egs/multi_ja_en/ASR/zipformer/train.py around lines 1187 to 1200, the cut filter enforces a minimum duration but lacks an upper duration cap which can allow outlier long utterances to cause OOM; add a configurable maximum duration check (e.g., MAX_UTT_DURATION = 30.0) and return False for cuts with c.duration > MAX_UTT_DURATION while logging a warning with the cut id and duration; place the constant near other config values or at module top and reference it in the if-statement so the threshold is easy to adjust based on dataset statistics.

pingfengluo · 2025-08-29T08:02:02Z

This PR adds a new recipe, mls_english, and overhauls an existing recipe, multi_ja_en.

In this prior PR, we added the multi_ja_en recipe for bilingual Japanese / English models. However, we observed that models performed significantly worse at English speech compared to Japanese speech. After determining the cause to be the large imbalance between the number of hours of English data compared to Japanese data, we decided to find a larger English dataset and use it instead of LibriSpeech.

Ultimately, we decided to use the English portion of Multilingual LibriSpeech (MLS English). However, this dataset did not yet have a recipe in icefall. So, we created a new recipe for MLS English, and then updated the multi_ja_en to rely on both reazonspeech recipe and mls_english recipe.

This PR includes both full recipes, so I will close my prior PR which implements the mls_english recipe alone.

Please see multi_ja_en/ASR/README.md and mls_english/ASR/README.md for more details about each recipe.

Please let me and @baileyeet know if there are any comments or concerns.

Summary by CodeRabbit

New Features

Added MLS English ASR recipe with end-to-end prepare, training, decoding, streaming, and tokenizer utilities.

Introduced tools for dataset download, transcript generation, BBPE training, subset creation, and manifest path updates.

Enabled MUSAN augmentation and feature computation across recipes.

Added left-padding support to pad mask utilities.

Refactor

Unified multi-dataset pipeline, replacing prior single-dataset data modules and updating default manifest/layout paths.

Decoders accept extra keyword arguments for broader compatibility.

Documentation

New MLS English README and RESULTS with commands and WER.

Updated bilingual (multi_ja_en) docs, usage, results, and preparation steps.

Not sure if adding a language tag prompt to the decoder/predictor can alleviate the issue of imbalanced multilingual data; it would be somewhat similar to the prompting approach Whisper uses for multi-language support.

kinanmartin · 2025-09-01T10:04:23Z

Not sure if adding a language tag prompt to the decoder/predictor can alleviate the issue of imbalanced multilingual data; it would be somewhat similar to the prompting approach Whisper uses for multi-language support.

@pingfengluo I see, this could work well to increase the accuracy of prediction for one-language clips! But at the same time, I worry it could make it harder for the model to accurately code-switch when predicting bilingual audio without further explicit fine-tuning on labeled code-switching data. If you think the benefit would be large enough though, perhaps we can modify the code to add it.

csukuangfj · 2025-09-01T13:44:30Z

@JinZr Can you have a look at this PR?

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)
119-119: Guard against os.cpu_count() returning None.
-    num_jobs = min(16, os.cpu_count())
+    num_jobs = min(16, os.cpu_count() or 1)

♻️ Duplicate comments (3)

egs/mls_english/ASR/local/utils/download_mls_english.py (2)

8-13: Parameterize HF repo and pin a revision for reproducibility (and align with future mirrors).

Hard-coding the repo makes runs brittle. Expose repo_id/revision/token with sensible defaults.

@@
-import argparse
-import os
-import sys
+import argparse
+import os
+import sys
+from typing import Optional
@@
-def download_dataset(dl_dir):
+def download_dataset(
+    dl_dir: str,
+    repo_id: str = "parler-tts/mls_eng",
+    revision: Optional[str] = None,
+    token: Optional[str] = None,
+) -> None:
@@
-    repo_id = "parler-tts/mls_eng"
     local_dataset_dir = os.path.join(dl_dir, "mls_english")
@@
-        snapshot_download(
+        snapshot_download(
             repo_id=repo_id,
             repo_type="dataset",
             local_dir=local_dataset_dir,
             local_dir_use_symlinks=False,
+            revision=revision,
+            token=token,
         )

31-33: Do not exit from library code; let exceptions propagate and handle in CLI.

Let callers handle failures; wrap the CLI call and exit non-zero there. Add flags to pass repo/revision; read token from HF_TOKEN.

@@
-    except Exception as e:
-        print(f"Error downloading dataset '{repo_id}': {e}", file=sys.stderr)
-        sys.exit(1)
+    except Exception as e:
+        print(f"Error downloading dataset '{repo_id}': {e}", file=sys.stderr)
+        raise
@@
-    parser = argparse.ArgumentParser(
+    parser = argparse.ArgumentParser(
         description="Download MLS English dataset from Hugging Face."
     )
@@
     parser.add_argument(
         "--dl-dir",
         type=str,
         required=True,
         help="The base directory where the 'mls_english' dataset will be downloaded.",
     )
+    parser.add_argument("--repo-id", type=str, default="parler-tts/mls_eng",
+                        help="HF dataset repo-id or local dataset path.")
+    parser.add_argument("--revision", type=str, default=None,
+                        help="HF revision/commit/tag for reproducible snapshots.")
     args = parser.parse_args()
 
-    download_dataset(args.dl_dir)
+    try:
+        download_dataset(
+            args.dl_dir,
+            repo_id=args.repo_id,
+            revision=args.revision,
+            token=os.getenv("HF_TOKEN"),
+        )
+    except Exception:
+        sys.exit(1)

Also applies to: 36-48

egs/mls_english/ASR/local/utils/create_subsets_greedy.py (1)

107-117: Harden duration parsing (handles None/NaN/strings/negatives).

Current code float(example[col]) will throw or propagate NaN. Use a safe parser. Also add math import.

+import math
@@
     def get_duration_from_column(example):
         """Helper to safely get duration from the specified column, in milliseconds."""
-        if duration_column_name in example:
-            return float(example[duration_column_name]) * 1000
-        else:
-            print(
-                f"Warning: Duration column '{duration_column_name}' not found in example. Returning 0.",
-                file=sys.stderr,
-            )
-            return 0
+        try:
+            if duration_column_name in example:
+                val = example[duration_column_name]
+                if isinstance(val, (int, float)) and math.isfinite(val) and val > 0:
+                    return float(val) * 1000
+                if isinstance(val, str):
+                    val = val.strip()
+                    if val:
+                        v = float(val)
+                        if math.isfinite(v) and v > 0:
+                            return v * 1000
+        except (TypeError, ValueError):
+            pass
+        return 0

🧹 Nitpick comments (10)

egs/mls_english/ASR/local/compute_fbank_mls_english.py (2)

63-65: Use logging instead of print for consistency.

-    print(f"{mls_eng_hf_dataset_path=}")
+    logging.info("mls_eng_hf_dataset_path=%s", mls_eng_hf_dataset_path)

57-59: Shorten exception message per Ruff TRY003 (optional).

-        raise ImportError(
-            "To process the MLS English HF corpus, please install optional dependency: pip install datasets"
-        )
+        raise ImportError("Missing optional dependency 'datasets'. Run: pip install datasets")

egs/mls_english/ASR/local/utils/create_subsets_greedy.py (8)

51-52: Generalize split filename regex to support any shard counts.

Your pattern only matches exactly 5-digit shard counters. Many datasets use variable digits. Relax it to \d+ for robustness.

-    split_pattern = re.compile(r"^(train|dev|test|validation)-\d{5}-of-\d{5}\.parquet$")
+    split_pattern = re.compile(r"^(train|dev|test|validation)-\d+-of-\d+\.parquet$")

76-84: Narrow the exception type when loading datasets.

Catching bare Exception hides actionable errors and triggers lints. Limit to expected types.

-    try:
-        full_dataset = load_dataset("parquet", data_files=data_files)
-    except Exception as e:
+    try:
+        full_dataset = load_dataset("parquet", data_files=data_files)
+    except (OSError, ValueError, RuntimeError) as e:
         print(
             f"Error loading dataset from '{full_data_dir}' with load_dataset: {e}",
             file=sys.stderr,
         )
         sys.exit(1)

217-234: Compute “other splits” duration without row iteration.

sum(get_duration_from_column(ex) for ex in other_split) materializes rows and may decode audio. Sum the prefetched column instead.

-            other_split = full_dataset[split_name]
-            subset_dataset[split_name] = other_split
-            other_duration_ms = sum(get_duration_from_column(ex) for ex in other_split)
+            other_split = full_dataset[split_name]
+            subset_dataset[split_name] = other_split
+            if duration_column_name in other_split.column_names:
+                vals = other_split[duration_column_name]
+                other_duration_ms = sum(
+                    (float(v) * 1000)
+                    for v in vals
+                    if v is not None
+                    and isinstance(v, (int, float, str))
+                    and str(v).strip() != ""
+                    and math.isfinite(float(v))
+                    and float(v) > 0
+                )
+            else:
+                other_duration_ms = 0

244-256: Narrow exception on save or explicitly ignore the lint.

Same BLE001 as above. Either narrow to expected IO/Arrow errors or add a # noqa: BLE001 if broad catch is intentional.

-    except Exception as e:
+    except (OSError, RuntimeError, ValueError) as e:
         print(
             f"Error saving subset dataset to '{output_subset_dir}': {e}",
             file=sys.stderr,
         )
         sys.exit(1)

22-28: Preserve fractional hours in output directory name.

Casting to int drops information (e.g., 0.5h → 0h). Use a compact formatter.

-    output_subset_dir = os.path.join(
-        output_base_dir,
-        f"mls_english_subset_train{int(target_train_hours)}h_dev{int(target_dev_hours)}h_test{int(target_test_hours)}h",
-    )
+    def _fmt_hours(h):  # avoids trailing .0 while preserving fractions
+        return f"{h:g}"
+    output_subset_dir = os.path.join(
+        output_base_dir,
+        f"mls_english_subset_train{_fmt_hours(target_train_hours)}h_dev{_fmt_hours(target_dev_hours)}h_test{_fmt_hours(target_test_hours)}h",
+    )

328-341: Remove f-prefix from literal strings (Ruff F541).

These have no placeholders; drop the f to satisfy Ruff.

-print(f"\nTo use your new subset dataset, you can load it like this:")
-print(f"from datasets import load_dataset")
-print(f"import os, glob")
-print(f"data_files = {}")
+print("\nTo use your new subset dataset, you can load it like this:")
+print("from datasets import load_dataset")
+print("import os, glob")
+print("data_files = {}")
@@
-    print(f"    files = glob.glob(split_path)")
-    print(f"    if files: data_files[split_name] = files")
-    print(f"subset = load_dataset('parquet', data_files=data_files)")
-    print(f"print(subset)")
+    print("    files = glob.glob(split_path)")
+    print("    if files: data_files[split_name] = files")
+    print("subset = load_dataset('parquet', data_files=data_files)")
+    print("print(subset)")

20-29: Validate non-negative target hours early.

Guard against accidental negative inputs; current logic would silently produce empty subsets.

     random.seed(random_seed)
+    for _name, _hours in {
+        "train": target_train_hours,
+        "dev": target_dev_hours,
+        "test": target_test_hours,
+    }.items():
+        if _hours < 0:
+            print(f"Error: target hours for '{_name}' must be >= 0 (got {_hours}).", file=sys.stderr)
+            sys.exit(1)

Also applies to: 281-297

112-116: Reduce noisy per-example warnings.

If a duration column is missing, this will spam logs per row. Consider counting and emitting a single summary warning.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 36fc1f1 and a4c1db5.

📒 Files selected for processing (3)

egs/mls_english/ASR/local/compute_fbank_mls_english.py (4 hunks)
egs/mls_english/ASR/local/utils/create_subsets_greedy.py (1 hunks)
egs/mls_english/ASR/local/utils/download_mls_english.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)

icefall/utils.py (1)

is_module_available (1975-1984)

🪛 Ruff (0.12.2)

egs/mls_english/ASR/local/utils/create_subsets_greedy.py

78-78: Do not catch blind exception: Exception

(BLE001)

251-251: Do not catch blind exception: Exception

(BLE001)

328-328: f-string without any placeholders

Remove extraneous f prefix

(F541)

329-329: f-string without any placeholders

Remove extraneous f prefix

(F541)

330-330: f-string without any placeholders

Remove extraneous f prefix

(F541)

331-331: f-string without any placeholders

Remove extraneous f prefix

(F541)

333-333: f-string without any placeholders

Remove extraneous f prefix

(F541)

338-338: f-string without any placeholders

Remove extraneous f prefix

(F541)

339-339: f-string without any placeholders

Remove extraneous f prefix

(F541)

340-340: f-string without any placeholders

Remove extraneous f prefix

(F541)

341-341: f-string without any placeholders

Remove extraneous f prefix

(F541)

egs/mls_english/ASR/local/compute_fbank_mls_english.py

57-59: Avoid specifying long messages outside the exception class

(TRY003)

egs/mls_english/ASR/local/utils/download_mls_english.py

31-31: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (48)

GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630

coderabbitai · 2025-09-02T02:50:23Z

egs/mls_english/ASR/local/compute_fbank_mls_english.py

+    parser.add_argument("-a", "--audio-dir", type=Path)
+    parser.add_argument("-d", "--dl-dir", type=Path)
    return parser.parse_args()


⚠️ Potential issue

Fix --dl-dir semantics and resolve the actual dataset path; current code will miss the downloaded ‘mls_english’ subdir.

The downloader writes to <dl_dir>/mls_english, but main() passes args.dl_dir directly to load_dataset(), causing failures. Also mark required args.

@@ - parser.add_argument("-m", "--manifest-dir", type=Path) - parser.add_argument("-a", "--audio-dir", type=Path) - parser.add_argument("-d", "--dl-dir", type=Path) + parser.add_argument("-m", "--manifest-dir", type=Path, required=True, + help="Directory for manifests and features.") + parser.add_argument("-a", "--audio-dir", type=Path, required=True, + help="Directory to materialize audio shards.") + parser.add_argument("-d", "--dl-dir", type=Path, required=True, + help="Base download dir (expects '<dl-dir>/mls_english') or a local HF dataset dir or an HF repo-id.") @@ - mls_eng_hf_dataset_path = args.dl_dir # "/root/datasets/parler-tts--mls_eng" - cut_sets = make_cutset_blueprints(mls_eng_hf_dataset_path) + # Accept: (1) <dl-dir>/mls_english from our downloader, + # (2) a direct local HF dataset dir, or + # (3) a repo-id string (if user passes one). + if args.dl_dir.is_dir(): + candidate = args.dl_dir / "mls_english" + mls_eng_hf_dataset_path = candidate if candidate.exists() else args.dl_dir + else: + mls_eng_hf_dataset_path = args.dl_dir # repo-id string allowed + cut_sets = make_cutset_blueprints(str(mls_eng_hf_dataset_path))

Also applies to: 132-134

🤖 Prompt for AI Agents

In egs/mls_english/ASR/local/compute_fbank_mls_english.py around lines 110-112 (and similarly lines 132-134), the CLI currently defines --dl-dir as an optional Path and passes args.dl_dir directly to load_dataset(), but the downloader writes into a subdirectory named "mls_english", so the code misses the actual dataset path and can fail; update the argparse definitions to mark required arguments (audio-dir and dl-dir) and, before calling load_dataset(), resolve the real dataset root by appending the "mls_english" subdirectory (e.g., dataset_root = args.dl_dir / "mls_english" if args.dl_dir is not None) and use that resolved path when invoking load_dataset(). Ensure you also handle the case where the resolved path does not exist with a clear error.

coderabbitai · 2025-09-02T02:50:23Z

egs/mls_english/ASR/local/utils/create_subsets_greedy.py

+    # --- NEW: Generalized sampling function ---
+    def sample_split_by_hours(split_name, original_split, target_hours):
+        """
+        Samples a dataset split to reach a target number of hours.
+        Returns the sampled Dataset object and its actual duration in milliseconds.
+        """
+        target_duration_ms = target_hours * 3600 * 1000
+        current_duration_ms = 0
+        indices_to_include = []
+
+        if original_split is None or len(original_split) == 0:
+            print(
+                f"  Warning: Original '{split_name}' split is empty or not found. Cannot sample.",
+                file=sys.stderr,
+            )
+            return None, 0
+
+        print(
+            f"\n  Processing '{split_name}' split to reach approximately {target_hours} hours..."
+        )
+        print(
+            f"  Total samples in original '{split_name}' split: {len(original_split)}"
+        )
+
+        all_original_indices = list(range(len(original_split)))
+        random.shuffle(all_original_indices)  # Shuffle indices for random sampling
+
+        num_samples_processed = 0
+        for original_idx in all_original_indices:
+            if current_duration_ms >= target_duration_ms and target_hours > 0:
+                print(
+                    f"  Target {split_name} hours reached ({target_hours}h). Stopping processing."
+                )
+                break
+
+            example = original_split[original_idx]
+            duration_ms = get_duration_from_column(example)
+
+            if duration_ms > 0:
+                indices_to_include.append(original_idx)
+                current_duration_ms += duration_ms
+


🛠️ Refactor suggestion

Avoid row-wise access (decodes audio) and prefetch durations; huge perf win.

original_split[i] materializes full rows and can decode Audio, which is prohibitively expensive. Prefetch the duration column and disable audio decode during sampling.

def sample_split_by_hours(split_name, original_split, target_hours): @@ - target_duration_ms = target_hours * 3600 * 1000 + target_duration_ms = target_hours * 3600 * 1000 @@ - all_original_indices = list(range(len(original_split))) + # Disable audio decoding to avoid loading waveforms while sampling + if "audio" in original_split.features and isinstance(original_split.features["audio"], Audio): + sr = getattr(original_split.features["audio"], "sampling_rate", 16000) + original_split = original_split.cast_column("audio", Audio(sampling_rate=sr, decode=False)) + + # Prefetch durations once to avoid per-row materialization + duration_values = None + if duration_column_name in original_split.column_names: + try: + duration_values = original_split[duration_column_name] + except (RuntimeError, ValueError, OSError): + duration_values = None + + all_original_indices = list(range(len(original_split))) random.shuffle(all_original_indices) # Shuffle indices for random sampling @@ - example = original_split[original_idx] - duration_ms = get_duration_from_column(example) + if duration_values is not None: + # Fast path: parse from prefetched column + try: + val = duration_values[original_idx] + duration_ms = 0 + if val is not None: + v = float(val) + if math.isfinite(v) and v > 0: + duration_ms = v * 1000 + except (TypeError, ValueError): + duration_ms = 0 + else: + # Fallback: access a single row (audio decoding disabled above) + duration_ms = get_duration_from_column(original_split[original_idx]) @@ - if "audio" in sampled_split.features and not isinstance( - sampled_split.features["audio"], Audio - ): - sampling_rate = ( - sampled_split.features["audio"].sampling_rate - if isinstance(sampled_split.features["audio"], Audio) - else 16000 - ) - new_features = sampled_split.features - new_features["audio"] = Audio(sampling_rate=sampling_rate) - sampled_split = sampled_split.cast(new_features) + if "audio" in sampled_split.features and not isinstance(sampled_split.features["audio"], Audio): + sr = getattr(sampled_split.features["audio"], "sampling_rate", 16000) + new_features = sampled_split.features.copy() + new_features["audio"] = Audio(sampling_rate=sr, decode=False) + sampled_split = sampled_split.cast(new_features)

Also applies to: 175-187

🤖 Prompt for AI Agents

egs/mls_english/ASR/local/utils/create_subsets_greedy.py lines 118-159 (also apply same change to 175-187): the loop uses original_split[i] which materializes full rows and decodes audio per-row; instead prefetch the duration column into memory and iterate that array while ensuring audio decoding is disabled for the split during sampling. Concretely: obtain a durations list/array from the dataset (e.g., fetch the single duration column via the dataset API so it does not decode audio), shuffle indices over that length, use durations[idx] to accumulate current_duration_ms and only access full examples later when building the final subset, and make sure the dataset is opened/used with audio decoding turned off or limited to the columns you need while sampling.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

egs/multi_ja_en/ASR/RESULTS.md (4)
33-38: Align ONNX export averaging with reported decode settings.
Export uses --use-averaged-model 0 and --avg 1, while decode reports --avg 15. Mismatch can yield different WER/CER. Consider exporting the averaged model (or add a note explaining the difference).

Apply:
 ./zipformer/export-onnx.py \
   --tokens ./data/lang/bbpe_2000/tokens.txt \
-  --use-averaged-model 0 \
-  --epoch 21 \
-  --avg 1 \
+  --use-averaged-model 1 \
+  --epoch 21 \
+  --avg 15 \
   --exp-dir ./zipformer/exp
30-31: Nit: capitalize ONNX.
Minor documentation polish.
-To export the model with onnx:
+To export the model with ONNX:
51-58: Make English WER table reproducible (versions/splits/normalization).
Briefly note dataset versions/splits (e.g., CV release, TED corpus variant, MLS split) and text normalization/tokenization used.

60-67: Add dataset details for Japanese CER table.
Specify corpus versions/splits and scoring pipeline (e.g., full-/half-width normalization, punctuation handling) for reproducibility.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a4c1db5 and 9a940c3.

📒 Files selected for processing (1)

egs/multi_ja_en/ASR/RESULTS.md (3 hunks)

🧰 Additional context used

🪛 LanguageTool

egs/multi_ja_en/ASR/RESULTS.md

[grammar] ~53-~53: There might be a mistake here.
Context: ...Corpus | WER (%) | |-----------------------------|---------...

(QB_NEW_EN)

[grammar] ~54-~54: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...

(QB_NEW_EN)

[grammar] ~55-~55: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...

(QB_NEW_EN)

[grammar] ~56-~56: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test-clean) | 8.64 |...

(QB_NEW_EN)

[grammar] ~62-~62: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...

(QB_NEW_EN)

[grammar] ~63-~63: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...

(QB_NEW_EN)

[grammar] ~64-~64: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...

(QB_NEW_EN)

[grammar] ~65-~65: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (130)

GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.4.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.4.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.4.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.4.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.4.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.4.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.4.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.4.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.4.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.4.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.0 v20250630

🔇 Additional comments (2)

egs/multi_ja_en/ASR/RESULTS.md (2)

12-12: Epoch change to 21 is consistent with decode/export — LGTM.
No issues spotted with this value across commands.

23-23: Decode epoch updated to 21 — LGTM.
Matches training/export epoch.

coderabbitai · 2025-09-02T02:54:29Z

egs/multi_ja_en/ASR/RESULTS.md

  --use-fp16 1 \
  --exp-dir zipformer/exp \
-  --max-duration 600
+  --manifest-dir data/manifests


💡 Verification agent

❓ Verification inconclusive

Verify manifest path expectations for training.
Switching to --manifest-dir data/manifests is fine; ensure the prep steps populate this directory and that the README points to the exact script/output layout.

Run to sanity-check presence:

🏁 Script executed:

#!/bin/bash ls -lah data/manifests || exit 1 fd -e jsonl -e jsonl.gz data/manifests -x sh -c 'echo "found: $1" _ {}' \; || true

Length of output: 95

Ensure data/manifests exists and is populated before using --manifest-dir
The directory data/manifests doesn’t exist, so training will fail. Update your data-prep script (and the README) to output the generated JSONL/JSONL.GZ files into data/manifests.

🤖 Prompt for AI Agents

In egs/multi_ja_en/ASR/RESULTS.md around line 16, the README references --manifest-dir data/manifests but that directory doesn't exist; update the data-prep script to create data/manifests and emit the generated JSONL/JSONL.GZ manifest files there, and update the README to document this output path and prerequisite (or change the flag to the actual manifest path your prep currently writes to).

egs/multi_ja_en/ASR/RESULTS.md

JinZr · 2025-09-02T07:09:35Z

for sure! i’ll look into it tonight Best Regards Jin

…

On Mon, 1 Sep 2025 at 21:44 Fangjun Kuang ***@***.***> wrote: *csukuangfj* left a comment (k2-fsa/icefall#2015) <#2015 (comment)> @JinZr <https://github.com/JinZr> Can you have a look at this PR? — Reply to this email directly, view it on GitHub <#2015 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42FGBNWS3SMOKXGF7O33QREVHAVCNFSM6AAAAACFAOJHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENBSGQ2DEMBUGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

egs/multi_ja_en/ASR/RESULTS.md (1)
12-16: Clarify manifests prep and make max-duration explicit (if non-default).

Ensure prepare scripts create/populate data/manifests before training; this was flagged earlier and still applies.

If train.py relies on a default max-duration, consider documenting the default or adding the flag for clarity.

Apply this doc tweak right below the training block to make the prerequisite unmistakable:
   --exp-dir zipformer/exp \
   --manifest-dir data/manifests
+Note: Run the data preparation steps first so that data/manifests exists and contains the JSONL/JSONL.GZ manifests used for training.
+If you rely on the default for --max-duration, consider stating it here (or pass it explicitly) for reproducibility.
</blockquote></details>
<details>
<summary>egs/mls_english/ASR/local/utils/asr_datamodule.py (2)</summary><blockquote>

`77-82`: **Resolved: --max-duration now uses float.**

Thanks for addressing the earlier inconsistency; `type=float` matches the default and expected semantics.

---

`319-321`: **Use configured --num-workers for validation DataLoader (and set persistent_workers accordingly).**

Hardcoding `num_workers=2` ignores the CLI knob and can underutilize hardware.


Apply:

```diff
         valid_dl = DataLoader(
             validate,
             sampler=valid_sampler,
             batch_size=None,
-            num_workers=2,
-            persistent_workers=False,
+            num_workers=self.args.num_workers,
+            persistent_workers=self.args.num_workers > 0,
         )
Optional: make the same change for train/test DataLoaders for consistency (see Lines 278-284 and 342-343).

🧹 Nitpick comments (11)

egs/multi_ja_en/ASR/RESULTS.md (4)
33-38: Confirm tokens path provenance and prep step.

The tokens path moved to ./data/lang/bbpe_2000/tokens.txt. Add a short note pointing readers to the tokenizer prep that generates it (bbpe_2000), or they’ll hit a missing file.

Suggested doc nudge:
 ./zipformer/export-onnx.py \
   --tokens ./data/lang/bbpe_2000/tokens.txt \
   --use-averaged-model 0 \
   --epoch 21 \
   --avg 1 \
   --exp-dir ./zipformer/exp
+
+# Ensure you have created ./data/lang/bbpe_2000 via the language prep step before exporting.
51-58: Annotate dataset versions and evaluation settings for WERs.

Specify dataset versions/splits (e.g., CV version, TED corpus variant), text normalization, and decoding method (greedy vs. MBS, LM? none). This aids reproducibility.

Example header tweak:
-| Corpus                       | WER (%) |
+| Corpus (version/split, norm, method) | WER (%) |
60-67: Do the same for Japanese CERs (versions/norm/method).

List dataset versions (e.g., JSUT release), normalization rules, and decode method for CER.

69-69: Pin the pre-trained model revision for stability.

Link is good; consider adding a specific HF revision/commit/tag in the text so results remain reproducible if the repo updates.
egs/mls_english/ASR/local/utils/asr_datamodule.py (7)
346-351: Avoid lru_cache on bound instance methods (B019); cache by path instead.

@lru_cache on methods retains references to self and can leak when multiple instances are created.

Replace the decorators and route through a path-keyed cache:
-    @lru_cache()
     def train_cuts(self) -> CutSet:
         logging.info("About to get train cuts")
-        return load_manifest_lazy(
-            self.args.manifest_dir / "mls_eng_cuts_train.jsonl.gz"
-        )
+        return _cached_load_cuts(str(self.args.manifest_dir / "mls_eng_cuts_train.jsonl.gz"))
 
-    @lru_cache()
     def valid_cuts(self) -> CutSet:
         logging.info("About to get dev cuts")
-        return load_manifest_lazy(
-            self.args.manifest_dir / "mls_eng_cuts_dev.jsonl.gz"
-        )
+        return _cached_load_cuts(str(self.args.manifest_dir / "mls_eng_cuts_dev.jsonl.gz"))
 
-    @lru_cache()
     def test_cuts(self) -> CutSet:
         logging.info("About to get test cuts")
-        return load_manifest_lazy(
-            self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"
-        )
+        return _cached_load_cuts(str(self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"))
Add the helper near the imports:
# top-level helper (place after imports)
@lru_cache(maxsize=16)
def _cached_load_cuts(path: str) -> CutSet:
    return load_manifest_lazy(Path(path))
Also applies to: 353-358, 360-365

196-205: Honor --concatenate-cuts during training too (not only in validation).

The CLI flag suggests train-time concatenation support; applying it here reduces padding and aligns with the docstring.

Apply:
         transforms = []
+        if self.args.concatenate_cuts:
+            transforms = [
+                CutConcatenate(
+                    duration_factor=self.args.duration_factor, gap=self.args.gap
+                )
+            ] + transforms
If order matters, keep CutConcatenate before CutMix so noise is applied after concatenation.

231-236: Be explicit about PrecomputedFeatures when not using on-the-fly features.

K2SpeechRecognitionDataset defaults may vary; explicit is clearer and future-proof.

Apply:
         train = K2SpeechRecognitionDataset(
             cut_transforms=transforms,
+            input_strategy=PrecomputedFeatures(),
             input_transforms=input_transforms,
             return_cuts=self.args.return_cuts,
         )
         else:
             validate = K2SpeechRecognitionDataset(
                 cut_transforms=transforms,
+                input_strategy=PrecomputedFeatures(),
                 return_cuts=self.args.return_cuts,
             )
Also applies to: 305-309

247-247: Remove unused noqa in commented code (RUF100).

The # noqa is unnecessary on a commented-out line.

Apply:
-            # transforms = [PerturbSpeed(factors=[0.9, 1.1], p=2/3)] + transforms   # noqa
+            # transforms = [PerturbSpeed(factors=[0.9, 1.1], p=2/3)] + transforms
291-295: Use iterable unpacking over list concatenation (RUF005).

Slightly cleaner and avoids an intermediate list.

Apply:
-            transforms = [
-                CutConcatenate(
-                    duration_factor=self.args.duration_factor, gap=self.args.gap
-                )
-            ] + transforms
+            transforms = [
+                CutConcatenate(
+                    duration_factor=self.args.duration_factor, gap=self.args.gap
+                ),
+                *transforms,
+            ]
26-26: Drop unused import load_manifest.

Apply:
-from lhotse import CutSet, Fbank, FbankConfig, load_manifest, load_manifest_lazy
+from lhotse import CutSet, Fbank, FbankConfig, load_manifest_lazy
278-284: Optional: enable persistent_workers when workers > 0 for throughput.

This can speed up data loading for long runs; keep it False when workers == 0.

Apply:
         train_dl = DataLoader(
             train,
             sampler=train_sampler,
             batch_size=None,
             num_workers=self.args.num_workers,
-            persistent_workers=False,
+            persistent_workers=self.args.num_workers > 0,
         )
         test_dl = DataLoader(
             test,
             batch_size=None,
             sampler=sampler,
-            num_workers=self.args.num_workers,
+            num_workers=self.args.num_workers,
+            persistent_workers=self.args.num_workers > 0,
         )
Also applies to: 338-343

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9a940c3 and ef7664e.

📒 Files selected for processing (2)

egs/mls_english/ASR/local/utils/asr_datamodule.py (1 hunks)
egs/multi_ja_en/ASR/RESULTS.md (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

egs/mls_english/ASR/local/utils/asr_datamodule.py (1)

icefall/utils.py (1)

str2bool (142-158)

🪛 Ruff (0.12.2)

egs/mls_english/ASR/local/utils/asr_datamodule.py

247-247: Unused blanket noqa directive

Remove unused noqa directive

(RUF100)

291-295: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

346-346: Use of functools.lru_cache or functools.cache on methods can lead to memory leaks

(B019)

353-353: Use of functools.lru_cache or functools.cache on methods can lead to memory leaks

(B019)

360-360: Use of functools.lru_cache or functools.cache on methods can lead to memory leaks

(B019)

🪛 LanguageTool

egs/multi_ja_en/ASR/RESULTS.md

[grammar] ~53-~53: There might be a mistake here.
Context: ...Corpus | WER (%) | |-----------------------------|---------...

(QB_NEW_EN)

[grammar] ~54-~54: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...

(QB_NEW_EN)

[grammar] ~55-~55: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...

(QB_NEW_EN)

[grammar] ~56-~56: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test-clean) | 8.64 |...

(QB_NEW_EN)

[grammar] ~62-~62: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...

(QB_NEW_EN)

[grammar] ~63-~63: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...

(QB_NEW_EN)

[grammar] ~64-~64: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...

(QB_NEW_EN)

[grammar] ~65-~65: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (197)

GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: style_check (ubuntu-latest, 3.10)
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: style_check (ubuntu-latest, 3.10)
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: style_check (ubuntu-latest, 3.10)
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: style_check (ubuntu-latest, 3.10)
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: style_check (ubuntu-latest, 3.10)
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: style_check (ubuntu-latest, 3.10)
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: style_check (ubuntu-latest, 3.10)
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: style_check (ubuntu-latest, 3.10)
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: style_check (ubuntu-latest, 3.10)
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: style_check (ubuntu-latest, 3.10)
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.5.0 v20250630
GitHub Check: py3.10 torch2.2.0 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.0.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630

🔇 Additional comments (1)

egs/multi_ja_en/ASR/RESULTS.md (1)

23-28: LGTM: decode epoch/avg pair is consistent.

Epoch 21 with avg 15 matches the training epoch change. No further action.

coderabbitai · 2025-09-02T09:17:15Z

egs/mls_english/ASR/local/utils/asr_datamodule.py

+    @lru_cache()
+    def test_cuts(self) -> List[CutSet]:
+        logging.info("About to get test cuts")
+        return load_manifest_lazy(
+            self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"
+        )


⚠️ Potential issue

Type annotation mismatch: test_cuts returns a CutSet, not List[CutSet].

This can cause confusion and downstream typing issues; test_dataloaders also expects a CutSet.

Apply:

- def test_cuts(self) -> List[CutSet]: + def test_cuts(self) -> CutSet: logging.info("About to get test cuts") return load_manifest_lazy( self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz" )

And remove the now-unused List import:

-from typing import Any, Dict, List, Optional +from typing import Any, Dict, Optional

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@lru_cache()

def test_cuts(self) -> List[CutSet]:

logging.info("About to get test cuts")

return load_manifest_lazy(

self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"

)

++ b/egs/mls_english/ASR/local/utils/asr_datamodule.py

@@ –1,1 +1,1 @@

from typing import Any, Dict, Optional

@@ –360,2 +360,2 @@

def test_cuts(self) -> CutSet:

logging.info("About to get test cuts")

return load_manifest_lazy(

self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"

)

🧰 Tools

🪛 Ruff (0.12.2)

360-360: Use of functools.lru_cache or functools.cache on methods can lead to memory leaks

(B019)

🤖 Prompt for AI Agents

In egs/mls_english/ASR/local/utils/asr_datamodule.py around lines 360 to 365, the test_cuts function is annotated to return List[CutSet] but actually returns a CutSet; update the return type annotation to CutSet and remove the now-unused List import from the file; ensure any related type hints (e.g., in test_dataloaders) remain consistent with CutSet.

… model section

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

egs/multi_ja_en/ASR/RESULTS.md (1)
13-20: Boolean CLI arg: confirm expected type for --enable-musan and ensure manifests exist.

Many icefall scripts expect 0/1 or store_true; passing literal "True" may be parsed as a string. Please confirm and align examples (e.g., --enable-musan 1 or just --enable-musan).

Also verify that the README/prep steps create and populate data/manifests before running these commands.

Run:
#!/bin/bash
# Check how enable_musan is defined
rg -nP --type=py -C2 '\benable[-_]?musan\b' egs/multi_ja_en/ASR/zipformer/train.py egs/multi_ja_en/ASR/zipformer/decode.py || true

# Sanity-check manifests after prep
ls -lah data/manifests || exit 1
fd -e jsonl -e jsonl.gz data/manifests -x sh -c 'echo "found: $1"' _ {}

🧹 Nitpick comments (5)

egs/multi_ja_en/ASR/RESULTS.md (5)
24-31: Decode config is underspecified (beam/num-active/etc.).

modified_beam_search usually needs beam/num_paths-like knobs for reproducibility. Add the key params you used (beam size, max contexts, n-best).

36-42: Capitalize and clarify: “export to ONNX”.

Minor wording fix and keep flags on one line per arg for copy/paste.

Apply:
-To export the model with onnx:
+To export the model to ONNX:
56-71: Clarify scoring setup for English WER and Japanese CER.

Specify casing (lowercased or original), punctuation normalization, tokenizer for Japanese (e.g., character-based vs. MeCab), and scoring tool (e.g., sclite). Without this, numbers aren’t comparable.

116-116: Document how chunk size affects WER/RTF.

Add a brief note on the trade-off across {16,32,64,128} (latency vs. accuracy) and indicate the value used to report streaming WER.

5-8: Add repro details: seed, commit hashes, and tool versions.

Include torch/k2/lhotse/icefall commit IDs, random seed(s), and GPU type to make these results reproducible.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ef7664e and bc2560c.

📒 Files selected for processing (1)

egs/multi_ja_en/ASR/RESULTS.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

egs/multi_ja_en/ASR/RESULTS.md

[grammar] ~44-~44: There might be a mistake here.
Context: ...xp-dir ./zipformer/exp ``` WER and CER on test set listed below (calculated with ...

(QB_NEW_EN)

[grammar] ~46-~46: There might be a mistake here.
Context: ...eech + MLS English (combined test set) | |----------------------|----------------...

(QB_NEW_EN)

[grammar] ~47-~47: There might be a mistake here.
Context: ...---------------------------------------| | Zipformer WER (%) | ...

(QB_NEW_EN)

[grammar] ~48-~48: There might be a mistake here.
Context: ... test | | greedy_search | ...

(QB_NEW_EN)

[grammar] ~49-~49: There might be a mistake here.
Context: ... 6.33 | | modified_beam_search | ...

(QB_NEW_EN)

[grammar] ~56-~56: There might be a mistake here.
Context: ... Corpus | WER (%) | |-----------------------------|---------...

(QB_NEW_EN)

[grammar] ~57-~57: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...

(QB_NEW_EN)

[grammar] ~58-~58: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...

(QB_NEW_EN)

[grammar] ~59-~59: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test set) | 8.64 ...

(QB_NEW_EN)

[grammar] ~65-~65: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...

(QB_NEW_EN)

[grammar] ~66-~66: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...

(QB_NEW_EN)

[grammar] ~67-~67: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...

(QB_NEW_EN)

[grammar] ~68-~68: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...

(QB_NEW_EN)

🔇 Additional comments (3)

egs/multi_ja_en/ASR/RESULTS.md (3)

72-75: HF link fix looks good.
Anchor and href are consistent now.

83-91: Streaming training flags likely missing.

The streaming section repeats non-streaming args. If you used streaming-specific options (e.g., causal/chunked training flags), include them; otherwise readers can’t reproduce “Streaming” results.

107-114: Confirm ONNX export flags match the streaming architecture.

--decode-chunk-len 32 is set; if additional streaming-only flags are required at export (e.g., causal conv settings), document them.

coderabbitai · 2025-09-03T09:00:47Z

egs/multi_ja_en/ASR/RESULTS.md

+WER and CER on test set listed below (calculated with `./zipformer/decode.py`):
+
+|       Datasets       | ReazonSpeech + MLS English (combined test set) |
+|----------------------|------------------------------------------------|
+|   Zipformer WER (%)  |                      test                      |
+|     greedy_search    |                      6.33                      |
+| modified_beam_search |                      6.32                      |
+


🛠️ Refactor suggestion

Fix table layout for combined-test metrics.

Header/rows are misaligned (“Zipformer WER (%) | test” reads oddly). Consider:

-| Datasets | ReazonSpeech + MLS English (combined test set) | -|----------------------|------------------------------------------------| -| Zipformer WER (%) | test | -| greedy_search | 6.33 | -| modified_beam_search | 6.32 | +| Method | WER (%) | Notes | +|----------------------|--------:|-----------------------------------------| +| greedy_search | 6.33 | Combined ReazonSpeech + MLS English | +| modified_beam_search | 6.32 | Combined ReazonSpeech + MLS English |

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

WER and CER on test set listed below (calculated with `./zipformer/decode.py`):

| Datasets | ReazonSpeech + MLS English (combined test set) |

|----------------------|------------------------------------------------|

| Zipformer WER (%) | test |

| greedy_search | 6.33 |

| modified_beam_search | 6.32 |

WER and CER on test set listed below (calculated with `./zipformer/decode.py`):

| Method | WER (%) | Notes |

|----------------------|--------:|-----------------------------------------|

| greedy_search | 6.33 | Combined ReazonSpeech + MLS English |

| modified_beam_search | 6.32 | Combined ReazonSpeech + MLS English |

🧰 Tools

🪛 LanguageTool

[grammar] ~44-~44: There might be a mistake here.
Context: ...xp-dir ./zipformer/exp ``` WER and CER on test set listed below (calculated with ...

(QB_NEW_EN)

[grammar] ~46-~46: There might be a mistake here.
Context: ...eech + MLS English (combined test set) | |----------------------|----------------...

(QB_NEW_EN)

[grammar] ~47-~47: There might be a mistake here.
Context: ...---------------------------------------| | Zipformer WER (%) | ...

(QB_NEW_EN)

[grammar] ~48-~48: There might be a mistake here.
Context: ... test | | greedy_search | ...

(QB_NEW_EN)

[grammar] ~49-~49: There might be a mistake here.
Context: ... 6.33 | | modified_beam_search | ...

(QB_NEW_EN)

🤖 Prompt for AI Agents

In egs/multi_ja_en/ASR/RESULTS.md around lines 44 to 51, the markdown table layout for the combined test metrics is misaligned (the header cell "ReazonSpeech + MLS English (combined test set)" and the "Zipformer WER (%) | test" row read oddly); fix by restructuring the table to have clear column headers (e.g., Dataset | Metric | Value or Dataset | Zipformer WER (%) with a single header row), align the separator row to match the number of columns, and move "test" into a proper column cell so each row maps correctly to the headers.

coderabbitai · 2025-09-03T09:00:48Z

egs/multi_ja_en/ASR/RESULTS.md

+./zipformer/decode.py \
+    --epoch 10 \
+    --avg 1 \
+    --exp-dir ./zipformer/exp \
+    --decoding-method modified_beam_search \
+    --manifest-dir data/manifests
+```


⚠️ Potential issue

Use streaming decode script/flags for the streaming model.

The streaming section uses decode.py without chunk args. Replace with the streaming decoder and include chunk length.

Apply:

-./zipformer/decode.py \ - --epoch 10 \ - --avg 1 \ - --exp-dir ./zipformer/exp \ - --decoding-method modified_beam_search \ - --manifest-dir data/manifests +./zipformer/streaming_decode.py \ + --epoch 10 \ + --avg 1 \ + --exp-dir ./zipformer/exp \ + --decoding-method modified_beam_search \ + --decode-chunk-len 32 \ + --manifest-dir data/manifests

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

./zipformer/decode.py \

--epoch 10 \

--avg 1 \

--exp-dir ./zipformer/exp \

--decoding-method modified_beam_search \

--manifest-dir data/manifests

```

./zipformer/streaming_decode.py \

--epoch 10 \

--avg 1 \

--exp-dir ./zipformer/exp \

--decoding-method modified_beam_search \

--decode-chunk-len 32 \

--manifest-dir data/manifests

🤖 Prompt for AI Agents

In egs/multi_ja_en/ASR/RESULTS.md around lines 96 to 102, the streaming decode example uses the non-streaming decode.py and lacks chunking arguments; replace it with the streaming decoder invocation (e.g., streaming_decode.py or the decode.py streaming mode) and add required streaming flags such as --decoding-method streaming_beam_search (or the project-specific streaming method) plus chunk-related args like --chunk-length (and if applicable --left-context/--right-context or --chunk-size) and any streaming-specific model/exp flags; update the command line to call the streaming decoder script and include a concrete --chunk-length value used by the streaming model.

coderabbitai · 2025-09-03T09:00:48Z

egs/multi_ja_en/ASR/RESULTS.md

+*Please let us know which script to use to evaluate the streaming model!*
+
+
+We also include WER% for common English ASR datasets:
+
+*Please let us know which script to use to evaluate the streaming model!*
+
+
+And CER% for common Japanese datasets:

+*Please let us know which script to use to evaluate the streaming model!*



🛠️ Refactor suggestion

Replace placeholders with concrete evaluation commands.

The repeated “Please let us know…” lines block reproducibility. Provide the exact script(s), command lines, and scoring steps for streaming WER/CER.

Apply:

-*Please let us know which script to use to evaluate the streaming model!* +Evaluate streaming with: +```shell +./zipformer/streaming_decode.py \ + --epoch 10 --avg 1 --exp-dir ./zipformer/exp \ + --decoding-method modified_beam_search \ + --decode-chunk-len 32 \ + --manifest-dir data/manifests +``` +Score with the same normalization used for non-streaming; list exact commands/tools.

🤖 Prompt for AI Agents

In egs/multi_ja_en/ASR/RESULTS.md around lines 120-131, replace the repeated placeholder lines with concrete evaluation instructions: add the exact streaming decode command (pointing to zipformer/streaming_decode.py with required flags such as epoch, avg, exp-dir, decoding-method, decode-chunk-len, manifest-dir), then add the precise scoring steps and commands (normalization pipeline matching non-streaming evaluation, followed by invocation of the scoring tool such as sclite or the repo's wer/cer scoring script) and an example invocation showing how to produce hypothesis and reference files and run the scorer; ensure you list any required normalization options (lowercasing, punctuation removal, kana/kanji normalization for Japanese) and the exact files produced/consumed so readers can reproduce the reported WER/CER.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (4)

egs/multi_ja_en/ASR/RESULTS.md (4)

19-21: Ensure data/manifests is created and populated in prep.

Docs should state that prepare scripts emit manifests into data/manifests before training/decoding uses it.

Run to sanity‑check:

#!/bin/bash
set -e
ls -lah data/manifests
fd -e jsonl -e jsonl.gz data/manifests -x sh -c 'echo "found: $1"' _ {}

45-52: Fix misaligned WER table.

Replace with a simple Method|WER table:

-|       Datasets       | ReazonSpeech + MLS English (combined test set) |
-|----------------------|------------------------------------------------|
-|   Zipformer WER (%)  |                      test                      |
-|     greedy_search    |                      6.33                      |
-| modified_beam_search |                      6.32                      |
+| Method               | WER (%) | Notes                               |
+|----------------------|--------:|-------------------------------------|
+| greedy_search        |   6.33  | ReazonSpeech+MLS English, combined  |
+| modified_beam_search |   6.32  | ReazonSpeech+MLS English, combined  |

97-99: Replace TODO with a working streaming decode command.

-```shell
-TODO
-```
+```shell
+./zipformer/streaming_decode.py \
+  --epoch 10 \
+  --avg 1 \
+  --exp-dir ./zipformer/exp-15k15k-streaming \
+  --decoding-method modified_beam_search \
+  --decode-chunk-len 32 \
+  --manifest-dir data/manifests
+```

149-160: Replace placeholders with concrete streaming eval steps.

-Word Error Rates (WERs) listed below:
-
-*Please let us know which script to use to evaluate the streaming model!*
-
-
-We also include WER% for common English ASR datasets:
-
-*Please let us know which script to use to evaluate the streaming model!*
-
-
-And CER% for common Japanese datasets:
-
-*Please let us know which script to use to evaluate the streaming model!*
+Word Error Rates (WERs) for streaming:
+
+```shell
+# Decode (streaming)
+./zipformer/streaming_decode.py \
+  --epoch 10 --avg 1 \
+  --exp-dir ./zipformer/exp-15k15k-streaming \
+  --decoding-method modified_beam_search \
+  --decode-chunk-len 32 \
+  --manifest-dir data/manifests
+
+# Score (same normalization as non-streaming)
+# Example: produces refs/hyps then runs the repo scorer.
+python ./zipformer/scoring/prepare_refs_hyps.py \
+  --manifest-dir data/manifests \
+  --output-dir ./zipformer/exp-15k15k-streaming/scoring
+python ./zipformer/scoring/score_wer_cer.py \
+  --scoring-dir ./zipformer/exp-15k15k-streaming/scoring
+```
+
+We also include WER% for common English ASR datasets and CER% for Japanese with the same normalization.

🧹 Nitpick comments (5)

egs/multi_ja_en/ASR/RESULTS.md (5)
34-36: Capitalize ONNX.
-To export the model with onnx:
+To export the model with ONNX:
57-62: Specify dataset versions/splits and normalization.

State exact corpora (e.g., “Common Voice 17 test”, “TED-LIUM 3 test” or “TED Talks subset”), and text normalization used, to make WER reproducible.

66-71: Clarify JP sets and CER normalization.

Please note the dataset releases/splits (e.g., JSUT v1.1) and normalization (e.g., Kanji→Kana mapping, punctuation handling).

101-103: Use canonical name “sherpa-onnx”.
-To export the model with sherpa onnx:
+To export the model with sherpa-onnx:
145-145: Match flag name in the text.
-You may also use decode chunk sizes `16`, `32`, `64`, `128`.
+You may also use --decode-chunk-len values 16, 32, 64, 128.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bc2560c and ecbe985.

📒 Files selected for processing (1)

egs/multi_ja_en/ASR/RESULTS.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

egs/multi_ja_en/ASR/RESULTS.md

[grammar] ~45-~45: There might be a mistake here.
Context: ...xp-dir ./zipformer/exp ``` WER and CER on test set listed below (calculated with ...

(QB_NEW_EN)

[grammar] ~47-~47: There might be a mistake here.
Context: ...eech + MLS English (combined test set) | |----------------------|----------------...

(QB_NEW_EN)

[grammar] ~48-~48: There might be a mistake here.
Context: ...---------------------------------------| | Zipformer WER (%) | ...

(QB_NEW_EN)

[grammar] ~49-~49: There might be a mistake here.
Context: ... test | | greedy_search | ...

(QB_NEW_EN)

[grammar] ~50-~50: There might be a mistake here.
Context: ... 6.33 | | modified_beam_search | ...

(QB_NEW_EN)

[grammar] ~57-~57: There might be a mistake here.
Context: ... Corpus | WER (%) | |-----------------------------|---------...

(QB_NEW_EN)

[grammar] ~58-~58: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...

(QB_NEW_EN)

[grammar] ~59-~59: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...

(QB_NEW_EN)

[grammar] ~60-~60: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test set) | 8.64 ...

(QB_NEW_EN)

[grammar] ~66-~66: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...

(QB_NEW_EN)

[grammar] ~67-~67: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...

(QB_NEW_EN)

[grammar] ~68-~68: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...

(QB_NEW_EN)

[grammar] ~69-~69: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (32)

GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.6.0 v20250630
GitHub Check: py3.10 torch2.7.1 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630

🔇 Additional comments (1)

egs/multi_ja_en/ASR/RESULTS.md (1)

27-31: Decode args look consistent with the new pipeline.

coderabbitai · 2025-09-04T02:02:38Z

egs/multi_ja_en/ASR/RESULTS.md

+  --world-size 8 \
+  --causal 1 \
+  --num-epochs 10 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp \
-  --max-duration 600
+  --manifest-dir data/manifests \
+  --enable-musan True
 ```


🛠️ Refactor suggestion

Non‑streaming section sets causal=1 — flip to 0 or clarify.

Using --causal 1 here will train a streaming/causal model under a “Non‑streaming” heading.

Apply:

./zipformer/train.py \ - --world-size 8 \ - --causal 1 \ + --world-size 8 \ + --causal 0 \ --num-epochs 10 \ --start-epoch 1 \ --use-fp16 1 \ --exp-dir zipformer/exp \ --manifest-dir data/manifests \ --enable-musan True

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

--world-size 8 \

--causal 1 \

--num-epochs 10 \

--start-epoch 1 \

--use-fp16 1 \

--exp-dir zipformer/exp \

--max-duration 600

--manifest-dir data/manifests \

--enable-musan True

```

./zipformer/train.py \

--world-size 8 \

--causal 0 \

--num-epochs 10 \

--start-epoch 1 \

--use-fp16 1 \

--exp-dir zipformer/exp \

--manifest-dir data/manifests \

--enable-musan True

🤖 Prompt for AI Agents

egs/multi_ja_en/ASR/RESULTS.md around lines 13 to 21: the Non‑streaming section currently sets --causal 1 which configures a streaming/causal model; change --causal 1 to --causal 0 to match the Non‑streaming heading (or alternatively add a clarifying note above the block that this run intentionally trains a causal/streaming model), and ensure the comment/heading and flag are consistent.

coderabbitai · 2025-09-04T02:02:38Z

egs/multi_ja_en/ASR/RESULTS.md

+./zipformer/export-onnx.py \
+  --tokens ./data/lang/bbpe_2000/tokens.txt \
+  --use-averaged-model 0 \
+  --epoch 10 \
+  --avg 1 \
+  --exp-dir ./zipformer/exp
 ```


💡 Verification agent

🧩 Analysis chain

Token path: add a quick existence check in docs or prep.

Ensure data/lang/bbpe_2000/tokens.txt is produced by your BBPE training step before export.

Run:

🏁 Script executed:

#!/bin/bash test -f data/lang/bbpe_2000/tokens.txt && echo "tokens.txt OK" || (echo "tokens.txt missing" && exit 1)

Length of output: 96

Add existence check for tokens.txt in export instructions
In egs/multi_ja_en/ASR/RESULTS.md (lines 37–43), prepend a test to verify data/lang/bbpe_2000/tokens.txt exists before invoking export-onnx.py, for example:

test -f data/lang/bbpe_2000/tokens.txt || { echo "Error: tokens.txt not found. Run BBPE training first."; exit 1; }

This ensures users generate the tokens file beforehand and avoids silent failures.

🤖 Prompt for AI Agents

In egs/multi_ja_en/ASR/RESULTS.md around lines 37 to 43, the export example calls export-onnx.py without checking that data/lang/bbpe_2000/tokens.txt exists; add a pre-check that verifies that file exists and, if not, prints a clear error like "Error: tokens.txt not found. Run BBPE training first." and exits with a non-zero status before running export-onnx.py so users don't hit silent failures.

coderabbitai · 2025-09-04T02:02:38Z

egs/multi_ja_en/ASR/RESULTS.md

+  --world-size 8 \
+  --causal 1 \
+  --num-epochs 10 \
+  --start-epoch 1 \
+  --use-fp16 1 \
+  --exp-dir zipformer/exp \
+  --manifest-dir data/manifests \
+  --enable-musan True
+```


🛠️ Refactor suggestion

Streaming exp-dir inconsistent with later export path.

Train uses ./zipformer/exp, but export uses ./zipformer/exp-15k15k-streaming. Align them.

./zipformer/train.py \ --world-size 8 \ --causal 1 \ --num-epochs 10 \ --start-epoch 1 \ --use-fp16 1 \ - --exp-dir zipformer/exp \ + --exp-dir zipformer/exp-15k15k-streaming \ --manifest-dir data/manifests \ --enable-musan True

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

--world-size 8 \

--causal 1 \

--num-epochs 10 \

--start-epoch 1 \

--use-fp16 1 \

--exp-dir zipformer/exp \

--manifest-dir data/manifests \

--enable-musan True

```

./zipformer/train.py \

--world-size 8 \

--causal 1 \

--num-epochs 10 \

--start-epoch 1 \

--use-fp16 1 \

--exp-dir zipformer/exp-15k15k-streaming \

--manifest-dir data/manifests \

--enable-musan True

🤖 Prompt for AI Agents

In egs/multi_ja_en/ASR/RESULTS.md around lines 85 to 93, the training command uses --exp-dir zipformer/exp while the export later references ./zipformer/exp-15k15k-streaming; make them consistent by changing one to match the other (preferred: set --exp-dir to zipformer/exp-15k15k-streaming to match the streaming export) so both training and export use the identical experiment directory path.

csukuangfj · 2025-09-04T02:03:45Z

egs/multi_ja_en/ASR/RESULTS.md

+  --chunk-size 16 \
+  --left-context-frames 128 \
+  --tokens data/lang/bbpe_2000/tokens.txt \
+  --epoch 10 \


Since your dataset is large, can you also try --iter instead of --epoch?

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (5)

egs/mls_english/ASR/zipformer/streaming_decode.py (5)

37-37: Use MLSEnglishHFAsrDataModule instead of ReazonSpeechAsrDataModule.

This script imports ReazonSpeechAsrDataModule but should import the MLS English data module to align with the MLS English recipe context.

This is identical to the past review comment flagging the same issue.

515-522: Fix stream completion logic and tensor conversion.

Two critical issues: all streams are unconditionally marked as finished, and tensor values need conversion to Python integers.

This matches the past review comment identifying the same issues.

695-697: Fix variable name collision in WER summary filename.

The variable key is shadowed by the loop variable, creating ambiguous filename generation.

This matches the past review comment about the same variable scoping issue.

712-717: Use MLSEnglishHFAsrDataModule instead of ReazonSpeechAsrDataModule.

The main function uses ReazonSpeechAsrDataModule but should use the MLS English data module for consistency with the MLS English recipe.

This matches the past review comment flagging the same data module mismatch.

856-865: Use MLSEnglishHFAsrDataModule instead of ReazonSpeechAsrDataModule.

The dataset loading uses ReazonSpeechAsrDataModule but should use the appropriate MLS English data module.

This matches the past review comment identifying the same data module issue.

🧹 Nitpick comments (2)

egs/mls_english/ASR/zipformer/streaming_decode.py (2)
29-29: Remove unused pdb import.

The pdb import is not used in the code and should be removed.

Apply this diff:
-import pdb
415-418: Consider using iterable unpacking for state concatenation.

The state list concatenation can be simplified with iterable unpacking for better readability.

Apply this diff:
-    new_states = new_encoder_states + [
-        new_cached_embed_left_pad,
-        new_processed_lens,
-    ]
+    new_states = [*new_encoder_states, new_cached_embed_left_pad, new_processed_lens]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecbe985 and d74e232.

📒 Files selected for processing (2)

egs/mls_english/ASR/zipformer/streaming_decode.py (1 hunks)
egs/reazonspeech/ASR/local/compute_fbank_musan.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

egs/reazonspeech/ASR/local/compute_fbank_musan.py

🧰 Additional context used

🧬 Code graph analysis (1)

egs/mls_english/ASR/zipformer/streaming_decode.py (7)

egs/mls_english/ASR/zipformer/decode_stream.py (5)

DecodeStream (28-148)

get_feature_frames (123-137)

id (105-106)

set_features (108-121)

decoding_result (139-148)

egs/mls_english/ASR/zipformer/streaming_beam_search.py (3)

fast_beam_search_one_best (208-295)

greedy_search (30-96)

modified_beam_search (99-205)

egs/mls_english/ASR/zipformer/tokenizer.py (4)

Tokenizer (9-169)

add_arguments (13-25)

piece_to_id (191-195)

get_piece_size (200-201)

egs/mls_english/ASR/zipformer/train.py (3)

add_model_arguments (125-258)

get_model (612-640)

get_params (482-545)

icefall/checkpoint.py (2)

average_checkpoints_with_averaged_model (389-454)

find_checkpoints (254-314)

icefall/utils.py (4)

AttributeDict (214-236)

make_pad_mask (1394-1431)

store_transcripts (558-580)

str2bool (142-158)

egs/mls_english/ASR/local/utils/asr_datamodule.py (3)

add_arguments (62-180)

valid_cuts (354-358)

test_cuts (361-365)

🪛 Ruff (0.12.2)

egs/mls_english/ASR/zipformer/streaming_decode.py

29-29: Import for pdb found

(T100)

201-201: Do not perform function call torch.device in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

415-418: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

511-511: Avoid specifying long messages outside the exception class

(TRY003)

663-663: Avoid specifying long messages outside the exception class

(TRY003)

774-777: Avoid specifying long messages outside the exception class

(TRY003)

779-782: Avoid specifying long messages outside the exception class

(TRY003)

803-806: Avoid specifying long messages outside the exception class

(TRY003)

808-811: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (180)

GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.0 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.0 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.0 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.0 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.0 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.0 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.0 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.0 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.0 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch1.13.1 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.0.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.1 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.0 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630
GitHub Check: py3.10 torch2.1.0 v20250630
GitHub Check: py3.10 torch2.3.1 v20250630
GitHub Check: py3.10 torch2.2.2 v20250630
GitHub Check: py3.10 torch1.13.0 v20250630
GitHub Check: py3.10 torch2.4.1 v20250630
GitHub Check: py3.10 torch2.2.1 v20250630
GitHub Check: py3.10 torch2.1.2 v20250630

🔇 Additional comments (2)

egs/mls_english/ASR/zipformer/streaming_decode.py (2)
388-396: Correct processed mask construction.

The mask construction is correct here and properly computes padding positions without the flip(1) issue mentioned in past comments.

32-32: Add missing import for Optional type hint.

The Optional type hint is used in the suggested fix but not imported.

Apply this diff:
 from pathlib import Path
-from typing import Dict, List, Optional, Tuple
+from typing import Dict, List, Optional, Tuple
Note: This appears to already be imported correctly, so no change is needed.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-09-11T06:49:15Z

egs/mls_english/ASR/zipformer/streaming_decode.py

+Usage:
+./zipformer/streaming_decode.py--epoch 28   --avg 15   --causal 1   --chunk-size 32   --left-context-frames 256   --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192
+
+"""


⚠️ Potential issue

Fix malformed command line example in docstring.

The usage example has a syntax error with missing space after streaming_decode.py.

Apply this diff:

-./zipformer/streaming_decode.py--epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192 +./zipformer/streaming_decode.py --epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Usage:

./zipformer/streaming_decode.py--epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192

"""

Usage:

./zipformer/streaming_decode.py --epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192

"""

🤖 Prompt for AI Agents

egs/mls_english/ASR/zipformer/streaming_decode.py around lines 20 to 23: the docstring usage example is malformed because there's no space between "streaming_decode.py" and the command-line flags; fix by inserting a space after "streaming_decode.py" so the example reads "streaming_decode.py --epoch 28 ..." and ensure the rest of flags remain unchanged.

coderabbitai · 2025-09-11T06:49:16Z

egs/mls_english/ASR/zipformer/streaming_decode.py

+def get_init_states(
+    model: nn.Module,
+    batch_size: int = 1,
+    device: torch.device = torch.device("cpu"),


🛠️ Refactor suggestion

Avoid mutable default argument with function call.

Using torch.device("cpu") as a default argument is problematic as it creates the device object at function definition time rather than call time.

Apply this diff:

-def get_init_states( - model: nn.Module, - batch_size: int = 1, - device: torch.device = torch.device("cpu"), -) -> List[torch.Tensor]: +def get_init_states( + model: nn.Module, + batch_size: int = 1, + device: Optional[torch.device] = None, +) -> List[torch.Tensor]:

And update the function body:

+ if device is None: + device = torch.device("cpu")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

device: torch.device = torch.device("cpu"),

def get_init_states(

model: nn.Module,

batch_size: int = 1,

device: Optional[torch.device] = None,

) -> List[torch.Tensor]:

if device is None:

device = torch.device("cpu")

🧰 Tools

🪛 Ruff (0.12.2)

201-201: Do not perform function call torch.device in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🤖 Prompt for AI Agents

In egs/mls_english/ASR/zipformer/streaming_decode.py around line 201, the function currently uses torch.device("cpu") as a default argument which is created at import time; change the parameter default to None (typing it as Optional[torch.device]), add "from typing import Optional" if not present, and inside the function set device = torch.device("cpu") when device is None so the device object is created at call time; update any type hints and callers if necessary.

csukuangfj · 2025-09-16T03:37:34Z

egs/mls_english/ASR/RESULTS.md

+
+
+The pre-trained model is available here : [reazon-research/mls-english
+](https://huggingface.co/reazon-research/mls-english)


This URL leads to a 404 page. Can you make it public?

kinanmartin and others added 30 commits July 28, 2025 17:49

WIP v0 MLS English recipe

28f6545

update prepare.sh, fix asr_datamodule.py

ac0c0ed

change default path

a1fc642

replace file

defc71b

cleaned-up version of recipe

efe015d

symlink copied files to librispeech recipe dir

8c1c710

separate transcript prep stage from bpe train stage

8985259

pre-commit hooks

a34d34a

readme

ce44150

instead of on-the-fly features, precompute fbank and manifests in pre…

68e3cea

…pare.sh

move compute_fbank_mls_english.py, add validate_manifest.py, add shar…

d6e3c98

…ed symlink to librispeech

adjusted prepare.sh to only calculate fbank and manifest together; ad…

4ca8ee9

…just datamodule to load from manifest files

fix validation manifest name

59519a4

fix stage 2 and 3

f2e0171

optimize with num_jobs on save_audios

fa84782

new version of multi_ja_en prepare.sh script which swaps Librispeech …

abebb6a

…for MLS English

add fbank

c83b115

Revert "add fbank"

61e81bf

This reverts commit ba603e0.

deprecate params.bilingual=0, replace ReazonSpeechAsrDataModule for M…

3751441

…ultiDatasetAsrDataModule, not tested yet

remove bilingual tag from train.py

6d71d9c

restore version of mls_english compute_fbank_mls_english.py and prepa…

5417e09

…re.sh from commit 547f5c5

fix stage 5 output pathing

782e1fb

switch mls_english clone from https to ssh

f4b2987

use huggingface_hub library to download mls_english

a8ecb16

Combined updates. Changed BBPE path structure, changed dataset path s…

3307836

…tructure, added script to update cutset paths. WIP

fix decode script data module usage

2f1c611

add utility file for updating the storage_path of cutsets for use in …

eafbd64

…the multilingual training recipe directory structure

add utility file for creating subsets of mls english. must be fixed t…

b167ac7

…o make dev and test splits have matching sizes to reazonspeech

Parametrize dev and test split sizes.

ad1be22

Add failsafe for MLS English dev set key alternate name as validation

78ee595

baileyeet and others added 3 commits August 14, 2025 16:05

Update RESULTS.md

8e18616

Update README.md

556a3f0

Merge pull request #4 from reazon-research/musan-mls-clean-final

36fc1f1

Musan mls clean final

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

Remove changes to files outside of relevant recipes

7231cf4

baileyeet and others added 3 commits September 2, 2025 11:42

reformat

a4c1db5

Update RESULTS.md

2859c22

Update RESULTS.md

9a940c3

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

kinanmartin and others added 2 commits September 2, 2025 18:05

Update egs/multi_ja_en/ASR/RESULTS.md

f64a706

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update egs/mls_english/ASR/local/utils/asr_datamodule.py

ef7664e

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

Update training commands and decode.py accuracy values, add streaming…

bc2560c

… model section

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

Update streaming train and export commands

ecbe985

coderabbitai bot reviewed Sep 4, 2025

View reviewed changes

csukuangfj reviewed Sep 4, 2025

View reviewed changes

baileyeet and others added 4 commits September 11, 2025 15:18

Remove accidentally added submodule musan-k2-v2-reazonspeech-medium

a30e80c

Update egs/reazonspeech/ASR/local/compute_fbank_musan.py

9d389cd

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update egs/mls_english/ASR/zipformer/streaming_decode.py

8c84639

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Merge branch 'master' into multi_ja_en_mls_english_clean

d74e232

coderabbitai bot reviewed Sep 11, 2025

View reviewed changes

csukuangfj reviewed Sep 16, 2025

View reviewed changes

		@@ -0,0 +1 @@
		Subproject commit 157dc62a6439f03de1fcccb4914fc0d30b6f21b8

		from asr_datamodule import ReazonSpeechAsrDataModule
		from decode_stream import DecodeStream



		The pre-trained model is available here : [reazon-research/mls-english
		](https://huggingface.co/reazon-research/mls-english)

Bilingual Japanese / English zipformer recipe (multi_ja_en) and MLS English recipe (mls_english) #2015

Are you sure you want to change the base?

Bilingual Japanese / English zipformer recipe (multi_ja_en) and MLS English recipe (mls_english) #2015

Uh oh!

Conversation

kinanmartin commented Aug 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks (2 passed, 1 warning)

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

pingfengluo commented Aug 29, 2025

Summary by CodeRabbit

Uh oh!

kinanmartin commented Sep 1, 2025

Uh oh!

csukuangfj commented Sep 1, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 2, 2025

Choose a reason for hiding this comment

kinanmartin commented Aug 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 28, 2025 •

edited

Loading