Skip to content

Conversation

@kinanmartin
Copy link

@kinanmartin kinanmartin commented Aug 28, 2025

This PR adds a new recipe, mls_english, and overhauls an existing recipe, multi_ja_en.

In this prior PR, we added the multi_ja_en recipe for bilingual Japanese / English models. However, we observed that models performed significantly worse at English speech compared to Japanese speech. After determining the cause to be the large imbalance between the number of hours of English data compared to Japanese data, we decided to find a larger English dataset and use it instead of LibriSpeech.

Ultimately, we decided to use the English portion of Multilingual LibriSpeech (MLS English). However, this dataset did not yet have a recipe in icefall. So, we created a new recipe for MLS English, and then updated the multi_ja_en to rely on both reazonspeech recipe and mls_english recipe.

This PR includes both full recipes, so I will close my prior PR which implements the mls_english recipe alone.

Please see multi_ja_en/ASR/README.md and mls_english/ASR/README.md for more details about each recipe.

Please let me and @baileyeet know if there are any comments or concerns.

Summary by CodeRabbit

  • New Features

    • Added MLS English ASR recipe with end-to-end prepare, train, and decode (streaming and non-streaming).
    • Introduced multi-dataset data module, Hugging Face dataset support, on-the-fly features, and MUSAN augmentation.
    • Added utilities: dataset downloader, subset creation by hours, transcript generation, BPE training, and manifest path updater.
    • Provided tokenizer support (BPE/char).
  • Documentation

    • New MLS English README and RESULTS with WERs and commands.
    • Updated multi_ja_en README/RESULTS to use MLS English + ReazonSpeech, with revised commands and metrics.
  • Refactor

    • Reorganized data/manifests layout and language assets; unified SentencePiece pipeline.
  • Chores

    • Improved device selection for Whisper Fbank; added shared Zipformer links.

kinanmartin and others added 30 commits July 28, 2025 17:49
This reverts commit ba603e0.
…tructure, added script to update cutset paths. WIP
…the multilingual training recipe directory structure
…o make dev and test splits have matching sizes to reazonspeech
@coderabbitai
Copy link

coderabbitai bot commented Aug 28, 2025

Walkthrough

Adds a complete MLS English ASR recipe and integrates it into the multi_ja_en pipeline. Introduces dataset download, feature/materialization from Hugging Face, transcripts/BPE tooling, a new MLS English data module, and Zipformer training/decoding (non-streaming/streaming). Updates multi_ja_en to use MLS English + ReazonSpeech, overhauls manifests/lang prep, and replaces legacy LibriSpeech/char-token flows.

Changes

Cohort / File(s) Summary
Docs (MLS English, Multi JA-EN)
egs/mls_english/ASR/README.md, egs/mls_english/ASR/RESULTS.md, egs/multi_ja_en/ASR/README.md, egs/multi_ja_en/ASR/RESULTS.md
Adds MLS English recipe docs and results; updates multi_ja_en docs/results to MLS English + ReazonSpeech, new commands, metrics, and links.
MLS English data prep pipeline
egs/mls_english/ASR/prepare.sh, egs/mls_english/ASR/local/compute_fbank_mls_english.py, egs/mls_english/ASR/local/display_manifest_statistics.py, egs/mls_english/ASR/local/compute_fbank_musan.py, egs/mls_english/ASR/shared
Stage-based prep script; switches to Hugging Face CutSet creation; updates cut filenames and adds test split; adds Musan fbank symlink; shares common assets.
MLS English utilities
egs/mls_english/ASR/local/utils/download_mls_english.py, egs/mls_english/ASR/local/utils/generate_transcript.py, egs/mls_english/ASR/local/train_bpe_model.py, egs/mls_english/ASR/local/utils/asr_datamodule.py, egs/mls_english/ASR/local/utils/create_subsets_greedy.py
New HF dataset downloader; transcript generator from cuts; BPE trainer; MLS English HF data module with on-the-fly/precomputed features, MUSAN, SpecAugment; tool to create hour-limited Parquet subsets.
MLS English Zipformer decoding/training
egs/mls_english/ASR/zipformer/decode.py, egs/mls_english/ASR/zipformer/streaming_decode.py
Full-featured non-streaming and streaming Zipformer decoders with multiple decoding methods, LM integration, graphs, WER reporting.
MLS English Zipformer pointers
egs/mls_english/ASR/zipformer/asr_datamodule.py, .../beam_search.py, .../ctc_decode.py, .../decode_stream.py, .../decoder.py, .../encoder_interface.py, .../export-onnx.py, .../export.py, .../generate_averaged_model.py, .../joiner.py, .../model.py, .../my_profile.py, .../onnx_pretrained.py, .../optim.py, .../pretrained.py, .../scaling.py, .../scaling_converter.py, .../streaming_beam_search.py, .../subsampling.py, .../test_scaling.py, .../test_subsampling.py, .../zipformer.py
Adds modules referencing existing Librispeech implementations to reuse core Zipformer components.
MLS English Zipformer training (transducer)
egs/mls_english/ASR/zipformer/do_not_use_it_directly.py
Comprehensive RNN-T training script with DDP, mixed precision, checkpointing, diagnostics, and loss scheduling.
Multi JA-EN data/lang prep changes
egs/multi_ja_en/ASR/prepare.sh, egs/multi_ja_en/ASR/local/prepare_lang_bbpe.py, egs/multi_ja_en/ASR/local/train_bbpe_model.py, egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py, egs/multi_ja_en/ASR/local/prepare_lang_char.py (removed), egs/multi_ja_en/ASR/local/utils/asr_datamodule.py, egs/multi_ja_en/ASR/local/validate_bpe_lexicon.py
Reworks manifests layout and BBPE pipeline; adds path updater; enhances BBPE trainer options/output handling; adds MUSAN integration in datamodule; removes char-lang prep; adjusts validations and defaults.
Multi JA-EN Zipformer code updates
egs/multi_ja_en/ASR/zipformer/decode.py, .../do_not_use_it_directly.py, .../streaming_decode.py, .../train.py, .../multi_dataset.py
Switches to MultiDatasetAsrDataModule; unifies SentencePiece-based text encoding; merges MLS English + ReazonSpeech mux; simplifies test handling; updates signatures and defaults.
ReazonSpeech utility tweak
egs/reazonspeech/ASR/local/compute_fbank_musan.py
WhisperFbank now auto-selects device and warns if falling back to CPU.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant P as prepare.sh
  participant HF as HuggingFace Hub
  participant L as Lhotse
  participant M as MUSAN
  participant SPM as SentencePiece

  U->>P: Run prepare.sh (stages 0..6)
  P->>HF: snapshot_download mls_eng
  P->>L: CutSet.from_huggingface_dataset (train/dev/test)
  P->>L: save_audios + compute_and_store_features
  P->>M: lhotse prepare & fbank
  P->>L: generate transcript.txt
  P->>SPM: train BPE (bbpe.model)
  P->>P: update_cutset_paths
  P-->>U: Manifests, features, lang ready
Loading
sequenceDiagram
  autonumber
  participant T as train.py
  participant DM as MLSEnglish/MultiDataset DataModule
  participant D as DataLoader
  participant Z as Zipformer Model
  participant CK as Checkpointer

  T->>DM: build train/valid datasets (OTF feats, MUSAN, aug)
  DM-->>D: DynamicBucketingSampler loaders
  loop per epoch
    D->>Z: batches (features, tokens)
    Z-->>T: losses (transducer/CTC)
    T->>CK: save checkpoints / best
  end
Loading
sequenceDiagram
  autonumber
  participant Dec as decode.py
  participant DM as DataModule
  participant DL as DataLoader
  participant Z as Zipformer Model
  participant G as Decoding Graph/LM
  participant Eval as WER

  Dec->>DM: test_cuts -> DataLoader
  Dec->>Z: load/avg checkpoints
  Dec->>G: build (trivial/L/G/context)
  DL->>Z: batch features
  Z-->>Dec: hypotheses (greedy/beam/fast)
  Dec->>Eval: compute WER, save results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

Suggested reviewers

  • csukuangfj

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title concisely and accurately summarizes the PR’s primary changes: adding an MLS English recipe and updating the bilingual Japanese/English Zipformer recipe (multi_ja_en), matching the provided PR objectives and file summaries. It is specific, focused on the main intent, and omits extraneous detail.

Poem

A rabbit hopped through manifests’ glen,
Wove MLS with ReazonSpeech again.
Buckets brim, beams align,
Zipformer sings in tidy time.
BPE seeds in moonlit byte,
WERs fall softly—what a sight.
Hop, commit—tests green tonight! 🐇✨

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 47

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (15)
egs/tedlium3/ASR/conformer_ctc2/transformer.py (1)

606-616: Document kwargs behavior and ensure TorchScript compatibility

  • In egs/tedlium3/ASR/conformer_ctc2/transformer.py’s TransformerDecoderLayer.forward, append to the docstring:
               warmup:
                 controls selective bypass of layers; if < 1.0, we will bypass the layer more frequently (default=1.0).
  •      kwargs:
    
  •        accepted for API compatibility; currently ignored.
    
  • In egs/tedlium3/ASR/conformer_ctc2/export.py (around line 277), insert before model = torch.jit.script(model):
    # Prevent **kwargs in TransformerDecoderLayer.forward from breaking JIT
    model.__class__.forward = torch.jit.ignore(model.__class__.forward)
    to avoid scripting errors.
egs/mls_english/ASR/zipformer/optim.py (2)

103-116: Fix NameError: ‘key’ is undefined when batching parameters.

stacked_params_dict[key] = p_stacked uses key that isn’t defined in this scope.

Apply:

         for batch, batch_names in zip(batches, batches_names):
             p = batch[0]
@@
-            p_stacked.grad = grad
-            stacked_params_dict[key] = p_stacked
+            p_stacked.grad = grad
+            # derive the batch key from a representative param
+            key = (str(p.dtype), *p.shape)
+            stacked_params_dict[key] = p_stacked
             tuples.append((p_stacked, state, batch_names))

551-553: Fix undefined variable: use first_p.device instead of p.device.

p is not defined in this scope; should reference first_p.

Apply:

-            first_state["model_norms"] = torch.zeros(
-                clipping_update_period, device=p.device
-            )
+            first_state["model_norms"] = torch.zeros(
+                clipping_update_period, device=first_p.device
+            )
egs/mls_english/ASR/zipformer/my_profile.py (2)

158-163: Fix typo in get_model_profile kwarg and avoid shadowing config “params”

Typo: module_hoop_mapping -> module_hook_mapping. Also rename the returned “params” to avoid shadowing the config dict.

-    flops, params = get_model_profile(
+    flops, param_count = get_model_profile(
         model=model,
         args=(feature, feature_lens),
-        module_hoop_mapping=MODULE_HOOK_MAPPING,
+        module_hook_mapping=MODULE_HOOK_MAPPING,
     )
-    logging.info(f"For the encoder part, params: {params}, flops: {flops}")
+    logging.info(f"For the encoder part, params: {param_count}, flops: {flops}")

1-171: Remove outdated relative-path license headers in MLS English zipformer files
Remove the line

# See ../../../../LICENSE for clarification regarding multiple authors  

from all Python scripts in egs/mls_english/ASR/zipformer (my_profile.py, train.py, streaming_decode.py, decode.py, do_not_use_it_directly.py).

egs/mls_english/ASR/zipformer/zipformer.py (1)

239-251: Fix dtype bug when building feature masks (logical_and on float tensors).

torch.logical_and expects bool tensors; here it’s applied to float masks, which will error. Build boolean masks, then cast once for multiplication.

Apply this diff:

-        # mask1 shape: (1, batch_size, 1)
-        mask1 = (
-            torch.rand(1, batch_size, 1, device=x.device) > feature_mask_dropout_prob
-        ).to(x.dtype)
-
-        # mask2 has additional sequences masked, about twice the number.
-        mask2 = torch.logical_and(
-            mask1,
-            (
-                torch.rand(1, batch_size, 1, device=x.device)
-                > feature_mask_dropout_prob
-            ).to(x.dtype),
-        )
+        # mask1 shape: (1, batch_size, 1)
+        mask1_bool = torch.rand(1, batch_size, 1, device=x.device) > feature_mask_dropout_prob
+        # mask2 has additional sequences masked, about twice the number.
+        mask2_bool = torch.logical_and(
+            mask1_bool,
+            torch.rand(1, batch_size, 1, device=x.device) > feature_mask_dropout_prob,
+        )
 
-        # dim: (1, batch_size, 2)
-        mask = torch.cat((mask1, mask2), dim=-1)
+        # dim: (1, batch_size, 2) as dtype of x for downstream multiplication
+        mask = torch.cat((mask1_bool.to(x.dtype), mask2_bool.to(x.dtype)), dim=-1)
egs/multi_ja_en/ASR/local/prepare_lang_bbpe.py (1)

24-34: Fix mismatch between CLI/docs and actual BBPE model path; wire up --vocab-size and add fallback search.

Docs and help mention bbpe_<vocab_size>/bbpe.model, but the code only checks lang_dir/bbpe.model and ignores --vocab-size. This will raise FileNotFoundError in the new layout.

Apply:

@@
-    parser.add_argument(
-        "--lang-dir",
-        type=str,
-        help="""Input and output directory.
-        It should contain the words.txt file and the
-        bbpe model in a subdirectory (e.g., bbpe_2000/bbpe.model).
-        """,
-    )
+    parser.add_argument(
+        "--lang-dir",
+        type=str,
+        help="""Input and output directory.
+        It should contain words.txt and a BBPE model, either:
+          - {lang_dir}/bbpe_{vocab_size}/bbpe.model, or
+          - {lang_dir}/bbpe.model (deprecated fallback).
+        """,
+    )
@@
-    parser.add_argument(
-        "--vocab-size",
-        type=int,
-        default=2000,  # Add a default value for vocab_size for consistency
-        help="Vocabulary size used for BPE training (determines the bbpe model directory).",
-    )
+    parser.add_argument(
+        "--vocab-size",
+        type=int,
+        default=2000,
+        help="Vocabulary size used for BBPE (drives bbpe_{size}/bbpe.model lookup).",
+    )
@@
-    model_file = lang_dir / "bbpe.model"
-
-    if not model_file.is_file():
-        raise FileNotFoundError(f"BPE model not found at: {model_file}")
+    # Prefer new layout; fall back to legacy flat file.
+    candidate_new = lang_dir / f"bbpe_{args.vocab_size}" / "bbpe.model"
+    candidate_legacy = lang_dir / "bbpe.model"
+    model_file = candidate_new if candidate_new.is_file() else candidate_legacy
+    if not model_file.is_file():
+        raise FileNotFoundError(
+            "BBPE model not found. Searched:\n"
+            f"  - {candidate_new}\n"
+            f"  - {candidate_legacy}\n"
+            "Please run local/train_bbpe_model.py or adjust --vocab-size/lang-dir."
+        )
@@
-    lexicon, token_sym_table = generate_lexicon(str(model_file), words, args.oov)
+    lexicon, token_sym_table = generate_lexicon(str(model_file), words, args.oov)

Also applies to: 170-179, 188-194, 215-219, 230-231

egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)

97-104: Require essential CLI arguments.

These paths are dereferenced unconditionally; make them required to avoid None/TypeErrors.

-    parser.add_argument("-m", "--manifest-dir", type=Path)
-    parser.add_argument("-a", "--audio-dir", type=Path)
-    parser.add_argument("-d", "--dl-dir", type=Path)
+    parser.add_argument("-m", "--manifest-dir", type=Path, required=True, help="Directory for feature manifests and done marker")
+    parser.add_argument("-a", "--audio-dir", type=Path, required=True, help="Directory to materialize audio per split")
+    parser.add_argument("-d", "--dl-dir", type=Path, required=True, help="HF dataset name or local path (e.g., parler-tts/mls_eng)")
egs/multi_ja_en/ASR/local/train_bbpe_model.py (2)

83-87: Stream input conversion; avoid loading whole file into memory and set encoding

readlines() loads the entire transcript and encoding is unspecified.

-def _convert_to_bchar(in_path: str, out_path: str):
-    with open(out_path, "w") as f:
-        for line in open(in_path, "r").readlines():
-            f.write(byte_encode(tokenize_by_ja_char(line)) + "\n")
+def _convert_to_bchar(in_path: str, out_path: str):
+    with open(in_path, "r", encoding="utf-8") as src, open(
+        out_path, "w", encoding="utf-8"
+    ) as dst:
+        for line in src:
+            dst.write(byte_encode(tokenize_by_ja_char(line)) + "\n")

114-117: Use delete=False for NamedTemporaryFile to avoid Windows sharing issues

Opening the same path for write while NamedTemporaryFile is open can fail on Windows.

-    temp = tempfile.NamedTemporaryFile()
-    train_text = temp.name
+    temp = tempfile.NamedTemporaryFile(delete=False)
+    train_text = temp.name
+    temp.close()
egs/mls_english/ASR/zipformer/asr_datamodule.py (1)

1-366: De-duplicate: make this a thin re-export to the local/utils module

Keep a single implementation and re-export here to avoid drift.

@@
-<remove entire file contents>
+from ..local.utils.asr_datamodule import MLSEnglishHFAsrDataModule
+
+__all__ = ["MLSEnglishHFAsrDataModule"]
egs/mls_english/ASR/zipformer/do_not_use_it_directly.py (1)

1-1262: Replace lingering ReazonSpeechAsrDataModule imports with MLSEnglishHFAsrDataModule

  • In all scripts under egs/mls_english/ASR/zipformer/ (e.g. do_not_use_it_directly.py, streaming_decode.py, decode.py, train.py), remove
    from asr_datamodule import ReazonSpeechAsrDataModule
    and instead import
    from ASR.local.utils.asr_datamodule import MLSEnglishHFAsrDataModule
  • Update all instantiations of ReazonSpeechAsrDataModule to use MLSEnglishHFAsrDataModule(args), ensuring consistency with the sole asr_datamodule.py under ASR/local/utils.
egs/multi_ja_en/ASR/zipformer/decode.py (3)

67-72: Import error: asr_datamodule is not in this package path. Add sys.path shim.

-import torch.nn as nn
-from asr_datamodule import MultiDatasetAsrDataModule
+import torch.nn as nn
+import sys as _sys
+from pathlib import Path as _P
+_sys.path.append(str((_P(__file__).resolve().parents[1] / "local" / "utils")))
+from asr_datamodule import MultiDatasetAsrDataModule

379-392: Bug: tokenization for oracle n-best uses a list where a string is expected.

supervisions["text"] is a list. Apply normalization per-utterance.

-            num_paths=params.num_paths,
-            ref_texts=sp.encode(byte_encode(tokenize_by_ja_char(supervisions["text"]))),
+            num_paths=params.num_paths,
+            ref_texts=sp.encode(
+                [byte_encode(tokenize_by_ja_char(t)) for t in supervisions["text"]]
+            ),
             nbest_scale=params.nbest_scale,

556-564: WER summary filename uses an out-of-scope 'key'.

This will reference the last loop value or fail if empty. Remove 'key' from the filename.

-    errs_info = (
-        params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt"
-    )
+    errs_info = params.res_dir / f"wer-summary-{test_set_name}-{params.suffix}.txt"
🧹 Nitpick comments (74)
egs/mls_english/ASR/zipformer/scaling.py (1)

193-206: Optional: fix odd extra_repr formatting.

extra_repr() prints self.schedule.pairs[1:-1], dropping the end points. Use the full list for clarity.

-        return (
-            f"batch_count={self.batch_count}, schedule={str(self.schedule.pairs[1:-1])}"
-        )
+        return f"batch_count={self.batch_count}, schedule={self.schedule.pairs}"
egs/mls_english/ASR/zipformer/export.py (2)

25-29: Adjust usage/docs to reference MLS English paths

The embedded usage/help still references “librispeech” paths. Since this file lives under mls_english, update examples to avoid confusion.

Example diff:

-Note: This is a example for librispeech dataset, if you are using different
+Note: This is an example for the MLS English dataset; if you are using a different
 dataset, you should change the argument values according to your dataset.
@@
-    cd /path/to/egs/librispeech/ASR
+    cd /path/to/egs/mls_english/ASR
@@
-    cd /path/to/egs/librispeech/ASR
+    cd /path/to/egs/mls_english/ASR

Also applies to: 99-107, 115-129, 129-140


399-405: Enable CUDA automatically (or add a flag)

Export can benefit from GPU when available. Consider auto-selecting CUDA to avoid accidental CPU-only runs.

-    device = torch.device("cpu")
-    # if torch.cuda.is_available():
-    #     device = torch.device("cuda", 0)
+    device = torch.device("cuda", 0) if torch.cuda.is_available() else torch.device("cpu")
egs/mls_english/ASR/zipformer/optim.py (2)

463-463: Dead variable.

batch = True is unused; remove or document if intended for future use.

-        batch = True

480-503: Guard in-place grad scaling.

p.grad.mul_(clipping_scale) mutates grad in-place; if the same grad tensor is reused elsewhere, this could have side effects. Consider using out-of-place scaling.

-                    grad = (
-                        p.grad if clipping_scale == 1.0 else p.grad.mul_(clipping_scale)
-                    )
+                    grad = p.grad if clipping_scale == 1.0 else p.grad * clipping_scale
egs/mls_english/ASR/zipformer/my_profile.py (1)

111-111: Fix misleading shape comment after permute

The permute converts (T, N, C) -> (N, T, C), not the other way around.

-        encoder_out = encoder_out.permute(1, 0, 2)  # (N, T, C) -> (T, N, C)
+        encoder_out = encoder_out.permute(1, 0, 2)  # (T, N, C) -> (N, T, C)
egs/mls_english/ASR/zipformer/subsampling.py (2)

205-227: Docstring parameters are inaccurate/outdated

Second “layer1_channels” actually documents layer2; “bottleneck” isn’t a ctor arg; missing docs for layer3_channels and dropout.

-          layer1_channels:
-            Number of channels in layer2
-          bottleneck:
-            bottleneck dimension for 1d squeeze-excite
+          layer2_channels:
+            Number of channels in layer2
+          layer3_channels:
+            Number of channels in layer3 (input to ConvNeXt)
+          dropout:
+            Dropout probability or ScheduledFloat applied after output norm.

306-309: Stale comment: no explicit scaling of x occurs here

Comment references scaling x by 0.1, but code relies on ScaleGrad modules instead.

-        # scaling x by 0.1 allows us to use a larger grad-scale in fp16 "amp" (automatic mixed precision)
-        # training, since the weights in the first convolution are otherwise the limiting factor for getting infinite
-        # gradients.
+        # Gradient stability for mixed precision is handled above via ScaleGrad;
+        # no explicit scaling of x is performed here.
egs/mls_english/ASR/zipformer/decoder.py (1)

79-88: Guard grouped Conv1d with a divisibility check

groups=decoder_dim//4 requires decoder_dim % 4 == 0; assert early for clearer errors.

         assert context_size >= 1, context_size
+        assert decoder_dim % 4 == 0, "decoder_dim must be divisible by 4 when using groups=decoder_dim//4"
         self.context_size = context_size
         self.vocab_size = vocab_size
egs/mls_english/ASR/local/display_manifest_statistics.py (1)

34-46: Rename helper to reflect return type and simplify usage.

get_parser() returns parsed args, not a parser. Rename for clarity.

Apply this diff:

-def get_parser():
+def get_args():
     parser = argparse.ArgumentParser(
         description=ARGPARSE_DESCRIPTION,
         formatter_class=argparse.ArgumentDefaultsHelpFormatter,
     )
@@
-    return parser.parse_args()
+    return parser.parse_args()
@@
-def main():
-    args = get_parser()
+def main():
+    args = get_args()
egs/reazonspeech/ASR/local/utils/asr_datamodule.py (2)

189-194: Docstring: include cuts_musan in Args for completeness.

Apply this diff:

         Args:
           cuts_train:
             CutSet for training.
           sampler_state_dict:
             The state dict for the training sampler.
+          cuts_musan:
+            Optional CutSet of MUSAN noise to mix in via CutMix when enabled.

362-366: Fix return type annotation: this returns a single CutSet, not a list.

Apply this diff:

-    def test_cuts(self) -> List[CutSet]:
+    def test_cuts(self) -> CutSet:
@@
-        return load_manifest_lazy(
+        return load_manifest_lazy(
             self.args.manifest_dir / "reazonspeech_cuts_test.jsonl.gz"
         )
icefall/utils.py (1)

1394-1399: Make pad_left keyword-only to avoid accidental positional misuse

This preserves backward compatibility for existing positional calls that pass only lengths or (lengths, max_len).

-def make_pad_mask(
-    lengths: torch.Tensor,
-    max_len: int = 0,
-    pad_left: bool = False,
-) -> torch.Tensor:
+def make_pad_mask(
+    lengths: torch.Tensor,
+    max_len: int = 0,
+    *,
+    pad_left: bool = False,
+) -> torch.Tensor:
egs/mls_english/ASR/README.md (2)

10-12: Replace bare URLs with Markdown links

Silences MD034 and improves readability.

-- Dataset: https://huggingface.co/datasets/parler-tts/mls_eng
-- Original MLS dataset link: https://www.openslr.org/94
+- Dataset: [Hugging Face – parler-tts/mls_eng](https://huggingface.co/datasets/parler-tts/mls_eng)
+- Original MLS dataset: [OpenSLR 94](https://www.openslr.org/94)

16-19: Remove HTML/break hacks; tighten wording

Avoid backslash/HTML line breaks in Markdown; keep content concise.

-This recipe currently only supports on-the-fly feature bank computation, since `lhotse` manifests and feature banks are not pre-calculated in this recipe. This should mean that the dataset can be streamed from Hugging Face, but we have not tested this yet. We may add a version that supports pre-calculating features to better match existing recipes.\
-<br>
-
-[./RESULTS.md](./RESULTS.md) contains the latest results. This MLS English recipe was primarily developed for use in the ```multi_ja_en``` Japanese-English bilingual pipeline, which is based on MLS English and ReazonSpeech.
+This recipe currently supports only on-the-fly feature computation (no pre-calculated `lhotse` manifests/feature banks yet). Streaming from Hugging Face should work, but is untested. We may add a pre-calculated features option later.
+
+See [RESULTS.md](./RESULTS.md) for the latest results. This MLS English recipe was primarily developed for the `multi_ja_en` Japanese–English bilingual pipeline (MLS English + ReazonSpeech).
egs/multi_ja_en/ASR/RESULTS.md (3)

33-39: Nit: capitalize ONNX consistently

Minor wording polish in the heading; no code impact.


56-57: Use canonical dataset name “Common Voice”

Spelling consistency.

-| CommonVoice                 | 39.87   |
+| Common Voice               | 39.87   |

71-71: Replace bare URL with Markdown link

Improves readability and satisfies MD034.

-Pre-trained model can be found here: https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/main
+Pre-trained model: [reazon-research/reazonspeech-k2-v2-ja-en](https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/main)
egs/reazonspeech/ASR/zipformer/train.py (2)

68-69: Prefer lazy manifest loading for MUSAN to reduce memory footprint

load_manifest_lazy avoids eagerly materializing the CutSet. MUSAN isn’t huge, but lazy loading is safer and consistent with other loaders in this repo.

-from lhotse import load_manifest
+from lhotse import load_manifest_lazy

1223-1235: Switch MUSAN to lazy loading; keep graceful fallback

Reduces peak memory while keeping behavior identical when the manifest is absent.

-    if args.enable_musan:
-        musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz"
-        if musan_path.exists():
-            cuts_musan = load_manifest(musan_path)
-            logging.info(f"Loaded MUSAN manifest from {musan_path}")
-        else:
-            logging.warning(
-                f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation"
-            )
-            cuts_musan = None
+    if args.enable_musan:
+        musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz"
+        if musan_path.exists():
+            cuts_musan = load_manifest_lazy(musan_path)
+            logging.info(f"Loaded MUSAN manifest (lazy) from {musan_path}")
+        else:
+            logging.warning(
+                f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation"
+            )
+            cuts_musan = None
     else:
         cuts_musan = None
egs/multi_ja_en/ASR/README.md (4)

3-3: Fix stray punctuation

Removes the dangling comma/period.

-A bilingual Japanese-English ASR model developed by the developers of ReazonSpeech that utilizes ReazonSpeech and the English subset of Multilingual LibriSpeech (MLS English), .
+A bilingual Japanese–English ASR model developed by the ReazonSpeech team that utilizes ReazonSpeech and the English subset of Multilingual LibriSpeech (MLS English).

7-7: Use Markdown link for MLS English

Consistency with other references.

-**Multilingual LibriSpeech (MLS)** is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages. This icefall training recipe was created for the restructured version of the English split of the dataset available on Hugging Face from `parler-tts` [here](https://huggingface.co/datasets/parler-tts/mls_eng).
+**Multilingual LibriSpeech (MLS)** is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages—English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages. This recipe uses the restructured English split on Hugging Face: [`parler-tts/mls_eng`](https://huggingface.co/datasets/parler-tts/mls_eng).

15-20: Table fixes: header typo and links

Correct “Datset” → “Dataset” and replace bare URLs with links.

-|Datset| Number of hours| URL|
+|Dataset| Number of hours| URL|
@@
-|MLS English|44,500|https://huggingface.co/datasets/parler-tts/mls_eng|
-|ReazonSpeech (all)|35,000|https://huggingface.co/datasets/reazon-research/reazonspeech|
+|MLS English|44,500|[Hugging Face – parler-tts/mls_eng](https://huggingface.co/datasets/parler-tts/mls_eng)|
+|ReazonSpeech (all)|35,000|[Hugging Face – reazon-research/reazonspeech](https://huggingface.co/datasets/reazon-research/reazonspeech)|

23-35: Tighten usage steps; clarify balance note

Minor wording/structure improvements.

-This recipe relies on the `mls_english` recipe and the `reazonspeech` recipe. 
+This recipe relies on the `mls_english` and `reazonspeech` recipes.
@@
-This recipe does not enforce data balance: please ensure that the `mls_english` and `reazonspeech` datasets prepared above are balanced to your liking (you may use the utility script `create_subsets_greedy.py` in the `mls_english` recipe to create a custom-sized MLS English sub-dataset).
+This recipe does not enforce dataset balance. If desired, use `mls_english/local/utils/create_subsets_greedy.py` to create an MLS English subset sized to your needs.
@@
-0. Run `../../mls_english/ASR/prepare.sh` and `../../reazonspeech/ASR/prepare.sh`
+0. Run `../../mls_english/ASR/prepare.sh` and `../../reazonspeech/ASR/prepare.sh`
 1. Run `./prepare.sh`
-2. Run `update_cutset_paths.py` (we will soon add this to `./prepare.sh`)
+2. Run `local/utils/update_cutset_paths.py` (to be integrated into `./prepare.sh`)
 3. Run `zipformer/train.py` (see example arguments inside the file)
egs/multi_ja_en/ASR/local/prepare_lang_bbpe.py (1)

163-166: OOV decomposition sanity-check.

You emit OOV as ["▁", sp.id_to_piece(sp.unk_id())]. Ensure "▁" exists in the model pieces and downstream handles a two-piece OOV token; otherwise prefer just [sp.id_to_piece(sp.unk_id())].

-    lexicon.append((oov, ["▁", sp.id_to_piece(sp.unk_id())]))
+    unk_piece = sp.id_to_piece(sp.unk_id())
+    lexicon.append((oov, [unk_piece]))
egs/mls_english/ASR/local/utils/generate_transcript.py (2)

22-23: Remove unused import.

-from typing import Optional

60-64: Guard against None/newlines and normalize whitespace when writing transcripts.

Prevents blank/garbled lines.

-    with open(output_file, "w") as f:
-        for cut in tqdm(cuts, desc="Processing cuts"):
-            for sup in cut.supervisions:
-                f.write(f"{sup.text}\n")
+    with open(output_file, "w") as f:
+        for cut in tqdm(cuts, desc="Processing cuts"):
+            for sup in cut.supervisions:
+                if not sup.text:
+                    continue
+                txt = " ".join(str(sup.text).split())
+                f.write(txt + "\n")
egs/mls_english/ASR/local/utils/create_subsets_greedy.py (4)

111-127: Use enumerate for progress; avoid manual counter.

Cleaner and aligns with static analysis hint.

-        num_samples_processed = 0
-        for original_idx in all_original_indices:
+        for num_samples_processed, original_idx in enumerate(all_original_indices, 1):
@@
-            num_samples_processed += 1
-            if num_samples_processed % 10000 == 0: # Print progress periodically
+            if num_samples_processed % 10000 == 0:  # Print progress periodically
                 print(f"  Processed {num_samples_processed} samples for '{split_name}'. Current duration: {current_duration_ms / (3600*1000):.2f} hours")

136-141: Avoid mutating Features in-place; cast with a copied mapping.

Prevents side effects on the original Features object.

-        if "audio" in sampled_split.features and not isinstance(sampled_split.features["audio"], Audio):
-            sampling_rate = sampled_split.features["audio"].sampling_rate if isinstance(sampled_split.features["audio"], Audio) else 16000
-            new_features = sampled_split.features
-            new_features["audio"] = Audio(sampling_rate=sampling_rate)
-            sampled_split = sampled_split.cast(new_features)
+        if "audio" in sampled_split.features and not isinstance(sampled_split.features["audio"], Audio):
+            sampling_rate = 16000
+            new_features = sampled_split.features.copy()
+            new_features["audio"] = Audio(sampling_rate=sampling_rate)
+            sampled_split = sampled_split.cast(new_features)

170-175: Iterate dict directly (no .keys()); skip ‘validation’ cleanly.

-    for split_name in full_dataset.keys():
+    for split_name in full_dataset:
         if split_name not in ['train', 'dev', 'test', 'validation']: # Ensure 'validation' is not re-copied if not renamed

261-270: Remove unused f-strings in usage prints.

Silences F541 and reduces confusion.

-    print(f"\nTo use your new subset dataset, you can load it like this:")
-    print(f"from datasets import load_dataset")
-    print(f"import os, glob")
-    print(f"data_files = {{}}")
-    print(f"for split_name in ['train', 'dev', 'test']: # Or iterate through actual splits created")
-    print(f"    split_path = os.path.join('{output_subset_data_path}', f'{{split_name}}*.parquet')")
-    print(f"    files = glob.glob(split_path)")
-    print(f"    if files: data_files[split_name] = files")
-    print(f"subset = load_dataset('parquet', data_files=data_files)")
-    print(f"print(subset)")
+    print("\nTo use your new subset dataset, you can load it like this:")
+    print("from datasets import load_dataset")
+    print("import os, glob")
+    print("data_files = {}")
+    print("for split_name in ['train', 'dev', 'test']: # Or iterate through actual splits created")
+    print(f"    split_path = os.path.join('{output_subset_data_path}', f'{{split_name}}*.parquet')")
+    print("    files = glob.glob(split_path)")
+    print("    if files: data_files[split_name] = files")
+    print("subset = load_dataset('parquet', data_files=data_files)")
+    print("print(subset)")
egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)

117-126: Minor: argparse defaults and path handling.

You gate recomputation on args.manifest_dir done file (good). Also ensure dl_dir is cast to str for load_dataset (already done) and audio_dir/manifest_dir exist.

     logging.basicConfig(format=formatter, level=logging.INFO)
 
-    if (args.manifest_dir / ".mls-eng-fbank.done").exists():
+    args.manifest_dir.mkdir(parents=True, exist_ok=True)
+    args.audio_dir.mkdir(parents=True, exist_ok=True)
+    if (args.manifest_dir / ".mls-eng-fbank.done").exists():
@@
-        mls_eng_hf_dataset_path = args.dl_dir # "/root/datasets/parler-tts--mls_eng"
+        mls_eng_hf_dataset_path = args.dl_dir  # e.g., "parler-tts/mls_eng" or local path
         cut_sets = make_cutset_blueprints(mls_eng_hf_dataset_path)
egs/multi_ja_en/ASR/local/train_bbpe_model.py (2)

27-31: Remove unused import

re is unused here.

-import re

102-104: Atomicity and artifacts: skip/rename behavior

  • Skipping solely on is_file() may miss existing symlinks. Prefer .exists().
  • SentencePiece produces .model and .vocab; you move only .model. Confirm .vocab isn’t needed downstream or move it too.
-    if output_model.is_file():
+    if output_model.exists():
         print(f"{output_model} exists - skipping")
         return
@@
-    shutil.move(str(temp_model_file), str(output_model))
+    shutil.move(str(temp_model_file), str(output_model))
+    temp_vocab_file = Path(model_prefix + ".vocab")
+    if temp_vocab_file.exists():
+        shutil.move(str(temp_vocab_file), str(output_model.with_suffix(".vocab")))

Also applies to: 133-134

egs/mls_english/ASR/local/utils/asr_datamodule.py (2)

26-36: Remove unused import

load_manifest is not used.

-from lhotse import CutSet, Fbank, FbankConfig, load_manifest, load_manifest_lazy
+from lhotse import CutSet, Fbank, FbankConfig, load_manifest_lazy

346-365: Avoid lru_cache on instance methods; fix return type of test_cuts

  • @lru_cache() on methods can retain self and leak memory (Ruff B019). Limit cache or switch to cached_property.
  • test_cuts returns a single CutSet, not List[CutSet].
-    @lru_cache()
-    def train_cuts(self) -> CutSet:
+    @lru_cache(maxsize=1)
+    def train_cuts(self) -> CutSet:
@@
-    @lru_cache()
-    def valid_cuts(self) -> CutSet:
+    @lru_cache(maxsize=1)
+    def valid_cuts(self) -> CutSet:
@@
-    @lru_cache()
-    def test_cuts(self) -> List[CutSet]:
+    @lru_cache(maxsize=1)
+    def test_cuts(self) -> CutSet:
egs/mls_english/ASR/zipformer/do_not_use_it_directly.py (6)

1052-1061: Docstring vs thresholds mismatch

Comment says keep 1–20s, code uses 0.3–30s. Align the comment to code (or vice versa).

-        # Keep only utterances with duration between 1 second and 20 seconds
+        # Keep only utterances with duration between 0.3 and 30.0 seconds

886-899: GradScaler growth heuristic

Manually doubling grad scale inside loop is unusual; if intentional, add a brief comment referencing rationale.


1242-1262: Unreachable code after explicit RuntimeError in main()

main() raises; the rest of main() is unreachable. If this file is strictly a library, consider removing the code below the raise to avoid confusion.


1001-1006: Model averaging dtype

model_avg = copy.deepcopy(model).to(torch.float64) ups casts all params to float64. Confirm this matches usage in checkpointing and averaging utilities (most recipes keep float32). If unintended, use to(dtype=torch.float32).


818-845: Zeroing grads order

Consider optimizer.zero_grad(set_to_none=True) before backward to save memory, as in other recipes; current order is still valid.


1-44: Usage header references a different recipe

Usage points to pruned_transducer_stateless7_streaming/train.py. If this is intentional boilerplate, ignore; otherwise update to MLS English path to prevent confusion.

egs/mls_english/ASR/RESULTS.md (2)

7-12: Tighten the table formatting and naming consistency.

  • Use a caption or footnote to specify decoding method variants (e.g., “modified_beam_search”) for clarity.
  • Keep naming consistent: “MLS English” vs “MLS-English”.
-| Type          | Greedy | Beam search |
+| Type          | Greedy | Beam search (modified_beam_search) |

14-24: Improve reproducibility details.

Consider pinning exact commits/versions for icefall, k2, lhotse, and the BPE model checksum; optionally include the exact MLS-English subset/version.

If available, append a small “Environment” section with commit hashes and package versions.

egs/reazonspeech/ASR/prepare.sh (3)

57-60: Fix subset comment mismatch and quote paths.

Comment says "small-v1" but code downloads "medium"; also quote $dl_dir to avoid word-splitting.

-  if [ ! -d $dl_dir/ReazonSpeech/downloads ]; then
-    # Download small-v1 by default.
-    lhotse download reazonspeech --subset medium $dl_dir
+  if [ ! -d "$dl_dir/ReazonSpeech/downloads" ]; then
+    # Download "medium" subset by default.
+    lhotse download reazonspeech --subset medium "$dl_dir"

61-68: Quote musan path and dl_dir to avoid globbing/word-splitting.

Safer with spaces and special chars.

-  if [ ! -d $dl_dir/musan ]; then
-    lhotse download musan $dl_dir
+  if [ ! -d "$dl_dir/musan" ]; then
+    lhotse download musan "$dl_dir"

104-111: Run Musan FBANK script via python for portability.

Avoids relying on execute bits in git.

-    ./local/compute_fbank_musan.py
+    python local/compute_fbank_musan.py
egs/reazonspeech/ASR/local/compute_fbank_musan.py (1)

64-66: Guard against os.cpu_count() returning None.

-    num_jobs = min(15, os.cpu_count())
+    cpu_cnt = os.cpu_count() or 1
+    num_jobs = min(15, cpu_cnt)
egs/multi_ja_en/ASR/zipformer/multi_dataset.py (2)

3-3: Remove unused import.

lru_cache is not used after refactor.

-from functools import lru_cache

27-47: Optional: cache CutSet mux results to avoid repeated manifest scans.

If these methods are called multiple times, consider caching (e.g., cached_property) to avoid repeated lazy-load counts.

egs/mls_english/ASR/prepare.sh (3)

66-71: Typo and quoting fix in Musan guidance.

-  # If you ha`ve predownloaded it to /path/to/musan,
+  # If you have predownloaded it to /path/to/musan,
@@
-  if [ ! -d $dl_dir/musan ] ; then
+  if [ ! -d "$dl_dir/musan" ] ; then
@@
-	  lhotse download musan $dl_dir 
+	  lhotse download musan "$dl_dir"

103-107: Run Musan FBANK script via python for portability.

-		./local/compute_fbank_musan.py
+		python local/compute_fbank_musan.py

79-89: Stage 1 sequencing is fine; consider chaining validations with '&&' for clearer failure propagation.

Not required (set -e already covers it), but improves readability.

-		python local/compute_fbank_mls_english.py \
+		python local/compute_fbank_mls_english.py \
 			--manifest-dir data/manifests \
 			--audio-dir data/audio \
-			--dl-dir $dl_dir/mls_english
-					# --dl-dir /root/datasets/parler-tts--mls_eng
-					python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_train.jsonl.gz
-					python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_dev.jsonl.gz
-					python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_test.jsonl.gz
-					touch data/manifests/.mls_english-validated.done
+			--dl-dir "$dl_dir/mls_english" && \
+		python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_train.jsonl.gz && \
+		python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_dev.jsonl.gz && \
+		python local/validate_manifest.py --manifest data/manifests/mls_eng_cuts_test.jsonl.gz && \
+		touch data/manifests/.mls_english-validated.done
egs/multi_ja_en/ASR/local/utils/asr_datamodule.py (4)

25-25: Remove unused import (ruff F401).

torch is not used.

- import torch

21-21: Remove unused import.

lru_cache is not used in this file.

-from functools import lru_cache

205-213: Prefer lazy loading for MUSAN to reduce memory footprint.

Use load_manifest_lazy instead of load_manifest.

-            cuts_musan = load_manifest(
-                self.args.manifest_dir / "musan/musan_cuts.jsonl.gz"
-            )
+            cuts_musan = load_manifest_lazy(
+                self.args.manifest_dir / "musan/musan_cuts.jsonl.gz"
+            )

281-291: Sampler buffers are hard-coded; consider exposing as CLI or deriving from dataset size.

Current buffer_size and shuffle_buffer_size may be suboptimal across datasets.

         group.add_argument(
             "--num-buckets",
             type=int,
             default=30,
             help="The number of buckets for the DynamicBucketingSampler"
             "(you might want to increase it for larger datasets).",
         )
+        group.add_argument(
+            "--sampler-buffer-size",
+            type=int,
+            default=0,
+            help="Override DynamicBucketingSampler buffer_size; 0 uses num_buckets*2000.",
+        )
+        group.add_argument(
+            "--sampler-shuffle-buffer-size",
+            type=int,
+            default=0,
+            help="Override DynamicBucketingSampler shuffle_buffer_size; 0 uses num_buckets*5000.",
+        )
             train_sampler = DynamicBucketingSampler(
                 cuts_train,
                 max_duration=self.args.max_duration,
                 shuffle=self.args.shuffle,
                 num_buckets=self.args.num_buckets,
-                buffer_size=self.args.num_buckets * 2000,
-                shuffle_buffer_size=self.args.num_buckets * 5000,
+                buffer_size=(
+                    self.args.sampler_buffer_size
+                    if self.args.sampler_buffer_size > 0
+                    else self.args.num_buckets * 2000
+                ),
+                shuffle_buffer_size=(
+                    self.args.sampler_shuffle_buffer_size
+                    if self.args.sampler_shuffle_buffer_size > 0
+                    else self.args.num_buckets * 5000
+                ),
                 drop_last=self.args.drop_last,
             )
egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py (1)

143-151: Safer overwrite of symlink target.

Use unlink(missing_ok=True) and ensure parent dir exists.

-                os.unlink(symlink_path)
+                symlink_path.parent.mkdir(parents=True, exist_ok=True)
+                try:
+                    os.unlink(symlink_path)
+                except FileNotFoundError:
+                    pass
                 updated_cuts.to_file(symlink_path)
egs/multi_ja_en/ASR/prepare.sh (1)

181-185: Ensure target dir exists and quote path when symlinking validator.

Also quote expansion (SC2046).

-      ln -svfr $(realpath ../../multi_zh_en/ASR/local/validate_bpe_lexicon.py) local/
+      mkdir -p local
+      ln -svfr "$(realpath ../../multi_zh_en/ASR/local/validate_bpe_lexicon.py)" local/
egs/mls_english/ASR/zipformer/decode.py (2)

138-145: Remove unused import (ruff F401).

make_pad_mask is unused.

 from icefall.utils import (
     AttributeDict,
-    make_pad_mask,
     setup_logger,
     store_transcripts,
     str2bool,
     write_error_stats,
 )

1030-1036: Use context manager to read context file (SIM115).

Also simplify list construction.

-        if os.path.exists(params.context_file):
-            contexts = []
-            for line in open(params.context_file).readlines():
-                contexts.append(line.strip())
+        if os.path.exists(params.context_file):
+            with open(params.context_file, "r", encoding="utf-8") as f:
+                contexts = [line.strip() for line in f]
             context_graph = ContextGraph(params.context_score)
             context_graph.build(sp.encode(contexts))
egs/mls_english/ASR/zipformer/tokenizer.py (2)

171-190: Avoid accessing private members of SymbolTable.

Prefer public API to build maps.

-        self._id2sym = token_table._id2sym
-        self._sym2id = token_table._sym2id
+        self._id2sym = {v: k for k, v in token_table.symbol_to_id.items()}
+        self._sym2id = dict(token_table.symbol_to_id)

232-253: Keep tests out of production module entrypoint.

Consider moving test_CharTokenizer() into a proper test file.

-if __name__ == "__main__":
-    test_CharTokenizer()
+if __name__ == "__main__":
+    # For ad-hoc debugging only
+    test_CharTokenizer()
egs/mls_english/ASR/zipformer/streaming_decode.py (5)

25-31: Remove unused imports.

os, pdb, and subprocess are unused; drop them to keep the module clean.

 import argparse
 import logging
 import math
-import os
-import pdb
-import subprocess as sp
 from pathlib import Path
 from typing import Dict, List, Optional, Tuple

19-23: Fix CLI usage string.

Missing space after script path.

-./zipformer/streaming_decode.py--epoch 28   --avg 15   --causal 1   --chunk-size 32   --left-context-frames 256   --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192
+./zipformer/streaming_decode.py --epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192

198-203: Avoid function call in default argument (B008).

Set the default to None and initialize inside the function.

-def get_init_states(
-    model: nn.Module,
-    batch_size: int = 1,
-    device: torch.device = torch.device("cpu"),
-) -> List[torch.Tensor]:
+def get_init_states(
+    model: nn.Module,
+    batch_size: int = 1,
+    device: Optional[torch.device] = None,
+) -> List[torch.Tensor]:
@@
-    states = model.encoder.get_init_states(batch_size, device)
+    if device is None:
+        device = torch.device("cpu")
+    states = model.encoder.get_init_states(batch_size, device)

186-191: Reduce default parallel streams to avoid excessive memory.

2000 concurrent streams can blow up memory; 64 is a safer default for most setups.

-    parser.add_argument(
+    parser.add_argument(
         "--num-decode-streams",
         type=int,
-        default=2000,
+        default=64,
         help="The number of streams that can be decoded parallel.",
     )

756-763: Tokenizer: prefer explicit lang_dir usage or remove unused --bpe-model.

You already add Tokenizer args; consider removing --bpe-model from this script, or use it consistently. Also verify params.lang/params.lang_type exist from Tokenizer.add_arguments (they do).

egs/multi_ja_en/ASR/zipformer/train.py (2)

1204-1211: Minor: yoda condition.

Prefer len(tokens) > T for readability (ruff SIM300).

-        if T < len(tokens):
+        if len(tokens) > T:

1222-1232: Pre-BPE Japanese char tokenization may conflict with BPE training.

You apply tokenize_by_ja_char then rely on SentencePiece for targets. This could distort English texts and duplicate segmentation work. Either:

  • apply this only for Japanese lines; or
  • drop it and normalize text elsewhere before SP encoding.

Example conditional application (pseudo-diff; adjust placement as needed):

-    def tokenize_and_encode_text(c: Cut):
+    def tokenize_and_encode_text(c: Cut):
         # Text normalize for each sample
         text = c.supervisions[0].text
-        text = byte_encode(tokenize_by_ja_char(text))
+        # Apply JA char tokenization only if text contains Japanese scripts
+        if any("\u3040" <= ch <= "\u30ff" or "\u4e00" <= ch <= "\u9fff" for ch in text):
+            text = byte_encode(tokenize_by_ja_char(text))
         c.supervisions[0].text = text
         return c
egs/mls_english/ASR/zipformer/train.py (4)

1181-1217: Silence overly verbose duration logs.

Commented-out warnings suggest prior noise; keep logs at debug to avoid log spam during filtering.

-        if c.duration < 1.0 or c.duration > 30.0:
-            # logging.warning(
-            #     f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
-            # )
+        if c.duration < 1.0 or c.duration > 30.0:
+            logging.debug(
+                f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
+            )
             return False

862-869: Rename unused loop variable in validation.

batch_idx isn’t used; use _ to reflect intent (ruff B007).

-    for batch_idx, batch in enumerate(valid_dl):
+    for _, batch in enumerate(valid_dl):

702-705: Simplify nested if.

Combine nested if checks when setting start_epoch from checkpoints.

-    if params.start_batch > 0:
-        if "cur_epoch" in saved_params:
-            params["start_epoch"] = saved_params["cur_epoch"]
+    if params.start_batch > 0 and "cur_epoch" in saved_params:
+        params["start_epoch"] = saved_params["cur_epoch"]

323-336: Arg redundancy: --lang-dir vs Tokenizer.add_arguments.

You add a custom --lang-dir and also Tokenizer.add_arguments (which adds --lang, --lang-type). Consider using only the Tokenizer interface or harmonize both to avoid confusion.

Also applies to: 1382-1385

Comment on lines +91 to +112
model_file = Path(model_prefix + ".model")
if not model_file.is_file():
spm.SentencePieceTrainer.train(
input=train_text,
vocab_size=vocab_size,
model_type=model_type,
model_prefix=model_prefix,
input_sentence_size=input_sentence_size,
character_coverage=args.character_coverage,
user_defined_symbols=user_defined_symbols,
byte_fallback=args.byte_fallback,
unk_id=unk_id,
bos_id=-1,
eos_id=-1,
)
else:
print(f"{model_file} exists - skipping")
return

shutil.copyfile(model_file, f"{lang_dir}/bpe.model")


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Always sync bpe_.model to bpe.model, even when training is skipped.

Currently, if the versioned model exists, you return early without copying, leaving bpe.model stale/missing.

-    if not model_file.is_file():
+    if not model_file.is_file():
         spm.SentencePieceTrainer.train(
             input=train_text,
@@
         )
-    else:
-        print(f"{model_file} exists - skipping")
-        return
-
-    shutil.copyfile(model_file, f"{lang_dir}/bpe.model")
+    else:
+        print(f"{model_file} exists - skipping training")
+    # Ensure canonical symlink/copy is updated
+    shutil.copyfile(model_file, f"{lang_dir}/bpe.model")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model_file = Path(model_prefix + ".model")
if not model_file.is_file():
spm.SentencePieceTrainer.train(
input=train_text,
vocab_size=vocab_size,
model_type=model_type,
model_prefix=model_prefix,
input_sentence_size=input_sentence_size,
character_coverage=args.character_coverage,
user_defined_symbols=user_defined_symbols,
byte_fallback=args.byte_fallback,
unk_id=unk_id,
bos_id=-1,
eos_id=-1,
)
else:
print(f"{model_file} exists - skipping")
return
shutil.copyfile(model_file, f"{lang_dir}/bpe.model")
model_file = Path(model_prefix + ".model")
if not model_file.is_file():
spm.SentencePieceTrainer.train(
input=train_text,
vocab_size=vocab_size,
model_type=model_type,
model_prefix=model_prefix,
input_sentence_size=input_sentence_size,
character_coverage=args.character_coverage,
user_defined_symbols=user_defined_symbols,
byte_fallback=args.byte_fallback,
unk_id=unk_id,
bos_id=-1,
eos_id=-1,
)
else:
print(f"{model_file} exists - skipping training")
# Ensure canonical symlink/copy is updated
shutil.copyfile(model_file, f"{lang_dir}/bpe.model")
🤖 Prompt for AI Agents
In egs/mls_english/ASR/local/train_bpe_model.py around lines 91 to 112, the
function returns early when the versioned model file exists and therefore never
syncs model_prefix+".model" to "{lang_dir}/bpe.model"; change the control flow
so the shutil.copyfile(model_file, f"{lang_dir}/bpe.model") is executed
regardless of whether training ran or was skipped (e.g., remove the early return
and only skip training but still perform the copy, or move the copy call out of
the if/else block after the training check), ensuring model_file points to the
correct Path before copying and handling/reporting any copy errors.

Comment on lines 196 to 204
transforms = []
if cuts_musan is not None:
logging.info("Enable MUSAN")
transforms.append(
CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True)
)
else:
logging.info("Disable MUSAN")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Honor the CLI flag when enabling MUSAN and warn if requested but unavailable.

Currently MUSAN is applied whenever cuts_musan is passed, ignoring --enable-musan. Gate on both for predictable behavior, and warn if enabled but missing data.

Apply this diff:

-        if cuts_musan is not None:
-            logging.info("Enable MUSAN")
-            transforms.append(
-                CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True)
-            )
-        else:
-            logging.info("Disable MUSAN")
+        if self.args.enable_musan:
+            if cuts_musan is not None:
+                logging.info("Enable MUSAN")
+                transforms.append(
+                    CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True)
+                )
+            else:
+                logging.warning("MUSAN requested but no cuts provided; disabling MUSAN.")
+        else:
+            logging.info("Disable MUSAN")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
transforms = []
if cuts_musan is not None:
logging.info("Enable MUSAN")
transforms.append(
CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True)
)
else:
logging.info("Disable MUSAN")
transforms = []
if self.args.enable_musan:
if cuts_musan is not None:
logging.info("Enable MUSAN")
transforms.append(
CutMix(cuts=cuts_musan, p=0.5, snr=(10, 20), preserve_id=True)
)
else:
logging.warning("MUSAN requested but no cuts provided; disabling MUSAN.")
else:
logging.info("Disable MUSAN")
🤖 Prompt for AI Agents
In egs/reazonspeech/ASR/local/utils/asr_datamodule.py around lines 196 to 204,
the code currently enables MUSAN whenever cuts_musan is provided regardless of
the --enable-musan CLI flag; change the logic to only append the CutMix
transform when BOTH cuts_musan is not None and enable_musan is truthy, log.info
when MUSAN is enabled, log.info when MUSAN is explicitly disabled via the flag,
and emit a logging.warning if enable_musan is True but cuts_musan is None to
warn the user that MUSAN was requested but data is unavailable.

@@ -0,0 +1 @@
Subproject commit 157dc62a6439f03de1fcccb4914fc0d30b6f21b8
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Submodule pointer update: verify gitlink, .gitmodules entry, CI/docs, and license.

  • Ensure this SHA is reachable and intentional (ideally tagged for traceability).
  • Confirm .gitmodules contains the correct URL/path for this submodule.
  • Make sure CI and READMEs instruct users to init/update submodules; otherwise Musan-dependent stages will fail.
  • Verify the submodule repo’s license is compatible and referenced in the top-level NOTICE/README.

Run:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

path="egs/reazonspeech/ASR/musan-k2-v2-reazonspeech-medium"
sha_expected="157dc62a6439f03de1fcccb4914fc0d30b6f21b8"

echo "== gitlink at HEAD =="
git ls-tree HEAD "$path" || true

echo "== verify SHA =="
if git ls-tree HEAD "$path" | awk '{print $3}' | grep -q "$sha_expected"; then
  echo "OK: gitlink matches expected SHA"
else
  echo "Mismatch: gitlink SHA does not match $sha_expected"; exit 1
fi

echo "== .gitmodules entries =="
if [ -f .gitmodules ]; then
  git config -f .gitmodules --get-regexp 'submodule\..*' || true
  echo "Filtered for path:"
  rg -nP "(?s)\\[submodule\\s+\".*\"\\].*?\\n\\s*path\\s*=\\s*$path\\b" .gitmodules -n -C2 || true
else
  echo "No .gitmodules found"; exit 1
fi

echo "== READMEs mention submodule init? =="
rg -n -S 'git submodule (update|init)' --glob 'egs/**/ASR/**/README*' || true

echo "== prepare scripts mention submodule? =="
rg -n -S 'git submodule (update|init)' --glob 'egs/**/ASR/**/prepare.sh' || true

echo "== license file in submodule (if initialized) =="
if [ -d "$path/.git" ] || [ -f "$path/.git" ]; then
  fd -HI -a -t f -d 2 '(LICENSE|COPYING|NOTICE)(\..*)?$' "$path" || true
else
  echo "Submodule not initialized; skip license check"
fi

Length of output: 774


Add missing .gitmodules entry and update submodule docs

  • The .gitmodules file is currently absent; add a section for
    egs/reazonspeech/ASR/musan-k2-v2-reazonspeech-medium pointing to the correct repo URL at commit 157dc62a6439f03de1fcccb4914fc0d30b6f21b8
  • Update CI configs and any egs/reazonspeech/ASR/README* or prepare.sh to include git submodule init --recursive and git submodule update --init --recursive
  • After initializing the submodule, verify its LICENSE (or NOTICE) is compatible and reference it in the top-level NOTICE/README
🤖 Prompt for AI Agents
In egs/reazonspeech/ASR/musan-k2-v2-reazonspeech-medium around lines 1-1, add a
.gitmodules entry mapping this path to the repository URL at commit
157dc62a6439f03de1fcccb4914fc0d30b6f21b8; then update CI configs and any
egs/reazonspeech/ASR/README* and prepare.sh to run git submodule init
--recursive and git submodule update --init --recursive (or equivalent pipeline
steps) so the submodule is cloned and populated during CI and local setup;
finally, after initializing the submodule locally in your branch, inspect its
LICENSE/NOTICE for compatibility and add a reference to that license in the
top-level NOTICE/README as required.

Comment on lines 1223 to 1235
if args.enable_musan:
musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz"
if musan_path.exists():
cuts_musan = load_manifest(musan_path)
logging.info(f"Loaded MUSAN manifest from {musan_path}")
else:
logging.warning(
f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation"
)
cuts_musan = None
else:
cuts_musan = None

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify CLI flag is plumbed: --enable_musan

Ensure ReazonSpeechAsrDataModule.add_arguments (or parser) defines --enable_musan with a sensible default and that prepare.sh documents generating musan_cuts.jsonl.gz.


🏁 Script executed:

#!/bin/bash
# Verify the flag and where it's consumed.
rg -nC2 -g 'egs/reazonspeech/**' -P '\benable_musan\b'

# Verify MUSAN manifest generation is documented or scripted.
rg -nC2 -g 'egs/reazonspeech/**' -P 'musan_cuts\.jsonl\.gz|MUSAN'

Length of output: 3111


#!/bin/bash
# Locate prepare.sh in the egs/reazonspeech directory
find egs/reazonspeech -type f -name prepare.sh -print

#!/bin/bash
# List top-level files in egs/reazonspeech to find any prepare or doc scripts
find egs/reazonspeech -maxdepth 2 -type f

# Search for any README or documentation files under egs/reazonspeech
find egs/reazonspeech -type f -iname "*readme*" -o -iname "*.md"

🏁 Script executed:

#!/bin/bash
# List all shell scripts in the egs/reazonspeech directory
find egs/reazonspeech -type f -name "*.sh"

Length of output: 76


🏁 Script executed:

#!/bin/bash
sed -n '1,200p' egs/reazonspeech/ASR/prepare.sh

Length of output: 3949


🏁 Script executed:

#!/bin/bash
# Display beginning of compute_fbank_musan.py to confirm manifest creation
sed -n '1,200p' egs/reazonspeech/ASR/local/compute_fbank_musan.py

Length of output: 4444


Document musan_cuts.jsonl.gz generation in prepare.sh

  • --enable_musan is already defined in ReazonSpeechAsrDataModule (default False).
  • In egs/reazonspeech/ASR/prepare.sh, Stage 4 invokes compute_fbank_musan.py but doesn’t explicitly mention the output. Add a comment like
    # This generates data/manifests/musan_cuts.jsonl.gz
    before running ./local/compute_fbank_musan.py.
🤖 Prompt for AI Agents
In egs/reazonspeech/ASR/zipformer/train.py around lines 1223 to 1235, the code
expects a musan_cuts.jsonl.gz manifest when --enable_musan is used but the
project's prepare script doesn't document that file's generation; update
egs/reazonspeech/ASR/prepare.sh (Stage 4) to add an inline comment immediately
before invoking ./local/compute_fbank_musan.py stating that this step generates
data/manifests/musan_cuts.jsonl.gz (e.g. "# This generates
data/manifests/musan_cuts.jsonl.gz"), so users running preparation know where
the musan manifest comes from and the train.py check will succeed when
--enable_musan is set.

Comment on lines 1421 to 1431
max_len = max(max_len, lengths.max())
n = lengths.size(0)
seq_range = torch.arange(0, max_len, device=lengths.device)
expaned_lengths = seq_range.unsqueeze(0).expand(n, max_len)
expanded_lengths = seq_range.unsqueeze(0).expand(n, max_len)

if pad_left:
mask = expanded_lengths < (max_len - lengths).unsqueeze(1)
else:
mask = expanded_lengths >= lengths.unsqueeze(-1)

return expaned_lengths >= lengths.unsqueeze(-1)
return mask
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential int/Tensor mixing in max_len computation

max(max_len, lengths.max()) can raise a TypeError (int vs Tensor). Cast to Python ints to make it torch/version-agnostic. Also keep return path unchanged.

-    max_len = max(max_len, lengths.max())
+    max_len = max(int(max_len), int(lengths.max().item()))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
max_len = max(max_len, lengths.max())
n = lengths.size(0)
seq_range = torch.arange(0, max_len, device=lengths.device)
expaned_lengths = seq_range.unsqueeze(0).expand(n, max_len)
expanded_lengths = seq_range.unsqueeze(0).expand(n, max_len)
if pad_left:
mask = expanded_lengths < (max_len - lengths).unsqueeze(1)
else:
mask = expanded_lengths >= lengths.unsqueeze(-1)
return expaned_lengths >= lengths.unsqueeze(-1)
return mask
# ensure both sides are Python ints to avoid int/Tensor comparison issues
max_len = max(int(max_len), int(lengths.max().item()))
n = lengths.size(0)
seq_range = torch.arange(0, max_len, device=lengths.device)
expanded_lengths = seq_range.unsqueeze(0).expand(n, max_len)
if pad_left:
mask = expanded_lengths < (max_len - lengths).unsqueeze(1)
else:
mask = expanded_lengths >= lengths.unsqueeze(-1)
return mask
🤖 Prompt for AI Agents
In icefall/utils.py around lines 1421 to 1431, the computation max_len =
max(max_len, lengths.max()) mixes a Python int and a torch Tensor which can
raise a TypeError; fix it by converting the tensor result to a Python int (e.g.,
lengths.max().item() or int(lengths.max().item())) before calling max so max_len
stays a plain int, then proceed with seq_range/expanded_lengths and return mask
unchanged.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +37 to +38
from asr_datamodule import ReazonSpeechAsrDataModule
from decode_stream import DecodeStream
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use MLS English datamodule instead of ReazonSpeech here.

This script is under MLS English recipe but wires ReazonSpeech datamodule. Replace with MLSEnglishHFAsrDataModule throughout.

-from asr_datamodule import ReazonSpeechAsrDataModule
+from asr_datamodule import MLSEnglishHFAsrDataModule
@@
 def main():
     parser = get_parser()
-    ReazonSpeechAsrDataModule.add_arguments(parser)
+    MLSEnglishHFAsrDataModule.add_arguments(parser)
     Tokenizer.add_arguments(parser)
@@
-    reazonspeech_corpus = ReazonSpeechAsrDataModule(args)
-
-    valid_cuts = reazonspeech_corpus.valid_cuts()
-    test_cuts = reazonspeech_corpus.test_cuts()
+    mls_english_corpus = MLSEnglishHFAsrDataModule(args)
+
+    valid_cuts = mls_english_corpus.valid_cuts()
+    test_cuts = mls_english_corpus.test_cuts()

Also applies to: 712-717, 856-865

🤖 Prompt for AI Agents
In egs/mls_english/ASR/zipformer/streaming_decode.py around lines 37-38 (and
also update occurrences at 712-717 and 856-865), the file imports and uses
ReazonSpeechAsrDataModule but should use the MLS English datamodule; replace the
import from "from asr_datamodule import ReazonSpeechAsrDataModule" with the
correct MLS English import "from ml_eng_hf_datamodule import
MLSEnglishHFAsrDataModule" (or the actual module path used in this repo), then
update all instantiations, type hints, and references from
ReazonSpeechAsrDataModule to MLSEnglishHFAsrDataModule across the noted line
ranges so the script consistently constructs and uses the
MLSEnglishHFAsrDataModule instance. Ensure any configuration keys or constructor
arguments match the MLSEnglishHFAsrDataModule API and adjust variable names if
needed.

Comment on lines +515 to +522
finished_streams = []
for i in range(len(decode_streams)):
decode_streams[i].states = states[i]
decode_streams[i].done_frames += encoder_out_lens[i]
# if decode_streams[i].done:
# finished_streams.append(i)
finished_streams.append(i)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

All streams are marked finished; plus int/ Tensor add bug.

Only finalize streams that are truly done. Also convert encoder_out_lens[i] to Python int when accumulating done_frames.

     finished_streams = []
     for i in range(len(decode_streams)):
         decode_streams[i].states = states[i]
-        decode_streams[i].done_frames += encoder_out_lens[i]
-        # if decode_streams[i].done:
-        # finished_streams.append(i)
-        finished_streams.append(i)
+        decode_streams[i].done_frames += int(encoder_out_lens[i].item())
+        if decode_streams[i].done:
+            finished_streams.append(i)
🤖 Prompt for AI Agents
In egs/mls_english/ASR/zipformer/streaming_decode.py around lines 515 to 522,
fix two issues: only mark streams as finished when they are actually done, and
ensure encoder_out_lens[i] is converted to a Python int before adding to
done_frames. Replace the unconditional finished_streams.append(i) with a
conditional that appends i only if decode_streams[i].done is truthy, and change
the done_frames increment to use an integer conversion like done_frames +=
int(encoder_out_lens[i]) (or encoder_out_lens[i].item() if it's a torch tensor).

Comment on lines +695 to +703
test_set_wers = sorted(test_set_wers.items(), key=lambda x: x[1])
errs_info = (
params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt"
)
with open(errs_info, "w") as f:
print("settings\tWER", file=f)
for key, val in test_set_wers:
print("{}\t{}".format(key, val), file=f)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Summary WER filename uses an out-of-scope/ambiguous key.

Don’t include key (last-loop variable) in the summary filename.

-    errs_info = (
-        params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt"
-    )
+    errs_info = params.res_dir / f"wer-summary-{test_set_name}-{params.suffix}.txt"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
test_set_wers = sorted(test_set_wers.items(), key=lambda x: x[1])
errs_info = (
params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt"
)
with open(errs_info, "w") as f:
print("settings\tWER", file=f)
for key, val in test_set_wers:
print("{}\t{}".format(key, val), file=f)
test_set_wers = sorted(test_set_wers.items(), key=lambda x: x[1])
- errs_info = (
- params.res_dir / f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt"
errs_info = params.res_dir / f"wer-summary-{test_set_name}-{params.suffix}.txt"
with open(errs_info, "w") as f:
print("settings\tWER", file=f)
for key, val in test_set_wers:
print("{}\t{}".format(key, val), file=f)
🤖 Prompt for AI Agents
In egs/mls_english/ASR/zipformer/streaming_decode.py around lines 695 to 703,
the summary WER filename is constructed using the loop variable `key` which is
shadowed by the later for-loop and therefore ambiguous/out-of-scope; change the
sorting and loop to use distinct variable names (e.g., rename sorted items to
`sorted_wers` or use `k, v` in the loop) and use the correct intended identifier
for the filename (for example `method_key`, `summary_key`, or the outer scoped
name such as `test_set_name` or `params.suffix`) so the filename no longer
depends on the loop variable; update the open and print loop to iterate over the
renamed variables accordingly.

Comment on lines +1218 to +1220
mls_english_corpus = MLSEnglishHFAsrDataModule(args)
train_cuts = mls_english_corpus.train_cuts()
# mls_english_corpus.load_dataset(args.dataset_path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Train-time filtering is defined but not used.

Apply remove_short_and_long_utt to train_cuts.

     mls_english_corpus = MLSEnglishHFAsrDataModule(args)
     train_cuts = mls_english_corpus.train_cuts()
+    train_cuts = train_cuts.filter(remove_short_and_long_utt)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mls_english_corpus = MLSEnglishHFAsrDataModule(args)
train_cuts = mls_english_corpus.train_cuts()
# mls_english_corpus.load_dataset(args.dataset_path)
mls_english_corpus = MLSEnglishHFAsrDataModule(args)
train_cuts = mls_english_corpus.train_cuts()
train_cuts = train_cuts.filter(remove_short_and_long_utt)
# mls_english_corpus.load_dataset(args.dataset_path)
🤖 Prompt for AI Agents
egs/mls_english/ASR/zipformer/train.py around lines 1218-1220: train-time
filtering function remove_short_and_long_utt is defined but never applied to the
training cuts; call the filter on train_cuts and reassign (e.g., train_cuts =
remove_short_and_long_utt(train_cuts) or, if the function accepts duration
bounds, pass the configured values from args like train_cuts =
remove_short_and_long_utt(train_cuts, args.min_duration, args.max_duration)) so
the training set respects the min/max utterance length.

Comment on lines +1229 to +1239
if args.enable_musan:
musan_path = Path(args.manifest_dir) / "musan_cuts.jsonl.gz"
if musan_path.exists():
cuts_musan = load_manifest(musan_path)
logging.info(f"Loaded MUSAN manifest from {musan_path}")
else:
logging.warning(f"MUSAN manifest not found at {musan_path}, disabling MUSAN augmentation")
cuts_musan = None
else:
cuts_musan = None

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

MUSAN manifest is loaded but never passed to dataloader.

Wire cuts_musan into train_dataloaders so augmentation actually happens.

-    train_dl = mls_english_corpus.train_dataloaders(
-            train_cuts, sampler_state_dict=sampler_state_dict
-    )
+    train_dl = mls_english_corpus.train_dataloaders(
+        train_cuts, sampler_state_dict=sampler_state_dict, cuts_musan=cuts_musan
+    )

Also applies to: 1240-1245

🧰 Tools
🪛 Ruff (0.12.2)

1238-1238: Local variable cuts_musan is assigned to but never used

Remove assignment to unused variable cuts_musan

(F841)

🤖 Prompt for AI Agents
In egs/mls_english/ASR/zipformer/train.py around lines 1229-1239 (and similarly
1240-1245), cuts_musan is loaded but never forwarded into the dataloader
creation; pass cuts_musan into the train_dataloaders call (e.g., as a keyword
argument cuts_musan=cuts_musan) where train/validation dataloaders are
constructed, and if train_dataloaders (or its upstream helpers) don't accept
this parameter, update their signature to accept and forward cuts_musan into the
augmentation/data pipeline so the MUSAN augmentation is actually applied.

Comment on lines 1187 to 1200
def remove_short_and_long_utt(c: Cut):
# Keep only utterances with duration between 1 second and 20 seconds
#
# Caution: There is a reason to select 20.0 here. Please see
# ../local/display_manifest_statistics.py

# Keep only utterances greater than 1 second
#
# You should use ../local/display_manifest_statistics.py to get
# an utterance duration distribution for your dataset to select
# the threshold
# if c.duration < 1.0 or c.duration > 30.0:
# logging.warning(
# f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
# )
# return False
# the threshold as this is dependent on which datasets you choose
if c.duration < 1.0:
logging.warning(
f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
)
return False

# In pruned RNN-T, we require that T >= S
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Reintroduce an upper duration cap to prevent OOM/outliers.

Earlier recipes capped at 30s; current filter only enforces ≥1s.

-        # Keep only utterances greater than 1 second
+        # Keep only utterances within a safe duration range
@@
-        if c.duration < 1.0:
+        if c.duration < 1.0 or c.duration > 30.0:
             logging.warning(
                 f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
             )
             return False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def remove_short_and_long_utt(c: Cut):
# Keep only utterances with duration between 1 second and 20 seconds
#
# Caution: There is a reason to select 20.0 here. Please see
# ../local/display_manifest_statistics.py
# Keep only utterances greater than 1 second
#
# You should use ../local/display_manifest_statistics.py to get
# an utterance duration distribution for your dataset to select
# the threshold
# if c.duration < 1.0 or c.duration > 30.0:
# logging.warning(
# f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
# )
# return False
# the threshold as this is dependent on which datasets you choose
if c.duration < 1.0:
logging.warning(
f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
)
return False
# In pruned RNN-T, we require that T >= S
def remove_short_and_long_utt(c: Cut):
# Keep only utterances within a safe duration range
#
# You should use ../local/display_manifest_statistics.py to get
# an utterance duration distribution for your dataset to select
# the threshold as this is dependent on which datasets you choose
if c.duration < 1.0 or c.duration > 30.0:
logging.warning(
f"Exclude cut with ID {c.id} from training. Duration: {c.duration}"
)
return False
# In pruned RNN-T, we require that T >= S
🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/zipformer/train.py around lines 1187 to 1200, the cut
filter enforces a minimum duration but lacks an upper duration cap which can
allow outlier long utterances to cause OOM; add a configurable maximum duration
check (e.g., MAX_UTT_DURATION = 30.0) and return False for cuts with c.duration
> MAX_UTT_DURATION while logging a warning with the cut id and duration; place
the constant near other config values or at module top and reference it in the
if-statement so the threshold is easy to adjust based on dataset statistics.

@pingfengluo
Copy link
Contributor

This PR adds a new recipe, mls_english, and overhauls an existing recipe, multi_ja_en.

In this prior PR, we added the multi_ja_en recipe for bilingual Japanese / English models. However, we observed that models performed significantly worse at English speech compared to Japanese speech. After determining the cause to be the large imbalance between the number of hours of English data compared to Japanese data, we decided to find a larger English dataset and use it instead of LibriSpeech.

Ultimately, we decided to use the English portion of Multilingual LibriSpeech (MLS English). However, this dataset did not yet have a recipe in icefall. So, we created a new recipe for MLS English, and then updated the multi_ja_en to rely on both reazonspeech recipe and mls_english recipe.

This PR includes both full recipes, so I will close my prior PR which implements the mls_english recipe alone.

Please see multi_ja_en/ASR/README.md and mls_english/ASR/README.md for more details about each recipe.

Please let me and @baileyeet know if there are any comments or concerns.

Summary by CodeRabbit

  • New Features

    • Added MLS English ASR recipe with end-to-end prepare, training, decoding, streaming, and tokenizer utilities.
    • Introduced tools for dataset download, transcript generation, BBPE training, subset creation, and manifest path updates.
    • Enabled MUSAN augmentation and feature computation across recipes.
    • Added left-padding support to pad mask utilities.
  • Refactor

    • Unified multi-dataset pipeline, replacing prior single-dataset data modules and updating default manifest/layout paths.
    • Decoders accept extra keyword arguments for broader compatibility.
  • Documentation

    • New MLS English README and RESULTS with commands and WER.
    • Updated bilingual (multi_ja_en) docs, usage, results, and preparation steps.

Not sure if adding a language tag prompt to the decoder/predictor can alleviate the issue of imbalanced multilingual data; it would be somewhat similar to the prompting approach Whisper uses for multi-language support.

@kinanmartin
Copy link
Author

Not sure if adding a language tag prompt to the decoder/predictor can alleviate the issue of imbalanced multilingual data; it would be somewhat similar to the prompting approach Whisper uses for multi-language support.

@pingfengluo I see, this could work well to increase the accuracy of prediction for one-language clips! But at the same time, I worry it could make it harder for the model to accurately code-switch when predicting bilingual audio without further explicit fine-tuning on labeled code-switching data. If you think the benefit would be large enough though, perhaps we can modify the code to add it.

@csukuangfj
Copy link
Collaborator

@JinZr Can you have a look at this PR?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)

119-119: Guard against os.cpu_count() returning None.

-    num_jobs = min(16, os.cpu_count())
+    num_jobs = min(16, os.cpu_count() or 1)
♻️ Duplicate comments (3)
egs/mls_english/ASR/local/utils/download_mls_english.py (2)

8-13: Parameterize HF repo and pin a revision for reproducibility (and align with future mirrors).

Hard-coding the repo makes runs brittle. Expose repo_id/revision/token with sensible defaults.

@@
-import argparse
-import os
-import sys
+import argparse
+import os
+import sys
+from typing import Optional
@@
-def download_dataset(dl_dir):
+def download_dataset(
+    dl_dir: str,
+    repo_id: str = "parler-tts/mls_eng",
+    revision: Optional[str] = None,
+    token: Optional[str] = None,
+) -> None:
@@
-    repo_id = "parler-tts/mls_eng"
     local_dataset_dir = os.path.join(dl_dir, "mls_english")
@@
-        snapshot_download(
+        snapshot_download(
             repo_id=repo_id,
             repo_type="dataset",
             local_dir=local_dataset_dir,
             local_dir_use_symlinks=False,
+            revision=revision,
+            token=token,
         )

31-33: Do not exit from library code; let exceptions propagate and handle in CLI.

Let callers handle failures; wrap the CLI call and exit non-zero there. Add flags to pass repo/revision; read token from HF_TOKEN.

@@
-    except Exception as e:
-        print(f"Error downloading dataset '{repo_id}': {e}", file=sys.stderr)
-        sys.exit(1)
+    except Exception as e:
+        print(f"Error downloading dataset '{repo_id}': {e}", file=sys.stderr)
+        raise
@@
-    parser = argparse.ArgumentParser(
+    parser = argparse.ArgumentParser(
         description="Download MLS English dataset from Hugging Face."
     )
@@
     parser.add_argument(
         "--dl-dir",
         type=str,
         required=True,
         help="The base directory where the 'mls_english' dataset will be downloaded.",
     )
+    parser.add_argument("--repo-id", type=str, default="parler-tts/mls_eng",
+                        help="HF dataset repo-id or local dataset path.")
+    parser.add_argument("--revision", type=str, default=None,
+                        help="HF revision/commit/tag for reproducible snapshots.")
     args = parser.parse_args()
 
-    download_dataset(args.dl_dir)
+    try:
+        download_dataset(
+            args.dl_dir,
+            repo_id=args.repo_id,
+            revision=args.revision,
+            token=os.getenv("HF_TOKEN"),
+        )
+    except Exception:
+        sys.exit(1)

Also applies to: 36-48

egs/mls_english/ASR/local/utils/create_subsets_greedy.py (1)

107-117: Harden duration parsing (handles None/NaN/strings/negatives).

Current code float(example[col]) will throw or propagate NaN. Use a safe parser. Also add math import.

+import math
@@
     def get_duration_from_column(example):
         """Helper to safely get duration from the specified column, in milliseconds."""
-        if duration_column_name in example:
-            return float(example[duration_column_name]) * 1000
-        else:
-            print(
-                f"Warning: Duration column '{duration_column_name}' not found in example. Returning 0.",
-                file=sys.stderr,
-            )
-            return 0
+        try:
+            if duration_column_name in example:
+                val = example[duration_column_name]
+                if isinstance(val, (int, float)) and math.isfinite(val) and val > 0:
+                    return float(val) * 1000
+                if isinstance(val, str):
+                    val = val.strip()
+                    if val:
+                        v = float(val)
+                        if math.isfinite(v) and v > 0:
+                            return v * 1000
+        except (TypeError, ValueError):
+            pass
+        return 0
🧹 Nitpick comments (10)
egs/mls_english/ASR/local/compute_fbank_mls_english.py (2)

63-65: Use logging instead of print for consistency.

-    print(f"{mls_eng_hf_dataset_path=}")
+    logging.info("mls_eng_hf_dataset_path=%s", mls_eng_hf_dataset_path)

57-59: Shorten exception message per Ruff TRY003 (optional).

-        raise ImportError(
-            "To process the MLS English HF corpus, please install optional dependency: pip install datasets"
-        )
+        raise ImportError("Missing optional dependency 'datasets'. Run: pip install datasets")
egs/mls_english/ASR/local/utils/create_subsets_greedy.py (8)

51-52: Generalize split filename regex to support any shard counts.

Your pattern only matches exactly 5-digit shard counters. Many datasets use variable digits. Relax it to \d+ for robustness.

-    split_pattern = re.compile(r"^(train|dev|test|validation)-\d{5}-of-\d{5}\.parquet$")
+    split_pattern = re.compile(r"^(train|dev|test|validation)-\d+-of-\d+\.parquet$")

76-84: Narrow the exception type when loading datasets.

Catching bare Exception hides actionable errors and triggers lints. Limit to expected types.

-    try:
-        full_dataset = load_dataset("parquet", data_files=data_files)
-    except Exception as e:
+    try:
+        full_dataset = load_dataset("parquet", data_files=data_files)
+    except (OSError, ValueError, RuntimeError) as e:
         print(
             f"Error loading dataset from '{full_data_dir}' with load_dataset: {e}",
             file=sys.stderr,
         )
         sys.exit(1)

217-234: Compute “other splits” duration without row iteration.

sum(get_duration_from_column(ex) for ex in other_split) materializes rows and may decode audio. Sum the prefetched column instead.

-            other_split = full_dataset[split_name]
-            subset_dataset[split_name] = other_split
-            other_duration_ms = sum(get_duration_from_column(ex) for ex in other_split)
+            other_split = full_dataset[split_name]
+            subset_dataset[split_name] = other_split
+            if duration_column_name in other_split.column_names:
+                vals = other_split[duration_column_name]
+                other_duration_ms = sum(
+                    (float(v) * 1000)
+                    for v in vals
+                    if v is not None
+                    and isinstance(v, (int, float, str))
+                    and str(v).strip() != ""
+                    and math.isfinite(float(v))
+                    and float(v) > 0
+                )
+            else:
+                other_duration_ms = 0

244-256: Narrow exception on save or explicitly ignore the lint.

Same BLE001 as above. Either narrow to expected IO/Arrow errors or add a # noqa: BLE001 if broad catch is intentional.

-    except Exception as e:
+    except (OSError, RuntimeError, ValueError) as e:
         print(
             f"Error saving subset dataset to '{output_subset_dir}': {e}",
             file=sys.stderr,
         )
         sys.exit(1)

22-28: Preserve fractional hours in output directory name.

Casting to int drops information (e.g., 0.5h → 0h). Use a compact formatter.

-    output_subset_dir = os.path.join(
-        output_base_dir,
-        f"mls_english_subset_train{int(target_train_hours)}h_dev{int(target_dev_hours)}h_test{int(target_test_hours)}h",
-    )
+    def _fmt_hours(h):  # avoids trailing .0 while preserving fractions
+        return f"{h:g}"
+    output_subset_dir = os.path.join(
+        output_base_dir,
+        f"mls_english_subset_train{_fmt_hours(target_train_hours)}h_dev{_fmt_hours(target_dev_hours)}h_test{_fmt_hours(target_test_hours)}h",
+    )

328-341: Remove f-prefix from literal strings (Ruff F541).

These have no placeholders; drop the f to satisfy Ruff.

-print(f"\nTo use your new subset dataset, you can load it like this:")
-print(f"from datasets import load_dataset")
-print(f"import os, glob")
-print(f"data_files = {}")
+print("\nTo use your new subset dataset, you can load it like this:")
+print("from datasets import load_dataset")
+print("import os, glob")
+print("data_files = {}")
@@
-    print(f"    files = glob.glob(split_path)")
-    print(f"    if files: data_files[split_name] = files")
-    print(f"subset = load_dataset('parquet', data_files=data_files)")
-    print(f"print(subset)")
+    print("    files = glob.glob(split_path)")
+    print("    if files: data_files[split_name] = files")
+    print("subset = load_dataset('parquet', data_files=data_files)")
+    print("print(subset)")

20-29: Validate non-negative target hours early.

Guard against accidental negative inputs; current logic would silently produce empty subsets.

     random.seed(random_seed)
+    for _name, _hours in {
+        "train": target_train_hours,
+        "dev": target_dev_hours,
+        "test": target_test_hours,
+    }.items():
+        if _hours < 0:
+            print(f"Error: target hours for '{_name}' must be >= 0 (got {_hours}).", file=sys.stderr)
+            sys.exit(1)

Also applies to: 281-297


112-116: Reduce noisy per-example warnings.

If a duration column is missing, this will spam logs per row. Consider counting and emitting a single summary warning.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 36fc1f1 and a4c1db5.

📒 Files selected for processing (3)
  • egs/mls_english/ASR/local/compute_fbank_mls_english.py (4 hunks)
  • egs/mls_english/ASR/local/utils/create_subsets_greedy.py (1 hunks)
  • egs/mls_english/ASR/local/utils/download_mls_english.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
egs/mls_english/ASR/local/compute_fbank_mls_english.py (1)
icefall/utils.py (1)
  • is_module_available (1975-1984)
🪛 Ruff (0.12.2)
egs/mls_english/ASR/local/utils/create_subsets_greedy.py

78-78: Do not catch blind exception: Exception

(BLE001)


251-251: Do not catch blind exception: Exception

(BLE001)


328-328: f-string without any placeholders

Remove extraneous f prefix

(F541)


329-329: f-string without any placeholders

Remove extraneous f prefix

(F541)


330-330: f-string without any placeholders

Remove extraneous f prefix

(F541)


331-331: f-string without any placeholders

Remove extraneous f prefix

(F541)


333-333: f-string without any placeholders

Remove extraneous f prefix

(F541)


338-338: f-string without any placeholders

Remove extraneous f prefix

(F541)


339-339: f-string without any placeholders

Remove extraneous f prefix

(F541)


340-340: f-string without any placeholders

Remove extraneous f prefix

(F541)


341-341: f-string without any placeholders

Remove extraneous f prefix

(F541)

egs/mls_english/ASR/local/compute_fbank_mls_english.py

57-59: Avoid specifying long messages outside the exception class

(TRY003)

egs/mls_english/ASR/local/utils/download_mls_english.py

31-31: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (48)
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630

Comment on lines +110 to 112
parser.add_argument("-a", "--audio-dir", type=Path)
parser.add_argument("-d", "--dl-dir", type=Path)
return parser.parse_args()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix --dl-dir semantics and resolve the actual dataset path; current code will miss the downloaded ‘mls_english’ subdir.

The downloader writes to <dl_dir>/mls_english, but main() passes args.dl_dir directly to load_dataset(), causing failures. Also mark required args.

@@
-    parser.add_argument("-m", "--manifest-dir", type=Path)
-    parser.add_argument("-a", "--audio-dir", type=Path)
-    parser.add_argument("-d", "--dl-dir", type=Path)
+    parser.add_argument("-m", "--manifest-dir", type=Path, required=True,
+                        help="Directory for manifests and features.")
+    parser.add_argument("-a", "--audio-dir", type=Path, required=True,
+                        help="Directory to materialize audio shards.")
+    parser.add_argument("-d", "--dl-dir", type=Path, required=True,
+                        help="Base download dir (expects '<dl-dir>/mls_english') or a local HF dataset dir or an HF repo-id.")
@@
-        mls_eng_hf_dataset_path = args.dl_dir  # "/root/datasets/parler-tts--mls_eng"
-        cut_sets = make_cutset_blueprints(mls_eng_hf_dataset_path)
+        # Accept: (1) <dl-dir>/mls_english from our downloader,
+        #         (2) a direct local HF dataset dir, or
+        #         (3) a repo-id string (if user passes one).
+        if args.dl_dir.is_dir():
+            candidate = args.dl_dir / "mls_english"
+            mls_eng_hf_dataset_path = candidate if candidate.exists() else args.dl_dir
+        else:
+            mls_eng_hf_dataset_path = args.dl_dir  # repo-id string allowed
+        cut_sets = make_cutset_blueprints(str(mls_eng_hf_dataset_path))

Also applies to: 132-134

🤖 Prompt for AI Agents
In egs/mls_english/ASR/local/compute_fbank_mls_english.py around lines 110-112
(and similarly lines 132-134), the CLI currently defines --dl-dir as an optional
Path and passes args.dl_dir directly to load_dataset(), but the downloader
writes into a subdirectory named "mls_english", so the code misses the actual
dataset path and can fail; update the argparse definitions to mark required
arguments (audio-dir and dl-dir) and, before calling load_dataset(), resolve the
real dataset root by appending the "mls_english" subdirectory (e.g.,
dataset_root = args.dl_dir / "mls_english" if args.dl_dir is not None) and use
that resolved path when invoking load_dataset(). Ensure you also handle the case
where the resolved path does not exist with a clear error.

Comment on lines +118 to +159
# --- NEW: Generalized sampling function ---
def sample_split_by_hours(split_name, original_split, target_hours):
"""
Samples a dataset split to reach a target number of hours.
Returns the sampled Dataset object and its actual duration in milliseconds.
"""
target_duration_ms = target_hours * 3600 * 1000
current_duration_ms = 0
indices_to_include = []

if original_split is None or len(original_split) == 0:
print(
f" Warning: Original '{split_name}' split is empty or not found. Cannot sample.",
file=sys.stderr,
)
return None, 0

print(
f"\n Processing '{split_name}' split to reach approximately {target_hours} hours..."
)
print(
f" Total samples in original '{split_name}' split: {len(original_split)}"
)

all_original_indices = list(range(len(original_split)))
random.shuffle(all_original_indices) # Shuffle indices for random sampling

num_samples_processed = 0
for original_idx in all_original_indices:
if current_duration_ms >= target_duration_ms and target_hours > 0:
print(
f" Target {split_name} hours reached ({target_hours}h). Stopping processing."
)
break

example = original_split[original_idx]
duration_ms = get_duration_from_column(example)

if duration_ms > 0:
indices_to_include.append(original_idx)
current_duration_ms += duration_ms

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid row-wise access (decodes audio) and prefetch durations; huge perf win.

original_split[i] materializes full rows and can decode Audio, which is prohibitively expensive. Prefetch the duration column and disable audio decode during sampling.

     def sample_split_by_hours(split_name, original_split, target_hours):
@@
-        target_duration_ms = target_hours * 3600 * 1000
+        target_duration_ms = target_hours * 3600 * 1000
@@
-        all_original_indices = list(range(len(original_split)))
+        # Disable audio decoding to avoid loading waveforms while sampling
+        if "audio" in original_split.features and isinstance(original_split.features["audio"], Audio):
+            sr = getattr(original_split.features["audio"], "sampling_rate", 16000)
+            original_split = original_split.cast_column("audio", Audio(sampling_rate=sr, decode=False))
+
+        # Prefetch durations once to avoid per-row materialization
+        duration_values = None
+        if duration_column_name in original_split.column_names:
+            try:
+                duration_values = original_split[duration_column_name]
+            except (RuntimeError, ValueError, OSError):
+                duration_values = None
+
+        all_original_indices = list(range(len(original_split)))
         random.shuffle(all_original_indices)  # Shuffle indices for random sampling
@@
-            example = original_split[original_idx]
-            duration_ms = get_duration_from_column(example)
+            if duration_values is not None:
+                # Fast path: parse from prefetched column
+                try:
+                    val = duration_values[original_idx]
+                    duration_ms = 0
+                    if val is not None:
+                        v = float(val)
+                        if math.isfinite(v) and v > 0:
+                            duration_ms = v * 1000
+                except (TypeError, ValueError):
+                    duration_ms = 0
+            else:
+                # Fallback: access a single row (audio decoding disabled above)
+                duration_ms = get_duration_from_column(original_split[original_idx])
@@
-        if "audio" in sampled_split.features and not isinstance(
-            sampled_split.features["audio"], Audio
-        ):
-            sampling_rate = (
-                sampled_split.features["audio"].sampling_rate
-                if isinstance(sampled_split.features["audio"], Audio)
-                else 16000
-            )
-            new_features = sampled_split.features
-            new_features["audio"] = Audio(sampling_rate=sampling_rate)
-            sampled_split = sampled_split.cast(new_features)
+        if "audio" in sampled_split.features and not isinstance(sampled_split.features["audio"], Audio):
+            sr = getattr(sampled_split.features["audio"], "sampling_rate", 16000)
+            new_features = sampled_split.features.copy()
+            new_features["audio"] = Audio(sampling_rate=sr, decode=False)
+            sampled_split = sampled_split.cast(new_features)

Also applies to: 175-187

🤖 Prompt for AI Agents
egs/mls_english/ASR/local/utils/create_subsets_greedy.py lines 118-159 (also
apply same change to 175-187): the loop uses original_split[i] which
materializes full rows and decodes audio per-row; instead prefetch the duration
column into memory and iterate that array while ensuring audio decoding is
disabled for the split during sampling. Concretely: obtain a durations
list/array from the dataset (e.g., fetch the single duration column via the
dataset API so it does not decode audio), shuffle indices over that length, use
durations[idx] to accumulate current_duration_ms and only access full examples
later when building the final subset, and make sure the dataset is opened/used
with audio decoding turned off or limited to the columns you need while
sampling.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
egs/multi_ja_en/ASR/RESULTS.md (4)

33-38: Align ONNX export averaging with reported decode settings.
Export uses --use-averaged-model 0 and --avg 1, while decode reports --avg 15. Mismatch can yield different WER/CER. Consider exporting the averaged model (or add a note explaining the difference).

Apply:

 ./zipformer/export-onnx.py \
   --tokens ./data/lang/bbpe_2000/tokens.txt \
-  --use-averaged-model 0 \
-  --epoch 21 \
-  --avg 1 \
+  --use-averaged-model 1 \
+  --epoch 21 \
+  --avg 15 \
   --exp-dir ./zipformer/exp

30-31: Nit: capitalize ONNX.
Minor documentation polish.

-To export the model with onnx:
+To export the model with ONNX:

51-58: Make English WER table reproducible (versions/splits/normalization).
Briefly note dataset versions/splits (e.g., CV release, TED corpus variant, MLS split) and text normalization/tokenization used.


60-67: Add dataset details for Japanese CER table.
Specify corpus versions/splits and scoring pipeline (e.g., full-/half-width normalization, punctuation handling) for reproducibility.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a4c1db5 and 9a940c3.

📒 Files selected for processing (1)
  • egs/multi_ja_en/ASR/RESULTS.md (3 hunks)
🧰 Additional context used
🪛 LanguageTool
egs/multi_ja_en/ASR/RESULTS.md

[grammar] ~53-~53: There might be a mistake here.
Context: ...Corpus | WER (%) | |-----------------------------|---------...

(QB_NEW_EN)


[grammar] ~54-~54: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...

(QB_NEW_EN)


[grammar] ~55-~55: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...

(QB_NEW_EN)


[grammar] ~56-~56: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test-clean) | 8.64 |...

(QB_NEW_EN)


[grammar] ~62-~62: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...

(QB_NEW_EN)


[grammar] ~63-~63: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...

(QB_NEW_EN)


[grammar] ~64-~64: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...

(QB_NEW_EN)


[grammar] ~65-~65: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (130)
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.4.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.4.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.4.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.4.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.4.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.4.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.4.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.4.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.4.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.4.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.0 v20250630
🔇 Additional comments (2)
egs/multi_ja_en/ASR/RESULTS.md (2)

12-12: Epoch change to 21 is consistent with decode/export — LGTM.
No issues spotted with this value across commands.


23-23: Decode epoch updated to 21 — LGTM.
Matches training/export epoch.

--use-fp16 1 \
--exp-dir zipformer/exp \
--max-duration 600
--manifest-dir data/manifests
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Verify manifest path expectations for training.
Switching to --manifest-dir data/manifests is fine; ensure the prep steps populate this directory and that the README points to the exact script/output layout.

Run to sanity-check presence:


🏁 Script executed:

#!/bin/bash
ls -lah data/manifests || exit 1
fd -e jsonl -e jsonl.gz data/manifests -x sh -c 'echo "found: $1" _ {}' \; || true

Length of output: 95


Ensure data/manifests exists and is populated before using --manifest-dir
The directory data/manifests doesn’t exist, so training will fail. Update your data-prep script (and the README) to output the generated JSONL/JSONL.GZ files into data/manifests.

🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around line 16, the README references
--manifest-dir data/manifests but that directory doesn't exist; update the
data-prep script to create data/manifests and emit the generated JSONL/JSONL.GZ
manifest files there, and update the README to document this output path and
prerequisite (or change the flag to the actual manifest path your prep currently
writes to).

@JinZr
Copy link
Collaborator

JinZr commented Sep 2, 2025 via email

kinanmartin and others added 2 commits September 2, 2025 18:05
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
egs/multi_ja_en/ASR/RESULTS.md (1)

12-16: Clarify manifests prep and make max-duration explicit (if non-default).

  • Ensure prepare scripts create/populate data/manifests before training; this was flagged earlier and still applies.
  • If train.py relies on a default max-duration, consider documenting the default or adding the flag for clarity.

Apply this doc tweak right below the training block to make the prerequisite unmistakable:

   --exp-dir zipformer/exp \
   --manifest-dir data/manifests

+Note: Run the data preparation steps first so that data/manifests exists and contains the JSONL/JSONL.GZ manifests used for training.
+If you rely on the default for --max-duration, consider stating it here (or pass it explicitly) for reproducibility.


</blockquote></details>
<details>
<summary>egs/mls_english/ASR/local/utils/asr_datamodule.py (2)</summary><blockquote>

`77-82`: **Resolved: --max-duration now uses float.**

Thanks for addressing the earlier inconsistency; `type=float` matches the default and expected semantics.

---

`319-321`: **Use configured --num-workers for validation DataLoader (and set persistent_workers accordingly).**

Hardcoding `num_workers=2` ignores the CLI knob and can underutilize hardware.


Apply:

```diff
         valid_dl = DataLoader(
             validate,
             sampler=valid_sampler,
             batch_size=None,
-            num_workers=2,
-            persistent_workers=False,
+            num_workers=self.args.num_workers,
+            persistent_workers=self.args.num_workers > 0,
         )

Optional: make the same change for train/test DataLoaders for consistency (see Lines 278-284 and 342-343).

🧹 Nitpick comments (11)
egs/multi_ja_en/ASR/RESULTS.md (4)

33-38: Confirm tokens path provenance and prep step.

The tokens path moved to ./data/lang/bbpe_2000/tokens.txt. Add a short note pointing readers to the tokenizer prep that generates it (bbpe_2000), or they’ll hit a missing file.

Suggested doc nudge:

 ./zipformer/export-onnx.py \
   --tokens ./data/lang/bbpe_2000/tokens.txt \
   --use-averaged-model 0 \
   --epoch 21 \
   --avg 1 \
   --exp-dir ./zipformer/exp
+
+# Ensure you have created ./data/lang/bbpe_2000 via the language prep step before exporting.

51-58: Annotate dataset versions and evaluation settings for WERs.

Specify dataset versions/splits (e.g., CV version, TED corpus variant), text normalization, and decoding method (greedy vs. MBS, LM? none). This aids reproducibility.

Example header tweak:

-| Corpus                       | WER (%) |
+| Corpus (version/split, norm, method) | WER (%) |

60-67: Do the same for Japanese CERs (versions/norm/method).

List dataset versions (e.g., JSUT release), normalization rules, and decode method for CER.


69-69: Pin the pre-trained model revision for stability.

Link is good; consider adding a specific HF revision/commit/tag in the text so results remain reproducible if the repo updates.

egs/mls_english/ASR/local/utils/asr_datamodule.py (7)

346-351: Avoid lru_cache on bound instance methods (B019); cache by path instead.

@lru_cache on methods retains references to self and can leak when multiple instances are created.

Replace the decorators and route through a path-keyed cache:

-    @lru_cache()
     def train_cuts(self) -> CutSet:
         logging.info("About to get train cuts")
-        return load_manifest_lazy(
-            self.args.manifest_dir / "mls_eng_cuts_train.jsonl.gz"
-        )
+        return _cached_load_cuts(str(self.args.manifest_dir / "mls_eng_cuts_train.jsonl.gz"))
 
-    @lru_cache()
     def valid_cuts(self) -> CutSet:
         logging.info("About to get dev cuts")
-        return load_manifest_lazy(
-            self.args.manifest_dir / "mls_eng_cuts_dev.jsonl.gz"
-        )
+        return _cached_load_cuts(str(self.args.manifest_dir / "mls_eng_cuts_dev.jsonl.gz"))
 
-    @lru_cache()
     def test_cuts(self) -> CutSet:
         logging.info("About to get test cuts")
-        return load_manifest_lazy(
-            self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"
-        )
+        return _cached_load_cuts(str(self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"))

Add the helper near the imports:

# top-level helper (place after imports)
@lru_cache(maxsize=16)
def _cached_load_cuts(path: str) -> CutSet:
    return load_manifest_lazy(Path(path))

Also applies to: 353-358, 360-365


196-205: Honor --concatenate-cuts during training too (not only in validation).

The CLI flag suggests train-time concatenation support; applying it here reduces padding and aligns with the docstring.

Apply:

         transforms = []
+        if self.args.concatenate_cuts:
+            transforms = [
+                CutConcatenate(
+                    duration_factor=self.args.duration_factor, gap=self.args.gap
+                )
+            ] + transforms

If order matters, keep CutConcatenate before CutMix so noise is applied after concatenation.


231-236: Be explicit about PrecomputedFeatures when not using on-the-fly features.

K2SpeechRecognitionDataset defaults may vary; explicit is clearer and future-proof.

Apply:

         train = K2SpeechRecognitionDataset(
             cut_transforms=transforms,
+            input_strategy=PrecomputedFeatures(),
             input_transforms=input_transforms,
             return_cuts=self.args.return_cuts,
         )
         else:
             validate = K2SpeechRecognitionDataset(
                 cut_transforms=transforms,
+                input_strategy=PrecomputedFeatures(),
                 return_cuts=self.args.return_cuts,
             )

Also applies to: 305-309


247-247: Remove unused noqa in commented code (RUF100).

The # noqa is unnecessary on a commented-out line.

Apply:

-            # transforms = [PerturbSpeed(factors=[0.9, 1.1], p=2/3)] + transforms   # noqa
+            # transforms = [PerturbSpeed(factors=[0.9, 1.1], p=2/3)] + transforms

291-295: Use iterable unpacking over list concatenation (RUF005).

Slightly cleaner and avoids an intermediate list.

Apply:

-            transforms = [
-                CutConcatenate(
-                    duration_factor=self.args.duration_factor, gap=self.args.gap
-                )
-            ] + transforms
+            transforms = [
+                CutConcatenate(
+                    duration_factor=self.args.duration_factor, gap=self.args.gap
+                ),
+                *transforms,
+            ]

26-26: Drop unused import load_manifest.

Apply:

-from lhotse import CutSet, Fbank, FbankConfig, load_manifest, load_manifest_lazy
+from lhotse import CutSet, Fbank, FbankConfig, load_manifest_lazy

278-284: Optional: enable persistent_workers when workers > 0 for throughput.

This can speed up data loading for long runs; keep it False when workers == 0.

Apply:

         train_dl = DataLoader(
             train,
             sampler=train_sampler,
             batch_size=None,
             num_workers=self.args.num_workers,
-            persistent_workers=False,
+            persistent_workers=self.args.num_workers > 0,
         )
         test_dl = DataLoader(
             test,
             batch_size=None,
             sampler=sampler,
-            num_workers=self.args.num_workers,
+            num_workers=self.args.num_workers,
+            persistent_workers=self.args.num_workers > 0,
         )

Also applies to: 338-343

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9a940c3 and ef7664e.

📒 Files selected for processing (2)
  • egs/mls_english/ASR/local/utils/asr_datamodule.py (1 hunks)
  • egs/multi_ja_en/ASR/RESULTS.md (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
egs/mls_english/ASR/local/utils/asr_datamodule.py (1)
icefall/utils.py (1)
  • str2bool (142-158)
🪛 Ruff (0.12.2)
egs/mls_english/ASR/local/utils/asr_datamodule.py

247-247: Unused blanket noqa directive

Remove unused noqa directive

(RUF100)


291-295: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


346-346: Use of functools.lru_cache or functools.cache on methods can lead to memory leaks

(B019)


353-353: Use of functools.lru_cache or functools.cache on methods can lead to memory leaks

(B019)


360-360: Use of functools.lru_cache or functools.cache on methods can lead to memory leaks

(B019)

🪛 LanguageTool
egs/multi_ja_en/ASR/RESULTS.md

[grammar] ~53-~53: There might be a mistake here.
Context: ...Corpus | WER (%) | |-----------------------------|---------...

(QB_NEW_EN)


[grammar] ~54-~54: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...

(QB_NEW_EN)


[grammar] ~55-~55: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...

(QB_NEW_EN)


[grammar] ~56-~56: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test-clean) | 8.64 |...

(QB_NEW_EN)


[grammar] ~62-~62: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...

(QB_NEW_EN)


[grammar] ~63-~63: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...

(QB_NEW_EN)


[grammar] ~64-~64: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...

(QB_NEW_EN)


[grammar] ~65-~65: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (197)
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: style_check (ubuntu-latest, 3.10)
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: style_check (ubuntu-latest, 3.10)
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: style_check (ubuntu-latest, 3.10)
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: style_check (ubuntu-latest, 3.10)
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: style_check (ubuntu-latest, 3.10)
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: style_check (ubuntu-latest, 3.10)
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: style_check (ubuntu-latest, 3.10)
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: style_check (ubuntu-latest, 3.10)
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: style_check (ubuntu-latest, 3.10)
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: style_check (ubuntu-latest, 3.10)
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.5.0 v20250630
  • GitHub Check: py3.10 torch2.2.0 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.0.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
🔇 Additional comments (1)
egs/multi_ja_en/ASR/RESULTS.md (1)

23-28: LGTM: decode epoch/avg pair is consistent.

Epoch 21 with avg 15 matches the training epoch change. No further action.

Comment on lines +360 to +365
@lru_cache()
def test_cuts(self) -> List[CutSet]:
logging.info("About to get test cuts")
return load_manifest_lazy(
self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Type annotation mismatch: test_cuts returns a CutSet, not List[CutSet].

This can cause confusion and downstream typing issues; test_dataloaders also expects a CutSet.

Apply:

-    def test_cuts(self) -> List[CutSet]:
+    def test_cuts(self) -> CutSet:
         logging.info("About to get test cuts")
         return load_manifest_lazy(
             self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"
         )

And remove the now-unused List import:

-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, Optional
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@lru_cache()
def test_cuts(self) -> List[CutSet]:
logging.info("About to get test cuts")
return load_manifest_lazy(
self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"
)
++ b/egs/mls_english/ASR/local/utils/asr_datamodule.py
@@ –1,1 +1,1 @@
from typing import Any, Dict, Optional
@@ –360,2 +360,2 @@
def test_cuts(self) -> CutSet:
logging.info("About to get test cuts")
return load_manifest_lazy(
self.args.manifest_dir / "mls_eng_cuts_test.jsonl.gz"
)
🧰 Tools
🪛 Ruff (0.12.2)

360-360: Use of functools.lru_cache or functools.cache on methods can lead to memory leaks

(B019)

🤖 Prompt for AI Agents
In egs/mls_english/ASR/local/utils/asr_datamodule.py around lines 360 to 365,
the test_cuts function is annotated to return List[CutSet] but actually returns
a CutSet; update the return type annotation to CutSet and remove the now-unused
List import from the file; ensure any related type hints (e.g., in
test_dataloaders) remain consistent with CutSet.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
egs/multi_ja_en/ASR/RESULTS.md (1)

13-20: Boolean CLI arg: confirm expected type for --enable-musan and ensure manifests exist.

  • Many icefall scripts expect 0/1 or store_true; passing literal "True" may be parsed as a string. Please confirm and align examples (e.g., --enable-musan 1 or just --enable-musan).
  • Also verify that the README/prep steps create and populate data/manifests before running these commands.

Run:

#!/bin/bash
# Check how enable_musan is defined
rg -nP --type=py -C2 '\benable[-_]?musan\b' egs/multi_ja_en/ASR/zipformer/train.py egs/multi_ja_en/ASR/zipformer/decode.py || true

# Sanity-check manifests after prep
ls -lah data/manifests || exit 1
fd -e jsonl -e jsonl.gz data/manifests -x sh -c 'echo "found: $1"' _ {}
🧹 Nitpick comments (5)
egs/multi_ja_en/ASR/RESULTS.md (5)

24-31: Decode config is underspecified (beam/num-active/etc.).

modified_beam_search usually needs beam/num_paths-like knobs for reproducibility. Add the key params you used (beam size, max contexts, n-best).


36-42: Capitalize and clarify: “export to ONNX”.

Minor wording fix and keep flags on one line per arg for copy/paste.

Apply:

-To export the model with onnx:
+To export the model to ONNX:

56-71: Clarify scoring setup for English WER and Japanese CER.

Specify casing (lowercased or original), punctuation normalization, tokenizer for Japanese (e.g., character-based vs. MeCab), and scoring tool (e.g., sclite). Without this, numbers aren’t comparable.


116-116: Document how chunk size affects WER/RTF.

Add a brief note on the trade-off across {16,32,64,128} (latency vs. accuracy) and indicate the value used to report streaming WER.


5-8: Add repro details: seed, commit hashes, and tool versions.

Include torch/k2/lhotse/icefall commit IDs, random seed(s), and GPU type to make these results reproducible.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ef7664e and bc2560c.

📒 Files selected for processing (1)
  • egs/multi_ja_en/ASR/RESULTS.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
egs/multi_ja_en/ASR/RESULTS.md

[grammar] ~44-~44: There might be a mistake here.
Context: ...xp-dir ./zipformer/exp ``` WER and CER on test set listed below (calculated with ...

(QB_NEW_EN)


[grammar] ~46-~46: There might be a mistake here.
Context: ...eech + MLS English (combined test set) | |----------------------|----------------...

(QB_NEW_EN)


[grammar] ~47-~47: There might be a mistake here.
Context: ...---------------------------------------| | Zipformer WER (%) | ...

(QB_NEW_EN)


[grammar] ~48-~48: There might be a mistake here.
Context: ... test | | greedy_search | ...

(QB_NEW_EN)


[grammar] ~49-~49: There might be a mistake here.
Context: ... 6.33 | | modified_beam_search | ...

(QB_NEW_EN)


[grammar] ~56-~56: There might be a mistake here.
Context: ... Corpus | WER (%) | |-----------------------------|---------...

(QB_NEW_EN)


[grammar] ~57-~57: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...

(QB_NEW_EN)


[grammar] ~58-~58: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...

(QB_NEW_EN)


[grammar] ~59-~59: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test set) | 8.64 ...

(QB_NEW_EN)


[grammar] ~65-~65: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...

(QB_NEW_EN)


[grammar] ~66-~66: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...

(QB_NEW_EN)


[grammar] ~67-~67: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...

(QB_NEW_EN)


[grammar] ~68-~68: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...

(QB_NEW_EN)

🔇 Additional comments (3)
egs/multi_ja_en/ASR/RESULTS.md (3)

72-75: HF link fix looks good.
Anchor and href are consistent now.


83-91: Streaming training flags likely missing.

The streaming section repeats non-streaming args. If you used streaming-specific options (e.g., causal/chunked training flags), include them; otherwise readers can’t reproduce “Streaming” results.


107-114: Confirm ONNX export flags match the streaming architecture.

--decode-chunk-len 32 is set; if additional streaming-only flags are required at export (e.g., causal conv settings), document them.

Comment on lines +44 to +51
WER and CER on test set listed below (calculated with `./zipformer/decode.py`):

| Datasets | ReazonSpeech + MLS English (combined test set) |
|----------------------|------------------------------------------------|
| Zipformer WER (%) | test |
| greedy_search | 6.33 |
| modified_beam_search | 6.32 |

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix table layout for combined-test metrics.

Header/rows are misaligned (“Zipformer WER (%) | test” reads oddly). Consider:

-|       Datasets       | ReazonSpeech + MLS English (combined test set) |
-|----------------------|------------------------------------------------|
-|   Zipformer WER (%)  |                      test                      |
-|     greedy_search    |                      6.33                      |
-| modified_beam_search |                      6.32                      |
+| Method               | WER (%) | Notes                                   |
+|----------------------|--------:|-----------------------------------------|
+| greedy_search        |   6.33  | Combined ReazonSpeech + MLS English     |
+| modified_beam_search |   6.32  | Combined ReazonSpeech + MLS English     |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
WER and CER on test set listed below (calculated with `./zipformer/decode.py`):
| Datasets | ReazonSpeech + MLS English (combined test set) |
|----------------------|------------------------------------------------|
| Zipformer WER (%) | test |
| greedy_search | 6.33 |
| modified_beam_search | 6.32 |
WER and CER on test set listed below (calculated with `./zipformer/decode.py`):
| Method | WER (%) | Notes |
|----------------------|--------:|-----------------------------------------|
| greedy_search | 6.33 | Combined ReazonSpeech + MLS English |
| modified_beam_search | 6.32 | Combined ReazonSpeech + MLS English |
🧰 Tools
🪛 LanguageTool

[grammar] ~44-~44: There might be a mistake here.
Context: ...xp-dir ./zipformer/exp ``` WER and CER on test set listed below (calculated with ...

(QB_NEW_EN)


[grammar] ~46-~46: There might be a mistake here.
Context: ...eech + MLS English (combined test set) | |----------------------|----------------...

(QB_NEW_EN)


[grammar] ~47-~47: There might be a mistake here.
Context: ...---------------------------------------| | Zipformer WER (%) | ...

(QB_NEW_EN)


[grammar] ~48-~48: There might be a mistake here.
Context: ... test | | greedy_search | ...

(QB_NEW_EN)


[grammar] ~49-~49: There might be a mistake here.
Context: ... 6.33 | | modified_beam_search | ...

(QB_NEW_EN)

🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around lines 44 to 51, the markdown table
layout for the combined test metrics is misaligned (the header cell
"ReazonSpeech + MLS English (combined test set)" and the "Zipformer WER (%) |
test" row read oddly); fix by restructuring the table to have clear column
headers (e.g., Dataset | Metric | Value or Dataset | Zipformer WER (%) with a
single header row), align the separator row to match the number of columns, and
move "test" into a proper column cell so each row maps correctly to the headers.

Comment on lines 96 to 102
./zipformer/decode.py \
--epoch 10 \
--avg 1 \
--exp-dir ./zipformer/exp \
--decoding-method modified_beam_search \
--manifest-dir data/manifests
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use streaming decode script/flags for the streaming model.

The streaming section uses decode.py without chunk args. Replace with the streaming decoder and include chunk length.

Apply:

-./zipformer/decode.py \
-    --epoch 10 \
-    --avg 1 \
-    --exp-dir ./zipformer/exp \
-    --decoding-method modified_beam_search \
-    --manifest-dir data/manifests
+./zipformer/streaming_decode.py \
+    --epoch 10 \
+    --avg 1 \
+    --exp-dir ./zipformer/exp \
+    --decoding-method modified_beam_search \
+    --decode-chunk-len 32 \
+    --manifest-dir data/manifests
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
./zipformer/decode.py \
--epoch 10 \
--avg 1 \
--exp-dir ./zipformer/exp \
--decoding-method modified_beam_search \
--manifest-dir data/manifests
```
./zipformer/streaming_decode.py \
--epoch 10 \
--avg 1 \
--exp-dir ./zipformer/exp \
--decoding-method modified_beam_search \
--decode-chunk-len 32 \
--manifest-dir data/manifests
🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around lines 96 to 102, the streaming decode
example uses the non-streaming decode.py and lacks chunking arguments; replace
it with the streaming decoder invocation (e.g., streaming_decode.py or the
decode.py streaming mode) and add required streaming flags such as
--decoding-method streaming_beam_search (or the project-specific streaming
method) plus chunk-related args like --chunk-length (and if applicable
--left-context/--right-context or --chunk-size) and any streaming-specific
model/exp flags; update the command line to call the streaming decoder script
and include a concrete --chunk-length value used by the streaming model.

Comment on lines +120 to 131
*Please let us know which script to use to evaluate the streaming model!*


We also include WER% for common English ASR datasets:

*Please let us know which script to use to evaluate the streaming model!*


And CER% for common Japanese datasets:

*Please let us know which script to use to evaluate the streaming model!*

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace placeholders with concrete evaluation commands.

The repeated “Please let us know…” lines block reproducibility. Provide the exact script(s), command lines, and scoring steps for streaming WER/CER.

Apply:

-*Please let us know which script to use to evaluate the streaming model!*
+Evaluate streaming with:
+```shell
+./zipformer/streaming_decode.py \
+  --epoch 10 --avg 1 --exp-dir ./zipformer/exp \
+  --decoding-method modified_beam_search \
+  --decode-chunk-len 32 \
+  --manifest-dir data/manifests
+```
+Score with the same normalization used for non-streaming; list exact commands/tools.
🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around lines 120-131, replace the repeated
placeholder lines with concrete evaluation instructions: add the exact streaming
decode command (pointing to zipformer/streaming_decode.py with required flags
such as epoch, avg, exp-dir, decoding-method, decode-chunk-len, manifest-dir),
then add the precise scoring steps and commands (normalization pipeline matching
non-streaming evaluation, followed by invocation of the scoring tool such as
sclite or the repo's wer/cer scoring script) and an example invocation showing
how to produce hypothesis and reference files and run the scorer; ensure you
list any required normalization options (lowercasing, punctuation removal,
kana/kanji normalization for Japanese) and the exact files produced/consumed so
readers can reproduce the reported WER/CER.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
egs/multi_ja_en/ASR/RESULTS.md (4)

19-21: Ensure data/manifests is created and populated in prep.

Docs should state that prepare scripts emit manifests into data/manifests before training/decoding uses it.

Run to sanity‑check:

#!/bin/bash
set -e
ls -lah data/manifests
fd -e jsonl -e jsonl.gz data/manifests -x sh -c 'echo "found: $1"' _ {}

45-52: Fix misaligned WER table.

Replace with a simple Method|WER table:

-|       Datasets       | ReazonSpeech + MLS English (combined test set) |
-|----------------------|------------------------------------------------|
-|   Zipformer WER (%)  |                      test                      |
-|     greedy_search    |                      6.33                      |
-| modified_beam_search |                      6.32                      |
+| Method               | WER (%) | Notes                               |
+|----------------------|--------:|-------------------------------------|
+| greedy_search        |   6.33  | ReazonSpeech+MLS English, combined  |
+| modified_beam_search |   6.32  | ReazonSpeech+MLS English, combined  |

97-99: Replace TODO with a working streaming decode command.

-```shell
-TODO
-```
+```shell
+./zipformer/streaming_decode.py \
+  --epoch 10 \
+  --avg 1 \
+  --exp-dir ./zipformer/exp-15k15k-streaming \
+  --decoding-method modified_beam_search \
+  --decode-chunk-len 32 \
+  --manifest-dir data/manifests
+```

149-160: Replace placeholders with concrete streaming eval steps.

-Word Error Rates (WERs) listed below:
-
-*Please let us know which script to use to evaluate the streaming model!*
-
-
-We also include WER% for common English ASR datasets:
-
-*Please let us know which script to use to evaluate the streaming model!*
-
-
-And CER% for common Japanese datasets:
-
-*Please let us know which script to use to evaluate the streaming model!*
+Word Error Rates (WERs) for streaming:
+
+```shell
+# Decode (streaming)
+./zipformer/streaming_decode.py \
+  --epoch 10 --avg 1 \
+  --exp-dir ./zipformer/exp-15k15k-streaming \
+  --decoding-method modified_beam_search \
+  --decode-chunk-len 32 \
+  --manifest-dir data/manifests
+
+# Score (same normalization as non-streaming)
+# Example: produces refs/hyps then runs the repo scorer.
+python ./zipformer/scoring/prepare_refs_hyps.py \
+  --manifest-dir data/manifests \
+  --output-dir ./zipformer/exp-15k15k-streaming/scoring
+python ./zipformer/scoring/score_wer_cer.py \
+  --scoring-dir ./zipformer/exp-15k15k-streaming/scoring
+```
+
+We also include WER% for common English ASR datasets and CER% for Japanese with the same normalization.
🧹 Nitpick comments (5)
egs/multi_ja_en/ASR/RESULTS.md (5)

34-36: Capitalize ONNX.

-To export the model with onnx:
+To export the model with ONNX:

57-62: Specify dataset versions/splits and normalization.

State exact corpora (e.g., “Common Voice 17 test”, “TED-LIUM 3 test” or “TED Talks subset”), and text normalization used, to make WER reproducible.


66-71: Clarify JP sets and CER normalization.

Please note the dataset releases/splits (e.g., JSUT v1.1) and normalization (e.g., Kanji→Kana mapping, punctuation handling).


101-103: Use canonical name “sherpa-onnx”.

-To export the model with sherpa onnx:
+To export the model with sherpa-onnx:

145-145: Match flag name in the text.

-You may also use decode chunk sizes `16`, `32`, `64`, `128`.
+You may also use --decode-chunk-len values 16, 32, 64, 128.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bc2560c and ecbe985.

📒 Files selected for processing (1)
  • egs/multi_ja_en/ASR/RESULTS.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
egs/multi_ja_en/ASR/RESULTS.md

[grammar] ~45-~45: There might be a mistake here.
Context: ...xp-dir ./zipformer/exp ``` WER and CER on test set listed below (calculated with ...

(QB_NEW_EN)


[grammar] ~47-~47: There might be a mistake here.
Context: ...eech + MLS English (combined test set) | |----------------------|----------------...

(QB_NEW_EN)


[grammar] ~48-~48: There might be a mistake here.
Context: ...---------------------------------------| | Zipformer WER (%) | ...

(QB_NEW_EN)


[grammar] ~49-~49: There might be a mistake here.
Context: ... test | | greedy_search | ...

(QB_NEW_EN)


[grammar] ~50-~50: There might be a mistake here.
Context: ... 6.33 | | modified_beam_search | ...

(QB_NEW_EN)


[grammar] ~57-~57: There might be a mistake here.
Context: ... Corpus | WER (%) | |-----------------------------|---------...

(QB_NEW_EN)


[grammar] ~58-~58: There might be a mistake here.
Context: ...-----------------------------|---------| | CommonVoice | 29.03 ...

(QB_NEW_EN)


[grammar] ~59-~59: There might be a mistake here.
Context: ... CommonVoice | 29.03 | | TED | 16.78 ...

(QB_NEW_EN)


[grammar] ~60-~60: There might be a mistake here.
Context: ... TED | 16.78 | | MLS English (test set) | 8.64 ...

(QB_NEW_EN)


[grammar] ~66-~66: There might be a mistake here.
Context: ...e datasets: | Corpus | CER (%) | |---------------|---------| | JSUT ...

(QB_NEW_EN)


[grammar] ~67-~67: There might be a mistake here.
Context: ... | CER (%) | |---------------|---------| | JSUT | 8.13 | | CommonVoice...

(QB_NEW_EN)


[grammar] ~68-~68: There might be a mistake here.
Context: ...--|---------| | JSUT | 8.13 | | CommonVoice | 9.82 | | TEDx ...

(QB_NEW_EN)


[grammar] ~69-~69: There might be a mistake here.
Context: ... | 8.13 | | CommonVoice | 9.82 | | TEDx | 11.64 | Pre-traine...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (32)
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.6.0 v20250630
  • GitHub Check: py3.10 torch2.7.1 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
🔇 Additional comments (1)
egs/multi_ja_en/ASR/RESULTS.md (1)

27-31: Decode args look consistent with the new pipeline.

Comment on lines +13 to 21
--world-size 8 \
--causal 1 \
--num-epochs 10 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--max-duration 600
--manifest-dir data/manifests \
--enable-musan True
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Non‑streaming section sets causal=1 — flip to 0 or clarify.

Using --causal 1 here will train a streaming/causal model under a “Non‑streaming” heading.

Apply:

   ./zipformer/train.py \
-  --world-size 8 \
-  --causal 1 \
+  --world-size 8 \
+  --causal 0 \
   --num-epochs 10 \
   --start-epoch 1 \
   --use-fp16 1 \
   --exp-dir zipformer/exp \
   --manifest-dir data/manifests \
   --enable-musan True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
--world-size 8 \
--causal 1 \
--num-epochs 10 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--max-duration 600
--manifest-dir data/manifests \
--enable-musan True
```
./zipformer/train.py \
--world-size 8 \
--causal 0 \
--num-epochs 10 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--manifest-dir data/manifests \
--enable-musan True
🤖 Prompt for AI Agents
egs/multi_ja_en/ASR/RESULTS.md around lines 13 to 21: the Non‑streaming section
currently sets --causal 1 which configures a streaming/causal model; change
--causal 1 to --causal 0 to match the Non‑streaming heading (or alternatively
add a clarifying note above the block that this run intentionally trains a
causal/streaming model), and ensure the comment/heading and flag are consistent.

Comment on lines +37 to 43
./zipformer/export-onnx.py \
--tokens ./data/lang/bbpe_2000/tokens.txt \
--use-averaged-model 0 \
--epoch 10 \
--avg 1 \
--exp-dir ./zipformer/exp
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Token path: add a quick existence check in docs or prep.

Ensure data/lang/bbpe_2000/tokens.txt is produced by your BBPE training step before export.

Run:


🏁 Script executed:

#!/bin/bash
test -f data/lang/bbpe_2000/tokens.txt && echo "tokens.txt OK" || (echo "tokens.txt missing" && exit 1)

Length of output: 96


Add existence check for tokens.txt in export instructions
In egs/multi_ja_en/ASR/RESULTS.md (lines 37–43), prepend a test to verify data/lang/bbpe_2000/tokens.txt exists before invoking export-onnx.py, for example:

test -f data/lang/bbpe_2000/tokens.txt || { echo "Error: tokens.txt not found. Run BBPE training first."; exit 1; }

This ensures users generate the tokens file beforehand and avoids silent failures.

🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around lines 37 to 43, the export example
calls export-onnx.py without checking that data/lang/bbpe_2000/tokens.txt
exists; add a pre-check that verifies that file exists and, if not, prints a
clear error like "Error: tokens.txt not found. Run BBPE training first." and
exits with a non-zero status before running export-onnx.py so users don't hit
silent failures.

Comment on lines +85 to +93
--world-size 8 \
--causal 1 \
--num-epochs 10 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--manifest-dir data/manifests \
--enable-musan True
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Streaming exp-dir inconsistent with later export path.

Train uses ./zipformer/exp, but export uses ./zipformer/exp-15k15k-streaming. Align them.

 ./zipformer/train.py \
   --world-size 8 \
   --causal 1 \
   --num-epochs 10 \
   --start-epoch 1 \
   --use-fp16 1 \
-  --exp-dir zipformer/exp \
+  --exp-dir zipformer/exp-15k15k-streaming \
   --manifest-dir data/manifests \
   --enable-musan True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
--world-size 8 \
--causal 1 \
--num-epochs 10 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--manifest-dir data/manifests \
--enable-musan True
```
./zipformer/train.py \
--world-size 8 \
--causal 1 \
--num-epochs 10 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-15k15k-streaming \
--manifest-dir data/manifests \
--enable-musan True
🤖 Prompt for AI Agents
In egs/multi_ja_en/ASR/RESULTS.md around lines 85 to 93, the training command
uses --exp-dir zipformer/exp while the export later references
./zipformer/exp-15k15k-streaming; make them consistent by changing one to match
the other (preferred: set --exp-dir to zipformer/exp-15k15k-streaming to match
the streaming export) so both training and export use the identical experiment
directory path.

--chunk-size 16 \
--left-context-frames 128 \
--tokens data/lang/bbpe_2000/tokens.txt \
--epoch 10 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since your dataset is large, can you also try --iter instead of --epoch?

baileyeet and others added 4 commits September 11, 2025 15:18
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
egs/mls_english/ASR/zipformer/streaming_decode.py (5)

37-37: Use MLSEnglishHFAsrDataModule instead of ReazonSpeechAsrDataModule.

This script imports ReazonSpeechAsrDataModule but should import the MLS English data module to align with the MLS English recipe context.

This is identical to the past review comment flagging the same issue.


515-522: Fix stream completion logic and tensor conversion.

Two critical issues: all streams are unconditionally marked as finished, and tensor values need conversion to Python integers.

This matches the past review comment identifying the same issues.


695-697: Fix variable name collision in WER summary filename.

The variable key is shadowed by the loop variable, creating ambiguous filename generation.

This matches the past review comment about the same variable scoping issue.


712-717: Use MLSEnglishHFAsrDataModule instead of ReazonSpeechAsrDataModule.

The main function uses ReazonSpeechAsrDataModule but should use the MLS English data module for consistency with the MLS English recipe.

This matches the past review comment flagging the same data module mismatch.


856-865: Use MLSEnglishHFAsrDataModule instead of ReazonSpeechAsrDataModule.

The dataset loading uses ReazonSpeechAsrDataModule but should use the appropriate MLS English data module.

This matches the past review comment identifying the same data module issue.

🧹 Nitpick comments (2)
egs/mls_english/ASR/zipformer/streaming_decode.py (2)

29-29: Remove unused pdb import.

The pdb import is not used in the code and should be removed.

Apply this diff:

-import pdb

415-418: Consider using iterable unpacking for state concatenation.

The state list concatenation can be simplified with iterable unpacking for better readability.

Apply this diff:

-    new_states = new_encoder_states + [
-        new_cached_embed_left_pad,
-        new_processed_lens,
-    ]
+    new_states = [*new_encoder_states, new_cached_embed_left_pad, new_processed_lens]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecbe985 and d74e232.

📒 Files selected for processing (2)
  • egs/mls_english/ASR/zipformer/streaming_decode.py (1 hunks)
  • egs/reazonspeech/ASR/local/compute_fbank_musan.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • egs/reazonspeech/ASR/local/compute_fbank_musan.py
🧰 Additional context used
🧬 Code graph analysis (1)
egs/mls_english/ASR/zipformer/streaming_decode.py (7)
egs/mls_english/ASR/zipformer/decode_stream.py (5)
  • DecodeStream (28-148)
  • get_feature_frames (123-137)
  • id (105-106)
  • set_features (108-121)
  • decoding_result (139-148)
egs/mls_english/ASR/zipformer/streaming_beam_search.py (3)
  • fast_beam_search_one_best (208-295)
  • greedy_search (30-96)
  • modified_beam_search (99-205)
egs/mls_english/ASR/zipformer/tokenizer.py (4)
  • Tokenizer (9-169)
  • add_arguments (13-25)
  • piece_to_id (191-195)
  • get_piece_size (200-201)
egs/mls_english/ASR/zipformer/train.py (3)
  • add_model_arguments (125-258)
  • get_model (612-640)
  • get_params (482-545)
icefall/checkpoint.py (2)
  • average_checkpoints_with_averaged_model (389-454)
  • find_checkpoints (254-314)
icefall/utils.py (4)
  • AttributeDict (214-236)
  • make_pad_mask (1394-1431)
  • store_transcripts (558-580)
  • str2bool (142-158)
egs/mls_english/ASR/local/utils/asr_datamodule.py (3)
  • add_arguments (62-180)
  • valid_cuts (354-358)
  • test_cuts (361-365)
🪛 Ruff (0.12.2)
egs/mls_english/ASR/zipformer/streaming_decode.py

29-29: Import for pdb found

(T100)


201-201: Do not perform function call torch.device in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


415-418: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


511-511: Avoid specifying long messages outside the exception class

(TRY003)


663-663: Avoid specifying long messages outside the exception class

(TRY003)


774-777: Avoid specifying long messages outside the exception class

(TRY003)


779-782: Avoid specifying long messages outside the exception class

(TRY003)


803-806: Avoid specifying long messages outside the exception class

(TRY003)


808-811: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (180)
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch1.13.1 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.0.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.1 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.0 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
  • GitHub Check: py3.10 torch2.1.0 v20250630
  • GitHub Check: py3.10 torch2.3.1 v20250630
  • GitHub Check: py3.10 torch2.2.2 v20250630
  • GitHub Check: py3.10 torch1.13.0 v20250630
  • GitHub Check: py3.10 torch2.4.1 v20250630
  • GitHub Check: py3.10 torch2.2.1 v20250630
  • GitHub Check: py3.10 torch2.1.2 v20250630
🔇 Additional comments (2)
egs/mls_english/ASR/zipformer/streaming_decode.py (2)

388-396: Correct processed mask construction.

The mask construction is correct here and properly computes padding positions without the flip(1) issue mentioned in past comments.


32-32: Add missing import for Optional type hint.

The Optional type hint is used in the suggested fix but not imported.

Apply this diff:

 from pathlib import Path
-from typing import Dict, List, Optional, Tuple
+from typing import Dict, List, Optional, Tuple

Note: This appears to already be imported correctly, so no change is needed.

Likely an incorrect or invalid review comment.

Comment on lines +20 to +23
Usage:
./zipformer/streaming_decode.py--epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix malformed command line example in docstring.

The usage example has a syntax error with missing space after streaming_decode.py.

Apply this diff:

-./zipformer/streaming_decode.py--epoch 28   --avg 15   --causal 1   --chunk-size 32   --left-context-frames 256   --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192
+./zipformer/streaming_decode.py --epoch 28   --avg 15   --causal 1   --chunk-size 32   --left-context-frames 256   --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Usage:
./zipformer/streaming_decode.py--epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192
"""
Usage:
./zipformer/streaming_decode.py --epoch 28 --avg 15 --causal 1 --chunk-size 32 --left-context-frames 256 --exp-dir ./zipformer/exp-large --lang data/lang_char --num-encoder-layers 2,2,4,5,4,2 --feedforward-dim 512,768,1536,2048,1536,768 --encoder-dim 192,256,512,768,512,256 --encoder-unmasked-dim 192,192,256,320,256,192
"""
🤖 Prompt for AI Agents
egs/mls_english/ASR/zipformer/streaming_decode.py around lines 20 to 23: the
docstring usage example is malformed because there's no space between
"streaming_decode.py" and the command-line flags; fix by inserting a space after
"streaming_decode.py" so the example reads "streaming_decode.py --epoch 28 ..."
and ensure the rest of flags remain unchanged.

def get_init_states(
model: nn.Module,
batch_size: int = 1,
device: torch.device = torch.device("cpu"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid mutable default argument with function call.

Using torch.device("cpu") as a default argument is problematic as it creates the device object at function definition time rather than call time.

Apply this diff:

-def get_init_states(
-    model: nn.Module,
-    batch_size: int = 1,
-    device: torch.device = torch.device("cpu"),
-) -> List[torch.Tensor]:
+def get_init_states(
+    model: nn.Module,
+    batch_size: int = 1,
+    device: Optional[torch.device] = None,
+) -> List[torch.Tensor]:

And update the function body:

+    if device is None:
+        device = torch.device("cpu")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
device: torch.device = torch.device("cpu"),
def get_init_states(
model: nn.Module,
batch_size: int = 1,
device: Optional[torch.device] = None,
) -> List[torch.Tensor]:
if device is None:
device = torch.device("cpu")
🧰 Tools
🪛 Ruff (0.12.2)

201-201: Do not perform function call torch.device in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🤖 Prompt for AI Agents
In egs/mls_english/ASR/zipformer/streaming_decode.py around line 201, the
function currently uses torch.device("cpu") as a default argument which is
created at import time; change the parameter default to None (typing it as
Optional[torch.device]), add "from typing import Optional" if not present, and
inside the function set device = torch.device("cpu") when device is None so the
device object is created at call time; update any type hints and callers if
necessary.



The pre-trained model is available here : [reazon-research/mls-english
](https://huggingface.co/reazon-research/mls-english)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This URL leads to a 404 page. Can you make it public?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants