Skip to content
Open
Show file tree
Hide file tree
Changes from 135 commits
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
28f6545
WIP v0 MLS English recipe
kinanmartin Apr 9, 2025
ac0c0ed
update prepare.sh, fix asr_datamodule.py
kinanmartin Apr 11, 2025
a1fc642
change default path
kinanmartin Apr 11, 2025
defc71b
replace file
kinanmartin Apr 13, 2025
efe015d
cleaned-up version of recipe
kinanmartin Apr 15, 2025
8c1c710
symlink copied files to librispeech recipe dir
kinanmartin Apr 15, 2025
8985259
separate transcript prep stage from bpe train stage
kinanmartin Apr 15, 2025
a34d34a
pre-commit hooks
kinanmartin Apr 15, 2025
ce44150
readme
kinanmartin Apr 15, 2025
68e3cea
instead of on-the-fly features, precompute fbank and manifests in pre…
kinanmartin Apr 23, 2025
d6e3c98
move compute_fbank_mls_english.py, add validate_manifest.py, add shar…
kinanmartin Apr 24, 2025
4ca8ee9
adjusted prepare.sh to only calculate fbank and manifest together; ad…
kinanmartin Apr 30, 2025
59519a4
fix validation manifest name
kinanmartin Apr 30, 2025
f2e0171
fix stage 2 and 3
kinanmartin Apr 30, 2025
fa84782
optimize with num_jobs on save_audios
kinanmartin May 1, 2025
abebb6a
new version of multi_ja_en prepare.sh script which swaps Librispeech …
kinanmartin May 9, 2025
c83b115
add fbank
baileyeet May 1, 2025
61e81bf
Revert "add fbank"
baileyeet May 2, 2025
3751441
deprecate params.bilingual=0, replace ReazonSpeechAsrDataModule for M…
kinanmartin May 13, 2025
6d71d9c
remove bilingual tag from train.py
baileyeet May 13, 2025
5417e09
restore version of mls_english compute_fbank_mls_english.py and prepa…
kinanmartin May 14, 2025
782e1fb
fix stage 5 output pathing
kinanmartin May 15, 2025
f4b2987
switch mls_english clone from https to ssh
kinanmartin May 21, 2025
a8ecb16
use huggingface_hub library to download mls_english
kinanmartin May 22, 2025
3307836
Combined updates. Changed BBPE path structure, changed dataset path s…
kinanmartin Jun 4, 2025
2f1c611
fix decode script data module usage
kinanmartin Jun 6, 2025
eafbd64
add utility file for updating the storage_path of cutsets for use in …
kinanmartin Jun 6, 2025
b167ac7
add utility file for creating subsets of mls english. must be fixed t…
kinanmartin Jun 6, 2025
ad1be22
Parametrize dev and test split sizes.
kinanmartin Jun 10, 2025
78ee595
Add failsafe for MLS English dev set key alternate name as validation
kinanmartin Jun 11, 2025
fd3fbe6
Update README.md to reflect MLS English dataset
kinanmartin Jun 11, 2025
c77a847
add step 4: display manifest stats to mls_eng
baileyeet Jun 11, 2025
cdf246c
update manifest dir path
baileyeet Jun 11, 2025
f3e59df
add stage 6 - update cutset paths to prepare
baileyeet Jun 11, 2025
ddc2daa
remove commented out codels
baileyeet Jun 12, 2025
f6ad423
changes to train script - no need for limiting utterance length here
baileyeet Jun 12, 2025
19b62c0
remove unused local scripts
baileyeet Jun 12, 2025
5f2f684
make prepare.sh symlinks relative
kinanmartin Jul 8, 2025
70a7940
changes to asr_datamodule for musan support
baileyeet Jul 1, 2025
df923f3
typos
baileyeet Jul 1, 2025
5ec9389
commenting
baileyeet Jul 1, 2025
de35cc2
remove comment
baileyeet Jul 4, 2025
f51621b
resolve typos and import issues
baileyeet Jul 9, 2025
4e92879
update musan path
baileyeet Jul 10, 2025
093a035
update musan paths
baileyeet Jul 10, 2025
0f700ed
update musan symlinks
baileyeet Jul 11, 2025
d5cc030
attempt to fix musan paths
baileyeet Jul 14, 2025
aee7b87
working changes for musan mixing
baileyeet Jul 15, 2025
310aaec
Update egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py
baileyeet Jul 16, 2025
542620c
Update egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py
baileyeet Jul 16, 2025
f7fec4a
Update egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py
baileyeet Jul 16, 2025
154ef43
Update egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py
baileyeet Jul 16, 2025
6012edb
black and isort formatting
baileyeet Jul 16, 2025
dc4db37
PR review suggestions implemented
baileyeet Jul 16, 2025
9d93d63
Update RESULTS.md
baileyeet Jul 18, 2025
aed139f
Musan implementation for ReazonSpeech (#1988)
baileyeet Jul 18, 2025
dbd8977
Manually fix merge conflict in multi_ja_en/ASR/zipformer/train.py
kinanmartin Jul 28, 2025
1c5d792
Validate generated manifest files. (#338)
csukuangfj May 2, 2022
c92c606
WIP v0 MLS English recipe
kinanmartin Apr 9, 2025
ba6d8e8
update prepare.sh, fix asr_datamodule.py
kinanmartin Apr 11, 2025
0ab0274
change default path
kinanmartin Apr 11, 2025
1b8a306
replace file
kinanmartin Apr 13, 2025
e76b749
cleaned-up version of recipe
kinanmartin Apr 15, 2025
313afea
symlink copied files to librispeech recipe dir
kinanmartin Apr 15, 2025
c532a50
separate transcript prep stage from bpe train stage
kinanmartin Apr 15, 2025
24db8c1
pre-commit hooks
kinanmartin Apr 15, 2025
996334f
readme
kinanmartin Apr 15, 2025
fe88d1d
instead of on-the-fly features, precompute fbank and manifests in pre…
kinanmartin Apr 23, 2025
a8f45bc
move compute_fbank_mls_english.py, add validate_manifest.py, add shar…
kinanmartin Apr 24, 2025
eb2168b
adjusted prepare.sh to only calculate fbank and manifest together; ad…
kinanmartin Apr 30, 2025
2504b23
fix validation manifest name
kinanmartin Apr 30, 2025
73dea24
fix stage 2 and 3
kinanmartin Apr 30, 2025
0e86ef8
optimize with num_jobs on save_audios
kinanmartin May 1, 2025
06e4291
new version of multi_ja_en prepare.sh script which swaps Librispeech …
kinanmartin May 9, 2025
7d462aa
add fbank
baileyeet May 1, 2025
31a37c7
Revert "add fbank"
baileyeet May 2, 2025
99db0e4
deprecate params.bilingual=0, replace ReazonSpeechAsrDataModule for M…
kinanmartin May 13, 2025
8b035a0
remove bilingual tag from train.py
baileyeet May 13, 2025
7bea23e
restore version of mls_english compute_fbank_mls_english.py and prepa…
kinanmartin May 14, 2025
2265e1a
fix stage 5 output pathing
kinanmartin May 15, 2025
5682978
switch mls_english clone from https to ssh
kinanmartin May 21, 2025
1093e78
use huggingface_hub library to download mls_english
kinanmartin May 22, 2025
1b1a317
Combined updates. Changed BBPE path structure, changed dataset path s…
kinanmartin Jun 4, 2025
68bff93
fix decode script data module usage
kinanmartin Jun 6, 2025
b25254f
add utility file for updating the storage_path of cutsets for use in …
kinanmartin Jun 6, 2025
d136086
add utility file for creating subsets of mls english. must be fixed t…
kinanmartin Jun 6, 2025
b6d43a4
Parametrize dev and test split sizes.
kinanmartin Jun 10, 2025
9c318da
Add failsafe for MLS English dev set key alternate name as validation
kinanmartin Jun 11, 2025
065ca31
Update README.md to reflect MLS English dataset
kinanmartin Jun 11, 2025
0a4ed5e
add step 4: display manifest stats to mls_eng
baileyeet Jun 11, 2025
1ddd3cd
update manifest dir path
baileyeet Jun 11, 2025
606789b
add stage 6 - update cutset paths to prepare
baileyeet Jun 11, 2025
76bae70
remove commented out codels
baileyeet Jun 12, 2025
ac94174
changes to train script - no need for limiting utterance length here
baileyeet Jun 12, 2025
9c91775
remove unused local scripts
baileyeet Jun 12, 2025
694ecb9
make prepare.sh symlinks relative
kinanmartin Jul 8, 2025
d7ee48e
Validate generated manifest files. (#338)
csukuangfj May 2, 2022
1996507
changes to asr_datamodule for musan support
baileyeet Jul 1, 2025
ed2c0a4
typos
baileyeet Jul 1, 2025
5fb4bdf
commenting
baileyeet Jul 1, 2025
1cf544b
remove comment
baileyeet Jul 4, 2025
c610c6d
resolve typos and import issues
baileyeet Jul 9, 2025
6272827
update musan path
baileyeet Jul 10, 2025
aeffb15
update musan paths
baileyeet Jul 10, 2025
4475815
update musan symlinks
baileyeet Jul 11, 2025
a310d8f
attempt to fix musan paths
baileyeet Jul 14, 2025
60f326b
working changes for musan mixing
baileyeet Jul 15, 2025
95f58e6
Update egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py
baileyeet Jul 16, 2025
865b859
Update egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py
baileyeet Jul 16, 2025
b19929c
Update egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py
baileyeet Jul 16, 2025
2f1f419
Update egs/multi_ja_en/ASR/local/utils/update_cutset_paths.py
baileyeet Jul 16, 2025
7b4abba
black and isort formatting
baileyeet Jul 16, 2025
8dd2c0f
PR review suggestions implemented
baileyeet Jul 16, 2025
94cf8c3
support left pad for make_pad_mask (#1990)
yfyeung Jul 16, 2025
0ca7595
Update RESULTS.md
baileyeet Jul 18, 2025
11df2a8
Musan implementation for ReazonSpeech (#1988)
baileyeet Jul 18, 2025
2d8e3fd
Fix transformer decoder layer (#1995)
csukuangfj Jul 18, 2025
f15a783
Validate generated manifest files. (#338)
csukuangfj May 2, 2022
c23af2e
musan implementation for mls_english
baileyeet Aug 5, 2025
ed79fa3
revert unrelated transformer.py diffs from rebase
baileyeet Aug 5, 2025
636121c
remove bilingual tag from train.py
baileyeet May 13, 2025
0967f5f
Manually fix merge conflict in multi_ja_en/ASR/zipformer/train.py
kinanmartin Jul 28, 2025
f210002
Validate generated manifest files. (#338)
csukuangfj May 2, 2022
ee2a6d6
remove bilingual tag from train.py
baileyeet May 13, 2025
dee07de
Validate generated manifest files. (#338)
csukuangfj May 2, 2022
f9ceead
Validate generated manifest files. (#338)
csukuangfj May 2, 2022
4e05d70
fix stash commit
baileyeet Aug 6, 2025
130c2a5
Merge branch 'multi_ja_en_mls_english_clean' into musan-mls-clean-final
baileyeet Aug 6, 2025
5400f43
training and decoding compatibility changes
baileyeet Aug 11, 2025
8c08c9c
Create RESULTS.md
baileyeet Aug 14, 2025
8e18616
Update RESULTS.md
baileyeet Aug 14, 2025
556a3f0
Update README.md
baileyeet Aug 14, 2025
36fc1f1
Merge pull request #4 from reazon-research/musan-mls-clean-final
kinanmartin Aug 22, 2025
7231cf4
Remove changes to files outside of relevant recipes
kinanmartin Aug 29, 2025
a4c1db5
reformat
baileyeet Sep 2, 2025
2859c22
Update RESULTS.md
baileyeet Sep 2, 2025
9a940c3
Update RESULTS.md
baileyeet Sep 2, 2025
f64a706
Update egs/multi_ja_en/ASR/RESULTS.md
kinanmartin Sep 2, 2025
ef7664e
Update egs/mls_english/ASR/local/utils/asr_datamodule.py
kinanmartin Sep 2, 2025
bc2560c
Update training commands and decode.py accuracy values, add streaming…
kinanmartin Sep 3, 2025
ecbe985
Update streaming train and export commands
kinanmartin Sep 4, 2025
a30e80c
Remove accidentally added submodule musan-k2-v2-reazonspeech-medium
baileyeet Sep 11, 2025
9d389cd
Update egs/reazonspeech/ASR/local/compute_fbank_musan.py
baileyeet Sep 11, 2025
8c84639
Update egs/mls_english/ASR/zipformer/streaming_decode.py
baileyeet Sep 11, 2025
d74e232
Merge branch 'master' into multi_ja_en_mls_english_clean
baileyeet Sep 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions egs/mls_english/ASR/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Introduction



**Multilingual LibriSpeech (MLS)** is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages. This icefall training recipe was created for the restructured version of the English split of the dataset available on Hugging Face below.


The dataset is available on Hugging Face. For more details, please visit:

- Dataset: https://huggingface.co/datasets/parler-tts/mls_eng
- Original MLS dataset link: https://www.openslr.org/94


## On-the-fly feature computation

This recipe currently only supports on-the-fly feature bank computation, since `lhotse` manifests and feature banks are not pre-calculated in this recipe. This should mean that the dataset can be streamed from Hugging Face, but we have not tested this yet. We may add a version that supports pre-calculating features to better match existing recipes.\
<br>

[./RESULTS.md](./RESULTS.md) contains the latest results. This MLS English recipe was primarily developed for use in the ```multi_ja_en``` Japanese-English bilingual pipeline, which is based on MLS English and ReazonSpeech.
41 changes: 41 additions & 0 deletions egs/mls_english/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
## Results

### MLS-English training results (Non-streaming) on zipformer model

#### Non-streaming

**WER on Test Set (Epoch 20)**

| Type | Greedy | Beam search |
|---------------|--------|-------------|
| Non-streaming | 6.65 | 6.57 |


The training command:

```
./zipformer/train.py \
--world-size 8 \
--num-epochs 20 \
--start-epoch 9 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--lang-dir data/lang/bpe_2000/
```

The decoding command:

```
./zipformer/decode.py \
--epoch 20 \
--exp-dir ./zipformer/exp \
--lang-dir data/lang/bpe_2000/ \
--decoding-method greedy_search
```


The pre-trained model is available here : [reazon-research/mls-english
](https://huggingface.co/reazon-research/mls-english)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This URL leads to a 404 page. Can you make it public?



Please note that this recipe was developed primarily as the source of English input in the bilingual Japanese-English recipe `multi_ja_en`, which uses ReazonSpeech and MLS English.
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
RecordingSet,
SupervisionSet,
)
from lhotse.utils import is_module_available

# fmt: on

Expand All @@ -48,55 +49,54 @@


def make_cutset_blueprints(
manifest_dir: Path,
mls_eng_hf_dataset_path: str = "parler-tts/mls_eng",
) -> List[Tuple[str, CutSet]]:
cut_sets = []

if not is_module_available("datasets"):
raise ImportError(
"To process the MLS English HF corpus, please install optional dependency: pip install datasets"
)

from datasets import load_dataset

print(f"{mls_eng_hf_dataset_path=}")
dataset = load_dataset(str(mls_eng_hf_dataset_path))

# Create test dataset
logging.info("Creating test cuts.")
cut_sets.append(
(
"test",
CutSet.from_manifests(
recordings=RecordingSet.from_file(
manifest_dir / "reazonspeech_recordings_test.jsonl.gz"
),
supervisions=SupervisionSet.from_file(
manifest_dir / "reazonspeech_supervisions_test.jsonl.gz"
),
),
CutSet.from_huggingface_dataset(dataset["test"], text_key="transcript"),
)
)

# Create dev dataset
logging.info("Creating dev cuts.")
cut_sets.append(
(
"dev",
CutSet.from_manifests(
recordings=RecordingSet.from_file(
manifest_dir / "reazonspeech_recordings_dev.jsonl.gz"
),
supervisions=SupervisionSet.from_file(
manifest_dir / "reazonspeech_supervisions_dev.jsonl.gz"
try:
cut_sets.append(
(
"dev",
CutSet.from_huggingface_dataset(dataset["dev"], text_key="transcript"),
)
)
except KeyError:
cut_sets.append(
(
"dev",
CutSet.from_huggingface_dataset(
dataset["validation"], text_key="transcript"
),
),
)
)
)

# Create train dataset
logging.info("Creating train cuts.")
cut_sets.append(
(
"train",
CutSet.from_manifests(
recordings=RecordingSet.from_file(
manifest_dir / "reazonspeech_recordings_train.jsonl.gz"
),
supervisions=SupervisionSet.from_file(
manifest_dir / "reazonspeech_supervisions_train.jsonl.gz"
),
),
CutSet.from_huggingface_dataset(dataset["train"], text_key="transcript"),
)
)
return cut_sets
Expand All @@ -107,6 +107,8 @@ def get_args():
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument("-m", "--manifest-dir", type=Path)
parser.add_argument("-a", "--audio-dir", type=Path)
parser.add_argument("-d", "--dl-dir", type=Path)
return parser.parse_args()
Comment on lines +110 to 112
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix --dl-dir semantics and resolve the actual dataset path; current code will miss the downloaded ‘mls_english’ subdir.

The downloader writes to <dl_dir>/mls_english, but main() passes args.dl_dir directly to load_dataset(), causing failures. Also mark required args.

@@
-    parser.add_argument("-m", "--manifest-dir", type=Path)
-    parser.add_argument("-a", "--audio-dir", type=Path)
-    parser.add_argument("-d", "--dl-dir", type=Path)
+    parser.add_argument("-m", "--manifest-dir", type=Path, required=True,
+                        help="Directory for manifests and features.")
+    parser.add_argument("-a", "--audio-dir", type=Path, required=True,
+                        help="Directory to materialize audio shards.")
+    parser.add_argument("-d", "--dl-dir", type=Path, required=True,
+                        help="Base download dir (expects '<dl-dir>/mls_english') or a local HF dataset dir or an HF repo-id.")
@@
-        mls_eng_hf_dataset_path = args.dl_dir  # "/root/datasets/parler-tts--mls_eng"
-        cut_sets = make_cutset_blueprints(mls_eng_hf_dataset_path)
+        # Accept: (1) <dl-dir>/mls_english from our downloader,
+        #         (2) a direct local HF dataset dir, or
+        #         (3) a repo-id string (if user passes one).
+        if args.dl_dir.is_dir():
+            candidate = args.dl_dir / "mls_english"
+            mls_eng_hf_dataset_path = candidate if candidate.exists() else args.dl_dir
+        else:
+            mls_eng_hf_dataset_path = args.dl_dir  # repo-id string allowed
+        cut_sets = make_cutset_blueprints(str(mls_eng_hf_dataset_path))

Also applies to: 132-134

🤖 Prompt for AI Agents
In egs/mls_english/ASR/local/compute_fbank_mls_english.py around lines 110-112
(and similarly lines 132-134), the CLI currently defines --dl-dir as an optional
Path and passes args.dl_dir directly to load_dataset(), but the downloader
writes into a subdirectory named "mls_english", so the code misses the actual
dataset path and can fail; update the argparse definitions to mark required
arguments (audio-dir and dl-dir) and, before calling load_dataset(), resolve the
real dataset root by appending the "mls_english" subdirectory (e.g.,
dataset_root = args.dl_dir / "mls_english" if args.dl_dir is not None) and use
that resolved path when invoking load_dataset(). Ensure you also handle the case
where the resolved path does not exist with a clear error.



Expand All @@ -120,26 +122,33 @@ def main():

logging.basicConfig(format=formatter, level=logging.INFO)

if (args.manifest_dir / ".reazonspeech-fbank.done").exists():
if (args.manifest_dir / ".mls-eng-fbank.done").exists():
logging.info(
"Previous fbank computed for ReazonSpeech found. "
f"Delete {args.manifest_dir / '.reazonspeech-fbank.done'} to allow recomputing fbank."
"Previous fbank computed for MLS English found. "
f"Delete {args.manifest_dir / '.mls-eng-fbank.done'} to allow recomputing fbank."
)
return
else:
cut_sets = make_cutset_blueprints(args.manifest_dir)
mls_eng_hf_dataset_path = args.dl_dir # "/root/datasets/parler-tts--mls_eng"
cut_sets = make_cutset_blueprints(mls_eng_hf_dataset_path)
for part, cut_set in cut_sets:
logging.info(f"Processing {part}")
cut_set = cut_set.save_audios(
num_jobs=num_jobs,
storage_path=(args.audio_dir / part).as_posix(),
) # makes new cutset that loads audio from paths to actual audio files

cut_set = cut_set.compute_and_store_features(
extractor=extractor,
num_jobs=num_jobs,
storage_path=(args.manifest_dir / f"feats_{part}").as_posix(),
storage_type=LilcomChunkyWriter,
)
cut_set.to_file(args.manifest_dir / f"reazonspeech_cuts_{part}.jsonl.gz")

logging.info("All fbank computed for ReazonSpeech.")
(args.manifest_dir / ".reazonspeech-fbank.done").touch()
cut_set.to_file(args.manifest_dir / f"mls_eng_cuts_{part}.jsonl.gz")

logging.info("All fbank computed for MLS English.")
(args.manifest_dir / ".mls-eng-fbank.done").touch()


if __name__ == "__main__":
Expand Down
1 change: 1 addition & 0 deletions egs/mls_english/ASR/local/compute_fbank_musan.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ def get_parser():
def main():
args = get_parser()

for part in ["train", "dev"]:
path = args.manifest_dir / f"reazonspeech_cuts_{part}.jsonl.gz"
for part in ["dev", "test", "train"]:
path = args.manifest_dir / f"mls_eng_cuts_{part}.jsonl.gz"
cuts: CutSet = load_manifest(path)

print("\n---------------------------------\n")
Expand Down
114 changes: 114 additions & 0 deletions egs/mls_english/ASR/local/train_bpe_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
#!/usr/bin/env python3
# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang)
# Copyright 2024 Xiaomi Corp. (authors: Xiaoyu Yang)
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# You can install sentencepiece via:
#
# pip install sentencepiece
#
# Due to an issue reported in
# https://github.com/google/sentencepiece/pull/642#issuecomment-857972030
#
# Please install a version >=0.1.96

import argparse
import shutil
from pathlib import Path

import sentencepiece as spm


def get_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--lang-dir",
type=str,
help="""Input and output directory.
The generated bpe.model is saved to this directory.
""",
)

parser.add_argument(
"--byte-fallback",
action="store_true",
help="""Whether to enable byte_fallback when training bpe.""",
)

parser.add_argument(
"--character-coverage",
type=float,
default=1.0,
help="Character coverage in vocabulary.",
)

parser.add_argument(
"--transcript",
type=str,
help="Training transcript.",
)

parser.add_argument(
"--vocab-size",
type=int,
help="Vocabulary size for BPE training",
)

return parser.parse_args()


def main():
args = get_args()
vocab_size = args.vocab_size
lang_dir = Path(args.lang_dir)

model_type = "bpe"

model_prefix = f"{lang_dir}/{model_type}_{vocab_size}"
train_text = args.transcript
input_sentence_size = 100000000

user_defined_symbols = ["<blk>", "<sos/eos>"]
unk_id = len(user_defined_symbols)
# Note: unk_id is fixed to 2.
# If you change it, you should also change other
# places that are using it.

model_file = Path(model_prefix + ".model")
if not model_file.is_file():
spm.SentencePieceTrainer.train(
input=train_text,
vocab_size=vocab_size,
model_type=model_type,
model_prefix=model_prefix,
input_sentence_size=input_sentence_size,
character_coverage=args.character_coverage,
user_defined_symbols=user_defined_symbols,
byte_fallback=args.byte_fallback,
unk_id=unk_id,
bos_id=-1,
eos_id=-1,
)
else:
print(f"{model_file} exists - skipping")
return

shutil.copyfile(model_file, f"{lang_dir}/bpe.model")


Comment on lines +91 to +112
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Always sync bpe_.model to bpe.model, even when training is skipped.

Currently, if the versioned model exists, you return early without copying, leaving bpe.model stale/missing.

-    if not model_file.is_file():
+    if not model_file.is_file():
         spm.SentencePieceTrainer.train(
             input=train_text,
@@
         )
-    else:
-        print(f"{model_file} exists - skipping")
-        return
-
-    shutil.copyfile(model_file, f"{lang_dir}/bpe.model")
+    else:
+        print(f"{model_file} exists - skipping training")
+    # Ensure canonical symlink/copy is updated
+    shutil.copyfile(model_file, f"{lang_dir}/bpe.model")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model_file = Path(model_prefix + ".model")
if not model_file.is_file():
spm.SentencePieceTrainer.train(
input=train_text,
vocab_size=vocab_size,
model_type=model_type,
model_prefix=model_prefix,
input_sentence_size=input_sentence_size,
character_coverage=args.character_coverage,
user_defined_symbols=user_defined_symbols,
byte_fallback=args.byte_fallback,
unk_id=unk_id,
bos_id=-1,
eos_id=-1,
)
else:
print(f"{model_file} exists - skipping")
return
shutil.copyfile(model_file, f"{lang_dir}/bpe.model")
model_file = Path(model_prefix + ".model")
if not model_file.is_file():
spm.SentencePieceTrainer.train(
input=train_text,
vocab_size=vocab_size,
model_type=model_type,
model_prefix=model_prefix,
input_sentence_size=input_sentence_size,
character_coverage=args.character_coverage,
user_defined_symbols=user_defined_symbols,
byte_fallback=args.byte_fallback,
unk_id=unk_id,
bos_id=-1,
eos_id=-1,
)
else:
print(f"{model_file} exists - skipping training")
# Ensure canonical symlink/copy is updated
shutil.copyfile(model_file, f"{lang_dir}/bpe.model")
🤖 Prompt for AI Agents
In egs/mls_english/ASR/local/train_bpe_model.py around lines 91 to 112, the
function returns early when the versioned model file exists and therefore never
syncs model_prefix+".model" to "{lang_dir}/bpe.model"; change the control flow
so the shutil.copyfile(model_file, f"{lang_dir}/bpe.model") is executed
regardless of whether training ran or was skipped (e.g., remove the early return
and only skip training but still perform the copy, or move the copy call out of
the if/else block after the training check), ensuring model_file points to the
correct Path before copying and handling/reporting any copy errors.

if __name__ == "__main__":
main()
Loading
Loading