Implemented fast processing of extract_embedding #356

MiXaiLL76 · 2024-09-05T11:34:51Z

because we run this model on the processor. We can run parallel processing of the dataset.
In my experiment, I accelerated the preprocessing of the dataset from 2 hours to 30 minutes.

Moreover, we work with dicts, so it doesn't matter to us in what order the data is processed.

MiXaiLL76 · 2024-09-05T11:42:54Z

flake8 --max-line-length 150 --ignore B006,B008,B905,C408,E402,E741,W503,W504 --exclude ./third_party/,./runtime/python/grpc/cosyvoice_pb2*py
./cosyvoice/utils/scheduler.py:92:75: E231 missing whitespace after ','
./cosyvoice/utils/train_utils.py:133:9: F811 redefinition of unused 'scheduler' from line 127

I have a newer version of lintner, but I don't think that's a problem.

aluminumbox · 2024-09-06T01:46:08Z

yes, use multi thread can increase the throughput, but I don't think using queue is a good idea. you can use threadpool.map, this can make the code more clear. for example

、、、
with Pool(processes=num_workers) as pool:
predictions = pool.map(single_job, tasks)
、、、

MiXaiLL76 · 2024-09-06T08:09:08Z

yes, use multi thread can increase the throughput, but I don't think using queue is a good idea. you can use threadpool.map, this can make the code more clear. for example

、、、 with Pool(processes=num_workers) as pool: predictions = pool.map(single_job, tasks) 、、、

The implementation of concurrent.futures looks not bad, I thought that onnx can't work in this way.

aluminumbox · 2024-09-06T08:26:32Z

tools/extract_embedding.py

+from tqdm import tqdm
+
+
+def extract_embedding(input_list):


入参可以应该直接改为utt, wav

aluminumbox · 2024-09-06T08:27:08Z

tools/extract_embedding.py

+        )(audio)
+    feat = kaldi.fbank(audio, num_mel_bins=80, dither=0, sample_frequency=16000)
+    feat = feat - feat.mean(dim=0, keepdim=True)
+    embedding = (


不需要这么多换行，看我们workflow/lint.py，里面最大允许150长度，不然换行太多了也影响可读性

aluminumbox · 2024-09-06T08:31:21Z

麻烦按意见修改一下，改为了merge到dev/lyuxiang.lx，我这边测过后再会和几个新的修改统一merge到main，谢谢 @MiXaiLL76

add threading

7b3e285

Implementing concurrent.futures

73271d4

aluminumbox changed the base branch from main to dev/lyuxiang.lx September 6, 2024 08:25

aluminumbox reviewed Sep 6, 2024

View reviewed changes

fix

1d05ae5

aluminumbox merged commit 2665b06 into FunAudioLLM:dev/lyuxiang.lx Sep 18, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implemented fast processing of extract_embedding #356

Implemented fast processing of extract_embedding #356

Uh oh!

MiXaiLL76 commented Sep 5, 2024

Uh oh!

MiXaiLL76 commented Sep 5, 2024

Uh oh!

aluminumbox commented Sep 6, 2024

Uh oh!

MiXaiLL76 commented Sep 6, 2024

Uh oh!

aluminumbox Sep 6, 2024

Uh oh!

MiXaiLL76 Sep 6, 2024

Uh oh!

aluminumbox Sep 6, 2024

Uh oh!

MiXaiLL76 Sep 6, 2024

Uh oh!

aluminumbox commented Sep 6, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implemented fast processing of extract_embedding #356

Implemented fast processing of extract_embedding #356

Uh oh!

Conversation

MiXaiLL76 commented Sep 5, 2024

Uh oh!

MiXaiLL76 commented Sep 5, 2024

Uh oh!

aluminumbox commented Sep 6, 2024

Uh oh!

MiXaiLL76 commented Sep 6, 2024

Uh oh!

aluminumbox Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

MiXaiLL76 Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

aluminumbox Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

MiXaiLL76 Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

aluminumbox commented Sep 6, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants