ChunkFormer #2723

khanld · 2025-04-17T05:35:25Z

This PR is an implementation of ChunkFormer for WeNet encoder structure.

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription

Paper: https://arxiv.org/abs/2502.14673
Original code: https://github.com/khanld/chunkformer

Features:

Supports transcription of long audio, up to 16 hours, currently among the longest in open-source models.
Efficient batch decoding without padding. (e.g., Batch transcription of 1 hour and 1s audios would take only 1 hour + 1s computational resources (speed, memory) instead of 2 hours due to padding)
Extremely fast and memory-efficient: (e.g., batch of 2-hour audios processed in ~5 seconds on Nvidia T4 15GB) - [ref]

Modules:

Add ChunkAttentionWithRelativeRightContext
Add ChunkConvolutionModule module - https://arxiv.org/abs/2304.09325
Add DepthwiseConvSubsampling
Add Masked Batch Decoding
Add RelPositionalEncodingWithRightContext
Add Chunk context training

Todos:

Streaming Long-Form Transcription (eg: decode an audio of 10 hours of speech on 8gb GPU)

Evaluation Results:

Model info:
- Encoder Params: 32,356,096
- Downsample rate: dw_striding 8x
- encoder_dim 256, head 4, linear_units 2048
- num_blocks 12, cnn_module_kernel 15
Feature info: using fbank feature, cmvn, dither, online speed perturb
Training info:
- train_u2++_chunkformer_small.yaml, kernel size 15
- dynamic batch size 120.000, 2 GPU, acc_grad 4, 200 epochs, dither 1.0
- adamw, lr 1e-3, warmuplr, warmup_steps: 25000
- specaug and speed perturb
Decoding info: ctc_weight 0.3, reverse weight 0.5, average_num 100, beam size 10

Full context training -> Chunk context inferencing (Chunk size: 64. Left context size = Right context size = 128):

⚠️ Attention Decoder does not support chunk-context inference due to cross-attention mismatch with full context training. Chunk-context training is required to resolve this mismatch.

Decoding Mode	Dev Clean	Dev Other	Test Clean	Test Other
CTC Greedy Search	3.05	8.84	3.27	8.54
CTC Prefix Beam Search	3.04	8.83	3.26	8.54
Attention Decoder	4.58	9.62	5.07	9.22
Attention Rescoring	2.83	8.39	2.97	8.02

Full context training -> Full context inferencing:

Decoding Mode	Dev Clean	Dev Other	Test Clean	Test Other
CTC Greedy Search	3.08	8.82	3.24	8.55
CTC Prefix Beam Search	3.06	8.80	3.23	8.53
Attention Decoder	2.92	8.28	3.03	8.05
Attention Rescoring	2.80	8.37	2.94	8.03

khanld · 2025-04-17T07:16:50Z

@robin1001 @xingchensong @Mddct Hi there, all checks are now passing, review appreciated.

khanld · 2025-04-29T08:33:57Z

I added the implementation that enables joint training for both full and limited contexts. The configuration remains unchanged, and the attention decoder works as expected for chunk context inferencing.

Encoder:

dynamic_conv: true
dynamic_chunk_sizes: [-1, -1, 64, 128, 256] # -1 means full context
dynamic_left_context_sizes: [64, 128, 256]
dynamic_right_context_sizes: [64, 128, 256]
chunk size, left context size, and right context size are represented as (c, l, r)
results on test-clean / test other

Decoding Mode	(-1, -1, -1)	(64, 128, 128)	(128, 128, 128)	(128, 256, 256)	(256, 64, 64)	(256, 128, 128)
ctc_greedy_search	3.19 / 8.51*	3.22 / 8.54	3.20 / 8.53	3.20 / 8.53	3.18* / 8.52	3.18* / 8.51*
ctc_prefix_beam_search	3.17 / 8.50	3.20 / 8.53	3.18 / 8.51	3.19 / 8.51	3.16* / 8.50	3.16* / 8.49*
attention	3.24* / 8.03*	3.38 / 8.16	3.24* / 8.07	3.28 / 8.05	3.29 / 8.13	3.26 / 8.08
attention_rescoring	2.96 / 7.88*	2.82* / 7.89	2.96 / 7.90	2.97 / 7.90	2.95 / 7.89	2.95 / 7.88*
Average	3.14* / 8.23*	3.16 / 8.28	3.15 / 8.25	3.16 / 8.25	3.15 / 8.26	3.14* / 8.24

khanld and others added 5 commits April 2, 2025 11:09

[feat] add chunkformer model

4739a05

[feat] add masked batch and limited context decoding

24a19c5

[refactor] flake8 refactor for chunkformer

55781fc

[feat] add chunkformer training config and results

3e027ee

Fix: Address linting errors reported by quick-checks

7e63359

This was referenced Apr 19, 2025

Code Training khanld/chunkformer#4

Closed

Code to train model! khanld/chunkformer#3

Closed

khanld and others added 3 commits April 24, 2025 15:10

[feat] add chunked context training

82aa9e4

[feat] add chunk context training, inference results

6d925c9

[fix] address linting errors reported by quick-checks

6948c56

khanld force-pushed the khanhle-chunkformer branch from cd5b97d to 6948c56 Compare May 6, 2025 09:36

Mddct self-requested a review May 6, 2025 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ChunkFormer #2723

ChunkFormer #2723

Uh oh!

khanld commented Apr 17, 2025 •

edited

Loading

Uh oh!

khanld commented Apr 17, 2025

Uh oh!

khanld commented Apr 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

ChunkFormer #2723

Are you sure you want to change the base?

ChunkFormer #2723

Uh oh!

Conversation

khanld commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Full context training -> Chunk context inferencing (Chunk size: 64. Left context size = Right context size = 128):

Full context training -> Full context inferencing:

Uh oh!

khanld commented Apr 17, 2025

Uh oh!

khanld commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

khanld commented Apr 17, 2025 •

edited

Loading

khanld commented Apr 29, 2025 •

edited

Loading