Skip to content

Conversation

khanld
Copy link

@khanld khanld commented Apr 17, 2025

This PR is an implementation of ChunkFormer for WeNet encoder structure.

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription

Features:

  • Supports transcription of long audio, up to 16 hours, currently among the longest in open-source models.
  • Efficient batch decoding without padding. (e.g., Batch transcription of 1 hour and 1s audios would take only 1 hour + 1s computational resources (speed, memory) instead of 2 hours due to padding)
  • Extremely fast and memory-efficient: (e.g., batch of 2-hour audios processed in ~5 seconds on Nvidia T4 15GB) - [ref]

Modules:

  • Add ChunkAttentionWithRelativeRightContext
  • Add ChunkConvolutionModule module - https://arxiv.org/abs/2304.09325
  • Add DepthwiseConvSubsampling
  • Add Masked Batch Decoding
  • Add RelPositionalEncodingWithRightContext
  • Add Chunk context training

Todos:

  • Streaming Long-Form Transcription (eg: decode an audio of 10 hours of speech on 8gb GPU)

Evaluation Results:

  • Model info:
    • Encoder Params: 32,356,096
    • Downsample rate: dw_striding 8x
    • encoder_dim 256, head 4, linear_units 2048
    • num_blocks 12, cnn_module_kernel 15
  • Feature info: using fbank feature, cmvn, dither, online speed perturb
  • Training info:
    • train_u2++_chunkformer_small.yaml, kernel size 15
    • dynamic batch size 120.000, 2 GPU, acc_grad 4, 200 epochs, dither 1.0
    • adamw, lr 1e-3, warmuplr, warmup_steps: 25000
    • specaug and speed perturb
  • Decoding info: ctc_weight 0.3, reverse weight 0.5, average_num 100, beam size 10

Full context training -> Chunk context inferencing (Chunk size: 64. Left context size = Right context size = 128):

⚠️ Attention Decoder does not support chunk-context inference due to cross-attention mismatch with full context training. Chunk-context training is required to resolve this mismatch.

Decoding Mode Dev Clean Dev Other Test Clean Test Other
CTC Greedy Search 3.05 8.84 3.27 8.54
CTC Prefix Beam Search 3.04 8.83 3.26 8.54
Attention Decoder 4.58 9.62 5.07 9.22
Attention Rescoring 2.83 8.39 2.97 8.02

Full context training -> Full context inferencing:

Decoding Mode Dev Clean Dev Other Test Clean Test Other
CTC Greedy Search 3.08 8.82 3.24 8.55
CTC Prefix Beam Search 3.06 8.80 3.23 8.53
Attention Decoder 2.92 8.28 3.03 8.05
Attention Rescoring 2.80 8.37 2.94 8.03

@khanld
Copy link
Author

khanld commented Apr 17, 2025

@robin1001 @xingchensong @Mddct Hi there, all checks are now passing, review appreciated.

@khanld
Copy link
Author

khanld commented Apr 29, 2025

I added the implementation that enables joint training for both full and limited contexts. The configuration remains unchanged, and the attention decoder works as expected for chunk context inferencing.

Encoder:

  • dynamic_conv: true
  • dynamic_chunk_sizes: [-1, -1, 64, 128, 256] # -1 means full context
  • dynamic_left_context_sizes: [64, 128, 256]
  • dynamic_right_context_sizes: [64, 128, 256]
  • chunk size, left context size, and right context size are represented as (c, l, r)
  • results on test-clean / test other
Decoding Mode (-1, -1, -1) (64, 128, 128) (128, 128, 128) (128, 256, 256) (256, 64, 64) (256, 128, 128)
ctc_greedy_search 3.19 / 8.51* 3.22 / 8.54 3.20 / 8.53 3.20 / 8.53 3.18* / 8.52 3.18* / 8.51*
ctc_prefix_beam_search 3.17 / 8.50 3.20 / 8.53 3.18 / 8.51 3.19 / 8.51 3.16* / 8.50 3.16* / 8.49*
attention 3.24* / 8.03* 3.38 / 8.16 3.24* / 8.07 3.28 / 8.05 3.29 / 8.13 3.26 / 8.08
attention_rescoring 2.96 / 7.88* 2.82* / 7.89 2.96 / 7.90 2.97 / 7.90 2.95 / 7.89 2.95 / 7.88*
Average 3.14* / 8.23* 3.16 / 8.28 3.15 / 8.25 3.16 / 8.25 3.15 / 8.26 3.14* / 8.24

@khanld khanld force-pushed the khanhle-chunkformer branch from cd5b97d to 6948c56 Compare May 6, 2025 09:36
@Mddct Mddct self-requested a review May 6, 2025 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant