Skip to content

fix(doc_vlm): remove ROCm BF16 _keep_in_fp32_modules workaround in PaddleOCR-VL#5077

Open
fchange wants to merge 1 commit intoPaddlePaddle:developfrom
fchange:fix/remove-rocm-bf16-workaround
Open

fix(doc_vlm): remove ROCm BF16 _keep_in_fp32_modules workaround in PaddleOCR-VL#5077
fchange wants to merge 1 commit intoPaddlePaddle:developfrom
fchange:fix/remove-rocm-bf16-workaround

Conversation

@fchange
Copy link
Copy Markdown

@fchange fchange commented Apr 4, 2026

Remove ROCm BF16 _keep_in_fp32_modules workaround in PaddleOCR-VL

Summary

Removes the _keep_in_fp32_modules = ["visual", "mlp_AR"] workaround from PaddleOCRVLForConditionalGeneration, enabling BF16 precision inference for the vision encoder on AMD GPUs (ROCm).

Related Issue: #5076
Depends on: PaddlePaddle/Paddle#78587

Problem

The _keep_in_fp32_modules workaround forces the SigLIP vision encoder to run in FP32 on ROCm, even when the model is loaded with BF16 dtype. This was necessary because Paddle's HIP backend did not register BF16 convolution kernels.

Impact:

  • VRAM usage doubled (FP32 = 2x BF16)
  • Inference throughput reduced (no BF16 Tensor Core utilization)
  • Inconsistent model behavior (claimed BF16, actual FP32 for visual)

Change

 class PaddleOCRVLForConditionalGeneration(Ernie4_5PretrainedModel):
     _tied_weights_keys = ["lm_head.weight"]
     config_class = PaddleOCRVLConfig
     _no_split_modules = ["Ernie4_5DecoderLayer", "SiglipEncoderLayer"]
-    _keep_in_fp32_modules = ["visual", "mlp_AR"]
+    _keep_in_fp32_modules = None
     base_model_prefix = ""

Dependency

This PR requires the Paddle framework fix to be merged first:

Verification

Environment

  • AMD MI300X (gfx942), ROCm 7.0.51
  • PaddlePaddle 3.4.0.dev (with PR #78587 applied and rebuilt)
  • PaddleOCR-VL-1.5-0.9B (dtype=bfloat16)

Test: Native Backend Inference

cd /opt/PaddleX
paddlex --pipeline PaddleOCR-VL-native.yaml --input /tmp/test_ocr.png

Result: Successfully processed boarding pass image with correct OCR output:

  • "登机牌 BOARDING PASS"
  • Flight: MU 2379, Date: 03DEC
  • Destination: 福州 (FUZHOU), From: TAIYUAN
  • Passenger: 张祺伟 / ZHANGQIWEI, Gate: G11
  • 31 layout elements detected, all text correctly recognized

Test: Vision Encoder BF16 Verification

After this change, the vision encoder runs in BF16 (no longer forced to FP32):

from paddlex.inference.models.doc_vlm.modeling.paddleocr_vl._paddleocr_vl import (
    PaddleOCRVLForConditionalGeneration,
)
print(PaddleOCRVLForConditionalGeneration._keep_in_fp32_modules)
# Output: None (was ["visual", "mlp_AR"])

Before vs After

Metric Before (FP32 workaround) After (BF16)
Vision encoder precision FP32 BF16
VRAM for visual weights 2x baseline 1x baseline
Compute precision FP32 BF16 (native Tensor Core)
OCR accuracy Correct Correct

Notes

  • The static graph fuse pass disabling for ROCm (conv2d_add_act_fuse_pass, conv2d_add_fuse_pass in static_infer.py) is unchanged — these depend on cuDNN and are correctly disabled on HIP.
  • The is_bfloat16_available() function in misc.py does not have a ROCm override in the upstream develop branch, so no changes needed there.

Remove _keep_in_fp32_modules = ["visual", "mlp_AR"] from
PaddleOCRVLForConditionalGeneration. This workaround was added to
avoid MIOpen BF16 convolution bugs on ROCm 7.0 by forcing the visual
encoder to FP32, which doubled VRAM usage and reduced throughput.

The Paddle framework now registers BF16 conv kernels for HIP backend,
making this workaround unnecessary.

See: PaddlePaddle/Paddle#78587

Signed-off-by: fchange

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 4, 2026

Thanks for your contribution!

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 4, 2026

CLA assistant check
All committers have signed the CLA.

@paddle-bot paddle-bot bot added the contributor External developers label Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants