-
Couldn't load subscription status.
- Fork 2.3k
Description
Description
Hello,
Looking at the documentation, to enable fp8 operations you need some onnx surgery (inserting Q/DQ at specific locations) to trigger the right MHA (Multi-Head Attention) fusion in conjunction with fp8 precision.
However, the performance improvement is quite low for base ViT model (~20% latency reduction). It is even worse on the EfficientSAM encoder with basically no gain.
By looking at the profiling and layer info from TensorRT the FP8 seems there (even though some tactics are quite cryptic, especially the gmm_mha_v2_#weirdbitstream).
Environment
- TensorRT Version: 10.13.3
- NVIDIA GPU: Thor (Jetson DevKit)
- NVIDIA Driver Version: 580.00
- CUDA Version: 13
Relevant Files
- Model link: EfficientSAM-S
- Model link: ViT-Base
Steps To Reproduce
Model Optimizer -> commit
ViT-Base FP8 onnx generation:
python3 -m modelopt.onnx.quantization --onnx_path=./vit_base_patch8_224_Opset17.onnx --quantize_mode=fp8 --output_path=./vitb_fp8.onnx
EfficientSAM-S FP8 onnx generation:
python3 -m modelopt.onnx.quantization --onnx_path=./efficientsam_s_encoder.onnx --quantize_mode=fp8 --output_path=./sam_s_fp8.onnx
ViT-Base FP8 engine generation:
trtexec --stronglyTyped --onnx=./vitb_fp8.onnx --saveEngine=./vitb_fp8.engine
ViT-Base FP8 engine generation:
trtexec --stronglyTyped --onnx=./sam_s_fp8.onnx --saveEngine=./sam_s_fp8.engine
TensorRT Layer Info and Profiles
vit_base_patch8_224_Opset17_fp8.json
vit_base_patch8_224_Opset17_fp8.profile.txt
vit_base_patch8_224_Opset17_fp16.json
vit_base_patch8_224_Opset17_fp16.profile.txt
efficientsam_s_encoder_fp8.json
efficientsam_s_encoder_fp8.profile.txt
efficientsam_s_encoder_fp16.json
efficientsam_s_encoder_fp16.profile.txt