Low ViT Performance Gain on Jetson Thor Using FP8 vs FP16

## Description

Hello,

Looking at the [documentation](https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html#example-workflow-fp8-mha-fusion), to enable fp8 operations you need some onnx surgery (inserting Q/DQ at specific locations) to trigger the right MHA (Multi-Head Attention) fusion in conjunction with fp8 precision.

However, the performance improvement is quite low for base ViT model (~20% latency reduction). It is even worse on the [EfficientSAM](https://github.com/yformer/EfficientSAM/tree/main) encoder with basically no gain.

By looking at the profiling and layer info from TensorRT the FP8 seems there (even though some tactics are quite cryptic, especially the gmm_mha_v2_#weirdbitstream).


## Environment

- **TensorRT Version**: 10.13.3
- **NVIDIA GPU**:  Thor (Jetson DevKit)
- **NVIDIA Driver Version**:  580.00
- **CUDA Version**: 13

## Relevant Files

- **Model link**:  [EfficientSAM-S](https://huggingface.co/yunyangx/EfficientSAM/blob/main/efficientsam_s_encoder.onnx)
- **Model link**:  [ViT-Base](https://github.com/onnx/models/tree/main/Computer_Vision/vit_base_patch8_224_Opset17_timm)

## Steps To Reproduce

**Model Optimizer** -> [commit](https://github.com/NVIDIA/TensorRT-Model-Optimizer/commit/4df40919e902f0d7102a7340273dd4cccb824504)

**ViT-Base FP8 onnx generation**:
`python3 -m modelopt.onnx.quantization --onnx_path=./vit_base_patch8_224_Opset17.onnx --quantize_mode=fp8 --output_path=./vitb_fp8.onnx`

**EfficientSAM-S FP8 onnx generation**:
`python3 -m modelopt.onnx.quantization --onnx_path=./efficientsam_s_encoder.onnx --quantize_mode=fp8 --output_path=./sam_s_fp8.onnx`

**ViT-Base FP8 engine generation**:
`trtexec --stronglyTyped --onnx=./vitb_fp8.onnx  --saveEngine=./vitb_fp8.engine`

**ViT-Base FP8 engine generation**:
`trtexec --stronglyTyped --onnx=./sam_s_fp8.onnx  --saveEngine=./sam_s_fp8.engine`

## TensorRT Layer Info and Profiles

[vit_base_patch8_224_Opset17_fp8.json](https://github.com/user-attachments/files/22927450/vit_base_patch8_224_Opset17_fp8.json)
[vit_base_patch8_224_Opset17_fp8.profile.txt](https://github.com/user-attachments/files/22927446/vit_base_patch8_224_Opset17_fp8.profile.txt)
[vit_base_patch8_224_Opset17_fp16.json](https://github.com/user-attachments/files/22927448/vit_base_patch8_224_Opset17_fp16.json)
[vit_base_patch8_224_Opset17_fp16.profile.txt](https://github.com/user-attachments/files/22927444/vit_base_patch8_224_Opset17_fp16.profile.txt)
[efficientsam_s_encoder_fp8.json](https://github.com/user-attachments/files/22927449/efficientsam_s_encoder_fp8.json)
[efficientsam_s_encoder_fp8.profile.txt](https://github.com/user-attachments/files/22927447/efficientsam_s_encoder_fp8.profile.txt)
[efficientsam_s_encoder_fp16.json](https://github.com/user-attachments/files/22927445/efficientsam_s_encoder_fp16.json)
[efficientsam_s_encoder_fp16.profile.txt](https://github.com/user-attachments/files/22927451/efficientsam_s_encoder_fp16.profile.txt)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Low ViT Performance Gain on Jetson Thor Using FP8 vs FP16 #4599

Description

Environment

Relevant Files

Steps To Reproduce

TensorRT Layer Info and Profiles

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Low ViT Performance Gain on Jetson Thor Using FP8 vs FP16 #4599

Description

Description

Environment

Relevant Files

Steps To Reproduce

TensorRT Layer Info and Profiles

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions