🐛 [Bug] Exporting engine with `hardware_compatible` does not create hardware compatible egine

##  Bug Description
```py
from tensorrt import Logger, Runtime
from torch import randn
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from torch_tensorrt import convert_method_to_trt_engine

# Create model
weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights).eval()
example_input = randn(1, 3, 224, 224)

# Create TRT engine
engine_bytes = convert_method_to_trt_engine(
    model,
    ir="dynamo",
    inputs=[example_input],
    version_compatible=True,
    hardware_compatible=True,
    require_full_compilation=True
)

# Check hardware compat
logger = Logger(Logger.WARNING)
runtime = Runtime(logger)
engine = runtime.deserialize_cuda_engine(engine_bytes)
print("Hardware compat level:", engine.hardware_compatibility_level)
# prints: Hardware compat level: HardwareCompatibilityLevel.NONE
```

I am running on an A100 (`sm_80`). I also see that the correct `hardware_compatible` flag is being passed to C++, from the `torch-tensorrt` logger:
```py
CompilationSettings(
    enabled_precisions={<dtype.f32: 7>},
    workspace_size=1073741824,
    min_block_size=5,
    torch_executed_ops=set(),
    pass_through_build_failures=False,
    max_aux_streams=None,
    version_compatible=True,
    optimization_level=3,
    use_python_runtime=False,
    truncate_double=False,
    use_fast_partitioner=True,
    enable_experimental_decompositions=False,
    device=Device(type=DeviceType.GPU, gpu_id=0), 
    equire_full_compilation=True,
    disable_tf32=False,
    assume_dynamic_shape_support=False,
    sparse_weights=False, engine_capability=<EngineCapability.STANDARD: 1>,
    num_avg_timing_iters=1, dla_sram_size=1048576,
    dla_local_dram_size=1073741824,
    dla_global_dram_size=536870912,
    dryrun=False,
    hardware_compatible=True,
    timing_cache_path='/tmp/torch_tensorrt_engine_cache/timing_cache.bin',
    lazy_engine_init=False,
    cache_built_engines=False,
    reuse_cached_engines=False,
    use_explicit_typing=False,
    use_fp32_acc=False,
    refit_identical_engine_weights=False,
    strip_engine_weights=False,
    immutable_weights=True,
    enable_weight_streaming=False,
    enable_cross_compile_for_windows=False,
    tiling_optimization_level='none',
    l2_limit_for_tiling=-1,
    use_distributed_mode_trace=False,
    offload_module_to_cpu=False
)
```

Any ideas.

## To Reproduce

Steps to reproduce the behavior:

1. Run Python script above



## Expected behavior

Engine hardware compatibility shows `AMPERE_PLUS`.

## Environment

> Build information about Torch-TensorRT can be found by turning on debug messages

 - Torch-TensorRT Version (e.g. 1.0.0): 2.9.0
 - PyTorch Version (e.g. 1.0): 2.9.0
 - CPU Architecture: x86_64
 - OS (e.g., Linux): Linux
 - How you installed PyTorch (`conda`, `pip`, `libtorch`, source): `pip`
 - Build command you used (if compiling from source):
 - Are you using local sources or building from archives: No
 - Python version: 3.12
 - CUDA version: 12.8
 - GPU models and configuration: Nvidia A100
 - Any other relevant information: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 [Bug] Exporting engine with `hardware_compatible` does not create hardware compatible egine #3941

Bug Description

To Reproduce

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🐛 [Bug] Exporting engine with hardware_compatible does not create hardware compatible egine #3941

Description

Bug Description

To Reproduce

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

🐛 [Bug] Exporting engine with `hardware_compatible` does not create hardware compatible egine #3941