TensorRT-LLM 0.12.0 Release
Hi,
We are very pleased to announce the 0.12.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Supported LoRA for MoE models.
- The
ModelWeightsLoaderis enabled for LLaMA family models (experimental), seedocs/source/architecture/model-weights-loader.md. - Supported FP8 FMHA for NVIDIA Ada Lovelace Architecture.
- Supported GPT-J, Phi, Phi-3, Qwen, GPT, GLM, Baichuan, Falcon and Gemma models for the
LLMclass. - Supported FP8 OOTB MoE.
- Supported Starcoder2 SmoothQuant. (#1886)
- Supported ReDrafter Speculative Decoding, see “ReDrafter” section in
docs/source/speculative_decoding.md. - Supported padding removal for BERT, thanks to the contribution from @Altair-Alpha in #1834.
- Added in-flight batching support for GLM 10B model.
- Supported
gelu_pytorch_tanhactivation function, thanks to the contribution from @ttim in #1897. - Added
chunk_lengthparameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in #1909. - Added
concurrencyargument forgptManagerBenchmark. - Executor API supports requests with different beam widths, see
docs/source/executor.md#sending-requests-with-different-beam-widths. - Added the flag
--fast_buildtotrtllm-buildcommand (experimental).
API Changes
- [BREAKING CHANGE]
max_output_lenis removed fromtrtllm-buildcommand, if you want to limit sequence length on engine build stage, specifymax_seq_len. - [BREAKING CHANGE] The
use_custom_all_reduceargument is removed fromtrtllm-build. - [BREAKING CHANGE] The
multi_block_modeargument is moved from build stage (trtllm-buildand builder API) to the runtime. - [BREAKING CHANGE] The build time argument
context_fmha_fp32_accis moved to runtime for decoder models. - [BREAKING CHANGE] The arguments
tp_size,pp_sizeandcp_sizeis removed fromtrtllm-buildcommand. - The C++ batch manager API is deprecated in favor of the C++
executorAPI, and it will be removed in a future release of TensorRT-LLM. - Added a version API to the C++ library, a
cpp/include/tensorrt_llm/executor/version.hfile is going to be generated.
Model Updates
- Supported LLaMA 3.1 model.
- Supported Mamba-2 model.
- Supported EXAONE model, see
examples/exaone/README.md. - Supported Qwen 2 model.
- Supported GLM4 models, see
examples/chatglm/README.md. - Added LLaVa-1.6 (LLaVa-NeXT) multimodal support, see “LLaVA, LLaVa-NeXT and VILA” section in
examples/multimodal/README.md.
Fixed Issues
- Fixed wrong pad token for the CodeQwen models. (#1953)
- Fixed typo in
cluster_infosdefined intensorrt_llm/auto_parallel/cluster_info.py, thanks to the contribution from @saeyoonoh in #1987. - Removed duplicated flags in the command at
docs/source/reference/troubleshooting.md, thanks for the contribution from @hattizai in #1937. - Fixed segmentation fault in TopP sampling layer, thanks to the contribution from @akhoroshev in #2039. (#2040)
- Fixed the failure when converting the checkpoint for Mistral Nemo model. (#1985)
- Propagated
exclude_modulesto weight-only quantization, thanks to the contribution from @fjosw in #2056. - Fixed wrong links in README, thanks to the contribution from @Tayef-Shah in #2028.
- Fixed some typos in the documentation, thanks to the contribution from @lfz941 in #1939.
- Fixed the engine build failure when deduced
max_seq_lenis not an integer. (#2018)
Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.07-py3. - Base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:24.07-py3. - The dependent TensorRT version is updated to 10.3.0.
- The dependent CUDA version is updated to 12.5.1.
- The dependent PyTorch version is updated to 2.4.0.
- The dependent ModelOpt version is updated to v0.15.0.
Known Issues
- On Windows, installation of TensorRT-LLM may succeed, but you might hit
OSError: exception: access violation reading 0x0000000000000000when importing the library in Python. See Installing on Windows for workarounds.
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team