TensorRT-LLM 0.15.0 Release
Hi,
We are very pleased to announce the 0.15.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Added support for EAGLE. Refer to
examples/eagle/README.md. - Added functional support for GH200 systems.
- Added AutoQ (mixed precision) support.
- Added a
trtllm-servecommand to start a FastAPI based server. - Added FP8 support for Nemotron NAS 51B. Refer to
examples/nemotron_nas/README.md. - Added INT8 support for GPTQ quantization.
- Added TensorRT native support for INT8 Smooth Quantization.
- Added quantization support for Exaone model. Refer to
examples/exaone/README.md. - Enabled Medusa for Qwen2 models. Refer to “Medusa with Qwen2” section in
examples/medusa/README.md. - Optimized pipeline parallelism with ReduceScatter and AllGather for Mixtral models.
- Added support for
Qwen2ForSequenceClassificationmodel architecture. - Added Python plugin support to simplify plugin development efforts. Refer to
examples/python_plugin/README.md. - Added different rank dimensions support for LoRA modules when using the Hugging Face format. Thanks for the contribution from @AlessioNetti in #2366.
- Enabled embedding sharing by default. Refer to "Embedding Parallelism, Embedding Sharing, and Look-Up Plugin" section in
docs/source/performance/perf-best-practices.mdfor information about the required conditions for embedding sharing. - Added support for per-token per-channel FP8 (namely row-wise FP8) on Ada.
- Extended the maximum supported
beam_widthto256. - Added FP8 and INT8 SmoothQuant quantization support for the InternVL2-4B variant (LLM model only). Refer to
examples/multimodal/README.md. - Added support for prompt-lookup speculative decoding. Refer to
examples/prompt_lookup/README.md. - Integrated the QServe w4a8 per-group/per-channel quantization. Refer to “w4aINT8 quantization (QServe)” section in
examples/llama/README.md. - Added a C++ example for fast logits using the
executorAPI. Refer to “executorExampleFastLogits” section inexamples/cpp/executor/README.md. - [BREAKING CHANGE] NVIDIA Volta GPU support is removed in this and future releases.
- Added the following enhancements to the LLM API:
- [BREAKING CHANGE] Moved the runtime initialization from the first invocation of
LLM.generatetoLLM.__init__for better generation performance without warmup. - Added
nandbest_ofarguments to theSamplingParamsclass. These arguments enable returning multiple generations for a single request. - Added
ignore_eos,detokenize,skip_special_tokens,spaces_between_special_tokens, andtruncate_prompt_tokensarguments to theSamplingParamsclass. These arguments enable more control over the tokenizer behavior. - Added support for incremental detokenization to improve the detokenization performance for streaming generation.
- Added the
enable_prompt_adapterargument to theLLMclass and theprompt_adapter_requestargument for theLLM.generatemethod. These arguments enable prompt tuning.
- [BREAKING CHANGE] Moved the runtime initialization from the first invocation of
- Added support for a
gpt_variantargument to theexamples/gpt/convert_checkpoint.pyfile. This enhancement enables checkpoint conversion with more GPT model variants. Thanks to the contribution from @tonylek in #2352.
API Changes
- [BREAKING CHANGE] Moved the flag
builder_force_num_profilesintrtllm-buildcommand to theBUILDER_FORCE_NUM_PROFILESenvironment variable. - [BREAKING CHANGE] Modified defaults for
BuildConfigclass so that they are aligned with thetrtllm-buildcommand. - [BREAKING CHANGE] Removed Python bindings of
GptManager. - [BREAKING CHANGE]
autois used as the default value for--dtypeoption in quantize and checkpoints conversion scripts. - [BREAKING CHANGE] Deprecated
gptManagerAPI path ingptManagerBenchmark. - [BREAKING CHANGE] Deprecated the
beam_widthandnum_return_sequencesarguments to theSamplingParamsclass in the LLM API. Use then,best_ofanduse_beam_searcharguments instead. - Exposed
--trust_remote_codeargument to the OpenAI API server. (#2357)
Model Updates
- Added support for Llama 3.2 and llama 3.2-Vision model. Refer to
examples/mllama/README.mdfor more details on the llama 3.2-Vision model. - Added support for Deepseek-v2. Refer to
examples/deepseek_v2/README.md. - Added support for Cohere Command R models. Refer to
examples/commandr/README.md. - Added support for Falcon 2, refer to
examples/falcon/README.md, thanks to the contribution from @puneeshkhanna in #1926. - Added support for InternVL2. Refer to
examples/multimodal/README.md. - Added support for Qwen2-0.5B and Qwen2.5-1.5B model. (#2388)
- Added support for Minitron. Refer to
examples/nemotron. - Added a GPT Variant - Granite(20B and 34B). Refer to “GPT Variant - Granite” section in
examples/gpt/README.md. - Added support for LLaVA-OneVision model. Refer to “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in
examples/multimodal/README.md.
Fixed Issues
- Fixed a slice error in forward function. (#1480)
- Fixed an issue that appears when building BERT. (#2373)
- Fixed an issue that model is not loaded when building BERT. (2379)
- Fixed the broken executor examples. (#2294)
- Fixed the issue that the kernel
moeTopK()cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug. - Fixed an assertion failure on
crossKvCacheFraction. (#2419) - Fixed an issue when using smoothquant to quantize Qwen2 model. (#2370)
- Fixed a PDL typo in
docs/source/performance/perf-benchmarking.md, thanks @MARD1NO for pointing it out in #2425.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.10-py3. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:24.10-py3. - The dependent TensorRT version is updated to 10.6.
- The dependent CUDA version is updated to 12.6.2.
- The dependent PyTorch version is updated to 2.5.1.
- The dependent ModelOpt version is updated to 0.19 for Linux platform, while 0.17 is still used on Windows platform.
Documentation
- Added a copy button for code snippets in the documentation. (#2288)
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team