Releases: NVIDIA/TensorRT-LLM
TensorRT-LLM Release 0.18.2
Key Features and Enhancements
- This update addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/.
TensorRT-LLM Release 0.18.1
Key Features and Enhancements
- The 0.18.x series of releases builds upon the 0.17.0 release, focusing exclusively on dependency updates without incorporating features from the previous 0.18.0.dev pre-releases. These features will be included in future stable releases.
Infrastructure Changes
- The dependent
transformerspackage version is updated to 4.48.3.
TensorRT-LLM Release 0.18.0
Hi,
We are very pleased to announce the 0.18.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Features that were previously available in the 0.18.0.dev pre-releases are not included in this release.
- [BREAKING CHANGE] Windows platform support is deprecated as of v0.18.0. All Windows-related code and functionality will be completely removed in future releases.
Known Issues
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the PyTorch NGC Container for optimal support on SBSA platforms.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.03-py3. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:25.03-py3. - The dependent TensorRT version is updated to 10.9.
- The dependent CUDA version is updated to 12.8.1.
- The dependent NVIDIA ModelOpt version is updated to 0.25 for Linux platform.
TensorRT-LLM Release 0.17.0
Hi,
We are very pleased to announce the 0.17.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Blackwell support
- NOTE: pip installation is not supported for TRT-LLM 0.17 on Blackwell platforms only. Instead, it is recommended that the user build from source using NVIDIA NGC 25.01 PyTorch container.
- Added support for B200.
- Added support for GeForce RTX 50 series using Windows Subsystem for Linux (WSL) for limited models.
- Added NVFP4 Gemm support for Llama and Mixtral models.
- Added NVFP4 support for the
LLMAPI andtrtllm-benchcommand. - GB200 NVL is not fully supported.
- Added benchmark script to measure perf benefits of KV cache host offload with expected runtime improvements from GH200.
- PyTorch workflow
- The PyTorch workflow is an experimental feature in
tensorrt_llm._torch. The following is a list of supported infrastructure, models, and features that can be used with the PyTorch workflow. - Added support for H100/H200/B200.
- Added support for Llama models, Mixtral, QWen, Vila.
- Added support for FP16/BF16/FP8/NVFP4 Gemm and fused Mixture-Of-Experts (MOE), FP16/BF16/FP8 KVCache.
- Added custom context and decoding attention kernels support via PyTorch custom op.
- Added support for chunked context (default off).
- Added CudaGraph support for decoding only.
- Added overlap scheduler support to overlap prepare inputs and model forward by decoding 1 extra token.
- The PyTorch workflow is an experimental feature in
- Added FP8 context FMHA support for the W4A8 quantization workflow.
- Added ModelOpt quantized checkpoint support for the
LLMAPI. - Added FP8 support for the Llama-3.2 VLM model. Refer to the “MLLaMA” section in
examples/multimodal/README.md. - Added PDL support for
userbufferbased AllReduce-Norm fusion kernel. - Added runtime support for seamless lookahead decoding.
- Added token-aligned arbitrary output tensors support for the C++
executorAPI.
API Changes
- [BREAKING CHANGE] KV cache reuse is enabled automatically when
paged_context_fmhais enabled. - Added
--concurrencysupport for thethroughputsubcommand oftrtllm-bench.
Fixed Issues
- Fixed incorrect LoRA output dimension. Thanks for the contribution from @akhoroshev in #2484.
- Added NVIDIA H200 GPU into the
cluster_keyfor auto parallelism feature. (#2552) - Fixed a typo in the
__post_init__function ofLLmArgsClass. Thanks for the contribution from @topenkoff in #2691. - Fixed workspace size issue in the GPT attention plugin. Thanks for the contribution from @AIDC-AI.
- Fixed Deepseek-V2 model accuracy.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.01-py3. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:25.01-py3. - The dependent TensorRT version is updated to 10.8.0.
- The dependent CUDA version is updated to 12.8.0.
- The dependent ModelOpt version is updated to 0.23 for Linux platform, while 0.17 is still used on Windows platform.
Known Issues
- Need
--extra-index-url https://pypi.nvidia.comwhen runningpip install tensorrt-llmdue to new third-party dependencies. - The PYPI SBSA wheel is incompatible with PyTorch 2.5.1 due to a break in the PyTorch ABI/API, as detailed in the related GitHub issue.
TensorRT-LLM Release 0.16.0
Hi,
We are very pleased to announce the 0.16.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Added guided decoding support with XGrammar backend.
- Added quantization support for RecurrentGemma. Refer to
examples/recurrentgemma/README.md. - Added ulysses context parallel support. Refer to an example on building LLaMA 7B using 2-way tensor parallelism and 2-way context parallelism at
examples/llama/README.md. - Added W4A8 quantization support to BF16 models on Ada (SM89).
- Added PDL support for the FP8 GEMM plugins.
- Added a runtime
max_num_tokensdynamic tuning feature, which can be enabled by setting--enable_max_num_tokens_tuningtogptManagerBenchmark. - Added typical acceptance support for EAGLE.
- Supported chunked context and sliding window attention to be enabled together.
- Added head size 64 support for the XQA kernel.
- Added the following features to the LLM API:
- Lookahead decoding.
- DeepSeek V1 support.
- Medusa support.
max_num_tokensandmax_batch_sizearguments to control the runtime parameters.extended_runtime_perf_knob_configto enable various performance configurations.
- Added LogN scaling support for Qwen models.
- Added
AutoAWQcheckpoints support for Qwen. Refer to the “INT4-AWQ” section inexamples/qwen/README.md. - Added
AutoAWQandAutoGPTQHugging Face checkpoints support for LLaMA. (#2458) - Added
allottedTimeMsto the C++Requestclass to support per-request timeout. - [BREAKING CHANGE] Removed NVIDIA V100 GPU support.
API Changes
- [BREAKING CHANGE] Removed
enable_xqaargument fromtrtllm-build. - [BREAKING CHANGE] Chunked context is enabled by default when KV cache and paged context FMHA is enabled on non-RNN based models.
- [BREAKING CHANGE] Enabled embedding sharing automatically when possible and remove the flag
--use_embedding_sharingfrom convert checkpoints scripts. - [BREAKING CHANGE] The
if __name__ == "__main__"entry point is required for both single-GPU and multi-GPU cases when using theLLMAPI. - [BREAKING CHANGE] Cancelled requests now return empty results.
- Added the
enable_chunked_prefillflag to theLlmArgsof theLLMAPI. - Integrated BERT and RoBERTa models to the
trtllm-buildcommand.
Model Updates
- Added Qwen2-VL support. Refer to the “Qwen2-VL” section of
examples/multimodal/README.md. - Added multimodal evaluation examples. Refer to
examples/multimodal. - Added Stable Diffusion XL support. Refer to
examples/sdxl/README.md. Thanks for the contribution from @Zars19 in #1514.
Fixed Issues
- Fixed unnecessary batch logits post processor calls. (#2439)
- Fixed a typo in the error message. (#2473)
- Fixed the in-place clamp operation usage in smooth quant. Thanks for the contribution from @StarrickLiu in #2485.
- Fixed
sampling_paramsto only be setup ifend_idis None andtokenizeris not None in theLLMAPI. Thanks to the contribution from @mfuntowicz in #2573.
Infrastructure Changes
- Updated the base Docker image for TensorRT-LLM to
nvcr.io/nvidia/pytorch:24.11-py3. - Updated the base Docker image for TensorRT-LLM Backend to
nvcr.io/nvidia/tritonserver:24.11-py3. - Updated to TensorRT v10.7.
- Updated to CUDA v12.6.3.
- Added support for Python 3.10 and 3.12 to TensorRT-LLM Python wheels on PyPI.
- Updated to ModelOpt v0.21 for Linux platform, while v0.17 is still used on Windows platform.
Known Issues
- There is a known AllReduce performance issue on AMD-based CPU platforms on NCCL 2.23.4, which can be workarounded by
export NCCL_P2P_LEVEL=SYS.
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.15.0 Release
Hi,
We are very pleased to announce the 0.15.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Added support for EAGLE. Refer to
examples/eagle/README.md. - Added functional support for GH200 systems.
- Added AutoQ (mixed precision) support.
- Added a
trtllm-servecommand to start a FastAPI based server. - Added FP8 support for Nemotron NAS 51B. Refer to
examples/nemotron_nas/README.md. - Added INT8 support for GPTQ quantization.
- Added TensorRT native support for INT8 Smooth Quantization.
- Added quantization support for Exaone model. Refer to
examples/exaone/README.md. - Enabled Medusa for Qwen2 models. Refer to “Medusa with Qwen2” section in
examples/medusa/README.md. - Optimized pipeline parallelism with ReduceScatter and AllGather for Mixtral models.
- Added support for
Qwen2ForSequenceClassificationmodel architecture. - Added Python plugin support to simplify plugin development efforts. Refer to
examples/python_plugin/README.md. - Added different rank dimensions support for LoRA modules when using the Hugging Face format. Thanks for the contribution from @AlessioNetti in #2366.
- Enabled embedding sharing by default. Refer to "Embedding Parallelism, Embedding Sharing, and Look-Up Plugin" section in
docs/source/performance/perf-best-practices.mdfor information about the required conditions for embedding sharing. - Added support for per-token per-channel FP8 (namely row-wise FP8) on Ada.
- Extended the maximum supported
beam_widthto256. - Added FP8 and INT8 SmoothQuant quantization support for the InternVL2-4B variant (LLM model only). Refer to
examples/multimodal/README.md. - Added support for prompt-lookup speculative decoding. Refer to
examples/prompt_lookup/README.md. - Integrated the QServe w4a8 per-group/per-channel quantization. Refer to “w4aINT8 quantization (QServe)” section in
examples/llama/README.md. - Added a C++ example for fast logits using the
executorAPI. Refer to “executorExampleFastLogits” section inexamples/cpp/executor/README.md. - [BREAKING CHANGE] NVIDIA Volta GPU support is removed in this and future releases.
- Added the following enhancements to the LLM API:
- [BREAKING CHANGE] Moved the runtime initialization from the first invocation of
LLM.generatetoLLM.__init__for better generation performance without warmup. - Added
nandbest_ofarguments to theSamplingParamsclass. These arguments enable returning multiple generations for a single request. - Added
ignore_eos,detokenize,skip_special_tokens,spaces_between_special_tokens, andtruncate_prompt_tokensarguments to theSamplingParamsclass. These arguments enable more control over the tokenizer behavior. - Added support for incremental detokenization to improve the detokenization performance for streaming generation.
- Added the
enable_prompt_adapterargument to theLLMclass and theprompt_adapter_requestargument for theLLM.generatemethod. These arguments enable prompt tuning.
- [BREAKING CHANGE] Moved the runtime initialization from the first invocation of
- Added support for a
gpt_variantargument to theexamples/gpt/convert_checkpoint.pyfile. This enhancement enables checkpoint conversion with more GPT model variants. Thanks to the contribution from @tonylek in #2352.
API Changes
- [BREAKING CHANGE] Moved the flag
builder_force_num_profilesintrtllm-buildcommand to theBUILDER_FORCE_NUM_PROFILESenvironment variable. - [BREAKING CHANGE] Modified defaults for
BuildConfigclass so that they are aligned with thetrtllm-buildcommand. - [BREAKING CHANGE] Removed Python bindings of
GptManager. - [BREAKING CHANGE]
autois used as the default value for--dtypeoption in quantize and checkpoints conversion scripts. - [BREAKING CHANGE] Deprecated
gptManagerAPI path ingptManagerBenchmark. - [BREAKING CHANGE] Deprecated the
beam_widthandnum_return_sequencesarguments to theSamplingParamsclass in the LLM API. Use then,best_ofanduse_beam_searcharguments instead. - Exposed
--trust_remote_codeargument to the OpenAI API server. (#2357)
Model Updates
- Added support for Llama 3.2 and llama 3.2-Vision model. Refer to
examples/mllama/README.mdfor more details on the llama 3.2-Vision model. - Added support for Deepseek-v2. Refer to
examples/deepseek_v2/README.md. - Added support for Cohere Command R models. Refer to
examples/commandr/README.md. - Added support for Falcon 2, refer to
examples/falcon/README.md, thanks to the contribution from @puneeshkhanna in #1926. - Added support for InternVL2. Refer to
examples/multimodal/README.md. - Added support for Qwen2-0.5B and Qwen2.5-1.5B model. (#2388)
- Added support for Minitron. Refer to
examples/nemotron. - Added a GPT Variant - Granite(20B and 34B). Refer to “GPT Variant - Granite” section in
examples/gpt/README.md. - Added support for LLaVA-OneVision model. Refer to “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in
examples/multimodal/README.md.
Fixed Issues
- Fixed a slice error in forward function. (#1480)
- Fixed an issue that appears when building BERT. (#2373)
- Fixed an issue that model is not loaded when building BERT. (2379)
- Fixed the broken executor examples. (#2294)
- Fixed the issue that the kernel
moeTopK()cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug. - Fixed an assertion failure on
crossKvCacheFraction. (#2419) - Fixed an issue when using smoothquant to quantize Qwen2 model. (#2370)
- Fixed a PDL typo in
docs/source/performance/perf-benchmarking.md, thanks @MARD1NO for pointing it out in #2425.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.10-py3. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:24.10-py3. - The dependent TensorRT version is updated to 10.6.
- The dependent CUDA version is updated to 12.6.2.
- The dependent PyTorch version is updated to 2.5.1.
- The dependent ModelOpt version is updated to 0.19 for Linux platform, while 0.17 is still used on Windows platform.
Documentation
- Added a copy button for code snippets in the documentation. (#2288)
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.14.0 Release
Hi,
We are very pleased to announce the 0.14.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Enhanced the
LLMclass in the LLM API.- Added support for calibration with offline dataset.
- Added support for Mamba2.
- Added support for
finish_reasonandstop_reason.
- Added FP8 support for CodeLlama.
- Added
__repr__methods for classModule, thanks to the contribution from @1ytic in #2191. - Added BFloat16 support for fused gated MLP.
- Updated ReDrafter beam search logic to match Apple ReDrafter v1.1.
- Improved
customAllReduceperformance. - Draft model now can copy logits directly over MPI to the target model's process in
orchestratormode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference. - NVIDIA Volta GPU support is deprecated and will be removed in a future release.
API Changes
- [BREAKING CHANGE] The default
max_batch_sizeof thetrtllm-buildcommand is set to2048. - [BREAKING CHANGE] Remove
builder_optfrom theBuildConfigclass and thetrtllm-buildcommand. - Add logits post-processor support to the
ModelRunnerCppclass. - Added
isParticipantmethod to the C++ExecutorAPI to check if the current process is a participant in the executor instance.
Model Updates
- Added support for NemotronNas, see
examples/nemotron_nas/README.md. - Added support for Deepseek-v1, see
examples/deepseek_v1/README.md. - Added support for Phi-3.5 models, see
examples/phi/README.md.
Fixed Issues
- Fixed a typo in
tensorrt_llm/models/model_weights_loader.py, thanks to the contribution from @wangkuiyi in #2152. - Fixed duplicated import module in
tensorrt_llm/runtime/generation.py, thanks to the contribution from @lkm2835 in #2182. - Enabled
share_embeddingfor the models that have nolm_headin legacy checkpoint conversion path, thanks to the contribution from @lkm2835 in #2232. - Fixed
kv_cache_typeissue in the Python benchmark, thanks to the contribution from @qingquansong in #2219. - Fixed an issue with SmoothQuant calibration with custom datasets. Thanks to the contribution by @Bhuvanesh09 in #2243.
- Fixed an issue surrounding
trtllm-build --fast-buildwith fake or random weights. Thanks to @ZJLi2013 for flagging it in #2135. - Fixed missing
use_fused_mlpwhen constructingBuildConfigfrom dict, thanks for the fix from @ethnzhng in #2081. - Fixed lookahead batch layout for
numNewTokensCumSum. (#2263)
Infrastructure Changes
- The dependent ModelOpt version is updated to v0.17.
Documentation
- @Sherlock113 added a tech blog to the latest news in #2169, thanks for the contribution.
Known Issues
- Replit Code is not supported with the transformers 4.45+
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.13.0 Release
Hi,
We are very pleased to announce the 0.13.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Supported lookahead decoding (experimental), see
docs/source/speculative_decoding.md. - Added some enhancements to the
ModelWeightsLoader(a unified checkpoint converter, seedocs/source/architecture/model-weights-loader.md).- Supported Qwen models.
- Supported auto-padding for indivisible TP shape in INT4-wo/INT8-wo/INT4-GPTQ.
- Improved performance on
*.binand*.pth.
- Supported OpenAI Whisper in C++ runtime.
- Added some enhancements to the
LLMclass.- Supported LoRA.
- Supported engine building using dummy weights.
- Supported
trust_remote_codefor customized models and tokenizers downloaded from Hugging Face Hub.
- Supported beam search for streaming mode.
- Supported tensor parallelism for Mamba2.
- Supported returning generation logits for streaming mode.
- Added
curandandbfloat16support forReDrafter. - Added sparse mixer normalization mode for MoE models.
- Added support for QKV scaling in FP8 FMHA.
- Supported FP8 for MoE LoRA.
- Supported KV cache reuse for P-Tuning and LoRA.
- Supported in-flight batching for CogVLM models.
- Supported LoRA for the
ModelRunnerCppclass. - Supported
head_size=48cases for FMHA kernels. - Added FP8 examples for DiT models, see
examples/dit/README.md. - Supported decoder with encoder input features for the C++
executorAPI.
API Changes
- [BREAKING CHANGE] Set
use_fused_mlptoTrueby default. - [BREAKING CHANGE] Enabled
multi_block_modeby default. - [BREAKING CHANGE] Enabled
strongly_typedby default inbuilderAPI. - [BREAKING CHANGE] Renamed
maxNewTokens,randomSeedandminLengthtomaxTokens,seedandminTokensfollowing OpenAI style. - The
LLMclass- [BREAKING CHANGE] Updated
LLM.generatearguments to includePromptInputsandtqdm.
- [BREAKING CHANGE] Updated
- The C++
executorAPI- [BREAKING CHANGE] Added
LogitsPostProcessorConfig. - Added
FinishReasontoResult.
- [BREAKING CHANGE] Added
Model Updates
- Supported Gemma 2, see "Run Gemma 2" section in
examples/gemma/README.md.
Fixed Issues
- Fixed an accuracy issue when enabling remove padding issue for cross attention. (#1999)
- Fixed the failure in converting qwen2-0.5b-instruct when using
smoothquant. (#2087) - Matched the
exclude_modulespattern inconvert_utils.pyto the changes inquantize.py. (#2113) - Fixed build engine error when
FORCE_NCCL_ALL_REDUCE_STRATEGYis set. - Fixed unexpected truncation in the quant mode of
gpt_attention. - Fixed the hang caused by race condition when canceling requests.
- Fixed the default factory for
LoraConfig. (#1323)
Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.07-py3. - Base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:24.07-py3. - The dependent TensorRT version is updated to 10.4.0.
- The dependent CUDA version is updated to 12.5.1.
- The dependent PyTorch version is updated to 2.4.0.
- The dependent ModelOpt version is updated to v0.15.
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.12.0 Release
Hi,
We are very pleased to announce the 0.12.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Supported LoRA for MoE models.
- The
ModelWeightsLoaderis enabled for LLaMA family models (experimental), seedocs/source/architecture/model-weights-loader.md. - Supported FP8 FMHA for NVIDIA Ada Lovelace Architecture.
- Supported GPT-J, Phi, Phi-3, Qwen, GPT, GLM, Baichuan, Falcon and Gemma models for the
LLMclass. - Supported FP8 OOTB MoE.
- Supported Starcoder2 SmoothQuant. (#1886)
- Supported ReDrafter Speculative Decoding, see “ReDrafter” section in
docs/source/speculative_decoding.md. - Supported padding removal for BERT, thanks to the contribution from @Altair-Alpha in #1834.
- Added in-flight batching support for GLM 10B model.
- Supported
gelu_pytorch_tanhactivation function, thanks to the contribution from @ttim in #1897. - Added
chunk_lengthparameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in #1909. - Added
concurrencyargument forgptManagerBenchmark. - Executor API supports requests with different beam widths, see
docs/source/executor.md#sending-requests-with-different-beam-widths. - Added the flag
--fast_buildtotrtllm-buildcommand (experimental).
API Changes
- [BREAKING CHANGE]
max_output_lenis removed fromtrtllm-buildcommand, if you want to limit sequence length on engine build stage, specifymax_seq_len. - [BREAKING CHANGE] The
use_custom_all_reduceargument is removed fromtrtllm-build. - [BREAKING CHANGE] The
multi_block_modeargument is moved from build stage (trtllm-buildand builder API) to the runtime. - [BREAKING CHANGE] The build time argument
context_fmha_fp32_accis moved to runtime for decoder models. - [BREAKING CHANGE] The arguments
tp_size,pp_sizeandcp_sizeis removed fromtrtllm-buildcommand. - The C++ batch manager API is deprecated in favor of the C++
executorAPI, and it will be removed in a future release of TensorRT-LLM. - Added a version API to the C++ library, a
cpp/include/tensorrt_llm/executor/version.hfile is going to be generated.
Model Updates
- Supported LLaMA 3.1 model.
- Supported Mamba-2 model.
- Supported EXAONE model, see
examples/exaone/README.md. - Supported Qwen 2 model.
- Supported GLM4 models, see
examples/chatglm/README.md. - Added LLaVa-1.6 (LLaVa-NeXT) multimodal support, see “LLaVA, LLaVa-NeXT and VILA” section in
examples/multimodal/README.md.
Fixed Issues
- Fixed wrong pad token for the CodeQwen models. (#1953)
- Fixed typo in
cluster_infosdefined intensorrt_llm/auto_parallel/cluster_info.py, thanks to the contribution from @saeyoonoh in #1987. - Removed duplicated flags in the command at
docs/source/reference/troubleshooting.md, thanks for the contribution from @hattizai in #1937. - Fixed segmentation fault in TopP sampling layer, thanks to the contribution from @akhoroshev in #2039. (#2040)
- Fixed the failure when converting the checkpoint for Mistral Nemo model. (#1985)
- Propagated
exclude_modulesto weight-only quantization, thanks to the contribution from @fjosw in #2056. - Fixed wrong links in README, thanks to the contribution from @Tayef-Shah in #2028.
- Fixed some typos in the documentation, thanks to the contribution from @lfz941 in #1939.
- Fixed the engine build failure when deduced
max_seq_lenis not an integer. (#2018)
Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.07-py3. - Base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:24.07-py3. - The dependent TensorRT version is updated to 10.3.0.
- The dependent CUDA version is updated to 12.5.1.
- The dependent PyTorch version is updated to 2.4.0.
- The dependent ModelOpt version is updated to v0.15.0.
Known Issues
- On Windows, installation of TensorRT-LLM may succeed, but you might hit
OSError: exception: access violation reading 0x0000000000000000when importing the library in Python. See Installing on Windows for workarounds.
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.11.0 Release
Hi,
We are very pleased to announce the 0.11.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Supported very long context for LLaMA (see “Long context evaluation” section in
examples/llama/README.md). - Low latency optimization
- Added a reduce-norm feature which aims to fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, which is recommended to be enabled when the batch size is small and the generation phase time is dominant.
- Added FP8 support to the GEMM plugin, which benefits the cases when batch size is smaller than 4.
- Added a fused GEMM-SwiGLU plugin for FP8 on SM90.
- LoRA enhancements
- Supported running FP8 LLaMA with FP16 LoRA checkpoints.
- Added support for quantized base model and FP16/BF16 LoRA.
- SQ OOTB (- INT8 A/W) + FP16/BF16/FP32 LoRA
- INT8/ INT4 Weight-Only (INT8 /W) + FP16/BF16/FP32 LoRA
- Weight-Only Group-wise + FP16/BF16/FP32 LoRA
- Added LoRA support to Qwen2, see “Run models with LoRA” section in
examples/qwen/README.md. - Added support for Phi-3-mini/small FP8 base + FP16/BF16 LoRA, see “Run Phi-3 with LoRA” section in
examples/phi/README.md. - Added support for starcoder-v2 FP8 base + FP16/BF16 LoRA, see “Run StarCoder2 with LoRA” section in
examples/gpt/README.md.
- Encoder-decoder models C++ runtime enhancements
- Supported paged KV cache and inflight batching. (#800)
- Supported tensor parallelism.
- Supported INT8 quantization with embedding layer excluded.
- Updated default model for Whisper to
distil-whisper/distil-large-v3, thanks to the contribution from @IbrahimAmin1 in #1337. - Supported HuggingFace model automatically download for the Python high level API.
- Supported explicit draft tokens for in-flight batching.
- Supported local custom calibration datasets, thanks to the contribution from @DreamGenX in #1762.
- Added batched logits post processor.
- Added Hopper qgmma kernel to XQA JIT codepath.
- Supported tensor parallelism and expert parallelism enabled together for MoE.
- Supported the pipeline parallelism cases when the number of layers cannot be divided by PP size.
- Added
numQueuedRequeststo the iteration stats log of the executor API. - Added
iterLatencyMilliSecto the iteration stats log of the executor API. - Add HuggingFace model zoo from the community, thanks to the contribution from @matichon-vultureprime in #1674.
API Changes
- [BREAKING CHANGE]
trtllm-buildcommand- Migrated Whisper to unified workflow (
trtllm-buildcommand), see documents: examples/whisper/README.md. max_batch_sizeintrtllm-buildcommand is switched to 256 by default.max_num_tokensintrtllm-buildcommand is switched to 8192 by default.- Deprecated
max_output_lenand addedmax_seq_len. - Removed unnecessary
--weight_only_precisionargument fromtrtllm-buildcommand. - Removed
attention_qk_half_accumulationargument fromtrtllm-buildcommand. - Removed
use_context_fmha_for_generationargument fromtrtllm-buildcommand. - Removed
strongly_typedargument fromtrtllm-buildcommand. - The default value of
max_seq_lenreads from the HuggingFace mode config now.
- Migrated Whisper to unified workflow (
- C++ runtime
- [BREAKING CHANGE] Renamed
free_gpu_memory_fractioninModelRunnerCpptokv_cache_free_gpu_memory_fraction. - [BREAKING CHANGE] Refactored
GptManagerAPI- Moved
maxBeamWidthintoTrtGptModelOptionalParams. - Moved
schedulerConfigintoTrtGptModelOptionalParams.
- Moved
- Added some more options to
ModelRunnerCpp, includingmax_tokens_in_paged_kv_cache,kv_cache_enable_block_reuseandenable_chunked_context.
- [BREAKING CHANGE] Renamed
- [BREAKING CHANGE] Python high-level API
- Removed the
ModelConfigclass, and all the options are moved toLLMclass. - Refactored the
LLMclass, please refer toexamples/high-level-api/README.md- Moved the most commonly used options in the explicit arg-list, and hidden the expert options in the kwargs.
- Exposed
modelto accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine. - Support downloading model from HuggingFace model hub, currently only Llama variants are supported.
- Support build cache to reuse the built TensorRT-LLM engines by setting environment variable
TLLM_HLAPI_BUILD_CACHE=1or passingenable_build_cache=TruetoLLMclass. - Exposed low-level options including
BuildConfig,SchedulerConfigand so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.
- Refactored
LLM.generate()andLLM.generate_async()API.- Removed
SamplingConfig. - Added
SamplingParamswith more extensive parameters, seetensorrt_llm/hlapi/utils.py.- The new
SamplingParamscontains and manages fields from Python bindings ofSamplingConfig,OutputConfig, and so on.
- The new
- Refactored
LLM.generate()output asRequestOutput, seetensorrt_llm/hlapi/llm.py.
- Removed
- Updated the
appsexamples, specially by rewriting bothchat.pyandfastapi_server.pyusing theLLMAPIs, please refer to theexamples/apps/README.mdfor details.- Updated the
chat.pyto support multi-turn conversation, allowing users to chat with a model in the terminal. - Fixed the
fastapi_server.pyand eliminate the need formpirunin multi-GPU scenarios.
- Updated the
- Removed the
- [BREAKING CHANGE] Speculative decoding configurations unification
- Introduction of
SpeculativeDecodingMode.hto choose between different speculative decoding techniques. - Introduction of
SpeculativeDecodingModule.hbase class for speculative decoding techniques. - Removed
decodingMode.h.
- Introduction of
gptManagerBenchmark- [BREAKING CHANGE]
apiingptManagerBenchmarkcommand isexecutorby default now. - Added a runtime
max_batch_size. - Added a runtime
max_num_tokens.
- [BREAKING CHANGE]
- [BREAKING CHANGE] Added a
biasargument to theLayerNormmodule, and supports non-bias layer normalization. - [BREAKING CHANGE] Removed
GptSessionPython bindings.
Model Updates
- Supported Jais, see
examples/jais/README.md. - Supported DiT, see
examples/dit/README.md. - Supported VILA 1.5.
- Supported Video NeVA, see
Video NeVAsection inexamples/multimodal/README.md. - Supported Grok-1, see
examples/grok/README.md. - Supported Qwen1.5-110B with FP8 PTQ.
- Supported Phi-3 small model with block sparse attention.
- Supported InternLM2 7B/20B, thanks to the contribution from @RunningLeon in #1392.
- Supported Phi-3-medium models, see
examples/phi/README.md. - Supported Qwen1.5 MoE A2.7B.
- Supported phi 3 vision multimodal.
Fixed Issues
- Fixed brokens outputs for the cases when batch size is larger than 1. (#1539)
- Fixed
top_ktype inexecutor.py, thanks to the contribution from @vonjackustc in #1329. - Fixed stop and bad word list pointer offset in Python runtime, thanks to the contribution from @fjosw in #1486.
- Fixed some typos for Whisper model, thanks to the contribution from @Pzzzzz5142 in #1328.
- Fixed export failure with CUDA driver < 526 and pynvml >= 11.5.0, thanks to the contribution from @CoderHam in #1537.
- Fixed an issue in NMT weight conversion, thanks to the contribution from @Pzzzzz5142 in #1660.
- Fixed LLaMA Smooth Quant conversion, thanks to the contribution from @lopuhin in #1650.
- Fixed
qkv_biasshape issue for Qwen1.5-32B (#1589), thanks to the contribution from @Tlntin in #1637. - Fixed the error of Ada traits for
fpA_intB, thanks to the contribution from @JamesTheZ in #1583. - Update
examples/qwenvl/requirements.txt, thanks to the contribution from @ngoanpv in #1248. - Fixed rsLoRA scaling in
lora_manager, thanks to the contribution from @TheCodeWrangler in #1669. - Fixed Qwen1.5 checkpoint convert failure #1675.
- Fixed Medusa safetensors and AWQ conversion, thanks to the contribution from @Tushar-ml in #1535.
- Fixed
convert_hf_mpt_legacycall failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in #1534. - Fixed
use_fp8_context_fmhabroken outputs (#1539). - Fixed pre-norm weight conversion for NMT models, thanks to the contribution from @Pzzzzz5142 in #1723.
- Fixed random seed initialization issue, thanks to the contribution from @pathorn in #1742.
- Fixed stop words and bad words in python bindings. (#1642)
- Fixed the issue that when converting checkpoint for Mistral 7B v0.3, thanks to the contribution from @Ace-RR: #1732.
- Fixed broken inflight batching for fp8 Llama and Mixtral, thanks to the contribution from @bprus: #1738
- Fixed the failure when
quantize.pyis export data to config.json, thanks to the contribution from @janpetrov: #1676 - Raise error when autopp detects unsupported quant plugin #1626.
- Fixed the issue that
shared_embedding_tableis not being set when loading Gemma #1799, thanks to the contribution from @mfuntowicz. - Fixed stop and bad words list contiguous for
ModelRunner#1815, thanks to the contribution from @Marks101. - Fixed missing comment for
FAST_BUILD, thanks to the support from @lkm2835 in #1851. - Fixed the issues that Top-P sampling occasionally produces invalid tokens. #1590
- Fixed #1424.
- Fixed #1529.
- Fixed
benchmarks/cpp/README.mdfor #1562 and #1552. - Fixed dead link, thanks to the help from @DefTruth, @buvnswrn and @sunjiabin17 in: triton-inference-server/tensorrtllm_backend#478, triton-inference-server/tensorrtllm_backend#482 and triton-inference-server/tensorrtllm_backend#449.
Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.05-py3. - Base Docker image for TensorRT-LLM backend is updated to `nvcr.io/nvidia/...