TensorRT-LLM 0.12.0 Release #2167

Shixiaowei02 · 2024-08-29T15:01:08Z

Shixiaowei02
Aug 29, 2024
Collaborator

Hi,

We are very pleased to announce the 0.12.0 version of TensorRT-LLM. This update includes:

Key Features and Enhancements

Supported LoRA for MoE models.
The ModelWeightsLoader is enabled for LLaMA family models (experimental), see docs/source/architecture/model-weights-loader.md.
Supported FP8 FMHA for NVIDIA Ada Lovelace Architecture.
Supported GPT-J, Phi, Phi-3, Qwen, GPT, GLM, Baichuan, Falcon and Gemma models for the LLM class.
Supported FP8 OOTB MoE.
Supported Starcoder2 SmoothQuant. (smoothquant on starcoder2 #1886)
Supported ReDrafter Speculative Decoding, see “ReDrafter” section in docs/source/speculative_decoding.md.
Supported padding removal for BERT, thanks to the contribution from @Altair-Alpha in support remove_input_padding for BertForSequenceClassification models #1834.
Added in-flight batching support for GLM 10B model.
Supported gelu_pytorch_tanh activation function, thanks to the contribution from @ttim in Support gelu_pytorch_tanh activation function #1897.
Added chunk_length parameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in add chunk_length parameter to Whisper #1909.
Added concurrency argument for gptManagerBenchmark.
Executor API supports requests with different beam widths, see docs/source/executor.md#sending-requests-with-different-beam-widths.
Added the flag --fast_build to trtllm-build command (experimental).

API Changes

[BREAKING CHANGE] max_output_len is removed from trtllm-build command, if you want to limit sequence length on engine build stage, specify max_seq_len.
[BREAKING CHANGE] The use_custom_all_reduce argument is removed from trtllm-build.
[BREAKING CHANGE] The multi_block_mode argument is moved from build stage (trtllm-build and builder API) to the runtime.
[BREAKING CHANGE] The build time argument context_fmha_fp32_acc is moved to runtime for decoder models.
[BREAKING CHANGE] The arguments tp_size, pp_size and cp_size is removed from trtllm-build command.
The C++ batch manager API is deprecated in favor of the C++ executor API, and it will be removed in a future release of TensorRT-LLM.
Added a version API to the C++ library, a cpp/include/tensorrt_llm/executor/version.h file is going to be generated.

Model Updates

Supported LLaMA 3.1 model.
Supported Mamba-2 model.
Supported EXAONE model, see examples/exaone/README.md.
Supported Qwen 2 model.
Supported GLM4 models, see examples/chatglm/README.md.
Added LLaVa-1.6 (LLaVa-NeXT) multimodal support, see “LLaVA, LLaVa-NeXT and VILA” section in examples/multimodal/README.md.

Fixed Issues

Fixed wrong pad token for the CodeQwen models. ([Feature] quantize_by_modelopt.py get_tokenizer is not suitable for CodeQwen1.5 7B Chat #1953)
Fixed typo in cluster_infos defined in tensorrt_llm/auto_parallel/cluster_info.py, thanks to the contribution from @saeyoonoh in fix auto parallel cluster info typo #1987.
Removed duplicated flags in the command at docs/source/reference/troubleshooting.md, thanks for the contribution from @hattizai in chore: remove duplicate flag #1937.
Fixed segmentation fault in TopP sampling layer, thanks to the contribution from @akhoroshev in Fix segfault in TopP sampling layer #2039. (Segfault on main branch (problem in TopP layer) #2040)
Fixed the failure when converting the checkpoint for Mistral Nemo model. (Support for Mistral Nemo #1985)
Propagated exclude_modules to weight-only quantization, thanks to the contribution from @fjosw in [Fix] Propagate QuantConfig.exclude_modules to weight only quantization #2056.
Fixed wrong links in README, thanks to the contribution from @Tayef-Shah in update links in overview section of README #2028.
Fixed some typos in the documentation, thanks to the contribution from @lfz941 in chore(docs): fix typos #1939.
Fixed the engine build failure when deduced max_seq_len is not an integer. (llama 3.1 70B Instruct would not build engine "TypeError: set_shape(): incompatible function arguments." #2018)

Infrastructure Changes

Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.07-py3.
Base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.07-py3.
The dependent TensorRT version is updated to 10.3.0.
The dependent CUDA version is updated to 12.5.1.
The dependent PyTorch version is updated to 2.4.0.
The dependent ModelOpt version is updated to v0.15.0.

Known Issues

On Windows, installation of TensorRT-LLM may succeed, but you might hit OSError: exception: access violation reading 0x0000000000000000 when importing the library in Python. See Installing on Windows for workarounds.

Currently, there are two key branches in the project:

The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
The main branch is the dev branch. It is more experimental.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

This discussion was created from the release TensorRT-LLM 0.12.0 Release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TensorRT-LLM 0.12.0 Release #2167

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

TensorRT-LLM 0.12.0 Release #2167

Uh oh!

Shixiaowei02 Aug 29, 2024 Collaborator

Key Features and Enhancements

API Changes

Model Updates

Fixed Issues

Infrastructure Changes

Known Issues

Replies: 0 comments

Shixiaowei02
Aug 29, 2024
Collaborator