TensorRT-LLM 0.12.0 Release #2167
Shixiaowei02
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
We are very pleased to announce the 0.12.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
ModelWeightsLoaderis enabled for LLaMA family models (experimental), seedocs/source/architecture/model-weights-loader.md.LLMclass.docs/source/speculative_decoding.md.gelu_pytorch_tanhactivation function, thanks to the contribution from @ttim in Support gelu_pytorch_tanh activation function #1897.chunk_lengthparameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in addchunk_lengthparameter to Whisper #1909.concurrencyargument forgptManagerBenchmark.docs/source/executor.md#sending-requests-with-different-beam-widths.--fast_buildtotrtllm-buildcommand (experimental).API Changes
max_output_lenis removed fromtrtllm-buildcommand, if you want to limit sequence length on engine build stage, specifymax_seq_len.use_custom_all_reduceargument is removed fromtrtllm-build.multi_block_modeargument is moved from build stage (trtllm-buildand builder API) to the runtime.context_fmha_fp32_accis moved to runtime for decoder models.tp_size,pp_sizeandcp_sizeis removed fromtrtllm-buildcommand.executorAPI, and it will be removed in a future release of TensorRT-LLM.cpp/include/tensorrt_llm/executor/version.hfile is going to be generated.Model Updates
examples/exaone/README.md.examples/chatglm/README.md.examples/multimodal/README.md.Fixed Issues
cluster_infosdefined intensorrt_llm/auto_parallel/cluster_info.py, thanks to the contribution from @saeyoonoh in fix auto parallel cluster info typo #1987.docs/source/reference/troubleshooting.md, thanks for the contribution from @hattizai in chore: remove duplicate flag #1937.exclude_modulesto weight-only quantization, thanks to the contribution from @fjosw in [Fix] Propagate QuantConfig.exclude_modules to weight only quantization #2056.max_seq_lenis not an integer. (llama 3.1 70B Instruct would not build engine "TypeError: set_shape(): incompatible function arguments." #2018)Infrastructure Changes
nvcr.io/nvidia/pytorch:24.07-py3.nvcr.io/nvidia/tritonserver:24.07-py3.Known Issues
OSError: exception: access violation reading 0x0000000000000000when importing the library in Python. See Installing on Windows for workarounds.Currently, there are two key branches in the project:
We are updating the
mainbranch regularly with new features, bug fixes and performance optimizations. Therelbranch will be updated less frequently, and the exact frequencies depend on your feedback.Thanks,
The TensorRT-LLM Engineering Team
This discussion was created from the release TensorRT-LLM 0.12.0 Release.
Beta Was this translation helpful? Give feedback.
All reactions