TensorRT-LLM 0.11.0 Release #1970
kaiyux
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
We are very pleased to announce the 0.11.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
examples/llama/README.md).examples/qwen/README.md.examples/phi/README.md.examples/gpt/README.md.distil-whisper/distil-large-v3, thanks to the contribution from @IbrahimAmin1 in [feat]: Add Option to convert and run distil-whisper large-v3 #1337.numQueuedRequeststo the iteration stats log of the executor API.iterLatencyMilliSecto the iteration stats log of the executor API.API Changes
trtllm-buildcommandtrtllm-buildcommand), see documents: examples/whisper/README.md.max_batch_sizeintrtllm-buildcommand is switched to 256 by default.max_num_tokensintrtllm-buildcommand is switched to 8192 by default.max_output_lenand addedmax_seq_len.--weight_only_precisionargument fromtrtllm-buildcommand.attention_qk_half_accumulationargument fromtrtllm-buildcommand.use_context_fmha_for_generationargument fromtrtllm-buildcommand.strongly_typedargument fromtrtllm-buildcommand.max_seq_lenreads from the HuggingFace mode config now.free_gpu_memory_fractioninModelRunnerCpptokv_cache_free_gpu_memory_fraction.GptManagerAPImaxBeamWidthintoTrtGptModelOptionalParams.schedulerConfigintoTrtGptModelOptionalParams.ModelRunnerCpp, includingmax_tokens_in_paged_kv_cache,kv_cache_enable_block_reuseandenable_chunked_context.ModelConfigclass, and all the options are moved toLLMclass.LLMclass, please refer toexamples/high-level-api/README.mdmodelto accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine.TLLM_HLAPI_BUILD_CACHE=1or passingenable_build_cache=TruetoLLMclass.BuildConfig,SchedulerConfigand so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.LLM.generate()andLLM.generate_async()API.SamplingConfig.SamplingParamswith more extensive parameters, seetensorrt_llm/hlapi/utils.py.SamplingParamscontains and manages fields from Python bindings ofSamplingConfig,OutputConfig, and so on.LLM.generate()output asRequestOutput, seetensorrt_llm/hlapi/llm.py.appsexamples, specially by rewriting bothchat.pyandfastapi_server.pyusing theLLMAPIs, please refer to theexamples/apps/README.mdfor details.chat.pyto support multi-turn conversation, allowing users to chat with a model in the terminal.fastapi_server.pyand eliminate the need formpirunin multi-GPU scenarios.SpeculativeDecodingMode.hto choose between different speculative decoding techniques.SpeculativeDecodingModule.hbase class for speculative decoding techniques.decodingMode.h.gptManagerBenchmarkapiingptManagerBenchmarkcommand isexecutorby default now.max_batch_size.max_num_tokens.biasargument to theLayerNormmodule, and supports non-bias layer normalization.GptSessionPython bindings.Model Updates
examples/jais/README.md.examples/dit/README.md.Video NeVAsection inexamples/multimodal/README.md.examples/grok/README.md.examples/phi/README.md.Fixed Issues
top_ktype inexecutor.py, thanks to the contribution from @vonjackustc in Fix top_k type (float => int32) executor.py #1329.qkv_biasshape issue for Qwen1.5-32B (convert qwen 110b gptq checkpoint的时候,qkv_bias 的shape不能被3整除 #1589), thanks to the contribution from @Tlntin in fix up qkv.bias error when use qwen1.5-32b-gptq-int4 #1637.fpA_intB, thanks to the contribution from @JamesTheZ in Fix the error of Ada traits for fpA_intB. #1583.examples/qwenvl/requirements.txt, thanks to the contribution from @ngoanpv in Update requirements.txt #1248.lora_manager, thanks to the contribution from @TheCodeWrangler in Fixed rslora scaling in lora_manager #1669.convert_hf_mpt_legacycall failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in Define hf_config explisitly for convert_hf_mpt_legacy #1534.use_fp8_context_fmhabroken outputs (use_fp8_context_fmha broken outputs #1539).quantize.pyis export data to config.json, thanks to the contribution from @janpetrov: quantize.py fails to export important data to config.json (eg rotary scaling) #1676shared_embedding_tableis not being set when loading Gemma [GEMMA]from_hugging_facenot settingshare_embedding_tableto True leading to incapacity to load Gemma #1799, thanks to the contribution from @mfuntowicz.ModelRunner[ModelRunner] Fix stop and bad words list contiguous for offsets #1815, thanks to the contribution from @Marks101.FAST_BUILD, thanks to the support from @lkm2835 in Add FAST_BUILD comment at #endif #1851.benchmarks/cpp/README.mdfor gptManagerBenchmark seems to go into a dead loop with GPU usage 0% #1562 and Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182) #1552.Infrastructure Changes
nvcr.io/nvidia/pytorch:24.05-py3.nvcr.io/nvidia/tritonserver:24.05-py3.Known Issues
OSError: exception: access violation reading 0x0000000000000000. This issue is under investigation.Currently, there are two key branches in the project:
We are updating the
mainbranch regularly with new features, bug fixes and performance optimizations. Therelbranch will be updated less frequently, and the exact frequencies depend on your feedback.Thanks,
The TensorRT-LLM Engineering Team
This discussion was created from the release TensorRT-LLM 0.11.0 Release.
Beta Was this translation helpful? Give feedback.
All reactions