Releases: PaddlePaddle/FastDeploy
Releases · PaddlePaddle/FastDeploy
v2.2.0
新增功能
- 采样策略中的bad_words支持传入token ids
- 新增Qwen2.5-VL系列模型支持(视频请求不支持enable-chunked-prefill)
- API-Server completions接口prompt 字段支持传入token id列表,同时支持批量推理
- 新增function call解析功能,支持通过
tool-call-parse
解析function call结果 - 支持服务启动或请求中自定义chat_template
- 支持模型chat_template.jinja文件的加载
- 请求报错结果增加异常堆栈信息,完善异常log记录
- 新增混合MTP、Ngram的投机解码方法
- 支持用于投机解码的Tree Attention功能
- 模型加载功能增强,实现了使用迭代器加载模型权重,加载速度和内存占用进一步优化
- API-Server完善日志格式,增加时间信息
- 新增插件机制,允许用户在不修改FastDeploy核心代码的前提下扩展自定义功能
- 支持Marlin kernel文件在编译阶段按照模版配置自动生成
- 支持加载 HuggingFace原生Safetensors格式的文心、Qwen系列模型
- 完善DP+TP+EP混合并行推理
性能优化
- 新增W4Afp8 MoE Group GEMM算子
- CUDA Graph增加对超32K长文的支持
- 优化moe_topk_select算子性能,提升MoE模型性能
- 新增Machete WINT4 GEMM算子,优化WINT4 GEMM性能,通过FD_USE_MACHETE=1开启
- Chunked prefill 默认开启
- V1 KVCache调度策略与上下文缓存默认开启
- MTP支持更多草稿token推理,提升多步接受率
- 新增可插拔轻量化稀疏注意力加速长文推理
- 针对Decode支持自适应双阶段的All-to-All通信,提升通信速度
- 支持DeepSeek系列模型MLA Bankend encoder阶段启用Flash-Attrntion-V3
- 支持DeepSeek系列模型q_a_proj & kv_a_proj_with_mqa linear横向融合
- API-Server新增zmq dealer 模式通信管理模块,支持连接复用进一步扩展服务可支持的最大并发数
Bug修复
- completion接口echo回显支持
- 修复 V1调度下上下文缓存的管理 bug
- 修复 Qwen 模型固定 top_p=0 两次输出不一致的问题
- 修复 uvicorn 多worker启动、运行中随机挂掉问题
- 修复 API-Server completions接口中多个 prompt 的 logprobs 聚合方式
- 修复 MTP 的采样问题
- 修复PD 分离cache 传输信号错误
- 修复异常抛出流量控制信号释放问题
- 修复
max_tokens
为0 异常抛出失败问题 - 修复EP + DP 混合模式下离线推理退出hang问题
文档
- 更新了最佳实践文档中一些技术的用法和冲突关系
- 新增多机张量并行部署文档
- 新增数据并行部署文档
其它
- CI新增对自定义算子的Approve拦截
- Config整理及规范化
What's Changed
- Describe PR diff coverage using JSON file by @XieYunshen in #3114
- [CI] add xpu ci case by @plusNew001 in #3111
- disable test_cuda_graph.py by @XieYunshen in #3124
- [CE] Add base test class for web server testing by @DDDivano in #3120
- [OPs] MoE Preprocess OPs Support 160 Experts by @ckl117 in #3121
- [Docs] Optimal Deployment by @ming1753 in #2768
- fix stop seq unittest by @zoooo0820 in #3126
- [XPU]Fix out-of-memory issue during single-XPU deployment by @iosmers in #3133
- [Code Simplification] Refactor Post-processing in VL Model Forward Method by @DrRyanHuang in #2937
- add case by @DDDivano in #3150
- fix ci by @XieYunshen in #3141
- Fa3 支持集中式 by @yangjianfengo1 in #3112
- Add CI cases by @ZhangYulongg in #3155
- [XPU]Updata XPU dockerfiles by @plusNew001 in #3144
- [Feature] remove dependency on enable_mm and refine multimodal's code by @ApplEOFDiscord in #3014
- 【Inference Optimize】Support automatic generation of marlin kernel by @chang-wenbin in #3149
- Update init.py by @DDDivano in #3163
- fix load_pre_sharded_checkpoint by @bukejiyu in #3152
- 【Feature】add fd plugins && rm model_classes by @gzy19990617 in #3123
- [Bug Fix] fix pd disaggregated kv cache signal by @ltd0924 in #3172
- Update test_base_chat.py by @DDDivano in #3183
- Fix approve shell scripts by @YuanRisheng in #3108
- [Bug Fix] fix the bug in test_sampler by @zeroRains in #3157
- 【Feature】support qwen3 name_mapping by @gzy19990617 in #3179
- remove useless code by @zhoutianzi666 in #3166
- [Bug fix] Fix cudagraph when use ep. by @Wanglongzhi2001 in #3130
- [Bugfix] Fix uninitialized decoded_token and add corresponding unit t… by @sunlei1024 in #3195
- [CI] add test_compare_top_logprobs by @EmmonsCurse in #3191
- fix expertwise_scale by @rsmallblue in #3181
- [FIX]fix bad_words when sending requests consecutively by @Sunny-bot1 in #3197
- [plugin] Custom model_runner/model support by @lizhenyun01 in #3186
- Add more base chat cases by @DDDivano in #3203
- Add switch to apply fine-grained per token quant fp8 by @RichardWooSJTU in #3192
- [Bug Fix]Fix bug of append attention test case by @gongshaotian in #3202
- add more cases by @DDDivano in #3207
- fix coverage report by @XieYunshen in #3198
- [New Feature] fa3 支持flash mask by @yangjianfengo1 in #3184
- [Test] scaled_gemm_f8_i4_f16 skip test while sm != 89 by @ming1753 in #3210
- [EP] Refactor DeepEP Engine Organization for Mixed Mode & Buffer Management Optimization by @RichardWooSJTU in #3182
- [Bug fix] Fix lm head bias by @RichardWooSJTU in #3185
- Ce add repitation early stop cases by @DDDivano in #3213
- [BugFix]fix test_air_top_p_sampling name by @ckl117 in #3211
- [BugFix] support real batch_size by @lizexu123 in #3109
- Ce add bad cases by @DDDivano in #3215
- revise noaux_tc by @rsmallblue in #3164
- [Bug Fix] Fix bug of MLA Attention Backend by @gongshaotian in #3176
- support qk norm for append attn by @rsmallblue in #3145
- Fix approve ci by @XieYunshen in #3212
- [Trace]add trace when fd start by @sg263 in #3174
- [New Feature] Support W4Afp8 MoE GroupGemm by @yangjianfengo1 in #3171
- Perfect approve error message by @YuanRisheng in #3224
- Fix the confused enable_early_stop when only set early_stop_config by @zeroRains in #3214
- [CI] Add ci case for min token and max token by @xjkmfa in #3229
- add some evil cases by @DDDivano in #3240
- support qwen3moe by @bukejiyu in #3084
- [Feature] support seed parameter by @lizexu123 in #3161
- 【Fix Bug】 修复 fa3 支持集中式bug by @yangjianfengo1 in #3235
- [bugfix]fix blockwisefp8 and all_reduce by @bukejiyu in #3243
- [Feature] multi source download by @Yzc216 in #3125
- [fix] fix completion stream api output_tokens not in usage by @liyonghua0910 in #3247
- [Doc][XPU] Update deps and fix dead links by @hong19860320 in #3252
- Fix approve ci bug by @YuanRisheng in #3239
- [Executor]Update graph test case and delete test_attention by @gongshaotian in #3257
- [CI] remove useless case by @EmmonsCurse in #3261
- Ce add benchmark test by @DDDivano in #3262
- [stop_seq] fix out-bound value for stop sequence by @zoooo0820 in #3216
- [fix] multi source download by @Yzc216 in #3259
- [Bug fix] support logprob in scheduler v1 by @rainyfly in #3249
- [feat]add fast_weights_iterator by @bukejiyu in #3258
- [Iluvatar GPU] Optimze attention and moe performance by @wuyujiji in #3234
- delete parallel_state.py by @yuanlehome in #3250
- [bugfix]qwen3_fix and qwq fix by @bukejiyu in #3255
- 【Fix】【MTP】Fix MTP sample bug by @freeliuzc in #3139
- [CI] add CI logprobs case by @plusNew001 in #3189
- Move create_parameters to init in FuseMOE for CultassBackend and TritonBackend by @zeroRains in #3148
- [Bugfix] Fix model accuracy in some ops by @gzy19990617 in #3231
- add base test ci by @XieYunshen in #3225
- [BugFix] fix too many ...
v2.1.1
文档
- 新增多机张量并行部署文档
- 文心系列模型最佳实践文档更新到最新用法
- 更新CUDA Graph使用说明
新增功能
- 返回结果新增
completion_tokens
与prompt_tokens
,支持返回原始输入与模型原始输出文本 - completion接口支持
echo
参数
Bug修复
- 修复V1 KVCache调度下LogProb无法返回问题
- 修复
chat_template_kwargs
参数无法生效问题 - 修复混合架构部署下的EP并行问题
- 修复completion接口返回结果中输出Token计数错误问题
- 修复logprobs返回结果聚合问题
What's Changed
- [Docs] Add Multinode deployment document by @ltd0924 in #3416
- [docs] cherry-pick update docs by @zoooo0820 in #3422
- [Docs]update installation readme by @yongqiangma in #3435
- [Docs] release 2.1 by @ming1753 in #3441
- [Docs]Updata docs of graph opt backend by @gongshaotian in #3443
- [Feature] Support logprob in scheduler v1 for release/2.1 by @rainyfly in #3446
- [Bugfix]fix config bug in dynamic_weight_manager by @gzy19990617 in #3432
- [Feature] Pass through the chat_template_kwargs to the data processing module by @luukunn in #3469
- [CI] fix run_ci error in release/2.1 by @EmmonsCurse in #3499
- [BugFix] fix ep real_bsz by @lizexu123 in #3396
- [Feature] add prompt_tokens and completion_tokens by @memoryCoderC in #3505
- [fix] setting disable_chat_template while passing prompt_token_ids led to response error by @liyonghua0910 in #3511
- [Excutor] Fixed the issue of CUDA graph execution failure caused by d… by @gongshaotian in #3512
- [Feature] add tool parser by @luukunn in #3518
- [BUGFIX] fix ep mixed bug by @ltd0924 in #3513
- [BugFix] Api server bugs by @ltd0924 in #3530
- [Feature] Support limit thinking len for text models by @K11OntheBoat in #3527
- [Bug Fix] Close get think_end_id for XPU for now. by @K11OntheBoat in #3563
- [Feature] Support mixed deployment with yiyan adapter by @rainyfly in #3533
- [Cherry-Pick] Launch expert_service before kv_cache initialization in worker_process by @zeroRains in #3558
- 【BugFix】completion接口echo回显支持 by @AuferGachet in #3477
- [fix] fix completion stream api output_tokens not in usage by @liyonghua0910 in #3588
- [fix] fix ZmqIpcClient.close() error by @liyonghua0910 in #3600
- [Bugfix] Correct logprobs aggregation for multiple prompts in /completions endpoint by @sunlei1024 in #3620
- [BugFix] ep mixed mode offline exit failed by @ltd0924 in #3623
- 【Bugfix】修复2.1分支上0.3B模型性能大幅下降 by @AuferGachet in #3624
- [CI] add cleanup logic in release/2.1 workflows by @EmmonsCurse in #3655
- [BugFix] fix parameter is 0 by @ltd0924 in #3663
- [fix] qwen output inconsistency when top_p=0 (#3634) by @liyonghua0910 in #3662
- Revert "[BugFix] fix parameter is 0" by @Jiang-Jia-Jun in #3681
- [feat] add metrics for yiyan adapter by @liyonghua0910 in #3615
- [bugfix]PR3663 parameter is 0 by @ltd0924 in #3679
- [BugFix] Modify the bug in Qwen2 when enabling ENABLE_V1_KVCACHE_SCHEDULER. by @lizexu123 in #3670
- Revert "[BugFix] Modify the bug in Qwen2 when enabling ENABLE_V1_KVCACHE_SCHEDULER." by @Jiang-Jia-Jun in #3719
- [Cherry-Pick] fix the bug when num_key_value_heads < tensor_parallel_size by @zeroRains in #3722
- [Optimize] Increase zmq buffer size to prevent apiserver too slowly t… by @gongshaotian in #3728
- [Fix] Do not drop result when request result slowly by @rainyfly in #3704
- [Bug fix] Fix prefix cache in v1 by @rainyfly in #3710
- [Bug fix] Fix mix deployment perf with yiyan adapter in release21 by @rainyfly in #3703
Full Changelog: v2.1.0...v2.1.1
v2.1.0
FastDeploy v2.1.0通过升级KVCache调度机制、增强高并发场景能力以及丰富采样策略,进一步提升用户体验和服务稳定性;通过CUDA Graph以及MTP等多项优化提升推理性能;此外,还新增支持多款国产硬件上文心开源模型的推理能力。
使用体验优化
- KVCache调度机制升级:采用输入与输出的KVCache统一管理方式,解决此前由于
kv_cache_ratio
参数配置不当导致的OOM问题;解决多模态模型由于输出KVCache不足,生成提前结束的问题。部署时通过配置环境变量export ENABLE_V1_KVCACHE_SCHEDULER=1
启用(下个版本会默认开启),即可不再依赖kv_cache_ratio
的设置,推荐使用。 - 高并发场景功能增强:增加
max_concurrency
/max_waiting_time
控制并发,对于超时请求进行拒绝优化用户体验,保障服务稳定性。 - 多样的采样方式支持:新增
min_p
、top_k_top_p
采样方式支持,使用方式参考 采样说明;同时增加基于Repetition策略和基于stop词列表早停能力,详见 早停说明。 - 服务化部署能力提升:增加
return_token_ids
/include_stop_str_in_output
/logprobs
等参数支持返回更完整的推理信息。 - 默认参数下性能提升:增强因max_num_seqs默认值与实际并发不一致时性能下降问题,避免手动修改max_num_seqs。
推理性能优化
- CUDA Graph覆盖更多场景:覆盖多卡推理,支持与上下文缓存、Chunked Prefill同时使用,在ERNIE 4.5系列、Qwen3系列模型上性能提升17%~91%,详细使用可以参考最佳实践文档。
- MTP投机解码性能提升 :优化算子性能,减少CPU调度开销,提升整体性能;同时,相比v2.0.0版本新增ERNIE-4.5-21B-A3B模型支持MTP投机解码。
- 算子性能优化:优化W4A8、 KVCache INT4、WINT2 Group GEMM等计算Kernel,提升性能;如ERNIE-4.5-300B-A47B WINT2模型性能提升25.5%。
- PD分离完成更多模型验证:P节点完善FlashAttention后端,提升长文推理性能,并基于ERNIE-4.5-21B-A3B等轻量模型完成验证。
国产硬件部署能力升级
- 新增支持昆仑芯P800上ERNIE-4.5-21B-A3B模型部署,更多说明参考 昆仑芯P800部署文档。
- 新增支持海光K100-AI上ERNIE4.5文本系列模型部署,更多说明参考 海光K100-AI部署文档。
- 新增支持燧原S60上ERNIE4.5文本系列模型的部署,更多说明参考 燧原S60部署文档。
- 新增支持天数天垓150上ERNIE-4.5-300B-A47B和ERNIE-4.5-21B-A3B模型部署,并优化推理性能,更多说明参考 天数部署文档。
ERNIE4.5 模型国产硬件推理适配情况(✅ 已支持 🚧 适配中 ⛔暂无计划) | ||||||
---|---|---|---|---|---|---|
模型 | 昆仑芯P800 | 昇腾910B | 海光K100-AI | 天数天垓150 | 沐曦曦云C550 | 燧原S60/L600 |
ERNIE4.5-VL-424B-A47B | 🚧 | 🚧 | ⛔ | ⛔ | ⛔ | ⛔ |
ERNIE4.5-300B-A47B | ✅ | 🚧 | ✅ | ✅ | 🚧 | ✅ |
ERNIE4.5-VL-28B-A3B | 🚧 | 🚧 | ⛔ | 🚧 | ⛔ | ⛔ |
ERNIE4.5-21B-A3B | ✅ | 🚧 | ✅ | ✅ | ✅ | ✅ |
ERNIE4.5-0.3B | ✅ | 🚧 | ✅ | ✅ | ✅ | ✅ |
相关文档和说明
- 升级对飞桨框架的依赖**,FastDeploy v2.1.0版本依赖PaddlePaddle v3.1.1版本**,PaddlePaddle安装方式请参考飞桨官网安装说明
- FastDeploy v2.1.0的服务部署请求不再推荐使用metadata字段(Deprecated,v2.1.0版本可使用,未来会移除),更新为使用extra_body,详见参数支持说明
- FastDeploy多硬件安装和编译说明
- FastDeploy部署参数
- 服务化部署使用说明
- GPU部署最佳实践
更详细的说明列举如下,
-
新增功能
- PD分离D服务支持W4A8在线/离线量化
- PD分离开启Chunked Prefill下支持逐Chunk的KVCache传输
- 支持logprobs返回
- 支持OpenTelemetry采集请求处理状态
- 新增return_token_ids参数,支持返回请求的输入和输出Token ID列表
- 新增include_stop_str_in_output参数,支持结束符的返回
- 新增QwQ模型 enable_thinking参数控制思考模式开关
- 新增repetition早停功能支持
- 新增stop参数支持
- 新增多机张量并行部署支持
- 新增服务请求并发与超时控制
- 支持min_p/top_k_top_p采样
- 支持bad_words
- 优化OpenAI API-Server接口,支持extra_body扩充额外参数支持,废弃metadata的使用
-
性能优化
- PD分离EP并行下Decode的W4A8计算性能优化
- 基于权重重排优化WINT2 Group-GEMM算子Kernel性能
- MTP优化下支持开启Chunked Prefill
- 优化MTP & 投机解码推理性能
- 基于Triton优化Blockwise FP8量化性能
- CUDA Graph 支持 Padding Batch,显存占用大幅减少
- 新增Custom All Reduce算子,CUDA Graph支持TP并行
- 支持Chunked Prefill下开启CUDA Graph
- GetBolockShapeAndSplitKVBlock算子性能优化
- Attention支持C4非对称量化推理
- FlashAttn后端适配TP并行及支持FlashAttention V2
- KVCache管理机制升级,当前仅支持GPU,通过export ENABLE_V1_KVCACHE_SCHEDULER=1启用
- FlashAttention V3下支持开启C16/C8/C4的Chunked Prefill优化
- 服务部署支持Engine自动聚合生成结果提升服务与客户端通信效率
-
多硬件支持
- 昆仑芯 P800支持ERNIE-21B-A3B Wint4/Wint8模型
- 海光K100-AI支持ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B模型
- 燧原S60支持ERNIE4.5系列模型
- 天数支持ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B模型,并进行性能优化
-
Bug修复
- 修复PD分离部署架构开启MTP时D服务首Token错误问题
- 修复SFT后文心纯文模型Token采样越界问题
- 修复XPU 非0卡启动显存OOM的问题
- 修复XPU使用ENABLE_V1_KVCACHE_SCHEDULER=1性能下降问题
- 修复Chunked Prefill下多模态模型并发推理模型崩溃问题
- 修复Qwen3-8B模型生成结果乱码的问题
- 修复RMSNorm硬编码的问题
- 修复linear.py中qkv_bias没定义的问题
- 修复max_tokens=1时报错的问题
- 修复token_processor输入日志格式问题
- 修复chunked_prefill下,chunk size 小于block size服务hang问题
- 修复vl 场景下数据保存问题
-
文档
- 增加中文ReadME和MKDocs支持
- 新增各模型部署最佳实践文档
- 增加Sampling和Early Stopping使用文档说明
- 更新CUDA Graph与动转静使用接口和文档
- 更新模型支持文档
-
其它
- 新增中英文文档支持
- 优化模型加载量化模型参数报错信息
- 统一多模态模型和纯文模型的ModelRunner
- 基于triton_utils更新WINT2 Triton算子
- 优化代码中多个Config实现繁杂问题
What's Changed
- add wint2 performance by @ZhangHandi in #2673
- Update gh-pages.yml by @DDDivano in #2680
- add --force-reinstall --no-cache-dir when pip install fastdeploy*.whl by @yuanlehome in #2682
- [Sync] Update to latest code by @Jiang-Jia-Jun in #2679
- [doc] update docs by @kevincheng2 in #2690
- [Bug] fix logger format by @ltd0924 in #2689
- [feat] support fa3 backend for pd disaggregated by @yuanlehome in #2695
- add quick benchmark script by @DDDivano in #2703
- [Doc] modify reasoning_output docs by @LiqinruiG in #2696
- [MTP] Support chunked_prefill in speculative decoding(MTP) by @freeliuzc in #2705
- [RL] update reschedule finish reason by @ltd0924 in #2709
- [feature]add fd whl version info by @gzy19990617 in #2698
- Extract eh_proj Layer from ParallelLMHead for MTP to Avoid Weight Transposition Issue by @Deleter-D in #2707
- 添加XPU CI, test=model by @quanxiang-liu in #2701
- [CI] Add validation for MTP and CUDAGraph by @EmmonsCurse in #2710
- add support QWQ enable_thinking by @lizexu123 in #2706
- [BugFix] fix paddle_git_commit_id error by @EmmonsCurse in #2714
- spec token map lazy. by @wtmlon in #2715
- fix bug. by @wtmlon in #2718
- 修改XPU CI, test=model by @quanxiang-liu in #2721
- [LLM] support multi node deploy by @ltd0924 in #2708
- [Doc]Update eb45-0.3B minimum memory requirement by @ckl117 in #2686
- [RL] Check if the controller port is available by @lddfym in #2724
- remove redundant install whl of fastdeploy by @yuanlehome in #2726
- support FastDeploy version setting by @XieYunshen in #2725
- [iluvatar_gpu] Adapt for iluvatar gpu by @liddk in #2684
- [Optimize] Optimize tensorwise fp8 performance by @ming1753 in #2729
- [Bug fix] fix complie bug when sm < 89 by @ming1753 in #2738
- [SOT] Remove BreakGraph with
paddle.maximum
by @DrRyanHuang in #2731 - 【Fearture】support qwen2 some func by @gzy19990617 in #2740
- [GCU] Support gcu platform by @EnflameGCU in #2702
- [Bug fix] Fixed the garbled text issues in Qwen3-8B by @lizexu123 in #2737
- [Bug fix] Add the missing
pod_ip
param to the launch_cache_manager function. by @Wanglongzhi2001 in #2742 - [Bug fix] fix attention rank init by @RichardWooSJTU in #2743
- add precision check for ci by @xiegetest in #2732
- [SOT] Make custom_op dy&st unified by @DrRyanHuang in #2733
- Revert "[Bug fix] fix attention rank init" by @RichardWooSJ...
v2.0.0
FastDeploy 2.0: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
News
🔥 Released FastDeploy v2.0: Supports inference and deployment for ERNIE 4.5. Furthermore, we open-source an industrial-grade PD disaggregation with context caching, dynamic role switching for effective resource utilization to further enhance inference performance for MoE models.
About
FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:
- 🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
- 🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
- 🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
- 🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
- ⏩ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
- 🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc.
Supported Models
Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
---|---|---|---|---|---|---|---|
ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅ | ✅ | ✅ | ✅(WINT4) | WIP | 128K |
ERNIE-4.5-300B-A47B-Base | BF16/WINT4/WINT8 | ✅ | ✅ | ✅ | ✅(WINT4) | WIP | 128K |
ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP | 128K |
ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP | 128K |
ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅ | 128K |
ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅ | 128K |
ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅ | 128K |