新增功能
- 采样策略中的bad_words支持传入token ids
- 新增Qwen2.5-VL系列模型支持(视频请求不支持enable-chunked-prefill)
- API-Server completions接口prompt 字段支持传入token id列表,同时支持批量推理
- 新增function call解析功能,支持通过
tool-call-parse
解析function call结果 - 支持服务启动或请求中自定义chat_template
- 支持模型chat_template.jinja文件的加载
- 请求报错结果增加异常堆栈信息,完善异常log记录
- 新增混合MTP、Ngram的投机解码方法
- 支持用于投机解码的Tree Attention功能
- 模型加载功能增强,实现了使用迭代器加载模型权重,加载速度和内存占用进一步优化
- API-Server完善日志格式,增加时间信息
- 新增插件机制,允许用户在不修改FastDeploy核心代码的前提下扩展自定义功能
- 支持Marlin kernel文件在编译阶段按照模版配置自动生成
- 支持加载 HuggingFace原生Safetensors格式的文心、Qwen系列模型
- 完善DP+TP+EP混合并行推理
性能优化
- 新增W4Afp8 MoE Group GEMM算子
- CUDA Graph增加对超32K长文的支持
- 优化moe_topk_select算子性能,提升MoE模型性能
- 新增Machete WINT4 GEMM算子,优化WINT4 GEMM性能,通过FD_USE_MACHETE=1开启
- Chunked prefill 默认开启
- V1 KVCache调度策略与上下文缓存默认开启
- MTP支持更多草稿token推理,提升多步接受率
- 新增可插拔轻量化稀疏注意力加速长文推理
- 针对Decode支持自适应双阶段的All-to-All通信,提升通信速度
- 支持DeepSeek系列模型MLA Bankend encoder阶段启用Flash-Attrntion-V3
- 支持DeepSeek系列模型q_a_proj & kv_a_proj_with_mqa linear横向融合
- API-Server新增zmq dealer 模式通信管理模块,支持连接复用进一步扩展服务可支持的最大并发数
Bug修复
- completion接口echo回显支持
- 修复 V1调度下上下文缓存的管理 bug
- 修复 Qwen 模型固定 top_p=0 两次输出不一致的问题
- 修复 uvicorn 多worker启动、运行中随机挂掉问题
- 修复 API-Server completions接口中多个 prompt 的 logprobs 聚合方式
- 修复 MTP 的采样问题
- 修复PD 分离cache 传输信号错误
- 修复异常抛出流量控制信号释放问题
- 修复
max_tokens
为0 异常抛出失败问题 - 修复EP + DP 混合模式下离线推理退出hang问题
文档
- 更新了最佳实践文档中一些技术的用法和冲突关系
- 新增多机张量并行部署文档
- 新增数据并行部署文档
其它
- CI新增对自定义算子的Approve拦截
- Config整理及规范化
What's Changed
- Describe PR diff coverage using JSON file by @XieYunshen in #3114
- [CI] add xpu ci case by @plusNew001 in #3111
- disable test_cuda_graph.py by @XieYunshen in #3124
- [CE] Add base test class for web server testing by @DDDivano in #3120
- [OPs] MoE Preprocess OPs Support 160 Experts by @ckl117 in #3121
- [Docs] Optimal Deployment by @ming1753 in #2768
- fix stop seq unittest by @zoooo0820 in #3126
- [XPU]Fix out-of-memory issue during single-XPU deployment by @iosmers in #3133
- [Code Simplification] Refactor Post-processing in VL Model Forward Method by @DrRyanHuang in #2937
- add case by @DDDivano in #3150
- fix ci by @XieYunshen in #3141
- Fa3 支持集中式 by @yangjianfengo1 in #3112
- Add CI cases by @ZhangYulongg in #3155
- [XPU]Updata XPU dockerfiles by @plusNew001 in #3144
- [Feature] remove dependency on enable_mm and refine multimodal's code by @ApplEOFDiscord in #3014
- 【Inference Optimize】Support automatic generation of marlin kernel by @chang-wenbin in #3149
- Update init.py by @DDDivano in #3163
- fix load_pre_sharded_checkpoint by @bukejiyu in #3152
- 【Feature】add fd plugins && rm model_classes by @gzy19990617 in #3123
- [Bug Fix] fix pd disaggregated kv cache signal by @ltd0924 in #3172
- Update test_base_chat.py by @DDDivano in #3183
- Fix approve shell scripts by @YuanRisheng in #3108
- [Bug Fix] fix the bug in test_sampler by @zeroRains in #3157
- 【Feature】support qwen3 name_mapping by @gzy19990617 in #3179
- remove useless code by @zhoutianzi666 in #3166
- [Bug fix] Fix cudagraph when use ep. by @Wanglongzhi2001 in #3130
- [Bugfix] Fix uninitialized decoded_token and add corresponding unit t… by @sunlei1024 in #3195
- [CI] add test_compare_top_logprobs by @EmmonsCurse in #3191
- fix expertwise_scale by @rsmallblue in #3181
- [FIX]fix bad_words when sending requests consecutively by @Sunny-bot1 in #3197
- [plugin] Custom model_runner/model support by @lizhenyun01 in #3186
- Add more base chat cases by @DDDivano in #3203
- Add switch to apply fine-grained per token quant fp8 by @RichardWooSJTU in #3192
- [Bug Fix]Fix bug of append attention test case by @gongshaotian in #3202
- add more cases by @DDDivano in #3207
- fix coverage report by @XieYunshen in #3198
- [New Feature] fa3 支持flash mask by @yangjianfengo1 in #3184
- [Test] scaled_gemm_f8_i4_f16 skip test while sm != 89 by @ming1753 in #3210
- [EP] Refactor DeepEP Engine Organization for Mixed Mode & Buffer Management Optimization by @RichardWooSJTU in #3182
- [Bug fix] Fix lm head bias by @RichardWooSJTU in #3185
- Ce add repitation early stop cases by @DDDivano in #3213
- [BugFix]fix test_air_top_p_sampling name by @ckl117 in #3211
- [BugFix] support real batch_size by @lizexu123 in #3109
- Ce add bad cases by @DDDivano in #3215
- revise noaux_tc by @rsmallblue in #3164
- [Bug Fix] Fix bug of MLA Attention Backend by @gongshaotian in #3176
- support qk norm for append attn by @rsmallblue in #3145
- Fix approve ci by @XieYunshen in #3212
- [Trace]add trace when fd start by @sg263 in #3174
- [New Feature] Support W4Afp8 MoE GroupGemm by @yangjianfengo1 in #3171
- Perfect approve error message by @YuanRisheng in #3224
- Fix the confused enable_early_stop when only set early_stop_config by @zeroRains in #3214
- [CI] Add ci case for min token and max token by @xjkmfa in #3229
- add some evil cases by @DDDivano in #3240
- support qwen3moe by @bukejiyu in #3084
- [Feature] support seed parameter by @lizexu123 in #3161
- 【Fix Bug】 修复 fa3 支持集中式bug by @yangjianfengo1 in #3235
- [bugfix]fix blockwisefp8 and all_reduce by @bukejiyu in #3243
- [Feature] multi source download by @Yzc216 in #3125
- [fix] fix completion stream api output_tokens not in usage by @liyonghua0910 in #3247
- [Doc][XPU] Update deps and fix dead links by @hong19860320 in #3252
- Fix approve ci bug by @YuanRisheng in #3239
- [Executor]Update graph test case and delete test_attention by @gongshaotian in #3257
- [CI] remove useless case by @EmmonsCurse in #3261
- Ce add benchmark test by @DDDivano in #3262
- [stop_seq] fix out-bound value for stop sequence by @zoooo0820 in #3216
- [fix] multi source download by @Yzc216 in #3259
- [Bug fix] support logprob in scheduler v1 by @rainyfly in #3249
- [feat]add fast_weights_iterator by @bukejiyu in #3258
- [Iluvatar GPU] Optimze attention and moe performance by @wuyujiji in #3234
- delete parallel_state.py by @yuanlehome in #3250
- [bugfix]qwen3_fix and qwq fix by @bukejiyu in #3255
- 【Fix】【MTP】Fix MTP sample bug by @freeliuzc in #3139
- [CI] add CI logprobs case by @plusNew001 in #3189
- Move create_parameters to init in FuseMOE for CultassBackend and TritonBackend by @zeroRains in #3148
- [Bugfix] Fix model accuracy in some ops by @gzy19990617 in #3231
- add base test ci by @XieYunshen in #3225
- [BugFix] fix too many open files problem by @ltd0924 in #3256
- [Executor] Fixed the issue of CUDA graph execution failure caused by different branches during decoding by @littledgg in #3223
- [Bug Fix] Fix scheduler bug in develop by @rainyfly in #3292
- Split cases by @DDDivano in #3297
- Update _base_test.yml by @DDDivano in #3298
- [Logprob] merge logprob into _process_batch_output func by @ckl117 in #3266
- Update _base_test.yml by @DDDivano in #3299
- Acc by @DDDivano in #3301
- 【CI case】include total_tokens in the last packet of completion interface stream output by @xjkmfa in #3279
- [Bug fix] fix bug for scheduler v0 by @rainyfly in #3308
- [BugFix] num_seqs by @lizexu123 in #3291
- [V1 Loader] Support DeepSeekV3(bf16) by @zeroRains in #3294
- update base test by @DDDivano in #3304
- enhance eos_tokens by @yuanlehome in #3274
- Revert "[BugFix] num_seqs" by @Jiang-Jia-Jun in #3316
- Update deploy.py by @ZhangYulongg in #3310
- Launch expert_service before kv_cache initialization in worker_process by @zeroRains in #3045
- [Bug Fix] fix uvicorn multi worker error by @kevincheng2 in #3300
- fix ci pypi index error by @XieYunshen in #3326
- [Docs]fix sampling docs by @Sunny-bot1 in #3113
- Update _base_test.yml by @DDDivano in #3331
- [Test ] fix unittest by @zoooo0820 in #3328
- [Docs]remove redundent "bad_words" by @Sunny-bot1 in #3332
- Remove useless code by @Jiang-Jia-Jun in #3337
- [Bug fix] fix block num setting in scheduler v1 for develop by @rainyfly in #3303
- [Doc] 增加中英文切换 by @yangjianfengo1 in #3318
- [Bug Fix] fix vl V1 schedule bug by @ming1753 in #3323
- [Bug fix] fix ep lm head by @RichardWooSJTU in #3244
- [BugFix]fix namemaping when use it many times by @gzy19990617 in #3320
- Use latest PaddlePaddle package by @XieYunshen in #3347
- [BugFix] v1/completions add finish_reason by @memoryCoderC in #3246
- Pre ce modified by @XieYunshen in #3335
- add test for CustomAllreduce by @zhink in #3313
- Completion add raw_prediction/text_after_process by @memoryCoderC in #3356
- add Tool Parser by @luukunn in #3272
- Refactor moe_topk_select op to use apply_norm_weight as a template parameter by @Sunny-bot1 in #3345
- [BUG FIX][SOT] Fix parameter order for custom op: rms_norm_eps by @DrRyanHuang in #3348
- [MetaxGPU] Support FastDeploy on metax gpu by @Kane2011 in #3241
- [Iluvatar GPU] Modify the names of some variables by @wuyujiji in #3273
- [GCU] Enable gcu CI by @EnflameGCU in #3190
- fix TestOpenAIServingCompletion fail by @memoryCoderC in #3368
- [Loader V1] modify layername for DeepSeekV3 by @zeroRains in #3336
- Optimize CI execution workflow by @XieYunshen in #3371
- [Bug Fix] Fix V1 video bug by @ming1753 in #3388
- feat(log):add_request_and_response_log by @xiaolei373 in #3373
- [BugFix] Fix default log level of paddleformers by @Jiang-Jia-Jun in #3376
- [Optimize]Add norm_weights feature for topk_gating_softmax by @Sunny-bot1 in #3372
- 【BugFix】fix some op tests by @gzy19990617 in #3398
- [BugFix] fix real_bsz in ep by @lizexu123 in #3366
- Add requirements for running unit tests by @XieYunshen in #3350
- [BugFix] fix ErnieProcessor not set raw_prediction by @memoryCoderC in #3400
- make append_attn supports mask_offset by @carryyu in #3138
- [UT Fix] Fix bad_words test by @Sunny-bot1 in #3385
- [Docs] Modify readme about MTP by @freeliuzc in #3409
- 【CI】 evil case by @xjkmfa in #3359
- [V1 Loader] Support Ernie text(moe and dense) by @YuanRisheng in #3110
- [OPs] Universal optimization and Fix early_stop cuda 700 by @ckl117 in #3375
- [Doc] Add multinode deployment documents by @ltd0924 in #3417
- [Optimize]Add bias feature for topk_gating_softmax by @Sunny-bot1 in #3405
- [docs] update best practice docs by @zoooo0820 in #3420
- [Docs]XPU Update 2.1 Release Documentation by @iosmers in #3423
- [Docs] Release 2.1 docs and fix some description by @ming1753 in #3424
- [Bugs] Fix DeepGEMM pre-compile tools. by @Deleter-D in #3351
- add accuracy check ci by @XieYunshen in #3389
- [Docs]Modify the gpu-memory-utilization of the 128K 8-card Wint4 model to 0.95 by @iosmers in #3428
- [docs] fix some docs error by @zoooo0820 in #3439
- Update README by @yangjianfengo1 in #3426
- [Docs]update installation readme by @yongqiangma in #3429
- [Docs]Updata docs of graph opt backend by @gongshaotian in #3442
- [Docs] Update mkdocs.yml by @gongshaotian in #3444
- [BugFix] CPU init.py Delete import base by @ckl117 in #3448
- Add ci case by @ZhangYulongg in #3355
- [Feature][MTP]update multi-draft-token strategy by @freeliuzc in #3369
- [Bugfix]fix some bug in plugins and config bug in dynamic_weight_manager by @gzy19990617 in #3434
- Perf by @DDDivano in #3453
- [BugFix]fix mtp_rej_topp missing top_k_list input by @ckl117 in #3450
- [Excutor] Increase buffer size to prevent address corruption; add forward metadata debug tool by @littledgg in #3404
- [Excutor] Change cudagraph hashkey from batch size to num_tokens by @littledgg in #3454
- add custom chat template by @luukunn in #3251
- add publish workflow by @XieYunshen in #3063
- [Code Simplification] remove cum_offsets by @lizexu123 in #3410
- 【BugFix】completion接口echo回显支持 by @AuferGachet in #3245
- 【Inference Optimize】DeepSeek-v3 model inference performance optimization by @chang-wenbin in #3455
- [BugFix] fix num_running_requests in cuda_graph by @lizexu123 in #3457
- [Feature] Pass through the
chat_template_kwargs
to the data processing module by @luukunn in #3421 - [FixBug] compute early stopping with real batch size by @zeroRains in #3418
- [BugFix] fix control signal release failed by @ltd0924 in #3390
- [BugFix] fix request_output sampling_params (#3154) by @ckl117 in #3464
- [V1 Loader] Support MOE parameters create and load for DeepGemm and marlin backend by @zeroRains in #3447
- [CI] add test generation demo by @ltd0924 in #3270
- 【New Feature】支持Fp8 group Gemm 24稀疏 by @yangjianfengo1 in #3463
- add error traceback info by @kevincheng2 in #3419
- Add stable ci by @XieYunshen in #3460
- add error log to file by @xiaolei373 in #3431
- [bigfix] temporarily lock the paddlepaddle-gpu==3.0.0.dev20250818 by @EmmonsCurse in #3482
- Add eb45-8k-fp8-tp1-dp8_ep.yaml by @ZhangYulongg in #3485
- add e2e cases by @XieYunshen in #3476
- Update disaggregated.md by @ZhangYulongg in #3495
- Add custom op declaration for
all_reduce
by @DrRyanHuang in #3473 - [BugFix][V1 Loader] fix the bug in creat weight for block_wise_fp8 by @zeroRains in #3486
- [Feature] add prompt_tokens and completion_tokens by @memoryCoderC in #3504
- CE 编译任务(合入触发) by @XieYunshen in #3491
- [Feature] add dealer manager to reuse the connection by @ltd0924 in #3471
- [XPU] Cherry-pick Commit Release2.1 44c0c7e into develop by @qw86972190 in #3481
- Update CI by @ZhangYulongg in #3474
- [Feature] Models api by @Yzc216 in #3073
- [Feature] add tool parser by @luukunn in #3483
- [fix] setting disable_chat_template while passing prompt_token_ids led to response error by @liyonghua0910 in #3228
- [fix] fix output tokens count in streaming completion api by @liyonghua0910 in #3507
- Add PD CI case by @ZhangYulongg in #3490
- 【bug fix】修复w4a8编译慢 by @yangjianfengo1 in #3510
- Unify server-side and model-side Config(Part-5) by @YuanRisheng in #3497
- Update ci by @ZhangYulongg in #3519
- [V1 Loader]Ernie VL support loader v1 by @YuanRisheng in #3494
- disable stable test by @XieYunshen in #3529
- [CI] add container naming and cleanup logic in workflows by @EmmonsCurse in #3526
- [Feature][SpeculativeDecoding]Support tree-attention by @freeliuzc in #3514
- [CI] fix xpu ci bug by @plusNew001 in #3535
- Fix fdconfig bugs by @YuanRisheng in #3528
- [Feature] Add Qwen25-VL Processor by @lddfym in #3501
- Modified to support custom all reduce by default by @zhink in #3538
- [UnitTest][Copilot] Improve unit test coverage for entrypoints modules by @Copilot in #3546
- fix test name by @XieYunshen in #3493
- [V1 Loader] Support qwen2(bf16) by @zeroRains in #3502
- [BugFix]Fix FDconfig for SOT by @YuanRisheng in #3556
- Revert "[UnitTest][Copilot] Improve unit test coverage for entrypoints modules" by @Jiang-Jia-Jun in #3564
- [V1 Loader] support weight_only by @bukejiyu in #3413
- [V1 Loaderr]support qwen2 weight only by @bukejiyu in #3571
- [Feature][XPU] add custom kernels for mtp by @lengxia in #3537
- [CI] add sot test by @EmmonsCurse in #3579
- Modify the existing coverage collection method by @XieYunshen in #3573
- support w4afp8 EP inference by @rsmallblue in #3044
- Add coverage skip by @XieYunshen in #3553
- [Feature] Add temp_scaled_logprobs and top_p_normalized_logprobs parameters for logits and logprobs post processing by @ckl117 in #3552
- [CI] temporarily disable sot test due to occasional timeout issue by @EmmonsCurse in #3586
- [MetaxGPU] Adapt to the latest fastdeploy on metax gpu by @Kane2011 in #3492
- [Executor] CUDAGraph support RL training by @gongshaotian in #3265
- [Bugfix] fix api server control signal bugs by @ltd0924 in #3531
- [Features] support hugging face qwen3 dense and qwen2 model by @lizexu123 in #3574
- [Feature] bad words support v1 scheduler and specifiy token ids by @Sunny-bot1 in #3608
- [CudaGraph][SOT] Add unit tests for splitting the static graph into piecewise graphs that support cuda_graph by @DrRyanHuang in #3590
- [CE]add x1 w4a8c8 benchamrk config by @tianlef in #3607
- [NewFeatures] support noex rope3d by @xiaoxiaohehe001 in #3542
- [CI] reopen sot test by @EmmonsCurse in #3613
- 【Inference Optimize】Support DSK qkv_a_proj horizontal fusion under V0 Loder by @chang-wenbin in #3591
- [feature][MTP]Support new speculative decoding method "hybrid mtp with ngram" by @freeliuzc in #3610
- Supports DP+TP+EP hybrid parallel deployment strategy by @carryyu in #3489
- adaptive rms_norm's dtype by @yuanlehome in #3617
- [NewFeatures] support eplb by @xiaoxiaohehe001 in #3547
- [CUDAGraph]Add debug func by @gongshaotian in #3616
- [v1 loader]support fp8 by @bukejiyu in #3593
- [Bugfix] Correct logprobs aggregation for multiple prompts in /completions endpoint by @sunlei1024 in #3618
- [CI] Standard unittest by @YuanRisheng in #3606
- rename ernie_xxx to ernie4_5_xxx by @yuanlehome in #3621
- [NewFeature]Support dp multi api server && Fix some bug in mixed ep && merge develop by @gzy19990617 in #3598
- [CUDAGraph]Switch the scope so that output buffer of CUDAGraph can automatically release by @gongshaotian in #3612
- [Feature] block sparse attention by @yangjianfengo1 in #3209
- fix publish task by @XieYunshen in #3635
- [New Features] support fa3 rope3d by @xiaoxiaohehe001 in #3622
- [Precision] Support lm_head layer running in float32 by @ckl117 in #3597
- update xpu ci by @plusNew001 in #3632
- [BugFix] Fix qwen3 lm_head load bug by @ckl117 in #3639
- 【CI case】for echo finish_reason text_after_process and raw_prediction check by @xjkmfa in #3630
- [docs] Update best practice doc by @zoooo0820 in #3539
- Change paddlepaddle-xpu installation command by @plusNew001 in #3646
- deepgemm don't support tp+ep (for ci) by @carryyu in #3638
- [fix] qwen output inconsistency when top_p=0 by @liyonghua0910 in #3634
- Revert "[Feature] block sparse attention" by @Jiang-Jia-Jun in #3647
- 【fix】 undefined cuPointerGetAttribute symbol error by @Lmywl in #3628
- delete ernie4_5_vl_tokenizer by @yuanlehome in #3631
- [BugFix] fix ce bugs by @ltd0924 in #3641
- [BugFix] Modify the bug in Qwen2 when enabling ENABLE_V1_KVCACHE_SCHEDULER. by @lizexu123 in #3625
- [V1 Loader]support param create and load for wint2 and xpu backend by @zeroRains in #3581
- [Optimize]support machete weight only gemm by @Sunny-bot1 in #3561
- [BugFix] fix parameter is 0 by @ltd0924 in #3592
- 【Hackathon 9th No.77】supplementary unit test for get_filtered_metrics by @Echo-Nie in #3578
- Support 45t fp8 8 GPU by @zhoutianzi666 in #3659
- 【New Feature】集中式支持w4afp8 by @yangjianfengo1 in #3644
- [BufFix]Fix rl bugs by @YuanRisheng in #3654
- fix w4afp8_gemm_scale_permute import error on A100 by @rsmallblue in #3611
- Update run_ci_xpu.sh to lock xvllm version by @plusNew001 in #3671
- [Docs] add fastdeploy_unit_test_guide.md by @mattheliu in #3484
- Fix target_version by @co63oc in #3159
- fix typos by @co63oc in #3633
- [BugFix] pd disaggregate port format by @ltd0924 in #3669
- fix by @bukejiyu in #3676
- [BugFix] fix logger by @ltd0924 in #3666
- [BugFix] ep mixed offline exit by @ltd0924 in #3661
- Add with_output version AppendAttention by @Lmywl in #3302
- 【fix】fix fp8 deepgemm_moe TP parallel Bug by @Lmywl in #3658
- add concurrency cases by @DDDivano in #3689
- [BugFix]fix dp&ep&tp and muti node infer by @gzy19990617 in #3629
- Update _base_test.yml by @DDDivano in #3690
- [V1 Loader]Ernie mtp support loader v1 by @YuanRisheng in #3675
- [Bug Fix] VL Support w4a8/w4afp8 by @ming1753 in #3686
- add input_processor plugin by @yuanlehome in #3657
- [v1 loader]fix qwen3 235B tp 8 by @bukejiyu in #3697
- update ci envs for structred output by @kevincheng2 in #3687
- 【DCU】enable dcu ci by @lifulll in #3402
- 【Hackathon 9th No.70】supplementary unit test for CPUPlatform and CUDAPlatform by @Echo-Nie in #3580
- fix MultimodalRegistry by @yuanlehome in #3699
- fix scaled_gemm_f8_i4_f16_weight_quantize input by @co63oc in #3685
- MoE Default use triton's blockwise fp8 in TP Case by @zhoutianzi666 in #3678
- [feat] completion api supports passing input token ids in either
prompt
orprompt_token_ids
by @liyonghua0910 in #3311 - [BugFix] fix key error in mm by @yuanlehome in #3702
- [NewFeature] support w4afp8 eplb by @xiaoxiaohehe001 in #3680
- Specify dtype as int32 for paddle.cumsum due to API Change by @DrRyanHuang in #3692
- Fix mtp tp group by @carryyu in #3648
- [CudaGraph] [SOT] Support spliting static graph into piecewise graph with cuda_graph by @zyfncg in #3478
- add w4afp8 offline script by @rsmallblue in #3636
- [CI] update paddle version to nightly by @EmmonsCurse in #3698
- [Model]support qwen2_5_vl by @xyxinyang in #3557
- [Feature] block sparse attention by @yangjianfengo1 in #3668
- [Feature]support load eb 0.3B and 21B torch model by @ckl117 in #3660
- Optimize coverage jobs by @XieYunshen in #3683
- [Optimize] Increase zmq buffer size to prevent apiserver too slowly t… by @rainyfly in #3723
- [Bug Fix] fix the bug when num_key_value_heads < tensor_parallel_size in launching kv_cahce_manager by @zeroRains in #3717
- [Features] support hugging face qwen3 moe by @lizexu123 in #3649
- [Attn] max_partition_size default is 1024 by @ckl117 in #3720
- [Code Simplification] delete print by @lizexu123 in #3729
- [Feature]support chat_template.jinja by @luukunn in #3721
- [Feature] Add AsyncTokenizerClient&ChatResponseProcessor with remote encode&decode support. by @sunlei1024 in #3674
- [FIX]Fix Machete compile via ENABLE_MACHETE by @Sunny-bot1 in #3727
- [feat] add metrics for yiyan adapter by @liyonghua0910 in #3614
- default enable chunked prefill by @kevincheng2 in #3731
- fix mask_offset in append_attn by @lizhenyun01 in #3745
- [Bug fix] Fix prefix cache in V1 by @rainyfly in #3715
- fix ce build job by @XieYunshen in #3777
- fix ce compile task upload error by @XieYunshen in #3788
- Fix chunked prefill by @kevincheng2 in #3778
- [Feature] Setting number of apiserver workers automatically by @Jiang-Jia-Jun in #3794
- [Feature] support model weight update in ep by @ltd0924 in #3802
- [BugFix] fix max streaming tokens invalid by @ltd0924 in #3799
- [Executor] Fix bug of import paddle with RLHF (#3781) by @gongshaotian in #3817
- [BugFix] fix qwen vl processor by @ltd0924 in #3806
- [cp] fix error of import paddle.base.core.Config (#3761) by @yuanlehome in #3804
- [v1loader]Reduce EB300B model loading time (#3700) by @bukejiyu in #3810
- [BugFix] fix scheduler by @ltd0924 in #3818
- add reasoning parser plugin by @luukunn in #3820
- [XPU] fix xpu ci bug by @plusNew001 in #3834
- 【BugFix】add moe noaux_tc tatics in trition backend by @gzy19990617 in #3821
- [XPU]Update XPU CI Case by @plusNew001 in #3844
- [XPU] Update XPU stable xvllm and xtdk version for 2.2 & Change CI Case by @plusNew001 in #3855
- 【Fix bug] The nblock of w4afp8 is fixed to 256, and the mask parameter is added to the append attn of fa3. (#3771) by @yangjianfengo1 in #3835
- [Bug Fix] Fix bug of multimodal inputs only text by @ming1753 in #3850
- 【BUG FIX】 Fixed moba single test port conflict by @yangjianfengo1 in #3865
- Support for async processor added. by @sunlei1024 in #3870
- 【BugFix】fix gpu mem oom by @gzy19990617 in #3852
- [Feature] Set scheduler v1 as default by @rainyfly in #3812
- [Bug fix] Fix prompt token ids dtype in v1 by @rainyfly in #3861
- [bugfix] scheduler by @ltd0924 in #3871
- [bug] fix finish reason by @luukunn in #3858
- fix DP&&TP by @lizhenyun01 in #3872
- [CI] update paddleformers==0.2 in release/2.2 by @EmmonsCurse in #3828
- [Fix] disable scheduler v1 in guided decoding by @rainyfly in #3877
- paddleformers==0.1.4 by @yuanlehome in #3908
- [Fix] fix qwen_vl_processor miss image_patch_id by @xyxinyang in #3894
- [Feature] support controller port in multi api server by @ltd0924 in #3895
- [BugFix] fix OOM while TPDP weight loading by @lizhenyun01 in #3882
- [Fix] mv connection_manager init by @ltd0924 in #3902
- 【CP】Compatible with EB 0.3B torch model arch by @ckl117 in #3914
- paddleformers==0.2.1 by @yuanlehome in #3925
- [XPU]Fixed the issue of performance degradation caused by enabling ENABLE_V1_KVCACHE_SCHEDULER by @iosmers in #3900
- Revert "[Feature] Setting number of apiserver workers automatically" by @rainyfly in #3918
- [BugFix] fix TaskQueue dp_id in multi node by @lizhenyun01 in #3919
- [MTP]Update hybrid mtp with ngram r2.2 by @freeliuzc in #3924
- add cache queue port (#3904) by @ZhangYulongg in #3926
- [Feature] Enable prefix caching as default by @rainyfly in #3816
- [Optimize] optimize prefix cache in release22 by @rainyfly in #3889
- [BugFix] qwen2.5vl enable_thinking=true bug fix by @CSWYF3634076 in #3920
- [Fix] when prompt token ids is numpy by @rainyfly in #3944
- ignore by @bukejiyu in #3949
- [Feature] support hierarchical cache in v1 by @rainyfly in #3939
- [Bug Fix] Fix mm performance degradation by @ming1753 in #3942
- Update paddleformers version to >=0.2.3 by @yuanlehome in #3936
- [Cherry-Pick][Bug Fix]fix the bug for real size 0 in cudagraph by @zeroRains in #3888
- [BugFix] fix default parser by @luukunn in #3932
- [CI] update ci by @ZhangYulongg in #3953
- [Docs][CP 2.2] Update env docs for Machete by @Sunny-bot1 in #3960
- [Feature] support rl_tp_degree by @lizhenyun01 in #3934
New Contributors
- @xjkmfa made their first contribution in #3229
- @wuyujiji made their first contribution in #3234
- @Kane2011 made their first contribution in #3241
- @qw86972190 made their first contribution in #3481
- @Copilot made their first contribution in #3546
- @lengxia made their first contribution in #3537
- @tianlef made their first contribution in #3607
- @Lmywl made their first contribution in #3628
- @mattheliu made their first contribution in #3484
- @zyfncg made their first contribution in #3478
- @xyxinyang made their first contribution in #3557
- @CSWYF3634076 made their first contribution in #3920
Full Changelog: v2.1.1...v2.2.0