Release v2.2.0 · PaddlePaddle/FastDeploy

新增功能

采样策略中的bad_words支持传入token ids
新增Qwen2.5-VL系列模型支持(视频请求不支持enable-chunked-prefill)
API-Server completions接口prompt 字段支持传入token id列表，同时支持批量推理
新增function call解析功能，支持通过tool-call-parse解析function call结果
支持服务启动或请求中自定义chat_template
支持模型chat_template.jinja文件的加载
请求报错结果增加异常堆栈信息，完善异常log记录
新增混合MTP、Ngram的投机解码方法
支持用于投机解码的Tree Attention功能
模型加载功能增强，实现了使用迭代器加载模型权重，加载速度和内存占用进一步优化
API-Server完善日志格式，增加时间信息
新增插件机制，允许用户在不修改FastDeploy核心代码的前提下扩展自定义功能
支持Marlin kernel文件在编译阶段按照模版配置自动生成
支持加载 HuggingFace原生Safetensors格式的文心、Qwen系列模型
完善DP+TP+EP混合并行推理

性能优化

新增W4Afp8 MoE Group GEMM算子
CUDA Graph增加对超32K长文的支持
优化moe_topk_select算子性能，提升MoE模型性能
新增Machete WINT4 GEMM算子，优化WINT4 GEMM性能，通过FD_USE_MACHETE=1开启
Chunked prefill 默认开启
V1 KVCache调度策略与上下文缓存默认开启
MTP支持更多草稿token推理，提升多步接受率
新增可插拔轻量化稀疏注意力加速长文推理
针对Decode支持自适应双阶段的All-to-All通信，提升通信速度
支持DeepSeek系列模型MLA Bankend encoder阶段启用Flash-Attrntion-V3
支持DeepSeek系列模型q_a_proj & kv_a_proj_with_mqa linear横向融合
API-Server新增zmq dealer 模式通信管理模块，支持连接复用进一步扩展服务可支持的最大并发数

Bug修复

completion接口echo回显支持
修复 V1调度下上下文缓存的管理 bug
修复 Qwen 模型固定 top_p=0 两次输出不一致的问题
修复 uvicorn 多worker启动、运行中随机挂掉问题
修复 API-Server completions接口中多个 prompt 的 logprobs 聚合方式
修复 MTP 的采样问题
修复PD 分离cache 传输信号错误
修复异常抛出流量控制信号释放问题
修复max_tokens为0 异常抛出失败问题
修复EP + DP 混合模式下离线推理退出hang问题

文档

更新了最佳实践文档中一些技术的用法和冲突关系
新增多机张量并行部署文档
新增数据并行部署文档

其它

CI新增对自定义算子的Approve拦截
Config整理及规范化

What's Changed

Describe PR diff coverage using JSON file by @XieYunshen in #3114
[CI] add xpu ci case by @plusNew001 in #3111
disable test_cuda_graph.py by @XieYunshen in #3124
[CE] Add base test class for web server testing by @DDDivano in #3120
[OPs] MoE Preprocess OPs Support 160 Experts by @ckl117 in #3121
[Docs] Optimal Deployment by @ming1753 in #2768
fix stop seq unittest by @zoooo0820 in #3126
[XPU]Fix out-of-memory issue during single-XPU deployment by @iosmers in #3133
[Code Simplification] Refactor Post-processing in VL Model Forward Method by @DrRyanHuang in #2937
add case by @DDDivano in #3150
fix ci by @XieYunshen in #3141
Fa3 支持集中式 by @yangjianfengo1 in #3112
Add CI cases by @ZhangYulongg in #3155
[XPU]Updata XPU dockerfiles by @plusNew001 in #3144
[Feature] remove dependency on enable_mm and refine multimodal's code by @ApplEOFDiscord in #3014
【Inference Optimize】Support automatic generation of marlin kernel by @chang-wenbin in #3149
Update init.py by @DDDivano in #3163
fix load_pre_sharded_checkpoint by @bukejiyu in #3152
【Feature】add fd plugins && rm model_classes by @gzy19990617 in #3123
[Bug Fix] fix pd disaggregated kv cache signal by @ltd0924 in #3172
Update test_base_chat.py by @DDDivano in #3183
Fix approve shell scripts by @YuanRisheng in #3108
[Bug Fix] fix the bug in test_sampler by @zeroRains in #3157
【Feature】support qwen3 name_mapping by @gzy19990617 in #3179
remove useless code by @zhoutianzi666 in #3166
[Bug fix] Fix cudagraph when use ep. by @Wanglongzhi2001 in #3130
[Bugfix] Fix uninitialized decoded_token and add corresponding unit t… by @sunlei1024 in #3195
[CI] add test_compare_top_logprobs by @EmmonsCurse in #3191
fix expertwise_scale by @rsmallblue in #3181
[FIX]fix bad_words when sending requests consecutively by @Sunny-bot1 in #3197
[plugin] Custom model_runner/model support by @lizhenyun01 in #3186
Add more base chat cases by @DDDivano in #3203
Add switch to apply fine-grained per token quant fp8 by @RichardWooSJTU in #3192
[Bug Fix]Fix bug of append attention test case by @gongshaotian in #3202
add more cases by @DDDivano in #3207
fix coverage report by @XieYunshen in #3198
[New Feature] fa3 支持flash mask by @yangjianfengo1 in #3184
[Test] scaled_gemm_f8_i4_f16 skip test while sm != 89 by @ming1753 in #3210
[EP] Refactor DeepEP Engine Organization for Mixed Mode & Buffer Management Optimization by @RichardWooSJTU in #3182
[Bug fix] Fix lm head bias by @RichardWooSJTU in #3185
Ce add repitation early stop cases by @DDDivano in #3213
[BugFix]fix test_air_top_p_sampling name by @ckl117 in #3211
[BugFix] support real batch_size by @lizexu123 in #3109
Ce add bad cases by @DDDivano in #3215
revise noaux_tc by @rsmallblue in #3164
[Bug Fix] Fix bug of MLA Attention Backend by @gongshaotian in #3176
support qk norm for append attn by @rsmallblue in #3145
Fix approve ci by @XieYunshen in #3212
[Trace]add trace when fd start by @sg263 in #3174
[New Feature] Support W4Afp8 MoE GroupGemm by @yangjianfengo1 in #3171
Perfect approve error message by @YuanRisheng in #3224
Fix the confused enable_early_stop when only set early_stop_config by @zeroRains in #3214
[CI] Add ci case for min token and max token by @xjkmfa in #3229
add some evil cases by @DDDivano in #3240
support qwen3moe by @bukejiyu in #3084
[Feature] support seed parameter by @lizexu123 in #3161
【Fix Bug】修复 fa3 支持集中式bug by @yangjianfengo1 in #3235
[bugfix]fix blockwisefp8 and all_reduce by @bukejiyu in #3243
[Feature] multi source download by @Yzc216 in #3125
[fix] fix completion stream api output_tokens not in usage by @liyonghua0910 in #3247
[Doc][XPU] Update deps and fix dead links by @hong19860320 in #3252
Fix approve ci bug by @YuanRisheng in #3239
[Executor]Update graph test case and delete test_attention by @gongshaotian in #3257
[CI] remove useless case by @EmmonsCurse in #3261
Ce add benchmark test by @DDDivano in #3262
[stop_seq] fix out-bound value for stop sequence by @zoooo0820 in #3216
[fix] multi source download by @Yzc216 in #3259
[Bug fix] support logprob in scheduler v1 by @rainyfly in #3249
[feat]add fast_weights_iterator by @bukejiyu in #3258
[Iluvatar GPU] Optimze attention and moe performance by @wuyujiji in #3234
delete parallel_state.py by @yuanlehome in #3250
[bugfix]qwen3_fix and qwq fix by @bukejiyu in #3255
【Fix】【MTP】Fix MTP sample bug by @freeliuzc in #3139
[CI] add CI logprobs case by @plusNew001 in #3189
Move create_parameters to init in FuseMOE for CultassBackend and TritonBackend by @zeroRains in #3148
[Bugfix] Fix model accuracy in some ops by @gzy19990617 in #3231
add base test ci by @XieYunshen in #3225
[BugFix] fix too many open files problem by @ltd0924 in #3256
[Executor] Fixed the issue of CUDA graph execution failure caused by different branches during decoding by @littledgg in #3223
[Bug Fix] Fix scheduler bug in develop by @rainyfly in #3292
Split cases by @DDDivano in #3297
Update _base_test.yml by @DDDivano in #3298
[Logprob] merge logprob into _process_batch_output func by @ckl117 in #3266
Update _base_test.yml by @DDDivano in #3299
Acc by @DDDivano in #3301
【CI case】include total_tokens in the last packet of completion interface stream output by @xjkmfa in #3279
[Bug fix] fix bug for scheduler v0 by @rainyfly in #3308
[BugFix] num_seqs by @lizexu123 in #3291
[V1 Loader] Support DeepSeekV3(bf16) by @zeroRains in #3294
update base test by @DDDivano in #3304
enhance eos_tokens by @yuanlehome in #3274
Revert "[BugFix] num_seqs" by @Jiang-Jia-Jun in #3316
Update deploy.py by @ZhangYulongg in #3310
Launch expert_service before kv_cache initialization in worker_process by @zeroRains in #3045
[Bug Fix] fix uvicorn multi worker error by @kevincheng2 in #3300
fix ci pypi index error by @XieYunshen in #3326
[Docs]fix sampling docs by @Sunny-bot1 in #3113
Update _base_test.yml by @DDDivano in #3331
[Test ] fix unittest by @zoooo0820 in #3328
[Docs]remove redundent "bad_words" by @Sunny-bot1 in #3332
Remove useless code by @Jiang-Jia-Jun in #3337
[Bug fix] fix block num setting in scheduler v1 for develop by @rainyfly in #3303
[Doc] 增加中英文切换 by @yangjianfengo1 in #3318
[Bug Fix] fix vl V1 schedule bug by @ming1753 in #3323
[Bug fix] fix ep lm head by @RichardWooSJTU in #3244
[BugFix]fix namemaping when use it many times by @gzy19990617 in #3320
Use latest PaddlePaddle package by @XieYunshen in #3347
[BugFix] v1/completions add finish_reason by @memoryCoderC in #3246
Pre ce modified by @XieYunshen in #3335
add test for CustomAllreduce by @zhink in #3313
Completion add raw_prediction/text_after_process by @memoryCoderC in #3356
add Tool Parser by @luukunn in #3272
Refactor moe_topk_select op to use apply_norm_weight as a template parameter by @Sunny-bot1 in #3345
[BUG FIX][SOT] Fix parameter order for custom op: rms_norm_eps by @DrRyanHuang in #3348
[MetaxGPU] Support FastDeploy on metax gpu by @Kane2011 in #3241
[Iluvatar GPU] Modify the names of some variables by @wuyujiji in #3273
[GCU] Enable gcu CI by @EnflameGCU in #3190
fix TestOpenAIServingCompletion fail by @memoryCoderC in #3368
[Loader V1] modify layername for DeepSeekV3 by @zeroRains in #3336
Optimize CI execution workflow by @XieYunshen in #3371
[Bug Fix] Fix V1 video bug by @ming1753 in #3388
feat(log):add_request_and_response_log by @xiaolei373 in #3373
[BugFix] Fix default log level of paddleformers by @Jiang-Jia-Jun in #3376
[Optimize]Add norm_weights feature for topk_gating_softmax by @Sunny-bot1 in #3372
【BugFix】fix some op tests by @gzy19990617 in #3398
[BugFix] fix real_bsz in ep by @lizexu123 in #3366
Add requirements for running unit tests by @XieYunshen in #3350
[BugFix] fix ErnieProcessor not set raw_prediction by @memoryCoderC in #3400
make append_attn supports mask_offset by @carryyu in #3138
[UT Fix] Fix bad_words test by @Sunny-bot1 in #3385
[Docs] Modify readme about MTP by @freeliuzc in #3409
【CI】 evil case by @xjkmfa in #3359
[V1 Loader] Support Ernie text（moe and dense） by @YuanRisheng in #3110
[OPs] Universal optimization and Fix early_stop cuda 700 by @ckl117 in #3375
[Doc] Add multinode deployment documents by @ltd0924 in #3417
[Optimize]Add bias feature for topk_gating_softmax by @Sunny-bot1 in #3405
[docs] update best practice docs by @zoooo0820 in #3420
[Docs]XPU Update 2.1 Release Documentation by @iosmers in #3423
[Docs] Release 2.1 docs and fix some description by @ming1753 in #3424
[Bugs] Fix DeepGEMM pre-compile tools. by @Deleter-D in #3351
add accuracy check ci by @XieYunshen in #3389
[Docs]Modify the gpu-memory-utilization of the 128K 8-card Wint4 model to 0.95 by @iosmers in #3428
[docs] fix some docs error by @zoooo0820 in #3439
Update README by @yangjianfengo1 in #3426
[Docs]update installation readme by @yongqiangma in #3429
[Docs]Updata docs of graph opt backend by @gongshaotian in #3442
[Docs] Update mkdocs.yml by @gongshaotian in #3444
[BugFix] CPU init.py Delete import base by @ckl117 in #3448
Add ci case by @ZhangYulongg in #3355
[Feature][MTP]update multi-draft-token strategy by @freeliuzc in #3369
[Bugfix]fix some bug in plugins and config bug in dynamic_weight_manager by @gzy19990617 in #3434
Perf by @DDDivano in #3453
[BugFix]fix mtp_rej_topp missing top_k_list input by @ckl117 in #3450
[Excutor] Increase buffer size to prevent address corruption; add forward metadata debug tool by @littledgg in #3404
[Excutor] Change cudagraph hashkey from batch size to num_tokens by @littledgg in #3454
add custom chat template by @luukunn in #3251
add publish workflow by @XieYunshen in #3063
[Code Simplification] remove cum_offsets by @lizexu123 in #3410
【BugFix】completion接口echo回显支持 by @AuferGachet in #3245
【Inference Optimize】DeepSeek-v3 model inference performance optimization by @chang-wenbin in #3455
[BugFix] fix num_running_requests in cuda_graph by @lizexu123 in #3457
[Feature] Pass through the chat_template_kwargs to the data processing module by @luukunn in #3421
[FixBug] compute early stopping with real batch size by @zeroRains in #3418
[BugFix] fix control signal release failed by @ltd0924 in #3390
[BugFix] fix request_output sampling_params (#3154) by @ckl117 in #3464
[V1 Loader] Support MOE parameters create and load for DeepGemm and marlin backend by @zeroRains in #3447
[CI] add test generation demo by @ltd0924 in #3270
【New Feature】支持Fp8 group Gemm 24稀疏 by @yangjianfengo1 in #3463
add error traceback info by @kevincheng2 in #3419
Add stable ci by @XieYunshen in #3460
add error log to file by @xiaolei373 in #3431
[bigfix] temporarily lock the paddlepaddle-gpu==3.0.0.dev20250818 by @EmmonsCurse in #3482
Add eb45-8k-fp8-tp1-dp8_ep.yaml by @ZhangYulongg in #3485
add e2e cases by @XieYunshen in #3476
Update disaggregated.md by @ZhangYulongg in #3495
Add custom op declaration for all_reduce by @DrRyanHuang in #3473
[BugFix][V1 Loader] fix the bug in creat weight for block_wise_fp8 by @zeroRains in #3486
[Feature] add prompt_tokens and completion_tokens by @memoryCoderC in #3504
CE 编译任务(合入触发) by @XieYunshen in #3491
[Feature] add dealer manager to reuse the connection by @ltd0924 in #3471
[XPU] Cherry-pick Commit Release2.1 44c0c7e into develop by @qw86972190 in #3481
Update CI by @ZhangYulongg in #3474
[Feature] Models api by @Yzc216 in #3073
[Feature] add tool parser by @luukunn in #3483
[fix] setting disable_chat_template while passing prompt_token_ids led to response error by @liyonghua0910 in #3228
[fix] fix output tokens count in streaming completion api by @liyonghua0910 in #3507
Add PD CI case by @ZhangYulongg in #3490
【bug fix】修复w4a8编译慢 by @yangjianfengo1 in #3510
Unify server-side and model-side Config(Part-5) by @YuanRisheng in #3497
Update ci by @ZhangYulongg in #3519
[V1 Loader]Ernie VL support loader v1 by @YuanRisheng in #3494
disable stable test by @XieYunshen in #3529
[CI] add container naming and cleanup logic in workflows by @EmmonsCurse in #3526
[Feature][SpeculativeDecoding]Support tree-attention by @freeliuzc in #3514
[CI] fix xpu ci bug by @plusNew001 in #3535
Fix fdconfig bugs by @YuanRisheng in #3528
[Feature] Add Qwen25-VL Processor by @lddfym in #3501
Modified to support custom all reduce by default by @zhink in #3538
[UnitTest][Copilot] Improve unit test coverage for entrypoints modules by @Copilot in #3546
fix test name by @XieYunshen in #3493
[V1 Loader] Support qwen2(bf16) by @zeroRains in #3502
[BugFix]Fix FDconfig for SOT by @YuanRisheng in #3556
Revert "[UnitTest][Copilot] Improve unit test coverage for entrypoints modules" by @Jiang-Jia-Jun in #3564
[V1 Loader] support weight_only by @bukejiyu in #3413
[V1 Loaderr]support qwen2 weight only by @bukejiyu in #3571
[Feature][XPU] add custom kernels for mtp by @lengxia in #3537
[CI] add sot test by @EmmonsCurse in #3579
Modify the existing coverage collection method by @XieYunshen in #3573
support w4afp8 EP inference by @rsmallblue in #3044
Add coverage skip by @XieYunshen in #3553
[Feature] Add temp_scaled_logprobs and top_p_normalized_logprobs parameters for logits and logprobs post processing by @ckl117 in #3552
[CI] temporarily disable sot test due to occasional timeout issue by @EmmonsCurse in #3586
[MetaxGPU] Adapt to the latest fastdeploy on metax gpu by @Kane2011 in #3492
[Executor] CUDAGraph support RL training by @gongshaotian in #3265
[Bugfix] fix api server control signal bugs by @ltd0924 in #3531
[Features] support hugging face qwen3 dense and qwen2 model by @lizexu123 in #3574
[Feature] bad words support v1 scheduler and specifiy token ids by @Sunny-bot1 in #3608
[CudaGraph][SOT] Add unit tests for splitting the static graph into piecewise graphs that support cuda_graph by @DrRyanHuang in #3590
[CE]add x1 w4a8c8 benchamrk config by @tianlef in #3607
[NewFeatures] support noex rope3d by @xiaoxiaohehe001 in #3542
[CI] reopen sot test by @EmmonsCurse in #3613
【Inference Optimize】Support DSK qkv_a_proj horizontal fusion under V0 Loder by @chang-wenbin in #3591
[feature][MTP]Support new speculative decoding method "hybrid mtp with ngram" by @freeliuzc in #3610
Supports DP+TP+EP hybrid parallel deployment strategy by @carryyu in #3489
adaptive rms_norm's dtype by @yuanlehome in #3617
[NewFeatures] support eplb by @xiaoxiaohehe001 in #3547
[CUDAGraph]Add debug func by @gongshaotian in #3616
[v1 loader]support fp8 by @bukejiyu in #3593
[Bugfix] Correct logprobs aggregation for multiple prompts in /completions endpoint by @sunlei1024 in #3618
[CI] Standard unittest by @YuanRisheng in #3606
rename ernie_xxx to ernie4_5_xxx by @yuanlehome in #3621
[NewFeature]Support dp multi api server && Fix some bug in mixed ep && merge develop by @gzy19990617 in #3598
[CUDAGraph]Switch the scope so that output buffer of CUDAGraph can automatically release by @gongshaotian in #3612
[Feature] block sparse attention by @yangjianfengo1 in #3209
fix publish task by @XieYunshen in #3635
[New Features] support fa3 rope3d by @xiaoxiaohehe001 in #3622
[Precision] Support lm_head layer running in float32 by @ckl117 in #3597
update xpu ci by @plusNew001 in #3632
[BugFix] Fix qwen3 lm_head load bug by @ckl117 in #3639
【CI case】for echo finish_reason text_after_process and raw_prediction check by @xjkmfa in #3630
[docs] Update best practice doc by @zoooo0820 in #3539
Change paddlepaddle-xpu installation command by @plusNew001 in #3646
deepgemm don't support tp+ep (for ci) by @carryyu in #3638
[fix] qwen output inconsistency when top_p=0 by @liyonghua0910 in #3634
Revert "[Feature] block sparse attention" by @Jiang-Jia-Jun in #3647
【fix】 undefined cuPointerGetAttribute symbol error by @Lmywl in #3628
delete ernie4_5_vl_tokenizer by @yuanlehome in #3631
[BugFix] fix ce bugs by @ltd0924 in #3641
[BugFix] Modify the bug in Qwen2 when enabling ENABLE_V1_KVCACHE_SCHEDULER. by @lizexu123 in #3625
[V1 Loader]support param create and load for wint2 and xpu backend by @zeroRains in #3581
[Optimize]support machete weight only gemm by @Sunny-bot1 in #3561
[BugFix] fix parameter is 0 by @ltd0924 in #3592
【Hackathon 9th No.77】supplementary unit test for get_filtered_metrics by @Echo-Nie in #3578
Support 45t fp8 8 GPU by @zhoutianzi666 in #3659
【New Feature】集中式支持w4afp8 by @yangjianfengo1 in #3644
[BufFix]Fix rl bugs by @YuanRisheng in #3654
fix w4afp8_gemm_scale_permute import error on A100 by @rsmallblue in #3611
Update run_ci_xpu.sh to lock xvllm version by @plusNew001 in #3671
[Docs] add fastdeploy_unit_test_guide.md by @mattheliu in #3484
Fix target_version by @co63oc in #3159
fix typos by @co63oc in #3633
[BugFix] pd disaggregate port format by @ltd0924 in #3669
fix by @bukejiyu in #3676
[BugFix] fix logger by @ltd0924 in #3666
[BugFix] ep mixed offline exit by @ltd0924 in #3661
Add with_output version AppendAttention by @Lmywl in #3302
【fix】fix fp8 deepgemm_moe TP parallel Bug by @Lmywl in #3658
add concurrency cases by @DDDivano in #3689
[BugFix]fix dp&ep&tp and muti node infer by @gzy19990617 in #3629
Update _base_test.yml by @DDDivano in #3690
[V1 Loader]Ernie mtp support loader v1 by @YuanRisheng in #3675
[Bug Fix] VL Support w4a8/w4afp8 by @ming1753 in #3686
add input_processor plugin by @yuanlehome in #3657
[v1 loader]fix qwen3 235B tp 8 by @bukejiyu in #3697
update ci envs for structred output by @kevincheng2 in #3687
【DCU】enable dcu ci by @lifulll in #3402
【Hackathon 9th No.70】supplementary unit test for CPUPlatform and CUDAPlatform by @Echo-Nie in #3580
fix MultimodalRegistry by @yuanlehome in #3699
fix scaled_gemm_f8_i4_f16_weight_quantize input by @co63oc in #3685
MoE Default use triton's blockwise fp8 in TP Case by @zhoutianzi666 in #3678
[feat] completion api supports passing input token ids in either prompt or prompt_token_ids by @liyonghua0910 in #3311
[BugFix] fix key error in mm by @yuanlehome in #3702
[NewFeature] support w4afp8 eplb by @xiaoxiaohehe001 in #3680
Specify dtype as int32 for paddle.cumsum due to API Change by @DrRyanHuang in #3692
Fix mtp tp group by @carryyu in #3648
[CudaGraph] [SOT] Support spliting static graph into piecewise graph with cuda_graph by @zyfncg in #3478
add w4afp8 offline script by @rsmallblue in #3636
[CI] update paddle version to nightly by @EmmonsCurse in #3698
[Model]support qwen2_5_vl by @xyxinyang in #3557
[Feature] block sparse attention by @yangjianfengo1 in #3668
[Feature]support load eb 0.3B and 21B torch model by @ckl117 in #3660
Optimize coverage jobs by @XieYunshen in #3683
[Optimize] Increase zmq buffer size to prevent apiserver too slowly t… by @rainyfly in #3723
[Bug Fix] fix the bug when num_key_value_heads < tensor_parallel_size in launching kv_cahce_manager by @zeroRains in #3717
[Features] support hugging face qwen3 moe by @lizexu123 in #3649
[Attn] max_partition_size default is 1024 by @ckl117 in #3720
[Code Simplification] delete print by @lizexu123 in #3729
[Feature]support chat_template.jinja by @luukunn in #3721
[Feature] Add AsyncTokenizerClient&ChatResponseProcessor with remote encode&decode support. by @sunlei1024 in #3674
[FIX]Fix Machete compile via ENABLE_MACHETE by @Sunny-bot1 in #3727
[feat] add metrics for yiyan adapter by @liyonghua0910 in #3614
default enable chunked prefill by @kevincheng2 in #3731
fix mask_offset in append_attn by @lizhenyun01 in #3745
[Bug fix] Fix prefix cache in V1 by @rainyfly in #3715
fix ce build job by @XieYunshen in #3777
fix ce compile task upload error by @XieYunshen in #3788
Fix chunked prefill by @kevincheng2 in #3778
[Feature] Setting number of apiserver workers automatically by @Jiang-Jia-Jun in #3794
[Feature] support model weight update in ep by @ltd0924 in #3802
[BugFix] fix max streaming tokens invalid by @ltd0924 in #3799
[Executor] Fix bug of import paddle with RLHF (#3781) by @gongshaotian in #3817
[BugFix] fix qwen vl processor by @ltd0924 in #3806
[cp] fix error of import paddle.base.core.Config (#3761) by @yuanlehome in #3804
[v1loader]Reduce EB300B model loading time (#3700) by @bukejiyu in #3810
[BugFix] fix scheduler by @ltd0924 in #3818
add reasoning parser plugin by @luukunn in #3820
[XPU] fix xpu ci bug by @plusNew001 in #3834
【BugFix】add moe noaux_tc tatics in trition backend by @gzy19990617 in #3821
[XPU]Update XPU CI Case by @plusNew001 in #3844
[XPU] Update XPU stable xvllm and xtdk version for 2.2 & Change CI Case by @plusNew001 in #3855
【Fix bug] The nblock of w4afp8 is fixed to 256, and the mask parameter is added to the append attn of fa3. (#3771) by @yangjianfengo1 in #3835
[Bug Fix] Fix bug of multimodal inputs only text by @ming1753 in #3850
【BUG FIX】 Fixed moba single test port conflict by @yangjianfengo1 in #3865
Support for async processor added. by @sunlei1024 in #3870
【BugFix】fix gpu mem oom by @gzy19990617 in #3852
[Feature] Set scheduler v1 as default by @rainyfly in #3812
[Bug fix] Fix prompt token ids dtype in v1 by @rainyfly in #3861
[bugfix] scheduler by @ltd0924 in #3871
[bug] fix finish reason by @luukunn in #3858
fix DP&&TP by @lizhenyun01 in #3872
[CI] update paddleformers==0.2 in release/2.2 by @EmmonsCurse in #3828
[Fix] disable scheduler v1 in guided decoding by @rainyfly in #3877
paddleformers==0.1.4 by @yuanlehome in #3908
[Fix] fix qwen_vl_processor miss image_patch_id by @xyxinyang in #3894
[Feature] support controller port in multi api server by @ltd0924 in #3895
[BugFix] fix OOM while TPDP weight loading by @lizhenyun01 in #3882
[Fix] mv connection_manager init by @ltd0924 in #3902
【CP】Compatible with EB 0.3B torch model arch by @ckl117 in #3914
paddleformers==0.2.1 by @yuanlehome in #3925
[XPU]Fixed the issue of performance degradation caused by enabling ENABLE_V1_KVCACHE_SCHEDULER by @iosmers in #3900
Revert "[Feature] Setting number of apiserver workers automatically" by @rainyfly in #3918
[BugFix] fix TaskQueue dp_id in multi node by @lizhenyun01 in #3919
[MTP]Update hybrid mtp with ngram r2.2 by @freeliuzc in #3924
add cache queue port (#3904) by @ZhangYulongg in #3926
[Feature] Enable prefix caching as default by @rainyfly in #3816
[Optimize] optimize prefix cache in release22 by @rainyfly in #3889
[BugFix] qwen2.5vl enable_thinking=true bug fix by @CSWYF3634076 in #3920
[Fix] when prompt token ids is numpy by @rainyfly in #3944
ignore by @bukejiyu in #3949
[Feature] support hierarchical cache in v1 by @rainyfly in #3939
[Bug Fix] Fix mm performance degradation by @ming1753 in #3942
Update paddleformers version to >=0.2.3 by @yuanlehome in #3936
[Cherry-Pick][Bug Fix]fix the bug for real size 0 in cudagraph by @zeroRains in #3888
[BugFix] fix default parser by @luukunn in #3932
[CI] update ci by @ZhangYulongg in #3953
[Docs][CP 2.2] Update env docs for Machete by @Sunny-bot1 in #3960
[Feature] support rl_tp_degree by @lizhenyun01 in #3934

New Contributors

@xjkmfa made their first contribution in #3229
@wuyujiji made their first contribution in #3234
@Kane2011 made their first contribution in #3241
@qw86972190 made their first contribution in #3481
@Copilot made their first contribution in #3546
@lengxia made their first contribution in #3537
@tianlef made their first contribution in #3607
@Lmywl made their first contribution in #3628
@mattheliu made their first contribution in #3484
@zyfncg made their first contribution in #3478
@xyxinyang made their first contribution in #3557
@CSWYF3634076 made their first contribution in #3920

Full Changelog: v2.1.1...v2.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2.2.0

新增功能

性能优化

Bug修复

文档

其它

What's Changed

New Contributors

Contributors

Uh oh!