[Cherry-pick][Performance Optimization] Rewrite GPU TopK kernel with radix-select and multi-tier sorting #78409 by zhengshengning · Pull Request #78659 · PaddlePaddle/Paddle

zhengshengning · 2026-04-13T08:50:01Z

PR Category

Operator Mechanism

PR Types

Improvements

Description

devPR:#78409

是否引起精度变化

否

Replace the existing GPU TopK implementation with a new radix-select based algorithm and multi-tier sorting strategy for improved performance: - Radix-select for efficient top-k selection - Multi-block top-k (mbtopk) for large slices - Single-block top-k (sbtopk) for smaller slices - Three-tier sort dispatch: Bitonic Sort (k<=32), WarpMergeSort (k<=128), BlockRadixSort (k<=4096), ArgsortKernel fallback (k>4096) - Rename old TopkKernel to TopkKernelOld for reference

On LP64 Linux, int64_t is typedef of long, not long long. Using int64_t caused duplicate specialization. Restore original long long / unsigned long long types with NOLINT to suppress cpplint, and remove the duplicate int64_t specialization.

When k comes from a tensor, InferMeta may set output dims with -1, making metadata invalid. Calling Alloc before resolving the actual k value triggers PreconditionNotMetError. Fix: move Alloc after FromTensor() resize, add empty-output guard and empty-input handling to match the old kernel behavior.

- Bitfield: add HIP fallback using bit shifts instead of PTX asm (bfe.u32/u64, bfi.b32/b64 are NVIDIA PTX only) - getLaneId/getLaneMaskLe/getLaneMaskLt: use HIP intrinsics on __HIPCC__ - CubKeyType<bfloat16>: use hip_bfloat16 instead of __nv_bfloat16 - Replace cudaStream_t with gpuStream_t (Paddle's unified type alias)

gpuStream_t is defined in phi:: namespace (via gpu_decls.h). The helper functions in the anonymous namespace cannot access it without qualification. Add 'using phi::gpuStream_t;' at the top of the anonymous namespace.

- Guard __syncwarp() with #if !defined(__HIPCC__) since HIP/DCU does not provide this intrinsic (AMD wavefronts are lockstep) - Replace cudaMemsetAsync with hipMemsetAsync under PADDLE_WITH_HIP - Use conservative defaults for regsPerMultiprocessor (65536) and maxBlocksPerMultiProcessor on HIP since hipDeviceProp_t lacks these members

paddle-bot · 2026-04-13T08:50:07Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

zhengshengning added 12 commits April 13, 2026 16:42

Fix Windows build: bring gpuStream_t into anonymous namespace

4f9bc38

gpuStream_t is defined in phi:: namespace (via gpu_decls.h). The helper functions in the anonymous namespace cannot access it without qualification. Add 'using phi::gpuStream_t;' at the top of the anonymous namespace.

rename tok_cuda_kernel

3c8e3e6

fix

f9724a6

fix

4ff63fb

fix2

ed87a41

fix

fd247fe

fix2

b5f9dd0

fix

0943063

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-pick][Performance Optimization] Rewrite GPU TopK kernel with radix-select and multi-tier sorting #78409#78659

[Cherry-pick][Performance Optimization] Rewrite GPU TopK kernel with radix-select and multi-tier sorting #78409#78659
zhengshengning wants to merge 13 commits intoPaddlePaddle:release/3.3from
zhengshengning:cp_acc_opt_topk

zhengshengning commented Apr 13, 2026

Uh oh!

paddle-bot bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhengshengning commented Apr 13, 2026

PR Category

PR Types

Description

是否引起精度变化

Uh oh!

paddle-bot bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant