[Cherry-pick][Performance Optimization] Rewrite GPU TopK kernel with radix-select and multi-tier sorting #78409#78659
Open
zhengshengning wants to merge 13 commits intoPaddlePaddle:release/3.3from
Open
Conversation
Replace the existing GPU TopK implementation with a new radix-select based algorithm and multi-tier sorting strategy for improved performance: - Radix-select for efficient top-k selection - Multi-block top-k (mbtopk) for large slices - Single-block top-k (sbtopk) for smaller slices - Three-tier sort dispatch: Bitonic Sort (k<=32), WarpMergeSort (k<=128), BlockRadixSort (k<=4096), ArgsortKernel fallback (k>4096) - Rename old TopkKernel to TopkKernelOld for reference
On LP64 Linux, int64_t is typedef of long, not long long. Using int64_t caused duplicate specialization. Restore original long long / unsigned long long types with NOLINT to suppress cpplint, and remove the duplicate int64_t specialization.
When k comes from a tensor, InferMeta may set output dims with -1, making metadata invalid. Calling Alloc before resolving the actual k value triggers PreconditionNotMetError. Fix: move Alloc after FromTensor() resize, add empty-output guard and empty-input handling to match the old kernel behavior.
- Bitfield: add HIP fallback using bit shifts instead of PTX asm (bfe.u32/u64, bfi.b32/b64 are NVIDIA PTX only) - getLaneId/getLaneMaskLe/getLaneMaskLt: use HIP intrinsics on __HIPCC__ - CubKeyType<bfloat16>: use hip_bfloat16 instead of __nv_bfloat16 - Replace cudaStream_t with gpuStream_t (Paddle's unified type alias)
gpuStream_t is defined in phi:: namespace (via gpu_decls.h). The helper functions in the anonymous namespace cannot access it without qualification. Add 'using phi::gpuStream_t;' at the top of the anonymous namespace.
- Guard __syncwarp() with #if !defined(__HIPCC__) since HIP/DCU does not provide this intrinsic (AMD wavefronts are lockstep) - Replace cudaMemsetAsync with hipMemsetAsync under PADDLE_WITH_HIP - Use conservative defaults for regsPerMultiprocessor (65536) and maxBlocksPerMultiProcessor on HIP since hipDeviceProp_t lacks these members
|
你的PR提交成功,感谢你对开源项目的贡献! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Operator Mechanism
PR Types
Improvements
Description
devPR:#78409
是否引起精度变化
否