Skip to content

hltPhase2SiPixelClustersSoA throws cudaErrorLaunchOutOfResources in CMSSW_15_1_0_pre4 when running on T4 GPUs #48460

@Parsifal-2045

Description

@Parsifal-2045

While running alpaka-enabled workflows in CMSSW_15_1_0_pre4 vanilla on a machine equipped with a Tesla T4 GPU, I've noticed errors like the following

---- Begin Fatal Exception 30-Jun-2025 18:54:24 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 10 event: 9002 stream: 0
   [1] Running path 'HLT_PFPuppiHT1070'
   [2] Calling method for module SiPixelPhase2DigiToCluster@alpaka/'hltPhase2SiPixelClustersSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_1_0_pre4-el8_amd64_gcc12/build/CMSSW_15_1_0_pre4-build/el8_amd64_gcc12/external/alpaka/1.2.0-23a2bf2e896b7aace8e772f289604b47/include/alpaka/mem/buf/uniformCudaHip/Copy.hpp(143) 'TApi::setDevice(m_iDstDevice)' A previous API call (not this one) set the error  : 'cudaErrorLaunchOutOfResources': 'too many resources requested for launch'!
----- End Fatal Exception -------------------------------------------------

Since this was not the case in CMSSW_15_1_0_pre3, I did some digging to figure out what the cause of the issue might be. Firstly, running the same workflows on lxplus8-gpu or on the NGT farm does not crash (in both cases the GPUs used are more powerful than a T4). Since the error is related to hltPhase2SiPixelClustersSoA, I went to investigate the various kernels launches in that module, finally finding the culprit here:

const auto workDivOneBlock = cms::alpakatools::make_workdiv<Acc1D>(1u, 1024u);
alpaka::exec<Acc1D>(queue, workDivOneBlock, FillHitsModuleStart<TrackerTraits>{}, clusters_d->view());

In particular, this kernel is launched with a single block and 1024 threads per block (which is the maximum amount for a T4). Reducing the threads per block drastically from 1024 to 64 fixes the crash. Since changing a number randomly didn't sit quite right with me, I checked the resource utilisation for that particular kernel (in CMSSW_15_1_0_pre4)

ptxas info    : Function properties for _ZN6alpaka6detail9gpuKernelIN17alpaka_cuda_async12pixelDetails19FillHitsModuleStartIN13pixelTopology6Phase2EEENS_9ApiCudaRtENS_22AccGpuUniformCudaHipRtIS8_St17integral_constantImLm1EEjEESB_jJN21SiPixelClustersLayoutILm128ELb0EE22ViewTemplateFreeParamsILm128ELb0ELb1ELb1EEEEEEvNS_3VecIT2_T3_EET_DpT4_
    32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 82 registers, used 1 barriers, 432 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Compile time = 58.720 ms

this tells us that the FillHitsModuleStart kernel uses 82 registers and no shared memory. Plugging these parameters in the occupancy simulator in Nsight Compute and using compute capabilities 7.5 to simulate a T4 we get something like this

Image

It turns out that the very upper limit of threads per block that we can use is 640 (as the highest multiple of 32 that does not crash). I have indeed tried with 640 and the next multiple of 32 (672): the former runs fine, while the latter crashes.
The launch parameters for this kernel have not been changed in the past few months, but looking at the same metrics for CMSSW_15_1_0_pre3

ptxas info    : Function properties for _ZN6alpaka6detail9gpuKernelIN17alpaka_cuda_async12pixelDetails19FillHitsModuleStartIN13pixelTopology6Phase2EEENS_9ApiCudaRtENS_22AccGpuUniformCudaHipRtIS8_St17integral_constantImLm1EEjEESB_jJN21SiPixelClustersLayoutILm128ELb0EE22ViewTemplateFreeParamsILm128ELb0ELb1ELb1EEEEEEvNS_3VecIT2_T3_EET_DpT4_
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 58 registers, used 1 barriers, 432 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Compile time = 41.599 ms

you can see that the registers jumped from 58 to 82 when going from pre3 to pre4, which causes the launch parameters used until now to not be compatible with a T4.
The easiest way to reproduce this issue is to run the first two steps of workflow 29606.402 (SingleMuon, alpaka enabled) on a machine equipped with a T4

runTheMatrix.py -w upgrade -l 29606.402 --maxSteps 2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions