-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
While running alpaka-enabled workflows in CMSSW_15_1_0_pre4
vanilla on a machine equipped with a Tesla T4 GPU, I've noticed errors like the following
---- Begin Fatal Exception 30-Jun-2025 18:54:24 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 1 lumi: 10 event: 9002 stream: 0
[1] Running path 'HLT_PFPuppiHT1070'
[2] Calling method for module SiPixelPhase2DigiToCluster@alpaka/'hltPhase2SiPixelClustersSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_1_0_pre4-el8_amd64_gcc12/build/CMSSW_15_1_0_pre4-build/el8_amd64_gcc12/external/alpaka/1.2.0-23a2bf2e896b7aace8e772f289604b47/include/alpaka/mem/buf/uniformCudaHip/Copy.hpp(143) 'TApi::setDevice(m_iDstDevice)' A previous API call (not this one) set the error : 'cudaErrorLaunchOutOfResources': 'too many resources requested for launch'!
----- End Fatal Exception -------------------------------------------------
Since this was not the case in CMSSW_15_1_0_pre3
, I did some digging to figure out what the cause of the issue might be. Firstly, running the same workflows on lxplus8-gpu or on the NGT farm does not crash (in both cases the GPUs used are more powerful than a T4). Since the error is related to hltPhase2SiPixelClustersSoA
, I went to investigate the various kernels launches in that module, finally finding the culprit here:
cmssw/RecoLocalTracker/SiPixelClusterizer/plugins/alpaka/SiPixelRawToClusterKernel.dev.cc
Lines 741 to 742 in 5c7b422
const auto workDivOneBlock = cms::alpakatools::make_workdiv<Acc1D>(1u, 1024u); | |
alpaka::exec<Acc1D>(queue, workDivOneBlock, FillHitsModuleStart<TrackerTraits>{}, clusters_d->view()); |
In particular, this kernel is launched with a single block and 1024 threads per block (which is the maximum amount for a T4). Reducing the threads per block drastically from 1024 to 64 fixes the crash. Since changing a number randomly didn't sit quite right with me, I checked the resource utilisation for that particular kernel (in CMSSW_15_1_0_pre4
)
ptxas info : Function properties for _ZN6alpaka6detail9gpuKernelIN17alpaka_cuda_async12pixelDetails19FillHitsModuleStartIN13pixelTopology6Phase2EEENS_9ApiCudaRtENS_22AccGpuUniformCudaHipRtIS8_St17integral_constantImLm1EEjEESB_jJN21SiPixelClustersLayoutILm128ELb0EE22ViewTemplateFreeParamsILm128ELb0ELb1ELb1EEEEEEvNS_3VecIT2_T3_EET_DpT4_
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 82 registers, used 1 barriers, 432 bytes cmem[0], 4 bytes cmem[2]
ptxas info : Compile time = 58.720 ms
this tells us that the FillHitsModuleStart
kernel uses 82 registers and no shared memory. Plugging these parameters in the occupancy simulator in Nsight Compute and using compute capabilities 7.5 to simulate a T4 we get something like this
It turns out that the very upper limit of threads per block that we can use is 640 (as the highest multiple of 32 that does not crash). I have indeed tried with 640 and the next multiple of 32 (672): the former runs fine, while the latter crashes.
The launch parameters for this kernel have not been changed in the past few months, but looking at the same metrics for CMSSW_15_1_0_pre3
ptxas info : Function properties for _ZN6alpaka6detail9gpuKernelIN17alpaka_cuda_async12pixelDetails19FillHitsModuleStartIN13pixelTopology6Phase2EEENS_9ApiCudaRtENS_22AccGpuUniformCudaHipRtIS8_St17integral_constantImLm1EEjEESB_jJN21SiPixelClustersLayoutILm128ELb0EE22ViewTemplateFreeParamsILm128ELb0ELb1ELb1EEEEEEvNS_3VecIT2_T3_EET_DpT4_
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 58 registers, used 1 barriers, 432 bytes cmem[0], 4 bytes cmem[2]
ptxas info : Compile time = 41.599 ms
you can see that the registers jumped from 58 to 82 when going from pre3 to pre4, which causes the launch parameters used until now to not be compatible with a T4.
The easiest way to reproduce this issue is to run the first two steps of workflow 29606.402 (SingleMuon, alpaka enabled) on a machine equipped with a T4
runTheMatrix.py -w upgrade -l 29606.402 --maxSteps 2