`hltPhase2SiPixelClustersSoA` throws `cudaErrorLaunchOutOfResources` in CMSSW_15_1_0_pre4 when running on T4 GPUs

While running alpaka-enabled workflows in `CMSSW_15_1_0_pre4` vanilla on a machine equipped with a Tesla T4 GPU, I've noticed errors like the following
```
---- Begin Fatal Exception 30-Jun-2025 18:54:24 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 10 event: 9002 stream: 0
   [1] Running path 'HLT_PFPuppiHT1070'
   [2] Calling method for module SiPixelPhase2DigiToCluster@alpaka/'hltPhase2SiPixelClustersSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_1_0_pre4-el8_amd64_gcc12/build/CMSSW_15_1_0_pre4-build/el8_amd64_gcc12/external/alpaka/1.2.0-23a2bf2e896b7aace8e772f289604b47/include/alpaka/mem/buf/uniformCudaHip/Copy.hpp(143) 'TApi::setDevice(m_iDstDevice)' A previous API call (not this one) set the error  : 'cudaErrorLaunchOutOfResources': 'too many resources requested for launch'!
----- End Fatal Exception -------------------------------------------------
```

Since this was not the case in `CMSSW_15_1_0_pre3`, I did some digging to figure out what the cause of the issue might be. Firstly, running the same workflows on lxplus8-gpu or on the NGT farm does not crash (in both cases the GPUs used are more powerful than a T4). Since the error is related to `hltPhase2SiPixelClustersSoA`, I went to investigate the various kernels launches in that module, finally finding the culprit here: https://github.com/cms-sw/cmssw/blob/5c7b4223f0f41f6adae93c660d32ba5287547a60/RecoLocalTracker/SiPixelClusterizer/plugins/alpaka/SiPixelRawToClusterKernel.dev.cc#L741-L742

In particular, this kernel is launched with a single block and 1024 threads per block (which is the maximum amount for a T4). Reducing the threads per block drastically from 1024 to 64 fixes the crash. Since changing a number randomly didn't sit quite right with me, I checked the resource utilisation for that particular kernel (in `CMSSW_15_1_0_pre4`)

```
ptxas info    : Function properties for _ZN6alpaka6detail9gpuKernelIN17alpaka_cuda_async12pixelDetails19FillHitsModuleStartIN13pixelTopology6Phase2EEENS_9ApiCudaRtENS_22AccGpuUniformCudaHipRtIS8_St17integral_constantImLm1EEjEESB_jJN21SiPixelClustersLayoutILm128ELb0EE22ViewTemplateFreeParamsILm128ELb0ELb1ELb1EEEEEEvNS_3VecIT2_T3_EET_DpT4_
    32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 82 registers, used 1 barriers, 432 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Compile time = 58.720 ms
``` 

this tells us that the `FillHitsModuleStart` kernel uses 82 registers and no shared memory. Plugging these parameters in the occupancy simulator in Nsight Compute and using compute capabilities 7.5 to simulate a T4 we get something like this

![Image](https://github.com/user-attachments/assets/9ed97276-8a45-4c51-b50d-a9dc7bf23edd)

It turns out that the very upper limit of threads per block that we can use is 640 (as the highest multiple of 32 that does not crash). I have indeed tried with 640 and the next multiple of 32 (672): the former runs fine, while the latter crashes. 
The launch parameters for this kernel have not been changed in the past few months, but looking at the same metrics for `CMSSW_15_1_0_pre3`

```
ptxas info    : Function properties for _ZN6alpaka6detail9gpuKernelIN17alpaka_cuda_async12pixelDetails19FillHitsModuleStartIN13pixelTopology6Phase2EEENS_9ApiCudaRtENS_22AccGpuUniformCudaHipRtIS8_St17integral_constantImLm1EEjEESB_jJN21SiPixelClustersLayoutILm128ELb0EE22ViewTemplateFreeParamsILm128ELb0ELb1ELb1EEEEEEvNS_3VecIT2_T3_EET_DpT4_
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 58 registers, used 1 barriers, 432 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Compile time = 41.599 ms
```

you can see that the registers jumped from 58 to 82 when going from pre3 to pre4, which causes the launch parameters used until now to not be compatible with a T4. 
The easiest way to reproduce this issue is to run the first two steps of workflow 29606.402 (SingleMuon, alpaka enabled) on a machine equipped with a T4
```bash
runTheMatrix.py -w upgrade -l 29606.402 --maxSteps 2
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`hltPhase2SiPixelClustersSoA` throws `cudaErrorLaunchOutOfResources` in CMSSW_15_1_0_pre4 when running on T4 GPUs #48460

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	const auto workDivOneBlock = cms::alpakatools::make_workdiv<Acc1D>(1u, 1024u);
	alpaka::exec<Acc1D>(queue, workDivOneBlock, FillHitsModuleStart<TrackerTraits>{}, clusters_d->view());

hltPhase2SiPixelClustersSoA throws cudaErrorLaunchOutOfResources in CMSSW_15_1_0_pre4 when running on T4 GPUs #48460

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`hltPhase2SiPixelClustersSoA` throws `cudaErrorLaunchOutOfResources` in CMSSW_15_1_0_pre4 when running on T4 GPUs #48460