[Issue]: Building ck fails on Fedora 42 with `reference to __host__ function '__glibcxx_assert_fail' in __host__ __device__ function`

### Problem Description

Following #2759 , I report an additional issue when compiling CK on Fedora 42. Building fails with the following error:

```
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/array:210:2: error: reference to __host__ function '__glibcxx_assert_fail' in __host__ __device__ function
  210 |         __glibcxx_requires_subscript(__n);
      |         ^
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/debug/assertions.h:39:3: note: expanded from macro '__glibcxx_requires_subscript'
   39 |   __glibcxx_assert(_N < this->size())
      |   ^
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/x86_64-redhat-linux/bits/c++config.h:2572:12: note: expanded from macro '__glibcxx_assert'
 2572 |       std::__glibcxx_assert_fail();                                     \
      |            ^
/home/aethor/Dev/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_contraction_multiple_d_wmma_cshuffle.hpp:524:17: note: called by 'operator()<ck::integral_constant<int, 0>>'
  524 |                 ds_offset[i] = static_cast<long_index_t>(g_idx) *
      |                 ^
/home/aethor/Dev/composable_kernel/include/ck/utility/functional2.hpp:60:10: note: called by 'operator()<(lambda at /home/aethor/Dev/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_contraction_multiple_d_wmma_cshuffle.hpp:523:44)>'
   60 |         (f(Number<Is>{}), ...);
      |          ^
/home/aethor/Dev/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_contraction_multiple_d_wmma_cshuffle.hpp:523:13: note: called by 'GetDsPtrOffset'
  523 |             static_for<0, NumDTensor, 1>{}([&](auto i) {
      |             ^
/home/aethor/Dev/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_d_wmma_cshuffle.hpp:165:62: note: called by 'kernel_contraction_multiple_d_wmma_cshuffle<ck::GridwiseGemmMultipleD_Wmma<_Float16, _Float16, float, _Float16, ck::Tuple<_Float16>, _Float16, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int>, ck::Tuple<int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 2>, ck::integral_constant<int, 2>, ck::integral_constant<int, 4>>, false>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 4>, ck::integral_constant<int, 16>>, false>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<7>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8, 9, 10, 11>, ck::Sequence<12, 13, 14>>, ck::Sequence<8, 12, 13, 9, 10, 14, 11>, long>, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int>, ck::Tuple<int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 4>>, false>, ck::PassThrough<int>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<7>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8, 9>, ck::Sequence<10>>, ck::Sequence<8, 10, 9>, long>, ck::Tuple<ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int, int>, ck::Tuple<int, int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::RightPad<int, int>, ck::RightPad<int, int>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3, 4>, ck::Sequence<5>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>>, ck::Sequence<7, 8>, long>>, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int, int>, ck::Tuple<int, int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::RightPad<int, int>, ck::RightPad<int, int>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3, 4>, ck::Sequence<5>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>>, ck::Sequence<7, 8>, long>, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::Add, ck::InMemoryDataOperationEnum::Set, 64, 64, 64, 16, 16, 4, 1, 4, 128, ck::Sequence<4, 32, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 4, 4, false, false, false, ck::Sequence<4, 32, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 4, 4, false, true, false, 1, 1, ck::Sequence<1, 64, 1, 2>, 8, 1, ck::LoopScheduler::Default, ck::PipelineVersion::v1>, _Float16, _Float16, ck::Tuple<const _Float16 *>, _Float16, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int>, ck::Tuple<int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 2>, ck::integral_constant<int, 2>, ck::integral_constant<int, 4>>, false>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 4>, ck::integral_constant<int, 16>>, false>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<7>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8, 9, 10, 11>, ck::Sequence<12, 13, 14>>, ck::Sequence<8, 12, 13, 9, 10, 14, 11>, long>, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int>, ck::Tuple<int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 4>>, false>, ck::PassThrough<int>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<7>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8, 9>, ck::Sequence<10>>, ck::Sequence<8, 10, 9>, long>, ck::Tuple<ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int, int>, ck::Tuple<int, int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 64>>, false>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 64>>, false>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>>, ck::Tuple<ck::Sequence<1, 2, 3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>, ck::Sequence<9, 10>, ck::Sequence<11, 12>>, ck::Sequence<9, 10, 11, 12>, long>>, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int, int>, ck::Tuple<int, int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 64>>, false>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 64>>, false>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>>, ck::Tuple<ck::Sequence<1, 2, 3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>, ck::Sequence<9, 10>, ck::Sequence<11, 12>>, ck::Sequence<9, 10, 11, 12>, long>, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::Add, ck::tensor_operation::device::DeviceBatchedContractionMultipleD_Wmma_CShuffle<2, 2, 2, 1, _Float16, _Float16, float, _Float16, ck::Tuple<_Float16>, _Float16, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::Add, ck::tensor_operation::device::GemmSpecialization::MNKPadding, ck::tensor_operation::device::TensorSpecialization::Default, ck::tensor_operation::device::TensorSpecialization::Default, ck::tensor_operation::device::TensorSpecialization::Default, 1, 128, 64, 64, 64, 4, 16, 16, 1, 4, ck::Sequence<4, 32, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 4, 4, false, ck::Sequence<4, 32, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 4, 4, false, 1, 1, ck::Sequence<1, 64, 1, 2>, 8>::ComputePtrOffsetOfStridedBatch, ck::BlockToCTileMap_M00_N0_M01Adapt<64, 64, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int, int>, ck::Tuple<int, int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::RightPad<int, int>, ck::RightPad<int, int>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3, 4>, ck::Sequence<5>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>>, ck::Sequence<7, 8>, long>>, true>'
  165 |     const auto ds_batch_offset = compute_ptr_offset_of_batch.GetDsPtrOffset(g_idx);
      |                                                              ^
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/x86_64-redhat-linux/bits/c++config.h:2566:3: note: '__glibcxx_assert_fail' declared here
 2566 |   __glibcxx_assert_fail()
      |   ^
```

### Operating System

Fedora 42

### CPU

AMD Ryzen 7 7800X3D

### GPU

Other

### Other

RX 9070 XT

### ROCm Version

ROCm 6.3.1

### ROCm Component

Composable Kernel

### Steps to Reproduce

Try to build composable kernel using the following Cmake command in the build folder:

```
cmake \
-D CMAKE_PREFIX_PATH=$(hipconfig -R) \
-D CMAKE_CXX_COMPILER=$(hipconfig -R)/bin/hipcc  \
-D CMAKE_BUILD_TYPE=Release \
-D GPU_TARGETS=gfx1100 \
-D CMAKE_CXX_FLAGS="-Wno-error=fuse-ld-path" \
-D CMAKE_HIP_FLAGS="-Wno-error=fuse-ld-path" \
..
```
(note that the two last flags `CMAKE_CXX_FLAGS` and `CMAKE_HIP_FLAGS` are necessary on Fedora 42 due to #2759)

Then, run `make` and observe the following error:

```
In file included from /home/aethor/Dev/composable_kernel/example/29_batched_gemm_bias_e_permute/batched_gemm_bias_e_permute_wmma_fp16.cpp:9:
In file included from /home/aethor/Dev/composable_kernel/include/ck/ck.hpp:10:
In file included from /usr/include/hip/hip_runtime.h:62:
In file included from /usr/include/hip/amd_detail/amd_hip_runtime.h:114:
In file included from /usr/include/hip/hip_runtime_api.h:575:
In file included from /usr/include/hip/texture_types.h:47:
In file included from /usr/include/hip/channel_descriptor.h:32:
In file included from /usr/include/hip/amd_detail/amd_channel_descriptor.h:29:
In file included from /usr/include/hip/amd_detail/amd_hip_vector_types.h:49:
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/array:210:2: error: reference to __host__ function '__glibcxx_assert_fail' in __host__ __device__ function
  210 |         __glibcxx_requires_subscript(__n);
      |         ^
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/debug/assertions.h:39:3: note: expanded from macro '__glibcxx_requires_subscript'
   39 |   __glibcxx_assert(_N < this->size())
      |   ^
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/x86_64-redhat-linux/bits/c++config.h:2572:12: note: expanded from macro '__glibcxx_assert'
 2572 |       std::__glibcxx_assert_fail();                                     \
      |            ^
/home/aethor/Dev/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_contraction_multiple_d_wmma_cshuffle.hpp:524:17: note: called by 'operator()<ck::integral_constant<int, 0>>'
  524 |                 ds_offset[i] = static_cast<long_index_t>(g_idx) *
      |                 ^
/home/aethor/Dev/composable_kernel/include/ck/utility/functional2.hpp:60:10: note: called by 'operator()<(lambda at /home/aethor/Dev/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_contraction_multiple_d_wmma_cshuffle.hpp:523:44)>'
   60 |         (f(Number<Is>{}), ...);
      |          ^
/home/aethor/Dev/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_contraction_multiple_d_wmma_cshuffle.hpp:523:13: note: called by 'GetDsPtrOffset'
  523 |             static_for<0, NumDTensor, 1>{}([&](auto i) {
      |             ^
/home/aethor/Dev/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_d_wmma_cshuffle.hpp:165:62: note: called by 'kernel_contraction_multiple_d_wmma_cshuffle<ck::GridwiseGemmMultipleD_Wmma<_Float16, _Float16, float, _Float16, ck::Tuple<_Float16>, _Float16, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int>, ck::Tuple<int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 2>, ck::integral_constant<int, 2>, ck::integral_constant<int, 4>>, false>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 4>, ck::integral_constant<int, 16>>, false>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<7>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8, 9, 10, 11>, ck::Sequence<12, 13, 14>>, ck::Sequence<8, 12, 13, 9, 10, 14, 11>, long>, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int>, ck::Tuple<int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 4>>, false>, ck::PassThrough<int>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<7>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8, 9>, ck::Sequence<10>>, ck::Sequence<8, 10, 9>, long>, ck::Tuple<ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int, int>, ck::Tuple<int, int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::RightPad<int, int>, ck::RightPad<int, int>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3, 4>, ck::Sequence<5>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>>, ck::Sequence<7, 8>, long>>, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int, int>, ck::Tuple<int, int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::RightPad<int, int>, ck::RightPad<int, int>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3, 4>, ck::Sequence<5>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>>, ck::Sequence<7, 8>, long>, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::Add, ck::InMemoryDataOperationEnum::Set, 64, 64, 64, 16, 16, 4, 1, 4, 128, ck::Sequence<4, 32, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 4, 4, false, false, false, ck::Sequence<4, 32, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 4, 4, false, true, false, 1, 1, ck::Sequence<1, 64, 1, 2>, 8, 1, ck::LoopScheduler::Default, ck::PipelineVersion::v1>, _Float16, _Float16, ck::Tuple<const _Float16 *>, _Float16, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int>, ck::Tuple<int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 2>, ck::integral_constant<int, 2>, ck::integral_constant<int, 4>>, false>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 4>, ck::integral_constant<int, 16>>, false>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<7>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8, 9, 10, 11>, ck::Sequence<12, 13, 14>>, ck::Sequence<8, 12, 13, 9, 10, 14, 11>, long>, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int>, ck::Tuple<int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 4>>, false>, ck::PassThrough<int>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<7>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3>, ck::Sequence<4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8, 9>, ck::Sequence<10>>, ck::Sequence<8, 10, 9>, long>, ck::Tuple<ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int, int>, ck::Tuple<int, int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 64>>, false>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 64>>, false>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>>, ck::Tuple<ck::Sequence<1, 2, 3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>, ck::Sequence<9, 10>, ck::Sequence<11, 12>>, ck::Sequence<9, 10, 11, 12>, long>>, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int, int>, ck::Tuple<int, int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::RightPad<int, int>, ck::RightPad<int, int>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 64>>, false>, ck::UnMerge<ck::Tuple<int, ck::integral_constant<int, 64>>, false>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>>, ck::Tuple<ck::Sequence<1, 2, 3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>, ck::Sequence<9, 10>, ck::Sequence<11, 12>>, ck::Sequence<9, 10, 11, 12>, long>, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::Add, ck::tensor_operation::device::DeviceBatchedContractionMultipleD_Wmma_CShuffle<2, 2, 2, 1, _Float16, _Float16, float, _Float16, ck::Tuple<_Float16>, _Float16, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::PassThrough, ck::tensor_operation::element_wise::Add, ck::tensor_operation::device::GemmSpecialization::MNKPadding, ck::tensor_operation::device::TensorSpecialization::Default, ck::tensor_operation::device::TensorSpecialization::Default, ck::tensor_operation::device::TensorSpecialization::Default, 1, 128, 64, 64, 64, 4, 16, 16, 1, 4, ck::Sequence<4, 32, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 4, 4, false, ck::Sequence<4, 32, 1>, ck::Sequence<1, 0, 2>, ck::Sequence<1, 0, 2>, 2, 4, 4, false, 1, 1, ck::Sequence<1, 64, 1, 2>, 8>::ComputePtrOffsetOfStridedBatch, ck::BlockToCTileMap_M00_N0_M01Adapt<64, 64, ck::TensorDescriptor<ck::Tuple<ck::Embed<ck::Tuple<int, int, int, int>, ck::Tuple<int, int, int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::Merge_v2_magic_division<ck::Tuple<int, int>>, ck::RightPad<int, int>, ck::RightPad<int, int>>, ck::Tuple<ck::Sequence<0>, ck::Sequence<1, 2>, ck::Sequence<3, 4>, ck::Sequence<5>, ck::Sequence<6>>, ck::Tuple<ck::Sequence<1, 2, 3, 4>, ck::Sequence<5>, ck::Sequence<6>, ck::Sequence<7>, ck::Sequence<8>>, ck::Sequence<7, 8>, long>>, true>'
  165 |     const auto ds_batch_offset = compute_ptr_offset_of_batch.GetDsPtrOffset(g_idx);
      |                                                              ^
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/x86_64-redhat-linux/bits/c++config.h:2566:3: note: '__glibcxx_assert_fail' declared here
 2566 |   __glibcxx_assert_fail()
      |   ^
```

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

```
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 7800X3D 8-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 7800X3D 8-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   5053                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32392232(0x1ee4428) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    32392232(0x1ee4428) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32392232(0x1ee4428) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 4                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32392232(0x1ee4428) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1201                            
  Uuid:                    GPU-8c7c68d970d211b8               
  Marketing Name:          AMD Radeon RX 9070 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      8192(0x2000) KB                    
    L3:                      65536(0x10000) KB                  
  Chip ID:                 30032(0x7550)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          256(0x100)                         
  Max Clock Freq. (MHz):   2460                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            64                                 
  SIMDs per CU:            2                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 58                                 
  SDMA engine uCode::      380                                
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16695296(0xfec000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1201         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done *** 
```

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: Building ck fails on Fedora 42 with `reference to host function '__glibcxx_assert_fail' in host device function` #2887

Problem Description

Operating System

CPU

GPU

Other

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: Building ck fails on Fedora 42 with reference to __host__ function '__glibcxx_assert_fail' in __host__ __device__ function #2887

Description

Problem Description

Operating System

CPU

GPU

Other

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Issue]: Building ck fails on Fedora 42 with `reference to host function '__glibcxx_assert_fail' in host device function` #2887