Skip to content

Failed running demo/guide-python/external_memory.py built from main #11751

@SkqLiao

Description

@SkqLiao

Hi, I'm trying to use ExtMemQuantileDMatrix for training huge dateset on gpus.

For example, training 1Tb raw fp32 dataset on 4/8xRTX 4090(24G) + 2/4Tb memory(which is sufficent for the same dataset with CPU). (btw, is it possible?)

However, I encountered this error when running the demo code demo/guide-python/external_memory.py:

/cache/xgboost/python-package/xgboost/core.py:1893: UserWarning: [01:54:44] WARNING: /cache/xgboost/src/data/ellpack_page_source.h:191: CUDA heterogeneous memory management is not available. The overhead of iterating through external memory might be significant.
  self._init(
/cache/xgboost/python-package/xgboost/core.py:1893: UserWarning: [01:54:44] WARNING: /cache/xgboost/src/data/ellpack_page_source.cu:618: Running on a NUMA system without membind. The overhead of iterating through external memory might be significant.
  self._init(
Traceback (most recent call last):
  File "/cache/xgboost/demo/guide-python/external_memory.py", line 213, in <module>
    main(tmpdir, args)
  File "/cache/xgboost/demo/guide-python/external_memory.py", line 172, in main
    hist_train(it)
  File "/cache/xgboost/demo/guide-python/external_memory.py", line 136, in hist_train
    Xy = xgboost.ExtMemQuantileDMatrix(it, missing=np.nan, enable_categorical=False)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cache/xgboost/python-package/xgboost/core.py", line 774, in inner_f
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/cache/xgboost/python-package/xgboost/core.py", line 1893, in __init__
    self._init(
  File "/cache/xgboost/python-package/xgboost/core.py", line 1940, in _init
    _check_call(ret)
  File "/cache/xgboost/python-package/xgboost/core.py", line 323, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [01:54:44] /cache/xgboost/src/common/common.cu:16: /cache/xgboost/src/common/cuda_pinned_allocator.cu: 49: cudaErrorInvalidValue: invalid argument
Stack trace:
  [bt] (0) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(+0x4c8bb1) [0x7f0fa1bc1bb1]
  [bt] (1) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x603) [0x7f0fa24ee453]
  [bt] (2) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(+0xdf5900) [0x7f0fa24ee900]
  [bt] (3) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::common::cuda_impl::CreateHostMemPool()+0x9) [0x7f0fa24ee969]
  [bt] (4) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::data::EllpackMemCache::EllpackMemCache(xgboost::data::EllpackCacheInfo, int)+0x306) [0x7f0fa260dd66]
  [bt] (5) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::data::EllpackCacheStreamPolicy<xgboost::EllpackPage, xgboost::data::EllpackFormatPolicy>::CreateWriter(xgboost::StringView, unsigned int)+0x3cb) [0x7f0fa26161db]
  [bt] (6) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::data::SparsePageSourceImpl<xgboost::EllpackPage, xgboost::data::EllpackCacheStreamPolicy<xgboost::EllpackPage, xgboost::data::EllpackFormatPolicy> >::WriteCache()+0x84) [0x7f0fa261af54]
  [bt] (7) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::data::ExtEllpackPageSourceImpl<xgboost::data::EllpackCacheStreamPolicy<xgboost::EllpackPage, xgboost::data::EllpackFormatPolicy> >::ExtEllpackPageSourceImpl(xgboost::Context const*, xgboost::MetaInfo*, xgboost::data::ExternalDataInfo, std::shared_ptr<xgboost::data::Cache>, std::shared_ptr<xgboost::common::HistogramCuts>, std::shared_ptr<xgboost::data::DataIterProxy<void (void*), int (void*)> >, xgboost::data::DMatrixProxy*, xgboost::data::EllpackCacheInfo const&)+0x743) [0x7f0fa262ab83]
  [bt] (8) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::data::ExtMemQuantileDMatrix::InitFromCUDA(xgboost::Context const*, std::shared_ptr<xgboost::data::DataIterProxy<void (void*), int (void*)> >, void*, xgboost::BatchParam const&, std::shared_ptr<xgboost::DMatrix>, long, xgboost::ExtMemConfig const&)+0xa47) [0x7f0fa26266e7]

I installed XGBoost according to the document:

git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
cmake -B build -S . -DUSE_CUDA=ON -DUSE_NCCL=ON -DPLUGIN_RMM=ON -DCMAKE_PREFIX_PATH=$CONDA_PREFIX  -DBUILD_WITH_SHARED_NCCL=ON
cd build && make -j$(nproc)
cd ../python-package && pip install -e .

The enviroment is RTX 4090 with cuda-12.9 and 2Tb memory. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions