-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Open
Description
Hi, I'm trying to use ExtMemQuantileDMatrix for training huge dateset on gpus.
For example, training 1Tb raw fp32 dataset on 4/8xRTX 4090(24G) + 2/4Tb memory(which is sufficent for the same dataset with CPU). (btw, is it possible?)
However, I encountered this error when running the demo code demo/guide-python/external_memory.py:
/cache/xgboost/python-package/xgboost/core.py:1893: UserWarning: [01:54:44] WARNING: /cache/xgboost/src/data/ellpack_page_source.h:191: CUDA heterogeneous memory management is not available. The overhead of iterating through external memory might be significant.
self._init(
/cache/xgboost/python-package/xgboost/core.py:1893: UserWarning: [01:54:44] WARNING: /cache/xgboost/src/data/ellpack_page_source.cu:618: Running on a NUMA system without membind. The overhead of iterating through external memory might be significant.
self._init(
Traceback (most recent call last):
File "/cache/xgboost/demo/guide-python/external_memory.py", line 213, in <module>
main(tmpdir, args)
File "/cache/xgboost/demo/guide-python/external_memory.py", line 172, in main
hist_train(it)
File "/cache/xgboost/demo/guide-python/external_memory.py", line 136, in hist_train
Xy = xgboost.ExtMemQuantileDMatrix(it, missing=np.nan, enable_categorical=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/cache/xgboost/python-package/xgboost/core.py", line 774, in inner_f
return func(**kwargs)
^^^^^^^^^^^^^^
File "/cache/xgboost/python-package/xgboost/core.py", line 1893, in __init__
self._init(
File "/cache/xgboost/python-package/xgboost/core.py", line 1940, in _init
_check_call(ret)
File "/cache/xgboost/python-package/xgboost/core.py", line 323, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [01:54:44] /cache/xgboost/src/common/common.cu:16: /cache/xgboost/src/common/cuda_pinned_allocator.cu: 49: cudaErrorInvalidValue: invalid argument
Stack trace:
[bt] (0) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(+0x4c8bb1) [0x7f0fa1bc1bb1]
[bt] (1) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x603) [0x7f0fa24ee453]
[bt] (2) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(+0xdf5900) [0x7f0fa24ee900]
[bt] (3) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::common::cuda_impl::CreateHostMemPool()+0x9) [0x7f0fa24ee969]
[bt] (4) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::data::EllpackMemCache::EllpackMemCache(xgboost::data::EllpackCacheInfo, int)+0x306) [0x7f0fa260dd66]
[bt] (5) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::data::EllpackCacheStreamPolicy<xgboost::EllpackPage, xgboost::data::EllpackFormatPolicy>::CreateWriter(xgboost::StringView, unsigned int)+0x3cb) [0x7f0fa26161db]
[bt] (6) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::data::SparsePageSourceImpl<xgboost::EllpackPage, xgboost::data::EllpackCacheStreamPolicy<xgboost::EllpackPage, xgboost::data::EllpackFormatPolicy> >::WriteCache()+0x84) [0x7f0fa261af54]
[bt] (7) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::data::ExtEllpackPageSourceImpl<xgboost::data::EllpackCacheStreamPolicy<xgboost::EllpackPage, xgboost::data::EllpackFormatPolicy> >::ExtEllpackPageSourceImpl(xgboost::Context const*, xgboost::MetaInfo*, xgboost::data::ExternalDataInfo, std::shared_ptr<xgboost::data::Cache>, std::shared_ptr<xgboost::common::HistogramCuts>, std::shared_ptr<xgboost::data::DataIterProxy<void (void*), int (void*)> >, xgboost::data::DMatrixProxy*, xgboost::data::EllpackCacheInfo const&)+0x743) [0x7f0fa262ab83]
[bt] (8) /cache/xgboost/python-package/xgboost/../../lib/libxgboost.so(xgboost::data::ExtMemQuantileDMatrix::InitFromCUDA(xgboost::Context const*, std::shared_ptr<xgboost::data::DataIterProxy<void (void*), int (void*)> >, void*, xgboost::BatchParam const&, std::shared_ptr<xgboost::DMatrix>, long, xgboost::ExtMemConfig const&)+0xa47) [0x7f0fa26266e7]
I installed XGBoost according to the document:
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
cmake -B build -S . -DUSE_CUDA=ON -DUSE_NCCL=ON -DPLUGIN_RMM=ON -DCMAKE_PREFIX_PATH=$CONDA_PREFIX -DBUILD_WITH_SHARED_NCCL=ON
cd build && make -j$(nproc)
cd ../python-package && pip install -e .
The enviroment is RTX 4090 with cuda-12.9 and 2Tb memory. Thanks!
Metadata
Metadata
Assignees
Labels
No labels