-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Description
I am attempting to run distributed XGBoost with Dask on an HTCondor cluster. I've tested that Dask on its own works with this setup, but when attempting training with XGBoost I get the following trace:
Traceback (most recent call last):
File "/home/hep/tr1123/dask_xgb/test_futures.py", line 63, in main
model.fit(X=X, y=y, verbose=True)
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 1825, in fit
return self._client_sync(self._fit_async, **args)
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 1620, in _client_sync
return self.client.sync(func, **kwargs, asynchronous=self.client.asynchronous)
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/distributed/utils.py", line 363, in sync
return sync(
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/distributed/utils.py", line 439, in sync
raise error
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/distributed/utils.py", line 413, in f
result = yield future
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
value = future.result()
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 1785, in _fit_async
results = await self.client.sync(
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 856, in _train_async
result = await map_worker_partitions(
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 590, in map_worker_partitions
result = await client.compute(fut).result()
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/distributed/client.py", line 409, in _result
raise exc.with_traceback(tb)
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 554, in fn
return [func(*args, **kwargs)]
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 811, in do_train
with CommunicatorContext(**coll_args), config.config_context(**global_config):
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/collective.py", line 326, in __enter__
init(**self.args)
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/collective.py", line 99, in init
_check_call(_LIB.XGCommunicatorInit(make_jcargs(**args)))
File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/core.py", line 310, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [15:58:51] /workspace/src/collective/result.cc:78:
- [comm.cc:220|15:58:51]: Failed to bootstrap the communication group.
- [comm.cc:332|15:58:51]: Failed to connect to other workers.
- [comm.cc:72|15:58:51]: Bootstrap failed to connect to ring next.
- [socket.cc:189|15:58:51]: Failed to connect to 146.179.108.84:57459
- [socket.h:79|15:58:51]: Poll error condition:Operation now in progress code:115 Operation now in progress
- [socket.h:348|15:58:51]: Socket error. No route to host
Stack trace:
[bt] (0) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x14bb8a4a7e7c]
[bt] (1) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x364e21) [0x14bb8a565e21]
[bt] (2) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x33264b) [0x14bb8a53364b]
[bt] (3) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x340687) [0x14bb8a541687]
[bt] (4) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x34217c) [0x14bb8a54317c]
[bt] (5) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGCommunicatorInit+0x64) [0x14bb8a4f3624]
[bt] (6) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/lib-dynload/../../libffi.so.8(+0x6a4a) [0x14bbbc5d4a4a]
[bt] (7) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/lib-dynload/../../libffi.so.8(+0x5fea) [0x14bbbc5d3fea]
[bt] (8) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x12461) [0x14bbbc5ec461]
This comes from a python script that looks something along the lines of :
import socket
from dask_iclx import ICCluster
from dask.distributed import Client
from dask_ml.datasets import make_classification
from xgboost import dask as dxgb
from xgboost.collective import Config
X, y = make_classification(n_samples=10_000, n_features=5)
cluster = ICCluster()
client = Client(cluster)
params = {
"objective": "binary:logistic",
"max_depth": 4,
"eta": 0.1,
"tree_method": "hist",
"n_estimators": 50
}
coll_cfg = Config(tracker_host_ip=socket.gethostbyname(socket.gethostname()), tracker_port=60000)
model = dxgb.DaskXGBClassifier(**params, coll_cfg=coll_cfg)
model.fit(X, y) where ICCluster is a thin wrapper around HTCondorCluster.
The workers within this pool are allowed to communicate with eachother over ports in the range (60000, 60099), so given XGBoost is attempting to use 57459 the error is not unexpected. I've manually verified with nc that two workers in the pool can indeed establish a connection over these "good" ports, so to me this is just a case of picking the right port.
For the life of me, I cannot see a way to tell rabit to use this port range in the documentation or in a glance over the source code - have I missed something? I found xgboost.collective.Config which allows one to configure worker-tracker communication (which originally had the same problem), but no mention of something to tell the communication group to use a particular port range. Apologies if this isn't the right place or I'm missing something very obvious!