Skip to content

Dask XGBoost fails on HTCondor cluster #11757

@runtingt

Description

@runtingt

I am attempting to run distributed XGBoost with Dask on an HTCondor cluster. I've tested that Dask on its own works with this setup, but when attempting training with XGBoost I get the following trace:

Traceback (most recent call last):
  File "/home/hep/tr1123/dask_xgb/test_futures.py", line 63, in main
    model.fit(X=X, y=y, verbose=True)
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 1825, in fit
    return self._client_sync(self._fit_async, **args)
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 1620, in _client_sync
    return self.client.sync(func, **kwargs, asynchronous=self.client.asynchronous)
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/distributed/utils.py", line 363, in sync
    return sync(
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/distributed/utils.py", line 439, in sync
    raise error
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/distributed/utils.py", line 413, in f
    result = yield future
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 1785, in _fit_async
    results = await self.client.sync(
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 856, in _train_async
    result = await map_worker_partitions(
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 590, in map_worker_partitions
    result = await client.compute(fut).result()
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/distributed/client.py", line 409, in _result
    raise exc.with_traceback(tb)
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 554, in fn
    return [func(*args, **kwargs)]
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/dask/__init__.py", line 811, in do_train
    with CommunicatorContext(**coll_args), config.config_context(**global_config):
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/collective.py", line 326, in __enter__
    init(**self.args)
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/collective.py", line 99, in init
    _check_call(_LIB.XGCommunicatorInit(make_jcargs(**args)))
  File "/home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/core.py", line 310, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [15:58:51] /workspace/src/collective/result.cc:78: 
- [comm.cc:220|15:58:51]: Failed to bootstrap the communication group.
- [comm.cc:332|15:58:51]: Failed to connect to other workers.
- [comm.cc:72|15:58:51]: Bootstrap failed to connect to ring next.
- [socket.cc:189|15:58:51]: Failed to connect to 146.179.108.84:57459
- [socket.h:79|15:58:51]: Poll error condition:Operation now in progress code:115 Operation now in progress
- [socket.h:348|15:58:51]: Socket error. No route to host
Stack trace:
  [bt] (0) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x14bb8a4a7e7c]
  [bt] (1) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x364e21) [0x14bb8a565e21]
  [bt] (2) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x33264b) [0x14bb8a53364b]
  [bt] (3) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x340687) [0x14bb8a541687]
  [bt] (4) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x34217c) [0x14bb8a54317c]
  [bt] (5) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGCommunicatorInit+0x64) [0x14bb8a4f3624]
  [bt] (6) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/lib-dynload/../../libffi.so.8(+0x6a4a) [0x14bbbc5d4a4a]
  [bt] (7) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/lib-dynload/../../libffi.so.8(+0x5fea) [0x14bbbc5d3fea]
  [bt] (8) /home/hep/tr1123/micromamba/envs/higgs-dna/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x12461) [0x14bbbc5ec461]

This comes from a python script that looks something along the lines of :

import socket
from dask_iclx import ICCluster
from dask.distributed import Client
from dask_ml.datasets import make_classification
from xgboost import dask as dxgb
from xgboost.collective import Config
X, y = make_classification(n_samples=10_000, n_features=5)
cluster = ICCluster()
client = Client(cluster)
params = {
    "objective": "binary:logistic",
    "max_depth": 4,
    "eta": 0.1,
    "tree_method": "hist",
    "n_estimators": 50
}
coll_cfg = Config(tracker_host_ip=socket.gethostbyname(socket.gethostname()), tracker_port=60000)
model = dxgb.DaskXGBClassifier(**params, coll_cfg=coll_cfg)
model.fit(X, y) 

where ICCluster is a thin wrapper around HTCondorCluster.

The workers within this pool are allowed to communicate with eachother over ports in the range (60000, 60099), so given XGBoost is attempting to use 57459 the error is not unexpected. I've manually verified with nc that two workers in the pool can indeed establish a connection over these "good" ports, so to me this is just a case of picking the right port.

For the life of me, I cannot see a way to tell rabit to use this port range in the documentation or in a glance over the source code - have I missed something? I found xgboost.collective.Config which allows one to configure worker-tracker communication (which originally had the same problem), but no mention of something to tell the communication group to use a particular port range. Apologies if this isn't the right place or I'm missing something very obvious!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions