Skip to content

Runtime error #6

@RealAntonVoronov

Description

@RealAntonVoronov

Hello. I'm having difficulties running the code provided. First of all, I have a question: is it possible to run your code without infiniband? I'm running as follows:

nohup sh run_tc_pipetransformer.sh 8 2 0 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &
nohup sh run_tc_pipetransformer.sh 8 2 1 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &

And get the following error:

Traceback (most recent call last):
  File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
    pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
  File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
    self.auto_dp = AutoDataParallel(config)
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 46, in __init__
    self.init_rpc()
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 117, in init_rpc
    rpc.init_rpc(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in 
init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 360, in 
_tensorpipe_init_backend_handler
    api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 224, in _all_gather
    rpc_sync(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 809, in rpc_sync
    return fut.wait()
RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)

Do you have an idea what it can be cause by?
I was thinking that maybe it's because i haven't turned infiniband on, but when I change 0 "lo" to 1 "ib0" in both scripts I get another error message:

Traceback (most recent call last):
  File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
    pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
  File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
    self.auto_dp = AutoDataParallel(config)
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 45, in __init__
    self.init_ddp()
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 86, in init_ddp
    dist.init_process_group(init_method='tcp://' + str(self.config.master_addr) + ':' + str(self.config.master_port),
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in 
init_process_group
    default_pg = _new_process_group_helper(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
    pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1670525541990/work/third_party/gloo/gloo/transport/tcp/device.cc:80] ifa != nullptr. Unable to find address for: ib0

Any help would be appreciated

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions