Runtime error

Hello. I'm having difficulties running the code provided. First of all, I have a question: is it possible to run your code without infiniband? I'm running as follows: 
```
nohup sh run_tc_pipetransformer.sh 8 2 0 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &
nohup sh run_tc_pipetransformer.sh 8 2 1 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &
```
And get the following error:
```
Traceback (most recent call last):
  File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
    pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
  File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
    self.auto_dp = AutoDataParallel(config)
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 46, in __init__
    self.init_rpc()
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 117, in init_rpc
    rpc.init_rpc(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in 
init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 360, in 
_tensorpipe_init_backend_handler
    api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 224, in _all_gather
    rpc_sync(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 809, in rpc_sync
    return fut.wait()
RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
```

Do you have an idea what it can be cause by?
I was thinking that maybe it's because i haven't turned infiniband on, but when I change `0 "lo"` to `1 "ib0"` in both scripts I get another error message:
```
Traceback (most recent call last):
  File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
    pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
  File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
    self.auto_dp = AutoDataParallel(config)
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 45, in __init__
    self.init_ddp()
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 86, in init_ddp
    dist.init_process_group(init_method='tcp://' + str(self.config.master_addr) + ':' + str(self.config.master_port),
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in 
init_process_group
    default_pg = _new_process_group_helper(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
    pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1670525541990/work/third_party/gloo/gloo/transport/tcp/device.cc:80] ifa != nullptr. Unable to find address for: ib0
```
Any help would be appreciated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Runtime error #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runtime error #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions