-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
Hello. I'm having difficulties running the code provided. First of all, I have a question: is it possible to run your code without infiniband? I'm running as follows:
nohup sh run_tc_pipetransformer.sh 8 2 0 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &
nohup sh run_tc_pipetransformer.sh 8 2 1 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &
And get the following error:
Traceback (most recent call last):
File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
self.auto_dp = AutoDataParallel(config)
File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 46, in __init__
self.init_rpc()
File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 117, in init_rpc
rpc.init_rpc(
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in
init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 360, in
_tensorpipe_init_backend_handler
api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 224, in _all_gather
rpc_sync(
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 809, in rpc_sync
return fut.wait()
RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
Do you have an idea what it can be cause by?
I was thinking that maybe it's because i haven't turned infiniband on, but when I change 0 "lo" to 1 "ib0" in both scripts I get another error message:
Traceback (most recent call last):
File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
self.auto_dp = AutoDataParallel(config)
File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 45, in __init__
self.init_ddp()
File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 86, in init_ddp
dist.init_process_group(init_method='tcp://' + str(self.config.master_addr) + ':' + str(self.config.master_port),
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in
init_process_group
default_pg = _new_process_group_helper(
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1670525541990/work/third_party/gloo/gloo/transport/tcp/device.cc:80] ifa != nullptr. Unable to find address for: ib0
Any help would be appreciated
Metadata
Metadata
Assignees
Labels
No labels