-
-
Notifications
You must be signed in to change notification settings - Fork 762
Open
Description
Hi,
I was trying to launch federate/cross_silo/cuda_rpc_fedavg_mnist_lr_example
, mapping all processes (1 server and 2 clients) to a single gpu.
it ended with error
File "/home/myhome/.local/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 235, in _validate_device_maps
raise ValueError(
ValueError: Node worker0 has target devices with invalid indices in its device map for worker2
device map = {device(type='cuda', index=0): device(type='cuda', index=2)}
device count = 1
I suspect there is a bug in python/fedml/core/distributed/communication/trpc/utils.py
# Generate Device Map for Cuda RPC
def set_device_map(options, worker_idx, device_list):
local_device = device_list[worker_idx]
for index, remote_device in enumerate(device_list):
logging.warn(f"Setting device map for client {index} as {remote_device}")
if index != worker_idx:
options.set_device_map(WORKER_NAME.format(index), {local_device: remote_device})
here device_list
is a dict {0:0, 1:0, 2:0}
, but enumerate
iterates over its keys and then assigns the key (0,1,2) as local_device
.
I tried to correct this as
for index, remote_device in enumerate(device_list):
logging.warn(f"Setting device map for client {index} as {device_list[remote_device]}")
if index != worker_idx:
options.set_device_map(WORKER_NAME.format(index), {local_device: device_list[remote_device]})
and the example worked ok.
Metadata
Metadata
Assignees
Labels
No labels