-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Labels
Description
Operating System
Linux
Version Information
$ az version --output=yaml
azure-cli: 2.71.0
azure-cli-core: 2.71.0
azure-cli-telemetry: 1.1.0
extensions:
ml: 2.27.0
Steps to reproduce
- Create two compute clusters in the AML workspace:
cluster-a100
with at least 2Standard_NC48ads_A100_v4
, as well ascluster-h100
with at least 2 Standard_NC80adis_H100_v5 - Download the cifar10 dataset and upload it as an Azure data asset called
cifar10
. - Create folders with
mkdir -p tmp/src
- Download the script
train.py
from https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/pytorch/distributed-training/src/train.py and put it undertmp/src/
folder. - Create the YAML file called
job_a100.yml
. Use the example job definition as the official job schema.
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python train.py
--epochs ${{inputs.epochs}}
--learning-rate ${{inputs.learning_rate}}
--data-dir ${{inputs.cifar}}
inputs:
epochs: 1
learning_rate: 0.2
cifar:
type: uri_folder
path: azureml:cifar10@latest
environment: azureml:AzureML-acpt-pytorch-2.2-cuda12.1@latest
compute: azureml:cluster-a100
distribution:
type: pytorch
process_count_per_instance: 2
resources:
instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.
- Create the YAML file called
job_h100.yml
as above but replace the compute withazureml:cluster-h100
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python train.py
--epochs ${{inputs.epochs}}
--learning-rate ${{inputs.learning_rate}}
--data-dir ${{inputs.cifar}}
inputs:
epochs: 1
learning_rate: 0.2
cifar:
type: uri_folder
path: azureml:cifar10@latest
environment: azureml:AzureML-acpt-pytorch-2.2-cuda12.1@latest
compute: azureml:cluster-h100
distribution:
type: pytorch
process_count_per_instance: 2
resources:
instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.
- Login the azure account and specify the default workspace through
az login
followed byaz configure --defaults workspace=xxxx group=xxxx
- Submit AML jobs with
az ml job create -f job_a100.yml
and
az ml job create -f job_h100.yml
Expected behavior
Both job shall succeed, showing four process in the AML job log, displaying GPU usage for the specified number of process. I.e., 4 process for 2 nodes with 2 process in each node. From the monitor panel of job, we can see e.g.,
Actual behavior
The job on A100 cluster succeeds without obvious error.
The job on H100 cluster failed completely due to some NCCL errors.
Addition information
Error message from the main process on H100 cluster is
854283d6899342f0a2ab3629685461f300000W:92:413 [0] socket.c:540 NCCL WARN Net : Connection closed by remote peer 854283d6899342f0a2ab3629685461f300000x.internal.cloudapp.net<51100>
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO ib_plugin.c:449 -> 6
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO transport/net.cc:869 -> 6
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO misc/socket.cc:47 -> 3
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO misc/socket.cc:58 -> 3
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO misc/socket.cc:773 -> 3
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO proxy.cc:1356 -> 3
854283d6899342f0a2ab3629685461f300000W:92:413 [0] proxy.cc:1505 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection
854283d6899342f0a2ab3629685461f300000W:92:413 [0] proxy.cc:1539 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
Traceback (most recent call last):
File "/mnt/azureml/cr/j/c6cd24b215ed41daaf6ea5d84fc406ea/exe/wd/train.py", line 194, in <module>
main(args)
File "/mnt/azureml/cr/j/c6cd24b215ed41daaf6ea5d84fc406ea/exe/wd/train.py", line 109, in main
model = nn.parallel.DistributedDataParallel(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, remote process exited or there was a network error, NCCL version 2.19.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connection closed by remote peer 854283d6899342f0a2ab3629685461f300000x.internal.cloudapp.net<47776>