Skip to content

Distributed GPU training works on A100 but not H100 #3561

@ZhiliangWu

Description

@ZhiliangWu

Operating System

Linux

Version Information

$ az version --output=yaml
azure-cli: 2.71.0
azure-cli-core: 2.71.0
azure-cli-telemetry: 1.1.0
extensions:
  ml: 2.27.0

Steps to reproduce

  1. Create two compute clusters in the AML workspace: cluster-a100 with at least 2 Standard_NC48ads_A100_v4, as well as cluster-h100 with at least 2 Standard_NC80adis_H100_v5
  2. Download the cifar10 dataset and upload it as an Azure data asset called cifar10.
  3. Create folders with mkdir -p tmp/src
  4. Download the script train.py from https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/pytorch/distributed-training/src/train.py and put it under tmp/src/ folder.
  5. Create the YAML file called job_a100.yml. Use the example job definition as the official job schema.
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
  python train.py
  --epochs ${{inputs.epochs}}
  --learning-rate ${{inputs.learning_rate}}
  --data-dir ${{inputs.cifar}}
inputs:
  epochs: 1
  learning_rate: 0.2
  cifar:
     type: uri_folder
     path: azureml:cifar10@latest
environment: azureml:AzureML-acpt-pytorch-2.2-cuda12.1@latest
compute: azureml:cluster-a100
distribution:
  type: pytorch
  process_count_per_instance: 2
resources:
  instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.
  1. Create the YAML file called job_h100.yml as above but replace the compute with azureml:cluster-h100
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
  python train.py
  --epochs ${{inputs.epochs}}
  --learning-rate ${{inputs.learning_rate}}
  --data-dir ${{inputs.cifar}}
inputs:
  epochs: 1
  learning_rate: 0.2
  cifar:
     type: uri_folder
     path: azureml:cifar10@latest
environment: azureml:AzureML-acpt-pytorch-2.2-cuda12.1@latest
compute: azureml:cluster-h100
distribution:
  type: pytorch
  process_count_per_instance: 2
resources:
  instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.
  1. Login the azure account and specify the default workspace through az login followed by az configure --defaults workspace=xxxx group=xxxx
  2. Submit AML jobs with
az ml job create -f job_a100.yml

and

az ml job create -f job_h100.yml

Expected behavior

Both job shall succeed, showing four process in the AML job log, displaying GPU usage for the specified number of process. I.e., 4 process for 2 nodes with 2 process in each node. From the monitor panel of job, we can see e.g.,

Image

Image

Actual behavior

The job on A100 cluster succeeds without obvious error.
The job on H100 cluster failed completely due to some NCCL errors.

Addition information

Error message from the main process on H100 cluster is

854283d6899342f0a2ab3629685461f300000W:92:413 [0] socket.c:540 NCCL WARN Net : Connection closed by remote peer 854283d6899342f0a2ab3629685461f300000x.internal.cloudapp.net<51100>
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO ib_plugin.c:449 -> 6
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO transport/net.cc:869 -> 6
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO misc/socket.cc:47 -> 3
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO misc/socket.cc:58 -> 3
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO misc/socket.cc:773 -> 3
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO proxy.cc:1356 -> 3

854283d6899342f0a2ab3629685461f300000W:92:413 [0] proxy.cc:1505 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

854283d6899342f0a2ab3629685461f300000W:92:413 [0] proxy.cc:1539 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/c6cd24b215ed41daaf6ea5d84fc406ea/exe/wd/train.py", line 194, in <module>
    main(args)
  File "/mnt/azureml/cr/j/c6cd24b215ed41daaf6ea5d84fc406ea/exe/wd/train.py", line 109, in main
    model = nn.parallel.DistributedDataParallel(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, remote process exited or there was a network error, NCCL version 2.19.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connection closed by remote peer 854283d6899342f0a2ab3629685461f300000x.internal.cloudapp.net<47776>

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions