Distributed GPU training works on A100 but not H100

### Operating System

Linux

### Version Information

```console
$ az version --output=yaml
azure-cli: 2.71.0
azure-cli-core: 2.71.0
azure-cli-telemetry: 1.1.0
extensions:
  ml: 2.27.0
```

### Steps to reproduce

1. Create two compute clusters in the AML workspace: `cluster-a100` with at least 2 [`Standard_NC48ads_A100_v4`](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nca100v4-series?tabs=sizeaccelerators), as well as `cluster-h100` with at least 2 [Standard_NC80adis_H100_v5](https://learn.microsoft.com/en-us/azure/virtual-machines/ncads-h100-v5#supported-features)
2. Download the [cifar10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) and upload it as an Azure data asset called `cifar10`.
3. Create folders with `mkdir -p tmp/src`
4. Download the script `train.py` from https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/pytorch/distributed-training/src/train.py and put it under `tmp/src/` folder. 
5. Create the YAML file called `job_a100.yml`. Use the example job definition as the [official job schema](https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-job-command?view=azureml-api-2#yaml-distributed-pytorch).

```yaml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
  python train.py
  --epochs ${{inputs.epochs}}
  --learning-rate ${{inputs.learning_rate}}
  --data-dir ${{inputs.cifar}}
inputs:
  epochs: 1
  learning_rate: 0.2
  cifar:
     type: uri_folder
     path: azureml:cifar10@latest
environment: azureml:AzureML-acpt-pytorch-2.2-cuda12.1@latest
compute: azureml:cluster-a100
distribution:
  type: pytorch
  process_count_per_instance: 2
resources:
  instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.
```

6. Create the YAML file called `job_h100.yml` as above but replace the compute with `azureml:cluster-h100`

```yaml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
  python train.py
  --epochs ${{inputs.epochs}}
  --learning-rate ${{inputs.learning_rate}}
  --data-dir ${{inputs.cifar}}
inputs:
  epochs: 1
  learning_rate: 0.2
  cifar:
     type: uri_folder
     path: azureml:cifar10@latest
environment: azureml:AzureML-acpt-pytorch-2.2-cuda12.1@latest
compute: azureml:cluster-h100
distribution:
  type: pytorch
  process_count_per_instance: 2
resources:
  instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.
```
7. Login the azure account and specify the default workspace through `az login` followed by `az configure --defaults workspace=xxxx group=xxxx`
8. Submit AML jobs with 

```console
az ml job create -f job_a100.yml
```

and 

```console
az ml job create -f job_h100.yml
```


### Expected behavior

Both job shall succeed, showing four process in the AML job log, displaying GPU usage for the specified number of process. I.e., 4 process for 2 nodes with 2 process in each node. From the monitor panel of job, we can see e.g., 

![Image](https://github.com/user-attachments/assets/db39e01b-ff87-4cba-ab6a-920915caf129)

![Image](https://github.com/user-attachments/assets/db8b6930-bb4c-45fb-b06d-071c8699211a)



### Actual behavior

The job on A100 cluster succeeds without obvious error. 
The job on H100 cluster failed completely due to some NCCL errors. 



### Addition information

Error message from the main process on H100 cluster is 

```console 
854283d6899342f0a2ab3629685461f300000W:92:413 [0] socket.c:540 NCCL WARN Net : Connection closed by remote peer 854283d6899342f0a2ab3629685461f300000x.internal.cloudapp.net<51100>
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO ib_plugin.c:449 -> 6
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO transport/net.cc:869 -> 6
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO misc/socket.cc:47 -> 3
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO misc/socket.cc:58 -> 3
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO misc/socket.cc:773 -> 3
854283d6899342f0a2ab3629685461f300000W:92:413 [0] NCCL INFO proxy.cc:1356 -> 3

854283d6899342f0a2ab3629685461f300000W:92:413 [0] proxy.cc:1505 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

854283d6899342f0a2ab3629685461f300000W:92:413 [0] proxy.cc:1539 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/c6cd24b215ed41daaf6ea5d84fc406ea/exe/wd/train.py", line 194, in <module>
    main(args)
  File "/mnt/azureml/cr/j/c6cd24b215ed41daaf6ea5d84fc406ea/exe/wd/train.py", line 109, in main
    model = nn.parallel.DistributedDataParallel(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, remote process exited or there was a network error, NCCL version 2.19.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connection closed by remote peer 854283d6899342f0a2ab3629685461f300000x.internal.cloudapp.net<47776>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed GPU training works on A100 but not H100 #3561

Operating System

Version Information

Steps to reproduce

Expected behavior

Actual behavior

Addition information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distributed GPU training works on A100 but not H100 #3561

Description

Operating System

Version Information

Steps to reproduce

Expected behavior

Actual behavior

Addition information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions