Neuronx Nemo megatron GPT3 23b pre-training tutorial crashes on Ubuntu 22.04 

This GPT 3 26B pretraining [tutorial](https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-gpt-job.md) crashes after 12-18 hours of pre-training, specifically on Ubuntu 22.04 stack. It works on Ubuntu 20.04 stack. 

Not all processes in the cluster crash, but 1 or more processes in the 4-node cluster crash with the same error. 

Key error stack trace is as follows:

```
2024-Jan-11 19:01:15.502423 18506:21252 ERROR   ENC:ncclNetRegMr                            failed neuronNetRegMr request to NCCL
2024-Jan-11 19:01:15.502435 18506:21252 ERROR   ENC:configure_net_connector                 [rank 0, channel 0] failed to register channel buffer, addr: 0x0x7fd881000000, len: 50331648
2024-Jan-11 19:01:15.503417 18506:21252 ERROR   ENC:alg_ring_init                           [nec_dev 13] failed to configure network connector for RING
2024-Jan-11 19:01:15.499989 18506:21240 ERROR   ENC:ncclNetRegMr                            failed neuronNetRegMr request to NCCL
2024-Jan-11 19:01:15.503432 18506:21240 ERROR   ENC:configure_net_connector                 [rank 0, channel 0] failed to register channel buffer, addr: 0x0x7fd895000000, len: 50331648
2024-Jan-11 19:01:15.501323 18506:21247 ERROR   ENC:ncclNetRegMr                            failed neuronNetRegMr request to NCCL
2024-Jan-11 19:01:15.503428 18506:21252 ERROR   ENC:init_ring_algorithm                     [nec_dev 13] failed to alg_ring_init for RING
2024-Jan-11 19:01:15.505482 18506:21240 ERROR   ENC:alg_ring_init                           [nec_dev 1] failed to configure network connector for RING
2024-Jan-11 19:01:15.505492 18506:21240 ERROR   ENC:init_ring_algorithm                     [nec_dev 1] failed to alg_ring_init for RING
2024-Jan-11 19:01:15.505497 18506:21240 ERROR   ENC:enc_init_comm                           [rank 0] failed to init ring algorithm
2024-Jan-11 19:01:15.505503 18506:21240 ERROR   ENC:enc_init_replica_groups                 [nec_dev 1] failed to init ENC comm
2024-Jan-11 19:01:15.505509 18506:21240 ERROR   ENC:enc_load_operations                     [nec_dev 1] failed to init replica groups
2024-Jan-11 19:01:15.505514 18506:21240 ERROR  TDRV:v2_cc_execute                           [nec_dev 1] failed to load operations
2024-Jan-11 19:01:15.505519 18506:21240 ERROR  NMGR:dlr_infer                               Failed to prep collectives execution, err: 1
2024-Jan-11 19:01:15.505550 18506:21240 ERROR  NMGR:kmgr_async_exec_default_exec_status_callbackExec id 0 for model 10017 on worker 1 failed with fatal status 1... aborting.
python3: /local/p4clients/pkgbuild-Gx12v/workspace/src/KaenaRuntime/kmgr/kmgr_async_exec.cc:27: void kmgr_async_exec_default_exec_status_callback(void*, uint32_t, uint32_t, uint64_t, NRT_STATUS): Assertion `0' failed.
2024-Jan-11 19:01:15.504471 18506:21247 ERROR   ENC:configure_net_connector                 [rank 0, channel 0] failed to register channel buffer, addr: 0x0x7fd88d000000, len: 50331648
2024-Jan-11 19:01:15.508072 18506:21247 ERROR   ENC:alg_ring_init                           [nec_dev 8] failed to configure network connector for RING
2024-Jan-11 19:01:15.505482 18506:21252 ERROR   ENC:enc_init_comm                           [rank 0] failed to init ring algorithm
2024-Jan-11 19:01:15.508081 18506:21247 ERROR   ENC:init_ring_algorithm                     [nec_dev 8] failed to alg_ring_init for RING
``` 

## OS information
```
Linux ip-172-31-73-214 6.2.0-1017-aws #17~22.04.1-Ubuntu SMP Fri Nov 17 21:07:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
```


## Pip freeze  for Neuron

```
apex @ file:///home/ubuntu/neuronx-nemo-megatron/build/apex-0.1-py3-none-any.whl#sha256=882cc65b94adc92e20864e468d82f072395571a54155472d77f1961b846cd9b2
aws-neuronx-runtime-discovery==2.9
libneuronxla==0.5.669
nemo_toolkit @ file:///home/ubuntu/neuronx-nemo-megatron/build/nemo_toolkit-1.14.0-py3-none-any.whl#sha256=dad4a2ecf0d65d03eb481542cffaaabe58c9960b25e3c58725cd7a0aad516cef
neuronx-cc==2.12.54.0+f631c2365
neuronx-hwm==2.12.0.0+422c9037c
torch-neuronx==1.13.1.1.13.0
torch-xla==1.13.1+torchneurond
```

## Packages for neuron

```
aws-neuronx-collectives              2.19.7.0-530fb3064                       amd64        neuron_ccom built using CMake
aws-neuronx-dkms                     2.15.9.0                                 amd64        aws-neuronx driver in DKMS format.
aws-neuronx-oci-hook                 2.2.45.0                                 amd64        neuron_oci_hook built using CMake
aws-neuronx-runtime-lib              2.19.5.0-97e2d271b                       amd64        neuron_runtime built using CMake
aws-neuronx-tools                    2.16.1.0                                 amd64        Neuron profile and debug tools
```

## Setup

### Cluster type 

OpenMPI

### Head Node

The head node type is `trn1.2xlarge` and was created using this [CFN template](https://github.com/aws-samples/aws-deep-learning-ami-ubuntu-dcv-desktop/blob/main/deep-learning-ubuntu-desktop.yaml), with EFS and FSx file-systems enabled

### Cluster Nodes

The cluster nodes were of type `trn1.32xlarge` and were created using this [CFN template](https://github.com/aws-samples/aws-deep-learning-ami-ubuntu-dcv-desktop/blob/main/deep-learning-ubuntu-efa-cluster.yaml)

### Open MPI Launch script

THIS IS THE LUANCH SCRIPT AFTTER RUNNING THE  neuron_parallel_compile

```
#!/bin/bash

set -o pipefail

[[ $# -ne 1 ]] && echo "usage: $0 script" && exit 1

SCRIPT=$1
echo "Training script: $SCRIPT"

[[ -z $MASTER_ADDR ]] && echo "MASTER_ADDR is not set" && exit 1
[[ -z $HOSTFILE ]] && echo "HOSTFILE is not set" && exit 1

NUM_PARALLEL=4

JOB_ID="neuron_nemo_megatron_gpt_23b"

export LOGS_DIR="$HOME/fsx/neuronx_logs/$JOB_ID"
mkdir -p $LOGS_DIR

export CACHE_DIR="$HOME/fsx/neuronx_cache/$JOB_ID"
mkdir -p $CACHE_DIR

export XDG_CACHE_HOME="$HOME/efs/.cache/$JOB_ID"
mkdir -p $XDG_CACHE_HOME

export DATA_PATH="$HOME/efs/examples_datasets/gpt2"

[[ ! -d $DATA_PATH ]] && echo "$DATA_PATH not found" && exit 1

export WORK_DIR=$HOME/efs/git/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 
export PATH='/opt/aws/neuron/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin'

LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/aws/neuron/lib"
LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/amazon/efa/lib"
LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/amazon/efa/lib64"
LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/amazon/openmpi/lib64"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib"

mpirun -np $NUM_PARALLEL --verbose \
--hostfile $HOSTFILE \
-bind-to none -map-by slot \
--mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 \
--mca hwloc_base_binding_policy none --mca rmaps_base_mapping_policy slot \
--mca orte_keep_fqdn_hostnames t \
--report-child-jobs-separately \
--display-map --tag-output --timestamp-output \
-wdir $WORK_DIR \
-x PATH \
-x LD_LIBRARY_PATH \
-x PYTHONUNBUFFERED=1 \
-x PYTHONIOENCODING=UTF-8 \
-x LANG=C.UTF-8 \
-x LC_ALL=C.UTF-8 \
-x MASTER_ADDR \
-x DATA_PATH \
-x CACHE_DIR \
-x LOGS_DIR \
-x WORK_DIR \
-x SCRIPT \
-x XDG_CACHE_HOME \
-x TOKENIZERS_PARALLELISM=false \
bash -c "source /home/ubuntu/aws_neuron_nemo_megatron/bin/activate && \
./$SCRIPT"
```

This is the `test.sh` script:

```
#!/usr/bin/env bash

source ./train_setup.sh

: ${SEQ_LENGTH:=2048}
: ${HS:=4096}
: ${TP:=8}
: ${PP:=1}
: ${N_LAYERS:=32}
: ${N_AH:=32}
: ${UBS:=1}
: ${ACT_CHKPNT_GRANULARITY:=full}
: ${GBS_MULTIPLE:=32}
GBS=$((NTASKS*GBS_MULTIPLE))

: ${TRAIN_ITERS:=300000}

FFN_HS=$(($HS*4))
echo "SEQ_LEN=$SEQ_LENGTH, HS=$HS, FFN_HS=$FFN_HS TP=$TP PP=$PP N_LAYERS=$N_LAYERS N_AH=$N_AH GBS=$GBS UBS=$UBS TRAIN_ITERS=$TRAIN_ITERS"


$MAYBE_COMPILE torchrun $DISTRIBUTED_ARGS megatron_gpt_pretraining.py  \
    --config-path=conf \
    --config-name=megatron_gpt_config \
    trainer.devices=$PROCESSES_PER_NODE \
    trainer.num_nodes=$NTASKS \
    trainer.max_epochs=null \
    trainer.max_steps=$TRAIN_ITERS\
    trainer.val_check_interval=$(($TRAIN_ITERS+1)) \
    trainer.log_every_n_steps=1 \
    trainer.limit_val_batches=1 \
    trainer.limit_test_batches=1 \
    trainer.accumulate_grad_batches=1 \
    trainer.precision=32 \
    model.megatron_amp_O2=$megatron_amp_O2 \
    model.micro_batch_size=$UBS \
    model.global_batch_size=$GBS \
    model.tensor_model_parallel_size=$TP \
    model.pipeline_model_parallel_size=$PP \
    model.max_position_embeddings=$SEQ_LENGTH \
    model.encoder_seq_length=$SEQ_LENGTH \
    model.hidden_size=$HS \
    model.ffn_hidden_size=$FFN_HS \
    model.num_layers=$N_LAYERS \
    model.num_attention_heads=$N_AH \
    model.init_method_std=0.021 \
    model.hidden_dropout=0.1 \
    model.layernorm_epsilon=1e-5 \
    model.tokenizer.vocab_file=$DATA_PATH/gpt2-vocab.json \
    model.tokenizer.merge_file=$DATA_PATH/gpt2-merges.txt \
    model.data.data_prefix=[1.0,$DATA_PATH/my-gpt2_text_document] \
    model.data.num_workers=1 \
    model.data.seq_length=$SEQ_LENGTH \
    model.optim.name=$OPTIM_NAME \
    model.optim.capturable=True \
    model.optim.lr=0.00015 \
    model.optim.betas=[0.9,0.95] \
    model.optim.weight_decay=0.01 \
    model.optim.sched.name=CosineAnnealing \
    model.optim.sched.warmup_steps=750 \
    model.optim.sched.constant_steps=80000 \
    model.optim.sched.min_lr=1.0e-5 \
    model.sequence_parallel=True  \
    model.activations_checkpoint_granularity=$ACT_CHKPNT_GRANULARITY \
    model.activations_checkpoint_method=uniform \
    model.activations_checkpoint_num_layers=1 \
    +model.save_xser=True \
    exp_manager.create_tensorboard_logger=$CREATE_TB_LOGGER \
    exp_manager.resume_if_exists=False \
    exp_manager.resume_ignore_no_checkpoint=False \
    exp_manager.create_checkpoint_callback=$CHECKPOINT_CALLBACK \
    exp_manager.explicit_log_dir=$EXPLICIT_LOGDIR \
    +exp_manager.checkpoint_callback_params.train_time_interval=3600 \
    model.use_cpu_initialization=True   2>&1  | tee -a $LOG_PATH/log

exit 0
```

This is the `train_setup.sh` script:

```
#!/usr/bin/env bash
set -o pipefail
set -e

ulimit -n 65535

export FI_EFA_USE_DEVICE_RDMA=1
export FI_PROVIDER=efa
export FI_EFA_FORK_SAFE=1

if [ -v SLURM_NNODES ]
then
    # SLURM runs
    sudo sysctl -w net.ipv4.ip_local_reserved_ports=41000
    IPS=""
    for h in $(scontrol show hostname); do
        IPS="$IPS $(nslookup $h  | awk '/^Address: / { print $2 }')";
    done
    HOSTS=(${IPS//\ / })
    NODEID=$SLURM_NODEID
    NTASKS=$SLURM_NTASKS

    export MASTER_ADDR=${HOSTS[0]}
    export NEMO_EXPM_VERSION=$SLURM_JOB_ID
    export EXPLICIT_LOGDIR=null
    : ${SLURM_RESTART_COUNT:=0}
    LOG_PATH=logs/$SLURM_JOB_ID/$SLURM_RESTART_COUNT/$NODEID/
    mkdir -p $LOG_PATH
    export NEURON_COMPILE_CACHE_URL="$HOME/neuron_cache" # Place cache on shared storage to reduce redundant compilations
    # Make sure to install latest runtime
    ./setup.sh   2>&1  | tee  $LOG_PATH/setup.log

elif [ -v OMPI_COMM_WORLD_RANK ]
then
    # MPI
    [[ -z $MASTER_ADDR ]] && echo "MASTER_ADDR is not set" && exit 1

    TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
    PRIMARY_MAC=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/latest/meta-data/mac)
    export CCOM_SOCKET_IFNAME=$(ip -o link show | grep -F "link/ether $PRIMARY_MAC" | awk -F'[ :]+' '{print $2}')

    NODEID=$OMPI_COMM_WORLD_RANK
    NTASKS=$OMPI_COMM_WORLD_SIZE
    export EXPLICIT_LOGDIR=$LOGS_DIR
    LOG_PATH=$LOGS_DIR/$NODEID/
    mkdir -p $LOG_PATH
    export NEURON_COMPILE_CACHE_URL=$CACHE_DIR/$NODEID # Place cache on shared storage to reduce redundant compilations
else
    # Single-node, non-SLURM, non-MPI runs
    HOSTS=(localhost)
    NODEID=0
    NTASKS=1
    export MASTER_ADDR=${HOSTS[0]}
    export NEMO_EXPM_VERSION=$(date "+%Y-%m-%d_%H-%M-%S")
    export EXPLICIT_LOGDIR=null
    LOG_PATH=./nemo_experiments/logs
    mkdir -p $LOG_PATH
fi

export HYDRA_FULL_ERROR=1
export PROCESSES_PER_NODE=32
export MASTER_PORT=41000

export NEURON_RT_EXEC_TIMEOUT=10
export DISTRIBUTED_ARGS="--nproc_per_node $PROCESSES_PER_NODE --nnodes $NTASKS --node_rank $NODEID --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
echo $DISTRIBUTED_ARGS

export BUCKET_CAP_MB=1024
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=5
export NEURON_TRANSFER_WITH_STATIC_RING_OPS=""
export MALLOC_ARENA_MAX=128
export TF_NUM_INTEROP_THREADS=1024
export XLA_THREAD_POOL_SIZE=4
export XLA_IO_THREAD_POOL_SIZE=4


export NEURON_RT_STOCHASTIC_ROUNDING_EN=1

#training_precision is one of 'bf16SR', 'megatron_amp_O2', 'fp32_OptStates'
#training_precision = "bf16SR", uses BF16 + Stochastic Rounding
#training_precision = "megatron_amp_O2", master weights and optimizer states are stored in fp32, model weights in bf16
#training_precision = "fp32_OptStates", optimizer states are stored in fp32, model weights in bf16
training_precision="bf16SR"
if [[ $training_precision == "bf16SR" ]];then
    echo using BF16 SR
    export XLA_USE_BF16=1
    export NEURON_CC_FLAGS="--model-type transformer --distribution-strategy=nemo --enable-mixed-precision-accumulation"
    export OPTIM_NAME=adamw
    export megatron_amp_O2=false
elif [[ $training_precision == "megatron_amp_O2" ]]; then
    echo using megatron_amp_O2
    export XLA_DOWNCAST_BF16=1
    export NEURON_CC_FLAGS="--model-type transformer --distribution-strategy=nemo --enable-mixed-precision-accumulation"
    export OPTIM_NAME=adamw
    export megatron_amp_O2=true
elif [[ $training_precision == "fp32_OptStates" ]]; then
    echo using FP32 Optimizer States
    export XLA_DOWNCAST_BF16=1
    export NEURON_CC_FLAGS="--model-type transformer --distribution-strategy=nemo --enable-mixed-precision-accumulation"
    export OPTIM_NAME=adamw_fp32OptState
    export megatron_amp_O2=false
else
    echo Incorrect Training Precision Provided
fi

export CREATE_TB_LOGGER=True
export CHECKPOINT_CALLBACK=True

if [ "$COMPILE" = "1" ]; then
    echo "compiling only run"
    MAYBE_COMPILE="neuron_parallel_compile"
    export TRAIN_ITERS=3
    CREATE_TB_LOGGER=False
    CHECKPOINT_CALLBACK=False
    export MASTER_PORT=41001
fi

```

### Steps to reproduce

1. Connect to head node using dcv client. Verify you have EFS mounted under `~/efs` and FSx for Lustre file-system mounted under `~/fsx`.
2. Set up SSH key on the head node so Open MPI can ssh to cluster nodes. This means adding the ssh key in ~/.ssh/id_rsa and setting `.ssh/config` as follows:
    ```
    Host *
        StrictHostKeyChecking no
    ```
4. `source ~/aws_neuron_nemo_megatron/bin/activate`
5. `sudo mkdir -p ~/efs/git; sudo chown -R ubuntu:ubuntu ~/efs/git`
6. `sudo mkdir -p ~/efs/examples_datasets/gpt2/; sudo chown -R ubuntu:ubuntu ~/efs/examples_datasets/gpt2/`
7. Prepare GPT2 data under `~/efs/examples_datasets/gpt2/`
8.  `cd ~/efs/git; git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git`
9. `cd ~/efs/git/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling`
10. Create an Open MPI hostfile for the four cluster nodes with `slots=1` for each of the four nodes. Set the path to the hostfile in the environment variable `export HOSTFILE=` and set the IP address of one of the cluster nodes in the environment variable  `export MASTER_ADDR=` 
11. After installing new, or replacing the existing shell scripts files noted  above, run: `./pretrain_openmpi.sh  gpt_23b.sh 1>/tmp/a.out 2>&1 &`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Neuronx Nemo megatron GPT3 23b pre-training tutorial crashes on Ubuntu 22.04 #25

OS information

Pip freeze for Neuron

Packages for neuron

Setup

Cluster type

Head Node

Cluster Nodes

Open MPI Launch script

Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Neuronx Nemo megatron GPT3 23b pre-training tutorial crashes on Ubuntu 22.04 #25

Description

OS information

Pip freeze for Neuron

Packages for neuron

Setup

Cluster type

Head Node

Cluster Nodes

Open MPI Launch script

Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions