Skip to content

Shampoo conformer workload hangs #389

@priyakasimbeg

Description

@priyakasimbeg

The conformer workload hangs when run with shampoo training algorithm.

Description

Traceback

I0505 23:26:00.158526 139795269302080 submission_runner.py:319] Starting training loop.
I0505 23:26:00.373614 139795269302080 input_pipeline.py:20] Loading split = train-clean-100
I0505 23:26:00.410641 139795269302080 input_pipeline.py:20] Loading split = train-clean-360
I0505 23:26:00.817146 139795269302080 input_pipeline.py:20] Loading split = train-other-500
2023-05-05 23:32:21.267134: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:

Steps to Reproduce

Pull the docker image:

$ docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image:timing

Run the container and entrypoint script which will launch a submission runner:

$ docker run -t -d -v /home/kasimbeg/data/:/data/ -v /home/kasimbeg/experiment_runs/:/experiment_runs -v /home/kasimbeg/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image:timing -d librispeech -f jax -s baselines/shampoo/jax/submission.py -w librispeech_conformer -t baselines/shampoo/tuning_search_space_conformer.json -e timing_fancy_2_redo/timing_shampoo -m 20000 -c False -o True -r False 

To see output of submission_runner.py monitor the logs of the container:
$ docker logs -f <container_id printed by previous command>

Source or Possible Fix

I think this may be an XLA memory issue. On a different VM the runs got a little further along and errored out with a seemingly memory related issue. I restarted all the VMs and they don't get any further along then the above message. I may have changed some environment flags on the VM that got further along. I tried setting XLA_PYTHON_CLIENT_PREALLOCATE=false which didn't do anything and setting XLA_PYTHON_CLIENT_MEM_FRACTION=.80 which made it error out sooner.

For reference the output of the run that got further:

I0504 04:13:55.604297 139669110556480 submission_runner.py:415] Time since start: 5507.98s, 	Step: 3192, 	{'train/ctc_loss': DeviceArray(1.8280892, dtype=float32), 'train/wer': 0.44060410729030713, 'validation/ctc_loss': DeviceArray(2.347627, dtype=float32), 'validation/wer': 0.4757016469044564, 'validation/num_examples': 5348, 'test/ctc_loss': DeviceArray(1.9786975, dtype=float32), 'test/wer': 0.4258119553957711, 'test/num_examples': 2472, 'score': 5199.0200300216675, 'total_duration': 5507.9753386974335, 'accumulated_submission_time': 5199.0200300216675, 'accumulated_eval_time': 308.8330419063568, 'accumulated_logging_time': 0.07393193244934082}
I0504 04:13:55.626670 139496276358912 logging_writer.py:48] [3192] accumulated_eval_time=308.833042, accumulated_logging_time=0.073932, accumulated_submission_time=5199.020030, global_step=3192, preemption_count=0, score=5199.020030, test/ctc_loss=1.9786975383758545, test/num_examples=2472, test/wer=0.425812, total_duration=5507.975339, train/ctc_loss=1.8280892372131348, train/wer=0.440604, validation/ctc_loss=2.3476269245147705, validation/num_examples=5348, validation/wer=0.475702
I0504 04:14:11.395573 139496267966208 logging_writer.py:48] [3200] global_step=3200, grad_norm=0.9491458535194397, loss=1.8900938034057617
I0504 04:16:31.686281 139496276358912 logging_writer.py:48] [3300] global_step=3300, grad_norm=0.8079001307487488, loss=1.9073154926300049
I0504 04:18:51.079895 139496267966208 logging_writer.py:48] [3400] global_step=3400, grad_norm=0.7481346726417542, loss=1.8942415714263916
I0504 04:21:10.230679 139496276358912 logging_writer.py:48] [3500] global_step=3500, grad_norm=0.8360145092010498, loss=1.8996608257293701
I0504 04:23:28.858766 139496267966208 logging_writer.py:48] [3600] global_step=3600, grad_norm=1.0013821125030518, loss=1.9014889001846313
I0504 04:25:48.481894 139496276358912 logging_writer.py:48] [3700] global_step=3700, grad_norm=0.9406089186668396, loss=1.8606724739074707
I0504 04:28:07.921166 139496267966208 logging_writer.py:48] [3800] global_step=3800, grad_norm=0.8744378089904785, loss=1.8769983053207397
2023-05-04 04:30:00.136705: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 5 failed: INTERNAL: Failed to launch CUDA kernel: fusion_59 with block dimensions: 32x1x1 and grid dimensions: 1x1x1: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2023-05-04 04:30:09.848574: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.848806: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.850079: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.852182: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.855277: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.855440: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.862346: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:10.145779: F external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2275] Replicated computation launch failed, but not all replicas terminated. Aborting process to work around deadlock. Failure message (there may have been multiple failures, see the error log for all failures): 

Failed to launch CUDA kernel: fusion_59 with block dimensions: 32x1x1 and grid dimensions: 1x1x1: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
Fatal Python error: Aborted

To debug in container:

Run the container without starting the submission runner (not passing in a value for the -s flag):

$ docker run -t -d -v /home/kasimbeg/data/:/data/ -v /home/kasimbeg/experiment_runs/:/experiment_runs -v /home/kasimbeg/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image:timing  -r false -b true

Start an interactive bash session in the running container:

$ docker exec --it <container_id> /bin/bash

Run submission_runner.py in the container:

$ python3 submission_runner.py --framework=jax --workload=librispeech_conformer --submission_path=baselines/shampoo/jax/submission.py --tuning_search_space=baselines/shampoo/tuning_search_space.json --data_dir=/data/librispeech --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=timing_fancy_2_redo/timing_shampoo --overwrite=True --save_checkpoints=False --max_global_steps=20000 --librispeech_tokenizer_vocab_path=/data/librispeech/spm_model.vocab 2>&1 | tee -a /logs/librispeech_conformer_jax

You can also pull the code to the host VM and mount the local repo so that you can make changes to the code without losing them.

  • Pull the repo:
$ cd $HOME
$ git clone https://github.com/priyakasimbeg/algorithmic-efficiency.git 
$ git fetch origin && git pull && git checkout shampoo_debugging

Run the container w the mounted dir:

docker run -t -d -v /home/kasimbeg/data/:/data/ -v /home/kasimbeg/experiment_runs/:/experiment_runs -v /home/kasimbeg/experiment_runs/logs:/logs -v $HOME/algorithmic-efficiency:/algorithmic-efficiency --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image:timing  -r False -b

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions