Shampoo conformer workload hangs

The conformer workload hangs when run with shampoo training algorithm.

## Description
Traceback
```
I0505 23:26:00.158526 139795269302080 submission_runner.py:319] Starting training loop.
I0505 23:26:00.373614 139795269302080 input_pipeline.py:20] Loading split = train-clean-100
I0505 23:26:00.410641 139795269302080 input_pipeline.py:20] Loading split = train-clean-360
I0505 23:26:00.817146 139795269302080 input_pipeline.py:20] Loading split = train-other-500
2023-05-05 23:32:21.267134: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
```

## Steps to Reproduce
Pull the docker image:
```
$ docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image:timing
```
Run the container and entrypoint script which will launch a submission runner:
```
$ docker run -t -d -v /home/kasimbeg/data/:/data/ -v /home/kasimbeg/experiment_runs/:/experiment_runs -v /home/kasimbeg/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image:timing -d librispeech -f jax -s baselines/shampoo/jax/submission.py -w librispeech_conformer -t baselines/shampoo/tuning_search_space_conformer.json -e timing_fancy_2_redo/timing_shampoo -m 20000 -c False -o True -r False 
```
To see output of `submission_runner.py` monitor the logs of the container:
$ docker logs -f <container_id printed by previous command> 

## Source or Possible Fix
I think this may be an XLA memory issue. On a different VM the runs got a little further along and errored out with a seemingly memory related issue.  I restarted all the VMs and they don't get any further along then the above message. I may have changed some environment flags on the VM that got further along.  I tried setting `XLA_PYTHON_CLIENT_PREALLOCATE=false` which didn't do anything and setting `XLA_PYTHON_CLIENT_MEM_FRACTION=.80` which made it error out sooner. 

For reference the output of the run that got further:
```
I0504 04:13:55.604297 139669110556480 submission_runner.py:415] Time since start: 5507.98s, 	Step: 3192, 	{'train/ctc_loss': DeviceArray(1.8280892, dtype=float32), 'train/wer': 0.44060410729030713, 'validation/ctc_loss': DeviceArray(2.347627, dtype=float32), 'validation/wer': 0.4757016469044564, 'validation/num_examples': 5348, 'test/ctc_loss': DeviceArray(1.9786975, dtype=float32), 'test/wer': 0.4258119553957711, 'test/num_examples': 2472, 'score': 5199.0200300216675, 'total_duration': 5507.9753386974335, 'accumulated_submission_time': 5199.0200300216675, 'accumulated_eval_time': 308.8330419063568, 'accumulated_logging_time': 0.07393193244934082}
I0504 04:13:55.626670 139496276358912 logging_writer.py:48] [3192] accumulated_eval_time=308.833042, accumulated_logging_time=0.073932, accumulated_submission_time=5199.020030, global_step=3192, preemption_count=0, score=5199.020030, test/ctc_loss=1.9786975383758545, test/num_examples=2472, test/wer=0.425812, total_duration=5507.975339, train/ctc_loss=1.8280892372131348, train/wer=0.440604, validation/ctc_loss=2.3476269245147705, validation/num_examples=5348, validation/wer=0.475702
I0504 04:14:11.395573 139496267966208 logging_writer.py:48] [3200] global_step=3200, grad_norm=0.9491458535194397, loss=1.8900938034057617
I0504 04:16:31.686281 139496276358912 logging_writer.py:48] [3300] global_step=3300, grad_norm=0.8079001307487488, loss=1.9073154926300049
I0504 04:18:51.079895 139496267966208 logging_writer.py:48] [3400] global_step=3400, grad_norm=0.7481346726417542, loss=1.8942415714263916
I0504 04:21:10.230679 139496276358912 logging_writer.py:48] [3500] global_step=3500, grad_norm=0.8360145092010498, loss=1.8996608257293701
I0504 04:23:28.858766 139496267966208 logging_writer.py:48] [3600] global_step=3600, grad_norm=1.0013821125030518, loss=1.9014889001846313
I0504 04:25:48.481894 139496276358912 logging_writer.py:48] [3700] global_step=3700, grad_norm=0.9406089186668396, loss=1.8606724739074707
I0504 04:28:07.921166 139496267966208 logging_writer.py:48] [3800] global_step=3800, grad_norm=0.8744378089904785, loss=1.8769983053207397
2023-05-04 04:30:00.136705: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 5 failed: INTERNAL: Failed to launch CUDA kernel: fusion_59 with block dimensions: 32x1x1 and grid dimensions: 1x1x1: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2023-05-04 04:30:09.848574: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.848806: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.850079: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.852182: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.855277: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.855440: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:09.862346: E external/org_tensorflow/tensorflow/compiler/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-05-04 04:30:10.145779: F external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2275] Replicated computation launch failed, but not all replicas terminated. Aborting process to work around deadlock. Failure message (there may have been multiple failures, see the error log for all failures): 

Failed to launch CUDA kernel: fusion_59 with block dimensions: 32x1x1 and grid dimensions: 1x1x1: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
Fatal Python error: Aborted
```

## To debug in container:
Run the container without starting the submission runner (not passing in a value for the -s flag):
```
$ docker run -t -d -v /home/kasimbeg/data/:/data/ -v /home/kasimbeg/experiment_runs/:/experiment_runs -v /home/kasimbeg/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image:timing  -r false -b true
```
Start an interactive bash session in the running container:
```
$ docker exec --it <container_id> /bin/bash
```
Run submission_runner.py in the container:
```
$ python3 submission_runner.py --framework=jax --workload=librispeech_conformer --submission_path=baselines/shampoo/jax/submission.py --tuning_search_space=baselines/shampoo/tuning_search_space.json --data_dir=/data/librispeech --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=timing_fancy_2_redo/timing_shampoo --overwrite=True --save_checkpoints=False --max_global_steps=20000 --librispeech_tokenizer_vocab_path=/data/librispeech/spm_model.vocab 2>&1 | tee -a /logs/librispeech_conformer_jax
```

You can also pull the code to the host VM and mount the local repo so that you can make changes to the code without losing them.
- Pull the repo:
```
$ cd $HOME
$ git clone https://github.com/priyakasimbeg/algorithmic-efficiency.git 
$ git fetch origin && git pull && git checkout shampoo_debugging
```

Run the container w the mounted dir:
```
docker run -t -d -v /home/kasimbeg/data/:/data/ -v /home/kasimbeg/experiment_runs/:/experiment_runs -v /home/kasimbeg/experiment_runs/logs:/logs -v $HOME/algorithmic-efficiency:/algorithmic-efficiency --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image:timing  -r False -b
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Shampoo conformer workload hangs #389

Description

Steps to Reproduce

Source or Possible Fix

To debug in container:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Shampoo conformer workload hangs #389

Description

Description

Steps to Reproduce

Source or Possible Fix

To debug in container:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions