Skip to content

Torchtitan CI gap Between ROCm & CUDA #2098

@akashveramd

Description

@akashveramd

Creating this issue to fill torchtitan CI gap between ROCm & CUDA.
We need to add support for integration tests & tests under experiments. First priority is for integration tests. The integration tests include features.py, flux.py, ft.py, h100.py and models.py tests. Experiments contain following tests compiler_toolkit, simple_fsdp, torchcomms, transformers_modeling_backend and vlm tests.

  • Implement ciflow/rocm for torchtitan. This will allow ROCm workflows to run from forked PRs.

- Integration Tests-

  • Currently, ROCm only supports features integration test. This PR added ROCm CI support for features tests Enable ROCm CI support #1786. For both CUDA & ROCm, the execution time for features integration test is ~18 minutes. Pull Docker step takes ~5 minutes. Hence, the total execution time for features integration test is ~23 minutes.
  • models integration test
  • flux integration test
  • ft integration test
  • h100 integration test

- Experiment Tests-

  • compiler_toolkit experiment test
  • simple_fsdp, torchcomms experiment test
  • transformers_modeling_backend experiment test
  • vlm experiment test

There are CPU Unit tests. However, they don't need ROCm support as these are CPU tests than runs on linux.2xlarge runner.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions