-
Notifications
You must be signed in to change notification settings - Fork 622
Open
Labels
Description
Creating this issue to fill torchtitan CI gap between ROCm & CUDA.
We need to add support for integration tests & tests under experiments. First priority is for integration tests. The integration tests include features.py, flux.py, ft.py, h100.py and models.py tests. Experiments contain following tests compiler_toolkit, simple_fsdp, torchcomms, transformers_modeling_backend and vlm tests.
- Implement ciflow/rocm for torchtitan. This will allow ROCm workflows to run from forked PRs.
- Integration Tests-
- Currently, ROCm only supports features integration test. This PR added ROCm CI support for features tests Enable ROCm CI support #1786. For both CUDA & ROCm, the execution time for features integration test is ~18 minutes. Pull Docker step takes ~5 minutes. Hence, the total execution time for features integration test is ~23 minutes.
- models integration test
- flux integration test
- ft integration test
- h100 integration test
- Experiment Tests-
- compiler_toolkit experiment test
- simple_fsdp, torchcomms experiment test
- transformers_modeling_backend experiment test
- vlm experiment test
There are CPU Unit tests. However, they don't need ROCm support as these are CPU tests than runs on linux.2xlarge runner.