[torch] Log environment reproduction steps in test workflows #2238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

ScottTodd wants to merge 6 commits into main from users/scotttodd/torch-repro-logging

Member

ScottTodd commented Nov 20, 2025 •

edited

Loading

Fixes #2173

Note

Posting this for review to see how others feel about the overall approach before I iterate some more.

This writes a step summary for each of our "Test PyTorch Wheels" workflow runs showing how to reproduce the test environment without needing to sift through the logs, find workflow file sources, or even clone TheRock to use our scripts.

I've tried a few variations on the output format and I'm not yet fully satisfied: https://gist.github.com/ScottTodd/6a465a4958fdaea59ede417434ba64b4. Here's a test run with v1: https://github.com/ROCm/TheRock/actions/runs/19553166331. The code in this PR is currently v2.

A few design points I'm working through:

I want the instructions to be easy to copy/paste into issue reports and terminals, but there are two key branches:
1. Starting a Docker container (which is optional). If you copy/paste a full set of instructions and one starts an interactive container the remaining instructions will not be run.
2. If you already have pytorch checked out, you don't need to clone the repository.
Probably also want a few quick bullet points at the start listing the inputs (pytorch version), linking to the github branch, linking to the release index page, etc.
If we have the pytorch tests generate a test report, we could also include that in the summary and even show how to run specific failed test cases individually (maybe limit to the first 10 failures?)

ScottTodd added 4 commits

November 20, 2025 13:12


          Rename "pytorch_version" to "pytorch_ref" to disambiguate further

d73dd8e


          Introduce summarize_test_pytorch_workflow.py.

d6d8ecb


          Call new script from test workflow

75be90f


          Iterate on output format.

fade399

ScottTodd requested a review from HereThereBeDragons

November 20, 2025 23:26

github-project-automation bot added this to TheRock Triage

github-project-automation bot moved this to TODO in TheRock Triage

ScottTodd added 2 commits

November 20, 2025 16:09


          Revert "Rename "pytorch_version" to "pytorch_ref" to disambiguate fur…

7b9d74d

…ther"

This reverts commit d73dd8e.


          Rename arg in script to pytorch_git_ref, revert workflow to old name

a1f7b6e

ScottTodd commented

View reviewed changes

.github/workflows/test_pytorch_wheels.yml

    
                      run: |

                        python build_tools/github_actions/summarize_test_pytorch_workflow.py \

                          --torch-version=${{ inputs.torch_version }} \

                          --pytorch-ref=${{ inputs.pytorch_version }} \

Member Author

ScottTodd Nov 21, 2025

Note: this input name will need to adapt to #2239 if that is merged first.

HereThereBeDragons reviewed

View reviewed changes

Contributor

HereThereBeDragons left a comment

overall a good idea.

here some thoughts for the moment.

i am not sure if we want to have it running all the time in the pipeline?

wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print

export AMDGPU_FAMILY="gfx1151"
export TORCH_VERSION="2.7.1+rocm7.10.0a20251120"
export PYTORCH_GIT_REF="release/2.7"

etc?

in the future, when i have finally finished #1732, we can also easily propagate the container image hash
related to the different markdown versions:
do workflow steps show proper markdown? i always thought it is just plain text.
python venv should be an extra step und not part of docker. maybe have the apt install and setup of it together.

build_tools/github_actions/summarize_test_pytorch_workflow.py

    
                  summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). "

                  summary += "For example:\n\n"

                  summary += "```bash\n"

                  summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n"

Contributor

HereThereBeDragons Nov 25, 2025

we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

Member Author

ScottTodd Nov 25, 2025

we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.

Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

What I want for that is to

generate a test report during testing
parse the test report to get a list of failures
list the failures and show commands to reproduce those specific failures

Contributor

HereThereBeDragons Nov 26, 2025

in that case where you give a specific list of failed commands, i can understand it would be quite cool to list the command to rerun.

but maybe we can reduce it "look here URL to setup the container" and then "here is the pytorch test command to run"

build_tools/github_actions/summarize_test_pytorch_workflow.py

    
                  # TODO: default the index subdir based on the current GPU somehow?

                  #       (share that logic with setup_venv.py if so)

                  parser.add_argument(

                      "--index-subdir",

Contributor

HereThereBeDragons Nov 25, 2025

maybe better: --gpu-arch

Member Author

ScottTodd Nov 26, 2025

I could go either way on this. "Index subdirectory" explains exactly how the data is going to be used, while "gpu arch" is closer to what a user might know.

Contributor

HereThereBeDragons Nov 26, 2025

the question is: are we going to change this directory hierarchy anytime soon? if not, i would change it to gpu and add a comment that this is just an index subdirectory?

build_tools/github_actions/summarize_test_pytorch_workflow.py

    
                  summary += "    ```bash\n"

                  summary += f"    git clone --branch {args.pytorch_git_ref} --origin {pytorch_remote_name} https://github.com/{pytorch_repo_org}/pytorch.git\n"

                  summary += "    ```\n\n"

                  summary += "* (B) Switch to pytorch ref using an existing repository\n\n"

Contributor

HereThereBeDragons Nov 25, 2025

Suggested change

      
                summary += "* (B) Switch to pytorch ref using an existing repository\n\n"
          
                summary += "* or (B) Switch to pytorch ref using an existing repository\n\n"

build_tools/github_actions/summarize_test_pytorch_workflow.py

    
                      summary += "```\n\n"

                  pytorch_remote_name = "upstream" if args.pytorch_git_ref == "nightly" else "rocm"

                  pytorch_repo_org = "pytorch" if args.pytorch_git_ref == "nightly" else "ROCm"

Contributor

HereThereBeDragons Nov 25, 2025

i dont think this is super elegant for the upstream. couldnt be the repo org be empty if we are upstream?

Member Author

ScottTodd Nov 25, 2025 •

edited

Loading

Would you like --origin {pytorch_remote_name} to be omitted here?

This currently logs either:

git clone --branch nightly --origin upstream https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git

Would you prefer this or something else?

git clone --branch nightly https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git

Contributor

HereThereBeDragons Nov 26, 2025

yeah the 2nd one :)

ScottTodd commented

View reviewed changes

build_tools/github_actions/summarize_test_pytorch_workflow.py

    
                  summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). "

                  summary += "For example:\n\n"

                  summary += "```bash\n"

                  summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n"

Member Author

ScottTodd Nov 25, 2025

we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.

Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

What I want for that is to

generate a test report during testing
parse the test report to get a list of failures
list the failures and show commands to reproduce those specific failures

build_tools/github_actions/summarize_test_pytorch_workflow.py

    
                      summary += "```\n\n"

                  pytorch_remote_name = "upstream" if args.pytorch_git_ref == "nightly" else "rocm"

                  pytorch_repo_org = "pytorch" if args.pytorch_git_ref == "nightly" else "ROCm"

Member Author

ScottTodd Nov 25, 2025 •

edited

Loading

Would you like --origin {pytorch_remote_name} to be omitted here?

This currently logs either:

git clone --branch nightly --origin upstream https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git

Would you prefer this or something else?

git clone --branch nightly https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git

build_tools/github_actions/summarize_test_pytorch_workflow.py

    
                  # TODO: default the index subdir based on the current GPU somehow?

                  #       (share that logic with setup_venv.py if so)

                  parser.add_argument(

                      "--index-subdir",

Member Author

ScottTodd Nov 26, 2025

I could go either way on this. "Index subdirectory" explains exactly how the data is going to be used, while "gpu arch" is closer to what a user might know.

build_tools/github_actions/summarize_test_pytorch_workflow.py

    
                  summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n"

                  summary += "```\n"

                  gha_append_step_summary(summary)

Member Author

ScottTodd Nov 26, 2025

3. related to the different markdown versions:
do workflow steps show proper markdown? i always thought it is just plain text.

This writes to GITHUB_STEP_SUMMARY (https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-commands#adding-a-job-summary), which supports GitHub flavored markdown.

It will appear like here: https://github.com/ROCm/TheRock/actions/runs/19553166331#summary-55989478372 . Note that when there are multiple jobs in the same workflow run, we will have a list like here: https://github.com/ROCm/TheRock/actions/runs/19680190301#summary-56371979719.

build_tools/github_actions/summarize_test_pytorch_workflow.py

Comment on lines +3 to +5

    
              """

              This summarizes the environment setup steps for the

              .github/workflows/test_pytorch_wheels.yml workflow.

Member Author

ScottTodd Nov 26, 2025

i am not sure if we want to have it running all the time in the pipeline?

wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print
export AMDGPU_FAMILY="gfx1151"
export TORCH_VERSION="2.7.1+rocm7.10.0a20251120"
export PYTORCH_GIT_REF="release/2.7"

I hear that, yeah. I think for many contributors who aren't as familiar with each CI pipeline, having a nicely formatted summary will make workflow results easier to understand. I know where in the logs to look for reproduction steps across each workflow type, but many developers do not.

Contributor

HereThereBeDragons Nov 26, 2025

see my comments some discussions above: maybe have a generic file in the doc/ how to setup docker for pytorch tests and then in the ci have a referrence to that doc as URL + the pytorch test command

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet