Skip to content

Conversation

@ScottTodd
Copy link
Member

@ScottTodd ScottTodd commented Nov 20, 2025

Fixes #2173

Note

Posting this for review to see how others feel about the overall approach before I iterate some more.

This writes a step summary for each of our "Test PyTorch Wheels" workflow runs showing how to reproduce the test environment without needing to sift through the logs, find workflow file sources, or even clone TheRock to use our scripts.

I've tried a few variations on the output format and I'm not yet fully satisfied: https://gist.github.com/ScottTodd/6a465a4958fdaea59ede417434ba64b4. Here's a test run with v1: https://github.com/ROCm/TheRock/actions/runs/19553166331. The code in this PR is currently v2.

A few design points I'm working through:

  • I want the instructions to be easy to copy/paste into issue reports and terminals, but there are two key branches:
    1. Starting a Docker container (which is optional). If you copy/paste a full set of instructions and one starts an interactive container the remaining instructions will not be run.
    2. If you already have pytorch checked out, you don't need to clone the repository.
  • Probably also want a few quick bullet points at the start listing the inputs (pytorch version), linking to the github branch, linking to the release index page, etc.
  • If we have the pytorch tests generate a test report, we could also include that in the summary and even show how to run specific failed test cases individually (maybe limit to the first 10 failures?)

run: |
python build_tools/github_actions/summarize_test_pytorch_workflow.py \
--torch-version=${{ inputs.torch_version }} \
--pytorch-ref=${{ inputs.pytorch_version }} \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this input name will need to adapt to #2239 if that is merged first.

Copy link
Contributor

@HereThereBeDragons HereThereBeDragons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall a good idea.

here some thoughts for the moment.

  1. i am not sure if we want to have it running all the time in the pipeline?

wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print

export AMDGPU_FAMILY="gfx1151"
export TORCH_VERSION="2.7.1+rocm7.10.0a20251120"
export PYTORCH_GIT_REF="release/2.7"

etc?

  1. in the future, when i have finally finished #1732, we can also easily propagate the container image hash

  2. related to the different markdown versions:
    do workflow steps show proper markdown? i always thought it is just plain text.

  3. python venv should be an extra step und not part of docker. maybe have the apt install and setup of it together.

summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). "
summary += "For example:\n\n"
summary += "```bash\n"
summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.

Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

What I want for that is to

  1. generate a test report during testing
  2. parse the test report to get a list of failures
  3. list the failures and show commands to reproduce those specific failures

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case where you give a specific list of failed commands, i can understand it would be quite cool to list the command to rerun.

but maybe we can reduce it "look here URL to setup the container" and then "here is the pytorch test command to run"

# TODO: default the index subdir based on the current GPU somehow?
# (share that logic with setup_venv.py if so)
parser.add_argument(
"--index-subdir",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better: --gpu-arch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way on this. "Index subdirectory" explains exactly how the data is going to be used, while "gpu arch" is closer to what a user might know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the question is: are we going to change this directory hierarchy anytime soon? if not, i would change it to gpu and add a comment that this is just an index subdirectory?

summary += " ```bash\n"
summary += f" git clone --branch {args.pytorch_git_ref} --origin {pytorch_remote_name} https://github.com/{pytorch_repo_org}/pytorch.git\n"
summary += " ```\n\n"
summary += "* (B) Switch to pytorch ref using an existing repository\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
summary += "* (B) Switch to pytorch ref using an existing repository\n\n"
summary += "* or (B) Switch to pytorch ref using an existing repository\n\n"

summary += "```\n\n"

pytorch_remote_name = "upstream" if args.pytorch_git_ref == "nightly" else "rocm"
pytorch_repo_org = "pytorch" if args.pytorch_git_ref == "nightly" else "ROCm"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think this is super elegant for the upstream. couldnt be the repo org be empty if we are upstream?

Copy link
Member Author

@ScottTodd ScottTodd Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like --origin {pytorch_remote_name} to be omitted here?

This currently logs either:

git clone --branch nightly --origin upstream https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git

Would you prefer this or something else?

git clone --branch nightly https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the 2nd one :)

summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). "
summary += "For example:\n\n"
summary += "```bash\n"
summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.

Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

What I want for that is to

  1. generate a test report during testing
  2. parse the test report to get a list of failures
  3. list the failures and show commands to reproduce those specific failures

summary += "```\n\n"

pytorch_remote_name = "upstream" if args.pytorch_git_ref == "nightly" else "rocm"
pytorch_repo_org = "pytorch" if args.pytorch_git_ref == "nightly" else "ROCm"
Copy link
Member Author

@ScottTodd ScottTodd Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like --origin {pytorch_remote_name} to be omitted here?

This currently logs either:

git clone --branch nightly --origin upstream https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git

Would you prefer this or something else?

git clone --branch nightly https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git

# TODO: default the index subdir based on the current GPU somehow?
# (share that logic with setup_venv.py if so)
parser.add_argument(
"--index-subdir",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way on this. "Index subdirectory" explains exactly how the data is going to be used, while "gpu arch" is closer to what a user might know.

summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n"
summary += "```\n"

gha_append_step_summary(summary)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. related to the different markdown versions:
do workflow steps show proper markdown? i always thought it is just plain text.

This writes to GITHUB_STEP_SUMMARY (https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-commands#adding-a-job-summary), which supports GitHub flavored markdown.

It will appear like here: https://github.com/ROCm/TheRock/actions/runs/19553166331#summary-55989478372 . Note that when there are multiple jobs in the same workflow run, we will have a list like here: https://github.com/ROCm/TheRock/actions/runs/19680190301#summary-56371979719.

Comment on lines +3 to +5
"""
This summarizes the environment setup steps for the
.github/workflows/test_pytorch_wheels.yml workflow.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. i am not sure if we want to have it running all the time in the pipeline?

wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print

export AMDGPU_FAMILY="gfx1151"
export TORCH_VERSION="2.7.1+rocm7.10.0a20251120"
export PYTORCH_GIT_REF="release/2.7"

I hear that, yeah. I think for many contributors who aren't as familiar with each CI pipeline, having a nicely formatted summary will make workflow results easier to understand. I know where in the logs to look for reproduction steps across each workflow type, but many developers do not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments some discussions above: maybe have a generic file in the doc/ how to setup docker for pytorch tests and then in the ci have a referrence to that doc as URL + the pytorch test command

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

[Feature] Log simple reproduction steps in PyTorch test workflows

3 participants