-
Notifications
You must be signed in to change notification settings - Fork 133
[torch] Log environment reproduction steps in test workflows #2238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| run: | | ||
| python build_tools/github_actions/summarize_test_pytorch_workflow.py \ | ||
| --torch-version=${{ inputs.torch_version }} \ | ||
| --pytorch-ref=${{ inputs.pytorch_version }} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this input name will need to adapt to #2239 if that is merged first.
HereThereBeDragons
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall a good idea.
here some thoughts for the moment.
- i am not sure if we want to have it running all the time in the pipeline?
wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print
export AMDGPU_FAMILY="gfx1151"
export TORCH_VERSION="2.7.1+rocm7.10.0a20251120"
export PYTORCH_GIT_REF="release/2.7"
etc?
-
in the future, when i have finally finished #1732, we can also easily propagate the container image hash
-
related to the different markdown versions:
do workflow steps show proper markdown? i always thought it is just plain text. -
python venv should be an extra step und not part of docker. maybe have the
apt installand setup of it together.
| summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). " | ||
| summary += "For example:\n\n" | ||
| summary += "```bash\n" | ||
| summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.
alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not using this
PYTORCH_TEST_WITH_ROCM- not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.
That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.
Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.
alternatively you should defnitely put here also an example with
-kas this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.
What I want for that is to
- generate a test report during testing
- parse the test report to get a list of failures
- list the failures and show commands to reproduce those specific failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case where you give a specific list of failed commands, i can understand it would be quite cool to list the command to rerun.
but maybe we can reduce it "look here URL to setup the container" and then "here is the pytorch test command to run"
| # TODO: default the index subdir based on the current GPU somehow? | ||
| # (share that logic with setup_venv.py if so) | ||
| parser.add_argument( | ||
| "--index-subdir", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better: --gpu-arch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go either way on this. "Index subdirectory" explains exactly how the data is going to be used, while "gpu arch" is closer to what a user might know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the question is: are we going to change this directory hierarchy anytime soon? if not, i would change it to gpu and add a comment that this is just an index subdirectory?
| summary += " ```bash\n" | ||
| summary += f" git clone --branch {args.pytorch_git_ref} --origin {pytorch_remote_name} https://github.com/{pytorch_repo_org}/pytorch.git\n" | ||
| summary += " ```\n\n" | ||
| summary += "* (B) Switch to pytorch ref using an existing repository\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| summary += "* (B) Switch to pytorch ref using an existing repository\n\n" | |
| summary += "* or (B) Switch to pytorch ref using an existing repository\n\n" |
| summary += "```\n\n" | ||
|
|
||
| pytorch_remote_name = "upstream" if args.pytorch_git_ref == "nightly" else "rocm" | ||
| pytorch_repo_org = "pytorch" if args.pytorch_git_ref == "nightly" else "ROCm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think this is super elegant for the upstream. couldnt be the repo org be empty if we are upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like --origin {pytorch_remote_name} to be omitted here?
This currently logs either:
git clone --branch nightly --origin upstream https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git
Would you prefer this or something else?
git clone --branch nightly https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the 2nd one :)
| summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). " | ||
| summary += "For example:\n\n" | ||
| summary += "```bash\n" | ||
| summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not using this
PYTORCH_TEST_WITH_ROCM- not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.
That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.
Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.
alternatively you should defnitely put here also an example with
-kas this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.
What I want for that is to
- generate a test report during testing
- parse the test report to get a list of failures
- list the failures and show commands to reproduce those specific failures
| summary += "```\n\n" | ||
|
|
||
| pytorch_remote_name = "upstream" if args.pytorch_git_ref == "nightly" else "rocm" | ||
| pytorch_repo_org = "pytorch" if args.pytorch_git_ref == "nightly" else "ROCm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like --origin {pytorch_remote_name} to be omitted here?
This currently logs either:
git clone --branch nightly --origin upstream https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git
Would you prefer this or something else?
git clone --branch nightly https://github.com/pytorch/pytorch.git
git clone --branch release/2.9 --origin rocm https://github.com/ROCm/pytorch.git
| # TODO: default the index subdir based on the current GPU somehow? | ||
| # (share that logic with setup_venv.py if so) | ||
| parser.add_argument( | ||
| "--index-subdir", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go either way on this. "Index subdirectory" explains exactly how the data is going to be used, while "gpu arch" is closer to what a user might know.
| summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n" | ||
| summary += "```\n" | ||
|
|
||
| gha_append_step_summary(summary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. related to the different markdown versions:
do workflow steps show proper markdown? i always thought it is just plain text.
This writes to GITHUB_STEP_SUMMARY (https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-commands#adding-a-job-summary), which supports GitHub flavored markdown.
It will appear like here: https://github.com/ROCm/TheRock/actions/runs/19553166331#summary-55989478372 . Note that when there are multiple jobs in the same workflow run, we will have a list like here: https://github.com/ROCm/TheRock/actions/runs/19680190301#summary-56371979719.
| """ | ||
| This summarizes the environment setup steps for the | ||
| .github/workflows/test_pytorch_wheels.yml workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- i am not sure if we want to have it running all the time in the pipeline?
wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print
export AMDGPU_FAMILY="gfx1151" export TORCH_VERSION="2.7.1+rocm7.10.0a20251120" export PYTORCH_GIT_REF="release/2.7"
I hear that, yeah. I think for many contributors who aren't as familiar with each CI pipeline, having a nicely formatted summary will make workflow results easier to understand. I know where in the logs to look for reproduction steps across each workflow type, but many developers do not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments some discussions above: maybe have a generic file in the doc/ how to setup docker for pytorch tests and then in the ci have a referrence to that doc as URL + the pytorch test command
Fixes #2173
Note
Posting this for review to see how others feel about the overall approach before I iterate some more.
This writes a step summary for each of our "Test PyTorch Wheels" workflow runs showing how to reproduce the test environment without needing to sift through the logs, find workflow file sources, or even clone TheRock to use our scripts.
I've tried a few variations on the output format and I'm not yet fully satisfied: https://gist.github.com/ScottTodd/6a465a4958fdaea59ede417434ba64b4. Here's a test run with v1: https://github.com/ROCm/TheRock/actions/runs/19553166331. The code in this PR is currently v2.
A few design points I'm working through: