Skip to content

Conversation

sjarus
Copy link
Collaborator

@sjarus sjarus commented Aug 21, 2025

Updates PyTorch version since old version is not downloadable.

  • Now at 2.9.0.dev20250820
  • TorchVision at 0.24.0.dev20250820

Updates PyTorch version since old version is not downloadable.
- Now at 2.9.0.dev20250820
- TorchVision at 0.24.0.dev20250820

Signed-off-by: Suraj Sudhir <[email protected]>
@zjgarvey
Copy link
Collaborator

It looks like #4262 is still relevant. Let's try to address this promptly.

If we need to temporarily add these to xfails or no-run sets, let's do this to unblock CI.

@sjarus
Copy link
Collaborator Author

sjarus commented Aug 25, 2025

@zjgarvey I'm having trouble locally reproducing the failure with " torch._dynamo.exc.InternalTorchDynamoError: TimeoutError: Timeout", which doesn't happen locally when build in-tree as in docs/dev within a venv and then do:
./projects/pt1/tools/e2e_test.sh --config onnx
./projects/pt1/tools/e2e_test.sh --config fx_importer
./projects/pt1/tools/e2e_test.sh --config fx_importer_stablehlo
./projects/pt1/tools/e2e_test.sh --config fx_importer_tosa
This is the case even though the logs indicate the right Torch version i.e.
TORCH_VERSION_FOR_COMPARISON = 2.9.0.dev20250820

Is there a pointer somewhere to how to exactly mimic the CI steps ?

@vivekkhandelwal1
Copy link
Collaborator

vivekkhandelwal1 commented Aug 26, 2025

It looks like #4262 is still relevant. Let's try to address this promptly.

If we need to temporarily add these to xfails or no-run sets, let's do this to unblock CI.

@zjgarvey, as per my observation the set of tests failing is not deterministic. For different execution, different tests fail. I suspect that the runner is using Python3.10, and that could be a possible reason for failure or something else related to runner. Since, I have never been able to reproduce this issue locally.

@sahas3
Copy link
Member

sahas3 commented Aug 26, 2025

I ran the steps of the ci workflow from

bash build_tools/ci/install_python_deps.sh ${{ matrix.torch-version }}
but unfortunately couldn't repro the failure. One thing I noticed is that somehow the error is pointing to python3.10 even though workflow file specifies python3.11 (locally I have python 3.11) https://github.com/llvm/torch-mlir/actions/runs/17141703353/job/48630987308?pr=4298#step:9:8705 -- I am not sure where it's getting 3.10 from and whether this has something to do with the failure.

@sjarus
Copy link
Collaborator Author

sjarus commented Aug 26, 2025

My working venv uses 3.10.12 . The pip list is

cmake             3.31.4
dill              0.3.9
filelock          3.16.1
fsspec            2024.12.0
Jinja2            3.1.5
MarkupSafe        3.0.2
mpmath            1.3.0
multiprocess      0.70.17
nanobind          2.5.0
networkx          3.4.2
ninja             1.11.1.3
numpy             2.2.1
onnx              1.16.1
packaging         24.2
pillow            11.1.0
pip               25.2
protobuf          5.29.3
pybind11          2.13.6
PyYAML            6.0.2
setuptools        59.6.0
sympy             1.13.3
torch             2.9.0.dev20250820+cpu
torchvision       0.24.0.dev20250820+cpu
typing_extensions 4.12.2
wheel             0.45.1

e2e_test.sh works fine:
onnx:

Summary:
    Passed: 941
    Expectedly Failed: 651

fx_importer:

Summary:
    Passed: 1493
    Expectedly Failed: 115

fx_importer_tosa:

Summary:
    Passed: 1173
    Expectedly Failed: 463

Not only did the new torch work, but it removed a bunch of things from xfails. This is very likely a CI setup problem and not a Torch-MLIR/PyTorch interaction issue as such.

@zjgarvey
Copy link
Collaborator

Sorry for the late reply. @sjarus this is a bit strange. It does look like deps are being installed to a 3.10 site packages, but then why are we installing python3.11 ? Something strange is going on here.

@sjarus
Copy link
Collaborator Author

sjarus commented Aug 27, 2025

Hi @zjgarvey yes it seems something about the CI setup scripts mangles things to such an extent that it's somehow interfering with the torch fx import step. Internally, we just do the standard build steps within the docs/development.md under 3.10.12 . For clarity these are our exact steps:

python -m venv mlir_venv
source mlir_venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install -r torchvision-requirements.txt
 
# Update submodules
git submodule update --init --recursive
 
# Create target:
cmake -GNinja -Bbuild \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DLLVM_ENABLE_ASSERTIONS=ON \
  -DPython_FIND_VIRTUALENV=ONLY \
  -DPython3_FIND_VIRTUALENV=ONLY \
  -DLLVM_ENABLE_PROJECTS=mlir \
  -DLLVM_EXTERNAL_PROJECTS="torch-mlir" \
  -DLLVM_EXTERNAL_TORCH_MLIR_SOURCE_DIR="$PWD" \
  -DMLIR_ENABLE_BINDINGS_PYTHON=ON \
  -DLLVM_TARGETS_TO_BUILD=host \
  -DTORCH_MLIR_ENABLE_PYTORCH_EXTENSIONS=ON \
  -DTORCH_MLIR_ENABLE_STABLEHLO=OFF \
  externals/llvm-project/llvm
 
# Build
cmake --build build

# Run
./projects/pt1/tools/e2e_test.sh --config fx_importer_tosa
./projects/pt1/tools/e2e_test.sh --config onnx_tosa

This works fine, and internally we have bumped up the PyTorch and TorchVision versions as listed in this PR, with clean runs.

There is a secondary problem with the CI scripts in that the stable and nightly report different results - but neither appear to function quite the same as the CI itself does.

The CI scripts and/or docker instance may need a serious review.

@sahas3
Copy link
Member

sahas3 commented Aug 28, 2025

@sjarus, @zjgarvey Looks like setting up python was the issue in the CI workflow. I've attempted to fix that and have a successful nightly build https://github.com/llvm/torch-mlir/actions/runs/17281729733/job/49051279175.

I do see different tests failing in nightly vs stable in the CI. Building stable locally now to see if it's also a CI only issue.

@sahas3
Copy link
Member

sahas3 commented Aug 28, 2025

@sjarus, @zjgarvey Looks like setting up python was the issue in the CI workflow. I've attempted to fix that and have a successful nightly build https://github.com/llvm/torch-mlir/actions/runs/17281729733/job/49051279175.

I do see different tests failing in nightly vs stable in the CI. Building stable locally now to see if it's also a CI only issue.

Stable test failure was reproducible locally too. Got clean CI in #4301.

@zjgarvey zjgarvey closed this Aug 28, 2025
@zjgarvey
Copy link
Collaborator

Closed as #4301 merged this change into main. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants