Skip to content

Provide parmmg with complete local vector including halos#228

Merged
stephankramer merged 9 commits intomainfrom
fix-metric-ordering
Mar 2, 2026
Merged

Provide parmmg with complete local vector including halos#228
stephankramer merged 9 commits intomainfrom
fix-metric-ordering

Conversation

@stephankramer
Copy link
Collaborator

Following on from #226

This changes to_petsc_local_numbering to take in a local vector (i.e. a sequential vector that includes halo DOFs) and returns a local vector in the "natural ordering" of the dmplex topological points. This is what the dmplex parmmg interface expects from the metric input vector. Should not change anything in serial.

Appears to fix the warnings about indefinite metric in parmmg tests - which I think were caused by halo entries that were copied, out-of-bounds, from the global vector we were providing previously

Closes #181

Hopefully also fixes #215

@joewallwork joewallwork changed the base branch from main to 225_to-petsc-local-numbering February 23, 2026 15:27
@stephankramer stephankramer added the bug Something isn't working label Feb 23, 2026
Copy link
Member

@joewallwork joewallwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stephankramer, I just have a suggestion and a small request then we can merge this into #226. Once we do that, I'll open a PR to remove to_petsc_local_numbering from Firedrake.

We should try and come up with some useful unit tests but that can be done as a follow-up.

Base automatically changed from 225_to-petsc-local-numbering to main February 24, 2026 15:28
This changes to_petsc_local_numbering to take in a local vector (i.e.
a sequential vector that includes halo DOFs) and returns a local vector
in the "natural ordering" of the dmplex topological points. This is what
the dmplex parmmg interface expects from the metric input vector. Should
not change anything in serial.

Appears to fix the warnings about indefinite metric in parmmg tests

Closes #181

Hopefully also fixes #215
@stephankramer
Copy link
Collaborator Author

Hm unfortunately does not seem to fix #215
It does get rid of the indefinite metric warnings

@stephankramer
Copy link
Collaborator Author

Seems to actually have gotten worse! The time out seems to happen consistently now. Investigating...

Hangs occur in python system exit when a final PETSc garbage
clean up is underway. Using PETSc.garbage_view() to look
at potential places where things potentially go out of sync,
the serialised checkpointing might be a potential culprit.

Adding a barrier after the initial collective checkpoint write,
because it's not clear to me that it's safe for rank 0 to immediately
open it again for reading if maybe other ranks are not yet ready
writing.

Also adding an explicit gc.collect() after the diverging code path to
ensure consistency.
@stephankramer
Copy link
Collaborator Author

Right, this seems to have worked. Unfortunately this is potentially a voodoo fix.

As I described on #215, the hangs occur in the final PETSc garbage collection that is called on python system exit (so after pytest has already finished, with all tests passing). Looking at historical firedrake issue with this: https://github.com/firedrakeproject/firedrake/issues?q=is%3Aissue%20hang%20garbage these failures can be notoriously hard/irregular to reproduce, as it depends on at which times and order the different processes decide to call garbage collection which is very unpredictable and may depend on timing issues. Indeed trying to reproduce the hangs in the CI container, it appears the issue never occurs running any of the parallel tests individually. And only occurs having already run all serial tests (which I suppose affects the cache), and then running all (or a substantial subset) of the parallel tests. In my tests it seemed it was required for the serialised mmg3d to be included for the hang to occur. So it may well be that the fact that the changes in this PR to to_petsc_local_numbering appear to make the CI fail consistently (when running all tests in order), as opposed to 50% of the times on main, is just a fluke. Therefore also my "fix" to try to make the behaviour in the serialised checkpoints a bit more consistent - by adding a barrier and an explicit garbage collect call - might be a fluke, and there could still be an issue somewhere in the parallel code paths.

I'll try to run the CI with this fix a few time and if it now passes consistently, I propose to merge this with the fix, as is. I still believe the changes to to_petsc_local_numbering are correct and needed for correct behaviour with parmmg3d. @joewallwork if you could have another look: I've addressed your previous review and expanded the numbering test to parallel

@stephankramer
Copy link
Collaborator Author

CI has passed successfully four times in a row now, which is better than main

Copy link
Member

@joewallwork joewallwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is great @stephankramer! Thanks for your work on this.

All of my comments and suggestions are minor, mostly to do with wording etc and for consistency with the approach elsewhere in the package.

# Check that this is the case:
owned = f.dat.data_ro.flatten()
halos = f.dat.data_ro_with_halos[owned.size:].flatten()
np.testing.assert_equal(owned, np.arange(owned.size) + rank_fraction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you didn't use

Suggested change
np.testing.assert_equal(owned, np.arange(owned.size) + rank_fraction)
assert np.allclose(owned, np.arange(owned.size) + rank_fraction)

like we do elsewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get a much more informative message when the test fails. With np.allclose(...) the expression just evaluates the False and it just says AssertionError with np.testing it prints the actual and desired arrays (or part of it if long) and some stats on the differences.

Happy to change to np.allclose for consistency, or here actually np.all(a==b) would be the equivalent for np.testing.assert_equal which tests strictly - but np.testing is pretty standard in my view. Not a big fan of pytest.approx btw, it seems a little opaque

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Happy to stick with this then.

stephankramer and others added 3 commits March 2, 2026 11:18
Co-authored-by: Joe Wallwork <22053413+joewallwork@users.noreply.github.com>
Copy link
Member

@joewallwork joewallwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, Stephan. Bar one typo this is good to go!

# Check that this is the case:
owned = f.dat.data_ro.flatten()
halos = f.dat.data_ro_with_halos[owned.size:].flatten()
np.testing.assert_equal(owned, np.arange(owned.size) + rank_fraction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Happy to stick with this then.

Co-authored-by: Joe Wallwork <22053413+joewallwork@users.noreply.github.com>
@stephankramer
Copy link
Collaborator Author

Ah thanks for catching that, all addressed now

@stephankramer stephankramer merged commit a9300ff into main Mar 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Occassional hangs in CI in parallel Incorrect petsc reordering for parmmg interface

2 participants