Provide parmmg with complete local vector including halos#228
Provide parmmg with complete local vector including halos#228stephankramer merged 9 commits intomainfrom
Conversation
joewallwork
left a comment
There was a problem hiding this comment.
Thanks @stephankramer, I just have a suggestion and a small request then we can merge this into #226. Once we do that, I'll open a PR to remove to_petsc_local_numbering from Firedrake.
We should try and come up with some useful unit tests but that can be done as a follow-up.
c9fb638 to
1099564
Compare
This changes to_petsc_local_numbering to take in a local vector (i.e. a sequential vector that includes halo DOFs) and returns a local vector in the "natural ordering" of the dmplex topological points. This is what the dmplex parmmg interface expects from the metric input vector. Should not change anything in serial. Appears to fix the warnings about indefinite metric in parmmg tests Closes #181 Hopefully also fixes #215
1099564 to
49a5595
Compare
|
Hm unfortunately does not seem to fix #215 |
|
Seems to actually have gotten worse! The time out seems to happen consistently now. Investigating... |
Hangs occur in python system exit when a final PETSc garbage clean up is underway. Using PETSc.garbage_view() to look at potential places where things potentially go out of sync, the serialised checkpointing might be a potential culprit. Adding a barrier after the initial collective checkpoint write, because it's not clear to me that it's safe for rank 0 to immediately open it again for reading if maybe other ranks are not yet ready writing. Also adding an explicit gc.collect() after the diverging code path to ensure consistency.
aed513c to
ed0fe44
Compare
|
Right, this seems to have worked. Unfortunately this is potentially a voodoo fix. As I described on #215, the hangs occur in the final PETSc garbage collection that is called on python system exit (so after pytest has already finished, with all tests passing). Looking at historical firedrake issue with this: https://github.com/firedrakeproject/firedrake/issues?q=is%3Aissue%20hang%20garbage these failures can be notoriously hard/irregular to reproduce, as it depends on at which times and order the different processes decide to call garbage collection which is very unpredictable and may depend on timing issues. Indeed trying to reproduce the hangs in the CI container, it appears the issue never occurs running any of the parallel tests individually. And only occurs having already run all serial tests (which I suppose affects the cache), and then running all (or a substantial subset) of the parallel tests. In my tests it seemed it was required for the serialised mmg3d to be included for the hang to occur. So it may well be that the fact that the changes in this PR to to_petsc_local_numbering appear to make the CI fail consistently (when running all tests in order), as opposed to 50% of the times on main, is just a fluke. Therefore also my "fix" to try to make the behaviour in the serialised checkpoints a bit more consistent - by adding a barrier and an explicit garbage collect call - might be a fluke, and there could still be an issue somewhere in the parallel code paths. I'll try to run the CI with this fix a few time and if it now passes consistently, I propose to merge this with the fix, as is. I still believe the changes to to_petsc_local_numbering are correct and needed for correct behaviour with parmmg3d. @joewallwork if you could have another look: I've addressed your previous review and expanded the numbering test to parallel |
|
CI has passed successfully four times in a row now, which is better than main |
joewallwork
left a comment
There was a problem hiding this comment.
Nice, this is great @stephankramer! Thanks for your work on this.
All of my comments and suggestions are minor, mostly to do with wording etc and for consistency with the approach elsewhere in the package.
| # Check that this is the case: | ||
| owned = f.dat.data_ro.flatten() | ||
| halos = f.dat.data_ro_with_halos[owned.size:].flatten() | ||
| np.testing.assert_equal(owned, np.arange(owned.size) + rank_fraction) |
There was a problem hiding this comment.
Is there a reason you didn't use
| np.testing.assert_equal(owned, np.arange(owned.size) + rank_fraction) | |
| assert np.allclose(owned, np.arange(owned.size) + rank_fraction) |
like we do elsewhere?
There was a problem hiding this comment.
You get a much more informative message when the test fails. With np.allclose(...) the expression just evaluates the False and it just says AssertionError with np.testing it prints the actual and desired arrays (or part of it if long) and some stats on the differences.
Happy to change to np.allclose for consistency, or here actually np.all(a==b) would be the equivalent for np.testing.assert_equal which tests strictly - but np.testing is pretty standard in my view. Not a big fan of pytest.approx btw, it seems a little opaque
There was a problem hiding this comment.
Fair enough. Happy to stick with this then.
Co-authored-by: Joe Wallwork <22053413+joewallwork@users.noreply.github.com>
joewallwork
left a comment
There was a problem hiding this comment.
Thanks for the updates, Stephan. Bar one typo this is good to go!
| # Check that this is the case: | ||
| owned = f.dat.data_ro.flatten() | ||
| halos = f.dat.data_ro_with_halos[owned.size:].flatten() | ||
| np.testing.assert_equal(owned, np.arange(owned.size) + rank_fraction) |
There was a problem hiding this comment.
Fair enough. Happy to stick with this then.
Co-authored-by: Joe Wallwork <22053413+joewallwork@users.noreply.github.com>
|
Ah thanks for catching that, all addressed now |
Following on from #226
This changes to_petsc_local_numbering to take in a local vector (i.e. a sequential vector that includes halo DOFs) and returns a local vector in the "natural ordering" of the dmplex topological points. This is what the dmplex parmmg interface expects from the metric input vector. Should not change anything in serial.
Appears to fix the warnings about indefinite metric in parmmg tests - which I think were caused by halo entries that were copied, out-of-bounds, from the global vector we were providing previously
Closes #181
Hopefully also fixes #215