-
Notifications
You must be signed in to change notification settings - Fork 47
Fix partial cancel for allocations that span higher-level vertices #1406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
488d970
to
9771b91
Compare
Note that there's a lot of code churn because I updated the |
9771b91
to
cb2248c
Compare
I'm going to work through the code shortly, starting with the discussion points though:
I hadn't realized we were doing this, so an implicit exclusive allocation is just set at the root of the exclusive tree and we don't track the elements under it? I can see why for efficiency, but if we're removing all the markings how do we end up with things remaining tagged, or is it a case of cutting out too early so we still leave it on that root vertex?
The fact that works makes me think it's a timeout or other delay related issue. @garlick, is there a way we can wait for an event or similar to know that a node has been fully disconnected/removed? If there isn't something specifically for that, we might be able to have something set up a streaming notify RPC with resource and wait for the update from there after the disconnect rather than using a sleep? I think it would show up on that stream as a drain. |
@trws wrote
Yeah there is a chain of asynchronous events there so the test is inherently racy. The
Although the test could call # Usage: wait_for_node_count N
wait_for_node_count() {
retries=5
while test $retries -ge 0; do
test $(flux resource list -s all -no {nnodes}) -eq $1 && return 0
retries=$(($retries-1))
sleep 0.1
done
return 1
} |
Thanks for the early reviews and feedback @trws and @garlick! I'll reply to the other comments later tonight, but will start here:
That's correct.
This may be a bug. If the behavior were motivated by efficiency, I'd expect implicit and explicit exclusivity to behave the same. For partial cancel in the scenario above (implicit, exclusive node allocation), the reason all job tags associated with a jobid don't get removed is because (due to the implicit behavior) the
rem_tag .
We could adjust the behavior of implicit
|
|
c970f55
to
077df00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've given this several looks through, and it seems reasonable! I'm still familiarizing myself with the partial cancel logic though.
077df00
to
923466a
Compare
6b15c08
to
ff0ed59
Compare
I think I've addressed your edits and feedback @trws and @jameshcorbett. The PR is ready for another review. Note that I added error outputs that are hard to test. I'm not sure if I can improve the patch code coverage a lot. |
5c08ca5
to
18811fd
Compare
Thanks for the feedback, @jameshcorbett! I think I addressed all your comments. Can you take a quick look at the changes? @trws, this is ready for a final review. Let me know if you have suggestions for improving the code coverage. |
9c0884c
to
b914bd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one question about something not really directly related to the changes in this PR.
Problem: reinitializing the traverser can result in errors and a non-zero return code. The current shrink_resources function does not return the reinitialization return code to the caller. Assign and return the code.
Problem: the resource type index getter does not check whether the iterator returned by find() is the end of the res_type hashed index, possibly leading to undefined behavior. Return the size of the hashed index if the resource type isn't found. That allows the client to check if it's the end and thus not found.
Problem: planner_multi_span_planned_at currently can return negative planned resources for planner subspans that have been deleted during a prior partial cancel. Check for the -1 sentinel value and return 0, indicating that there are no resources planned at that time for the corresponding sub planner.
Problem: there is no unit test to check for correct behavior of planner_multi_span_planned_at. Add a test.
Problem: the traverser does't handle error values returned by planner_multi_span_planned_at and planner_multi_span_planned_at. Add error handling to ensure `find` output is sensible and allocations are modified as required.
b914bd5
to
aa9dc21
Compare
This review makes me wish really, really hard we had a good way to do a stacked PRs workflow. There are several tiny really nice obvious fix commits in here we could have landed almost instantly. Anyway, I'm working on a full review over the next hour or so. |
static subsystem_t containment_sub{"containment"}; | ||
const subsystem_t containment_sub{"containment"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be both const and static if possible. No reason to look it up every call, the storage is too small to worry about it.
INFO: PARTIAL CANCEL PREORDER COUNT="0" | ||
INFO: PARTIAL CANCEL POSTORDER COUNT="0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it that both of these come out as zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since partial cancel now just directly removes allocation counts from the parents, there's no traversal and the counts should be zero. I could drop the stats
option in the tests to suppress the counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. We can do that but I'm not worried about it at the moment now that I know why it comes out this way.
Problem: the current JGF partial cancel logic checks for the available resources at time=0 rather than the resources used by the span. This is a bug. Moreover, the error checking and handling is incomplete. Fix the bug in JGF partial removal by checking the available resources at the start of the span, and add more thorough error handling.
Problems: multiple occurrences of resource state divergence between core and sched have been reported on production systems with JGF when partial cancel is enabled. A debugging session on one such system revealed the cause: allocations spanning higher-level vertices(e.g., an allocation of 3 nodes across two racks) cause errors when reducing planner_multi spans. The source of the problem is that accumulated counts of lower-level resources are removed from all higher-level resources associated with the job. In a cross-rack job with 3 nodes, partially cancelling 2 nodes under 1 rack will result in planner_multi attempting to remove 2 nodes from a span on the rack with only 1 node allocated to the job. The attempt results in an error which leaves the vertices in an inconsistent, unallocatable state. Furthermore, the current rank-based removal prior to subgraph deletion also accumulated counts without tracking counts associated with jobids or ranks. That oversight can lead to even more problems. Correctly executing partial cancel requires tracking the root of the subtree corresponding to a rank and the counts of resources accumulated in the rank-rooted subtree. Add tracking data to the reader base and JGF to enable correct partial cancel behavior. In the traverser, accumulate resource counts to be removed based on rank (and jobid where appropriate), and pass them to the ancestor vertices in the resource graph. This approach obviates a DFS job removal.
Problem: there is no reproducer for the cross-rack partial cancel bug observed on production clusters using JGF. Add the reproducer and test for both low and lonodex policies with various aggregate filter configurations.
Problem: there are no tests for correct behavior of partial cancel with cross-rack allocations observed on production clusters using JGF. Add the reproducer and test for both low and lonodex policies, and different specificities of resources with various aggregate filter configurations.
aa9dc21
to
d876a35
Compare
Thanks again for the suggestions @trws. I think I've addressed them all. This is ready for a hopefully final look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks for all the work @milroy!
Noting for posterity that the changes here can be optimized by implementing a tree reduction. That implementation is more complex and can be done as needed in a follow-up PR. Setting MWP. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1406 +/- ##
======================================
Coverage 77.3% 77.4%
======================================
Files 117 116 -1
Lines 16603 16691 +88
======================================
+ Hits 12850 12920 +70
- Misses 3753 3771 +18
🚀 New features to boost your workflow:
|
Problem summary and solution
Multiple occurrences of resource state divergence between core and sched have been reported on production systems when partial cancel is enabled and JGF is in use. A recent debugging session with @trws and @jameshcorbett on one such system revealed the cause: allocations spanning higher-level vertices (e.g., an allocation of 3 nodes spanning two racks) experience errors when reducing
planner_multi
spans during partial cancel.The source of the problem is that accumulated counts of lower-level resources are removed from all higher-level resources associated with the job regardless of whether those counts correspond to the number of job resources allocated in the subtree rooted at a vertex in question.
For example, in a rack-spanning job with 3 nodes (1 in
rack0
and 2 inrack1
), partially cancelling the 2 nodes inrack1
causesplanner_multi
to attempt to remove 2 nodes from the job span onrack0
. The attempt results in an error which leaves the vertices in an inconsistent, unallocatable state. Specifically, vertices can retain job tags and spans.Along the course of investigating this problem, I discovered two more issues. First, the current rank-based removal (used prior to subgraph deletion) also accumulated counts without tracking counts associated with jobids or higher-level graph resources. That oversight can lead to even greater problems. Second, the JGF partial cancel logic was computing the number of resources removed by referencing an
at
time of0
rather than the start time of the span. Fixing the second issue simply entailsreferencing the span start timegetting the span's occupied resources withplanner_span_resource_count
.Fixing partial cancel for higher-level, vertex-spanning allocations requires accumulating counts per rank and jobid and propagating those counts to the rank subgraph root's ancestors in the resource graph. That approach obviated a DFS cancellation.
Remaining items for discussion
When implementing tests using different pruning filter configurations and jobspecs, I noticed that one configuration did not result in a complete removal of all allocation vertex metadata: https://github.com/flux-framework/flux-sched/blob/992b8b18b7eb72be51ba72955f7aef1183b3494c/t/data/resource/expected/cancel/029.R.err. (Note the empty file: it means the partial cancel didn't remove all job metadata and thus the follow-up full cancel didn't error.) This is caused by partial cancel not removing the job tag due to this condition being met: https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu_impl_update.cpp#L427-L430. The cause is a jobspec like this:
The
slot
implicitly allocates children exclusively, however, the traverser does not accumulate their counts and set them in the agfilter upon allocation (i.e.,dfu
is empty):flux-sched/resource/traversers/dfu_impl_update.cpp
Line 75 in 8e1718d
A follow-up, full cancel fixes this issue, but it warrants further discussion since implicit slot exclusivity causes problems elsewhere.
The second remaining discussion point is the necessity for a
sleep
here: https://github.com/milroy/flux-sched/blob/992b8b18b7eb72be51ba72955f7aef1183b3494c/t/t2317-resource-shrink.t#L46This wasn't necessary before, and I'd like a better way to deal with the asynchronous behavior.