Fix partial cancel for allocations that span higher-level vertices #1406

milroy · 2025-09-21T08:53:06Z

Problem summary and solution

Multiple occurrences of resource state divergence between core and sched have been reported on production systems when partial cancel is enabled and JGF is in use. A recent debugging session with @trws and @jameshcorbett on one such system revealed the cause: allocations spanning higher-level vertices (e.g., an allocation of 3 nodes spanning two racks) experience errors when reducing planner_multi spans during partial cancel.

The source of the problem is that accumulated counts of lower-level resources are removed from all higher-level resources associated with the job regardless of whether those counts correspond to the number of job resources allocated in the subtree rooted at a vertex in question.

For example, in a rack-spanning job with 3 nodes (1 in rack0 and 2 in rack1), partially cancelling the 2 nodes in rack1 causes planner_multi to attempt to remove 2 nodes from the job span on rack0. The attempt results in an error which leaves the vertices in an inconsistent, unallocatable state. Specifically, vertices can retain job tags and spans.

Along the course of investigating this problem, I discovered two more issues. First, the current rank-based removal (used prior to subgraph deletion) also accumulated counts without tracking counts associated with jobids or higher-level graph resources. That oversight can lead to even greater problems. Second, the JGF partial cancel logic was computing the number of resources removed by referencing an at time of 0 rather than the start time of the span. Fixing the second issue simply entails ~~referencing the span start time~~ getting the span's occupied resources with planner_span_resource_count.

Fixing partial cancel for higher-level, vertex-spanning allocations requires accumulating counts per rank and jobid and propagating those counts to the rank subgraph root's ancestors in the resource graph. That approach obviated a DFS cancellation.

Remaining items for discussion

When implementing tests using different pruning filter configurations and jobspecs, I noticed that one configuration did not result in a complete removal of all allocation vertex metadata: https://github.com/flux-framework/flux-sched/blob/992b8b18b7eb72be51ba72955f7aef1183b3494c/t/data/resource/expected/cancel/029.R.err. (Note the empty file: it means the partial cancel didn't remove all job metadata and thus the follow-up full cancel didn't error.) This is caused by partial cancel not removing the job tag due to this condition being met: https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu_impl_update.cpp#L427-L430. The cause is a jobspec like this:

version: 9999
resources:
  - type: slot
    count: 3
    label: default
    with:
      - type: node
        count: 1

# a comment
attributes:
  system:
    duration: 3600
tasks:
  - command: [ "app" ]
    slot: default
    count:
      per_slot: 1

The slot implicitly allocates children exclusively, however, the traverser does not accumulate their counts and set them in the agfilter upon allocation (i.e., dfu is empty):

flux-sched/resource/traversers/dfu_impl_update.cpp

Line 75 in 8e1718d

if (subtree_plan && !dfu.empty ()) {

A follow-up, full cancel fixes this issue, but it warrants further discussion since implicit slot exclusivity causes problems elsewhere.

The second remaining discussion point is the necessity for a sleep here: https://github.com/milroy/flux-sched/blob/992b8b18b7eb72be51ba72955f7aef1183b3494c/t/t2317-resource-shrink.t#L46
This wasn't necessary before, and I'd like a better way to deal with the asynchronous behavior.

milroy · 2025-09-21T09:19:52Z

Note that there's a lot of code churn because I updated the power.json JGF to include a correctly enumerated rank in each vertex below the rack level. That was necessary to enable cross-rack allocation testing.

trws · 2025-09-23T15:00:47Z

I'm going to work through the code shortly, starting with the discussion points though:

The slot implicitly allocates children exclusively, however, the traverser does not accumulate their counts and set them in the agfilter upon allocation (i.e., dfu is empty):

I hadn't realized we were doing this, so an implicit exclusive allocation is just set at the root of the exclusive tree and we don't track the elements under it? I can see why for efficiency, but if we're removing all the markings how do we end up with things remaining tagged, or is it a case of cutting out too early so we still leave it on that root vertex?

The second remaining discussion point is the necessity for a sleep here: https://github.com/milroy/flux-sched/blob/992b8b18b7eb72be51ba72955f7aef1183b3494c/t/t2317-resource-shrink.t#L46
This wasn't necessary before, and I'd like a better way to deal with the asynchronous behavior.

The fact that works makes me think it's a timeout or other delay related issue. @garlick, is there a way we can wait for an event or similar to know that a node has been fully disconnected/removed? If there isn't something specifically for that, we might be able to have something set up a streaming notify RPC with resource and wait for the update from there after the disconnect rather than using a sleep? I think it would show up on that stream as a drain.

resource/readers/resource_reader_base.hpp

resource/readers/resource_reader_jgf.cpp

resource/readers/resource_reader_base.hpp

resource/readers/resource_reader_jgf.hpp

resource/traversers/dfu_impl_update.cpp

garlick · 2025-09-23T16:55:12Z

@trws wrote

The fact that works makes me think it's a timeout or other delay related issue. @garlick, is there a way we can wait for an event or similar to know that a node has been fully disconnected/removed?

Yeah there is a chain of asynchronous events there so the test is inherently racy. The flux overlay disconnect command returns immediately after sending a disconnect control message to the child instructing it to drop its connection. Then

broker wakes up to handle EHOSTUNREACH and marks peer down
groups module watching overlay wakes up and removes peer from broker.online list
resource module watching broker.online wakes up and changes the resource set

Although the test could call flux resource eventlog --wait offline or similar, I'd just add something simple like this that doesn't depend on the implementation as much (untested):

# Usage: wait_for_node_count N
wait_for_node_count() {
    retries=5
    while test $retries -ge 0; do
        test $(flux resource list -s all -no {nnodes}) -eq $1 && return 0
        retries=$(($retries-1))
        sleep 0.1
    done
    return 1
}

milroy · 2025-09-24T02:28:48Z

Thanks for the early reviews and feedback @trws and @garlick!

I'll reply to the other comments later tonight, but will start here:

I hadn't realized we were doing this, so an implicit exclusive allocation is just set at the root of the exclusive tree and we don't track the elements under it?

That's correct.

I can see why for efficiency, but if we're removing all the markings how do we end up with things remaining tagged, or is it a case of cutting out too early so we still leave it on that root vertex?

This may be a bug. If the behavior were motivated by efficiency, I'd expect implicit and explicit exclusivity to behave the same. For partial cancel in the scenario above (implicit, exclusive node allocation), the reason all job tags associated with a jobid don't get removed is because (due to the implicit behavior) the job2span map doesn't get a (jobid, span) inserted here:

flux-sched/resource/traversers/dfu_impl_update.cpp

Line 93 in 8e1718d

(*m_graph)[u].idata.job2span[jobmeta.jobid] = span;

Then during partial cancel, this condition gets met: https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu_impl_update.cpp#L427-L430, which ends up skipping the rem_tag.

We could adjust the behavior of implicit slot exclusivity, but at that point we might as well revisit whether we actually want children of slots to be exclusive, as I believe there were rabbit use cases for allowing non-exclusive slots: #1189 (comment). Alternatively, I can drop the !dfu.empty () condition here:

flux-sched/resource/traversers/dfu_impl_update.cpp

Line 75 in 8e1718d

if (subtree_plan && !dfu.empty ()) {

That change allows all other CI tests to pass, but I'm not sure if there are side effects.

milroy · 2025-09-24T06:28:30Z

Although the test could call flux resource eventlog --wait offline or similar, I'd just add something simple like this that doesn't depend on the implementation as much (untested):

wait_for_node_count works great, thanks @garlick!

jameshcorbett

I've given this several looks through, and it seems reasonable! I'm still familiarizing myself with the partial cancel logic though.

t/data/resource/jobspecs/cancel/test026.yaml

milroy · 2025-09-29T06:28:11Z

I think I've addressed your edits and feedback @trws and @jameshcorbett. The PR is ready for another review.

Note that I added error outputs that are hard to test. I'm not sure if I can improve the patch code coverage a lot.

milroy · 2025-10-02T00:37:28Z

Thanks for the feedback, @jameshcorbett! I think I addressed all your comments. Can you take a quick look at the changes?

@trws, this is ready for a final review. Let me know if you have suggestions for improving the code coverage.

jameshcorbett

LGTM, one question about something not really directly related to the changes in this PR.

t/data/resource/jobspecs/cancel/test028.yaml

Problem: reinitializing the traverser can result in errors and a non-zero return code. The current shrink_resources function does not return the reinitialization return code to the caller. Assign and return the code.

Problem: the resource type index getter does not check whether the iterator returned by find() is the end of the res_type hashed index, possibly leading to undefined behavior. Return the size of the hashed index if the resource type isn't found. That allows the client to check if it's the end and thus not found.

Problem: planner_multi_span_planned_at currently can return negative planned resources for planner subspans that have been deleted during a prior partial cancel. Check for the -1 sentinel value and return 0, indicating that there are no resources planned at that time for the corresponding sub planner.

Problem: there is no unit test to check for correct behavior of planner_multi_span_planned_at. Add a test.

Problem: the traverser does't handle error values returned by planner_multi_span_planned_at and planner_multi_span_planned_at. Add error handling to ensure `find` output is sensible and allocations are modified as required.

trws · 2025-10-07T18:58:25Z

This review makes me wish really, really hard we had a good way to do a stacked PRs workflow. There are several tiny really nice obvious fix commits in here we could have landed almost instantly. Anyway, I'm working on a full review over the next hour or so.

trws · 2025-10-07T19:05:34Z

resource/readers/resource_reader_jgf.cpp

-    static subsystem_t containment_sub{"containment"};
+    const subsystem_t containment_sub{"containment"};


This should be both const and static if possible. No reason to look it up every call, the storage is too small to worry about it.

resource/traversers/dfu_impl_update.cpp

t/t1028-rv1-partial-release-across-racks.t

trws · 2025-10-07T20:44:22Z

t/data/resource/expected/cancel/024.R.out

+INFO: PARTIAL CANCEL PREORDER COUNT="0"
+INFO: PARTIAL CANCEL POSTORDER COUNT="0"


How is it that both of these come out as zero?

Since partial cancel now just directly removes allocation counts from the parents, there's no traversal and the counts should be zero. I could drop the stats option in the tests to suppress the counts.

Got it. We can do that but I'm not worried about it at the moment now that I know why it comes out this way.

t/data/resource/expected/cancel/026.R.err

Problem: the current JGF partial cancel logic checks for the available resources at time=0 rather than the resources used by the span. This is a bug. Moreover, the error checking and handling is incomplete. Fix the bug in JGF partial removal by checking the available resources at the start of the span, and add more thorough error handling.

Problems: multiple occurrences of resource state divergence between core and sched have been reported on production systems with JGF when partial cancel is enabled. A debugging session on one such system revealed the cause: allocations spanning higher-level vertices(e.g., an allocation of 3 nodes across two racks) cause errors when reducing planner_multi spans. The source of the problem is that accumulated counts of lower-level resources are removed from all higher-level resources associated with the job. In a cross-rack job with 3 nodes, partially cancelling 2 nodes under 1 rack will result in planner_multi attempting to remove 2 nodes from a span on the rack with only 1 node allocated to the job. The attempt results in an error which leaves the vertices in an inconsistent, unallocatable state. Furthermore, the current rank-based removal prior to subgraph deletion also accumulated counts without tracking counts associated with jobids or ranks. That oversight can lead to even more problems. Correctly executing partial cancel requires tracking the root of the subtree corresponding to a rank and the counts of resources accumulated in the rank-rooted subtree. Add tracking data to the reader base and JGF to enable correct partial cancel behavior. In the traverser, accumulate resource counts to be removed based on rank (and jobid where appropriate), and pass them to the ancestor vertices in the resource graph. This approach obviates a DFS job removal.

Problem: there is no reproducer for the cross-rack partial cancel bug observed on production clusters using JGF. Add the reproducer and test for both low and lonodex policies with various aggregate filter configurations.

Problem: there are no tests for correct behavior of partial cancel with cross-rack allocations observed on production clusters using JGF. Add the reproducer and test for both low and lonodex policies, and different specificities of resources with various aggregate filter configurations.

milroy · 2025-10-08T07:22:43Z

Thanks again for the suggestions @trws. I think I've addressed them all. This is ready for a hopefully final look.

trws

This looks great, thanks for all the work @milroy!

milroy · 2025-10-08T18:21:16Z

Noting for posterity that the changes here can be optimized by implementing a tree reduction. That implementation is more complex and can be done as needed in a follow-up PR.

Setting MWP.

codecov · 2025-10-08T18:24:16Z

Codecov Report

❌ Patch coverage is 57.48503% with 71 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.4%. Comparing base (9908d7d) to head (d876a35).
⚠️ Report is 11 commits behind head on master.

Files with missing lines	Patch %	Lines
resource/traversers/dfu_impl_update.cpp	61.5%	40 Missing ⚠️
resource/readers/resource_reader_jgf.cpp	37.5%	25 Missing ⚠️
resource/traversers/dfu_impl.cpp	42.8%	4 Missing ⚠️
resource/planner/c/planner_multi_c_interface.cpp	71.4%	2 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #1406   +/-   ##
======================================
  Coverage    77.3%   77.4%           
======================================
  Files         117     116    -1     
  Lines       16603   16691   +88     
======================================
+ Hits        12850   12920   +70     
- Misses       3753    3771   +18

Files with missing lines	Coverage Δ
resource/modules/resource_match.cpp	`67.9% <100.0%> (ø)`
resource/planner/c++/planner_multi.cpp	`71.8% <100.0%> (+0.2%)`	⬆️
resource/planner/test/planner_test02.cpp	`100.0% <100.0%> (ø)`
resource/readers/resource_reader_jgf.hpp	`100.0% <100.0%> (ø)`
resource/readers/resource_reader_rv1exec.cpp	`72.0% <100.0%> (+0.1%)`	⬆️
resource/planner/c/planner_multi_c_interface.cpp	`63.5% <71.4%> (-0.2%)`	⬇️
resource/traversers/dfu_impl.cpp	`84.1% <42.8%> (-0.3%)`	⬇️
resource/readers/resource_reader_jgf.cpp	`68.5% <37.5%> (-0.1%)`	⬇️
resource/traversers/dfu_impl_update.cpp	`79.6% <61.5%> (-0.6%)`	⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

milroy force-pushed the partial-cancel-fixes branch 2 times, most recently from 488d970 to 9771b91 Compare September 21, 2025 09:15

milroy force-pushed the partial-cancel-fixes branch from 9771b91 to cb2248c Compare September 22, 2025 23:54