Skip to content

Conversation

milroy
Copy link
Member

@milroy milroy commented Sep 21, 2025

Problem summary and solution

Multiple occurrences of resource state divergence between core and sched have been reported on production systems when partial cancel is enabled and JGF is in use. A recent debugging session with @trws and @jameshcorbett on one such system revealed the cause: allocations spanning higher-level vertices (e.g., an allocation of 3 nodes spanning two racks) experience errors when reducing planner_multi spans during partial cancel.

The source of the problem is that accumulated counts of lower-level resources are removed from all higher-level resources associated with the job regardless of whether those counts correspond to the number of job resources allocated in the subtree rooted at a vertex in question.

For example, in a rack-spanning job with 3 nodes (1 in rack0 and 2 in rack1), partially cancelling the 2 nodes in rack1 causes planner_multi to attempt to remove 2 nodes from the job span on rack0. The attempt results in an error which leaves the vertices in an inconsistent, unallocatable state. Specifically, vertices can retain job tags and spans.

Along the course of investigating this problem, I discovered two more issues. First, the current rank-based removal (used prior to subgraph deletion) also accumulated counts without tracking counts associated with jobids or higher-level graph resources. That oversight can lead to even greater problems. Second, the JGF partial cancel logic was computing the number of resources removed by referencing an at time of 0 rather than the start time of the span. Fixing the second issue simply entails referencing the span start time getting the span's occupied resources with planner_span_resource_count.

Fixing partial cancel for higher-level, vertex-spanning allocations requires accumulating counts per rank and jobid and propagating those counts to the rank subgraph root's ancestors in the resource graph. That approach obviated a DFS cancellation.

Remaining items for discussion

When implementing tests using different pruning filter configurations and jobspecs, I noticed that one configuration did not result in a complete removal of all allocation vertex metadata: https://github.com/flux-framework/flux-sched/blob/992b8b18b7eb72be51ba72955f7aef1183b3494c/t/data/resource/expected/cancel/029.R.err. (Note the empty file: it means the partial cancel didn't remove all job metadata and thus the follow-up full cancel didn't error.) This is caused by partial cancel not removing the job tag due to this condition being met: https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu_impl_update.cpp#L427-L430. The cause is a jobspec like this:

version: 9999
resources:
  - type: slot
    count: 3
    label: default
    with:
      - type: node
        count: 1

# a comment
attributes:
  system:
    duration: 3600
tasks:
  - command: [ "app" ]
    slot: default
    count:
      per_slot: 1

The slot implicitly allocates children exclusively, however, the traverser does not accumulate their counts and set them in the agfilter upon allocation (i.e., dfu is empty):

if (subtree_plan && !dfu.empty ()) {

A follow-up, full cancel fixes this issue, but it warrants further discussion since implicit slot exclusivity causes problems elsewhere.

The second remaining discussion point is the necessity for a sleep here: https://github.com/milroy/flux-sched/blob/992b8b18b7eb72be51ba72955f7aef1183b3494c/t/t2317-resource-shrink.t#L46
This wasn't necessary before, and I'd like a better way to deal with the asynchronous behavior.

@milroy milroy force-pushed the partial-cancel-fixes branch 2 times, most recently from 488d970 to 9771b91 Compare September 21, 2025 09:15
@milroy
Copy link
Member Author

milroy commented Sep 21, 2025

Note that there's a lot of code churn because I updated the power.json JGF to include a correctly enumerated rank in each vertex below the rack level. That was necessary to enable cross-rack allocation testing.

@milroy milroy force-pushed the partial-cancel-fixes branch from 9771b91 to cb2248c Compare September 22, 2025 23:54
@trws
Copy link
Member

trws commented Sep 23, 2025

I'm going to work through the code shortly, starting with the discussion points though:

The slot implicitly allocates children exclusively, however, the traverser does not accumulate their counts and set them in the agfilter upon allocation (i.e., dfu is empty):

I hadn't realized we were doing this, so an implicit exclusive allocation is just set at the root of the exclusive tree and we don't track the elements under it? I can see why for efficiency, but if we're removing all the markings how do we end up with things remaining tagged, or is it a case of cutting out too early so we still leave it on that root vertex?

The second remaining discussion point is the necessity for a sleep here: https://github.com/milroy/flux-sched/blob/992b8b18b7eb72be51ba72955f7aef1183b3494c/t/t2317-resource-shrink.t#L46
This wasn't necessary before, and I'd like a better way to deal with the asynchronous behavior.

The fact that works makes me think it's a timeout or other delay related issue. @garlick, is there a way we can wait for an event or similar to know that a node has been fully disconnected/removed? If there isn't something specifically for that, we might be able to have something set up a streaming notify RPC with resource and wait for the update from there after the disconnect rather than using a sleep? I think it would show up on that stream as a drain.

@garlick
Copy link
Member

garlick commented Sep 23, 2025

@trws wrote

The fact that works makes me think it's a timeout or other delay related issue. @garlick, is there a way we can wait for an event or similar to know that a node has been fully disconnected/removed?

Yeah there is a chain of asynchronous events there so the test is inherently racy. The flux overlay disconnect command returns immediately after sending a disconnect control message to the child instructing it to drop its connection. Then

  • broker wakes up to handle EHOSTUNREACH and marks peer down
  • groups module watching overlay wakes up and removes peer from broker.online list
  • resource module watching broker.online wakes up and changes the resource set

Although the test could call flux resource eventlog --wait offline or similar, I'd just add something simple like this that doesn't depend on the implementation as much (untested):

# Usage: wait_for_node_count N
wait_for_node_count() {
    retries=5
    while test $retries -ge 0; do
        test $(flux resource list -s all -no {nnodes}) -eq $1 && return 0
        retries=$(($retries-1))
        sleep 0.1
    done
    return 1
}

@milroy
Copy link
Member Author

milroy commented Sep 24, 2025

Thanks for the early reviews and feedback @trws and @garlick!

I'll reply to the other comments later tonight, but will start here:

I hadn't realized we were doing this, so an implicit exclusive allocation is just set at the root of the exclusive tree and we don't track the elements under it?

That's correct.

I can see why for efficiency, but if we're removing all the markings how do we end up with things remaining tagged, or is it a case of cutting out too early so we still leave it on that root vertex?

This may be a bug. If the behavior were motivated by efficiency, I'd expect implicit and explicit exclusivity to behave the same. For partial cancel in the scenario above (implicit, exclusive node allocation), the reason all job tags associated with a jobid don't get removed is because (due to the implicit behavior) the job2span map doesn't get a (jobid, span) inserted here:

(*m_graph)[u].idata.job2span[jobmeta.jobid] = span;
Then during partial cancel, this condition gets met: https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu_impl_update.cpp#L427-L430, which ends up skipping the rem_tag.

We could adjust the behavior of implicit slot exclusivity, but at that point we might as well revisit whether we actually want children of slots to be exclusive, as I believe there were rabbit use cases for allowing non-exclusive slots: #1189 (comment). Alternatively, I can drop the !dfu.empty () condition here:

if (subtree_plan && !dfu.empty ()) {
That change allows all other CI tests to pass, but I'm not sure if there are side effects.

@milroy
Copy link
Member Author

milroy commented Sep 24, 2025

Although the test could call flux resource eventlog --wait offline or similar, I'd just add something simple like this that doesn't depend on the implementation as much (untested):

wait_for_node_count works great, thanks @garlick!

@milroy milroy force-pushed the partial-cancel-fixes branch 2 times, most recently from c970f55 to 077df00 Compare September 24, 2025 08:22
Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given this several looks through, and it seems reasonable! I'm still familiarizing myself with the partial cancel logic though.

@milroy
Copy link
Member Author

milroy commented Sep 29, 2025

I think I've addressed your edits and feedback @trws and @jameshcorbett. The PR is ready for another review.

Note that I added error outputs that are hard to test. I'm not sure if I can improve the patch code coverage a lot.

@milroy milroy force-pushed the partial-cancel-fixes branch 2 times, most recently from 5c08ca5 to 18811fd Compare September 29, 2025 07:36
@milroy
Copy link
Member Author

milroy commented Oct 2, 2025

Thanks for the feedback, @jameshcorbett! I think I addressed all your comments. Can you take a quick look at the changes?

@trws, this is ready for a final review. Let me know if you have suggestions for improving the code coverage.

@milroy milroy force-pushed the partial-cancel-fixes branch from 9c0884c to b914bd5 Compare October 2, 2025 00:43
Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one question about something not really directly related to the changes in this PR.

milroy added 5 commits October 2, 2025 15:30
Problem: reinitializing the traverser can result in errors and a
non-zero return code. The current shrink_resources function does not
return the reinitialization return code to the caller.

Assign and return the code.
Problem: the resource type index getter does not check whether the
iterator returned by find() is the end of the res_type hashed index,
possibly leading to undefined behavior.

Return the size of the hashed index if the resource type isn't found.
That allows the client to check if it's the end and thus not found.
Problem: planner_multi_span_planned_at currently can return negative
planned resources for planner subspans that have been deleted during a
prior partial cancel.

Check for the -1 sentinel value and return 0, indicating that there are
no resources planned at that time for the corresponding sub planner.
Problem: there is no unit test to check for correct behavior of
planner_multi_span_planned_at.

Add a test.
Problem: the traverser does't handle error values returned by
planner_multi_span_planned_at and planner_multi_span_planned_at.

Add error handling to ensure `find` output is sensible and allocations
are modified as required.
@trws
Copy link
Member

trws commented Oct 7, 2025

This review makes me wish really, really hard we had a good way to do a stacked PRs workflow. There are several tiny really nice obvious fix commits in here we could have landed almost instantly. Anyway, I'm working on a full review over the next hour or so.

Comment on lines 812 to 811
static subsystem_t containment_sub{"containment"};
const subsystem_t containment_sub{"containment"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be both const and static if possible. No reason to look it up every call, the storage is too small to worry about it.

Comment on lines +107 to +108
INFO: PARTIAL CANCEL PREORDER COUNT="0"
INFO: PARTIAL CANCEL POSTORDER COUNT="0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it that both of these come out as zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since partial cancel now just directly removes allocation counts from the parents, there's no traversal and the counts should be zero. I could drop the stats option in the tests to suppress the counts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. We can do that but I'm not worried about it at the moment now that I know why it comes out this way.

milroy added 4 commits October 7, 2025 23:21
Problem: the current JGF partial cancel logic checks for the
available resources at time=0 rather than the resources used by the
span. This is a bug. Moreover, the error checking and handling is
incomplete.

Fix the bug in JGF partial removal by checking the available resources
at the start of the span, and add more thorough error handling.
Problems: multiple occurrences of resource state divergence between core
and sched have been reported on production systems with JGF when
partial cancel is enabled. A debugging session on one such system
revealed the cause: allocations spanning higher-level vertices(e.g., an
allocation of 3 nodes across two racks) cause errors when reducing
planner_multi spans. The source of the problem is that accumulated
counts of lower-level resources are removed from all higher-level
resources associated with the job. In a cross-rack job with 3 nodes,
partially cancelling 2 nodes under 1 rack will result in planner_multi
attempting to remove 2 nodes from a span on the rack with only 1 node
allocated to the job. The attempt results in an error which leaves the
vertices in an inconsistent, unallocatable state.

Furthermore, the current rank-based removal prior to subgraph deletion
also accumulated counts without tracking counts associated with jobids
or ranks. That oversight can lead to even more problems.

Correctly executing partial cancel requires tracking the root of the
subtree corresponding to a rank and the counts of resources accumulated
in the rank-rooted subtree.

Add tracking data to the reader base and JGF to enable correct partial
cancel behavior. In the traverser, accumulate resource counts to be
removed based on rank (and jobid where appropriate), and pass them to
the ancestor vertices in the resource graph. This approach obviates a
DFS job removal.
Problem: there is no reproducer for the cross-rack partial cancel bug
observed on production clusters using JGF.

Add the reproducer and test for both low and lonodex policies with
various aggregate filter configurations.
Problem: there are no tests for correct behavior of partial cancel with
cross-rack allocations observed on production clusters using JGF.

Add the reproducer and test for both low and lonodex policies, and
different specificities of resources with various aggregate filter
configurations.
@milroy milroy force-pushed the partial-cancel-fixes branch from aa9dc21 to d876a35 Compare October 8, 2025 06:21
@milroy
Copy link
Member Author

milroy commented Oct 8, 2025

Thanks again for the suggestions @trws. I think I've addressed them all. This is ready for a hopefully final look.

Copy link
Member

@trws trws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks for all the work @milroy!

@milroy
Copy link
Member Author

milroy commented Oct 8, 2025

Noting for posterity that the changes here can be optimized by implementing a tree reduction. That implementation is more complex and can be done as needed in a follow-up PR.

Setting MWP.

@milroy milroy added the merge-when-passing mergify.io - merge PR automatically once CI passes label Oct 8, 2025
@mergify mergify bot added the queued label Oct 8, 2025
@mergify mergify bot merged commit 312e29f into flux-framework:master Oct 8, 2025
42 of 44 checks passed
@mergify mergify bot removed the queued label Oct 8, 2025
Copy link

codecov bot commented Oct 8, 2025

Codecov Report

❌ Patch coverage is 57.48503% with 71 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.4%. Comparing base (9908d7d) to head (d876a35).
⚠️ Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
resource/traversers/dfu_impl_update.cpp 61.5% 40 Missing ⚠️
resource/readers/resource_reader_jgf.cpp 37.5% 25 Missing ⚠️
resource/traversers/dfu_impl.cpp 42.8% 4 Missing ⚠️
resource/planner/c/planner_multi_c_interface.cpp 71.4% 2 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##           master   #1406   +/-   ##
======================================
  Coverage    77.3%   77.4%           
======================================
  Files         117     116    -1     
  Lines       16603   16691   +88     
======================================
+ Hits        12850   12920   +70     
- Misses       3753    3771   +18     
Files with missing lines Coverage Δ
resource/modules/resource_match.cpp 67.9% <100.0%> (ø)
resource/planner/c++/planner_multi.cpp 71.8% <100.0%> (+0.2%) ⬆️
resource/planner/test/planner_test02.cpp 100.0% <100.0%> (ø)
resource/readers/resource_reader_jgf.hpp 100.0% <100.0%> (ø)
resource/readers/resource_reader_rv1exec.cpp 72.0% <100.0%> (+0.1%) ⬆️
resource/planner/c/planner_multi_c_interface.cpp 63.5% <71.4%> (-0.2%) ⬇️
resource/traversers/dfu_impl.cpp 84.1% <42.8%> (-0.3%) ⬇️
resource/readers/resource_reader_jgf.cpp 68.5% <37.5%> (-0.1%) ⬇️
resource/traversers/dfu_impl_update.cpp 79.6% <61.5%> (-0.6%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-when-passing mergify.io - merge PR automatically once CI passes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants