Skip to content

Conversation

milroy
Copy link
Member

@milroy milroy commented Feb 14, 2025

This PR adds grow_allocate functionality, which attempts to allocate new resources specified by a jobspec to an existing jobid. The PR adds necessary functionality to the traverser, resource module, and REAPI, and adds testsuite tests for correct behavior.

This PR is WIP pending client testing in FluxQueue, Fluence, and additional testsuite tests.

Copy link
Member

@vsoch vsoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍔

goto error;
}
} else {
if (strcmp ("grow_allocate", cmd) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to solidify my understanding here:

  1. When there is a call to the match request callback, the only condition (match op) this is allowed for is grow_allocate
  2. In the cast that the jobid exists but it's not grow_allocate, that is an error case. Only that one supports an existing id.
  3. In the case we don't have a job id and it is grow_allocate that's another error because we require one!
  4. Shrinking an allocation is done via partial cancel (thus not in this function)

One case I'm interested to test is if it's possible to shrink down to 0 and then go back up, and also what happens if you try to grow or shrink in unreasonable ways (where those ways are TBA!).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right on all four items.

One case I'm interested to test is if it's possible to shrink down to 0 and then go back up

It should be possible to do that in sched but not in core.

MATCH_ALLOCATE,
MATCH_ALLOCATE_W_SATISFIABILITY,
MATCH_ALLOCATE_ORELSE_RESERVE,
MATCH_GROW_ALLOCATION,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for naming consistency:

Suggested change
MATCH_GROW_ALLOCATION,
MATCH_GROW_ALLOCATE,

I shouldn't be reviewing this early, I've read that word less than 10 times and am getting semantic satiation. "ALLOCATE" looks like some kind of fruit name to me right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I waffled on this and am back to preferring MATCH_GROW_ALLOCATION. My current thinking is that the pattern is verb_verb_{noun/verb/conjunction}, and GROW_ALLOCATION is clearer.

// only the initial time matters for allocate
const int64_t target_time = op == match_op_t::MATCH_ALLOCATE ? meta.at : graph_end - 1;
const int64_t target_time =
(op == match_op_t::MATCH_ALLOCATE || op == match_op_t::MATCH_GROW_ALLOCATION)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this syntax say "if the statement is true (if it's allocate or grow) default to the meta.at, otherwise the graph end -1 (or is it the other way around)? I know this syntax exists in JavaScript and I always get it wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if that is the case, it's saying that if we grow or match allocate, we want the time to be the job metadata "at" which I assume is the time matched at? Just to check - for grow do we have a new at time for "at" set here in the job metadata (otherwise it would be the previous at associated with the jobid when the first allocation was done)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for grow do we have a new at time for "at" set here

That's correct, the at time is relative to when the job request (grow or allocate) is submitted.

// no chance, don't even try
if (op == match_op_t::MATCH_ALLOCATE_ORELSE_RESERVE || op == match_op_t::MATCH_ALLOCATE) {
if (op == match_op_t::MATCH_ALLOCATE_ORELSE_RESERVE || op == match_op_t::MATCH_ALLOCATE
|| op == match_op_t::MATCH_GROW_ALLOCATION) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EEEEEEEEEEEEBUSY continues doing nodey things ⌨️

break;
}
case match_op_t::MATCH_ALLOCATE:
case match_op_t::MATCH_GROW_ALLOCATION:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this syntax - does this mean do nothing for match allocate, or that match allocate is the same as match grow allocate? I would guess the latter since one line was just added (so we want EBUSY for both cases).

In that case, this is saying if no resources are available, just return busy. I think this answers my question about reserve - we can't reserve with an update. It's a yes / no, and right now, if we don't have it ask again later!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean do nothing for match allocate

Kind of, yes. It's fall-through syntax, see: https://en.cppreference.com/w/cpp/language/switch.html or https://learn.microsoft.com/en-us/cpp/cpp/switch-statement-cpp?view=msvc-170

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be extra clear, maybe add the standard fallthrough annotation here. That way if we turn on fallthrough warnings it will not come back and bug us later.

https://en.cppreference.com/w/cpp/language/attributes/fallthrough

Suggested change
case match_op_t::MATCH_GROW_ALLOCATION:
[[fallthrough]];
case match_op_t::MATCH_GROW_ALLOCATION:

} else {
jobid = static_cast<int64_t> (std::strtoll (args[3].c_str (), NULL, 10));
if (!ctx->jobs.contains (jobid)) {
std::cerr << "ERROR: nonexistent jobid " << jobid << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is (I think) a third check for existence - I am guessing this is done because the state might change, and better to be safe than sorry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This existence check is on a different code path. Currently, resource-query isn't a client of all the C++ API functions but it's going to transition to being a full client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh maybe this is it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, my browser didn't update the thread and I missed your comment - gotcha.

@milroy milroy force-pushed the alloc-grow branch 4 times, most recently from eff17d2 to 0b51ab9 Compare July 21, 2025 07:02
@milroy milroy marked this pull request as ready for review July 21, 2025 07:04
@milroy milroy changed the title [WIP] Add support for growing allocations Add support for growing allocations Jul 21, 2025
milroy added 13 commits July 21, 2025 18:41
Problem: adding support for growing an allocation requires a new
match_op.

Add match_op_t::MATCH_GROW_ALLOCATION.
Problem: adding support for growing an allocation requires a new
match_op.

Add match_op_t::MATCH_GROW_ALLOCATION to the traverser. While the new
match operation behaves identically to MATCH_ALLOCATE in the traverser,
creating a new operation is clearer in intent.
Problem: adding support for growing an allocation requires a new match
command in the resource module.

Add the grow_allocation command to the module and add error handling for
existent or nonexistent jobids.
Problem: adding support for growing an allocation requires a new match
operation in the REAPI CLI.

Add the MATCH_GROW_ALLOCATION operation to the CLI and add error
handling for nonexistent jobids.
Problem: adding support for growing an allocation requires a new match
operation in the C REAPI CLI.

Add the MATCH_GROW_ALLOCATION operation to the C CLI and add error
handling for nonexistent jobids.
Problem: the new MATCH_GROW_ALLOCATION isn't documented in the C REAPI
module.

Add an entry for MATCH_GROW_ALLOCATION.
Problem: adding support for growing an allocation requires handing the
grow_allocation command in resource-query.

Add the grow_allocation command to resource-query and add error handling
for nonexistent jobids.
Problem: adding support for growing an allocation requires handing the
grow_allocation command in rq2.

Add the grow_allocation command to rq2 and add error handling for
nonexistent jobids.
Problem: growing an allocation is untested in the sharness testsuite.

Add tests to the existing grow sharness test.
Problem: growing an allocation does not have any unit testing.

Add unit tests of initialization and allocation growth via Catch2.
Problem: shrinking an allocation does not have any unit testing.

Add unit tests of initialization and allocation shrink via Catch2.
Problem: growing an allocation is not supported in flux ion-resource.

Add the ability to send match RPCs to grow an existing allocation.
Problem: growing an allocation via flux ion-resource is untested.

Add tests to grow an existing allocation via flux ion-resource.
@milroy
Copy link
Member Author

milroy commented Jul 22, 2025

I've added Catch2-based unit testing and improved testsuite tests by adding support for allocation growth in flux ion-resource. This PR is ready for review.

That said, there are two issues that are open for discussion:

  1. should it be MATCH_GROW_ALLOCATION or MATCH_GROW_ALLOCATE?
  2. What is the expected integration with flux-core and qmanager? What happens if the request gets stuck in the queue for so long the at time isn't useful or the base job has completed?

@milroy milroy requested review from trws and vsoch July 22, 2025 02:20
Copy link

codecov bot commented Jul 22, 2025

Codecov Report

Attention: Patch coverage is 86.22222% with 31 lines in your changes missing coverage. Please review.

Project coverage is 76.2%. Comparing base (fc7621f) to head (0f6afd4).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
resource/modules/resource_match.cpp 50.0% 10 Missing ⚠️
resource/reapi/test/reapi_cli_test_c.cpp 92.6% 6 Missing ⚠️
resource/reapi/test/reapi_cli_test_cxx.cpp 92.8% 6 Missing ⚠️
resource/reapi/bindings/c++/reapi_cli_impl.hpp 33.3% 4 Missing ⚠️
resource/policies/base/match_op.h 50.0% 2 Missing ⚠️
resource/traversers/dfu.cpp 87.5% 1 Missing ⚠️
resource/utilities/command.cpp 92.3% 1 Missing ⚠️
resource/utilities/rq2.cpp 83.3% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1340     +/-   ##
========================================
- Coverage    76.4%   76.2%   -0.2%     
========================================
  Files         112     115      +3     
  Lines       16700   17048    +348     
========================================
+ Hits        12765   13002    +237     
- Misses       3935    4046    +111     
Files with missing lines Coverage Δ
resource/reapi/bindings/c/reapi_cli.cpp 23.9% <100.0%> (ø)
resource/traversers/dfu.cpp 88.9% <87.5%> (-0.2%) ⬇️
resource/utilities/command.cpp 79.7% <92.3%> (+0.1%) ⬆️
resource/utilities/rq2.cpp 62.2% <83.3%> (+0.1%) ⬆️
resource/policies/base/match_op.h 58.8% <50.0%> (-5.5%) ⬇️
resource/reapi/bindings/c++/reapi_cli_impl.hpp 56.6% <33.3%> (+5.0%) ⬆️
resource/reapi/test/reapi_cli_test_c.cpp 92.6% <92.6%> (ø)
resource/reapi/test/reapi_cli_test_cxx.cpp 92.8% <92.8%> (ø)
resource/modules/resource_match.cpp 70.9% <50.0%> (-0.1%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@trws
Copy link
Member

trws commented Jul 25, 2025

I'm good with MATCH_GROW_ALLOCATION as long as you are.

As to the timing, I think we need to have something in free that does a cancel on any in-flight grow requests. Speaking of that actually, I haven't searched for it but this fails if the job matching its ID is in a state other than "allocated/running" right? Stacking grows on a pending job sounds like a recipe for strange, strange cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants