Skip to content

Conversation

@chu11
Copy link
Member

@chu11 chu11 commented Oct 17, 2025

Per discussion in #6782. This largely mirrors the "parallelization" done for valrefs in flux-fsck. The maximum number of RPCs that can be sent out in parallel is hard coded to 1000.

We have some circumstances (i.e. flux-dump on a live broker) that we may intentionally want to be slower to not cause performance issues for the running system. So we don't want to parallelize all the time. I elected to support this as an option --async (name can be debated) vs the opposite --no-async, so that the current behavior could stay the same.

Against a 9000 entry KVS valref array put a few date calls around a flux-dump.

Before

17:58:23.882822734
flux-dump: archived 1 keys
17:58:24.912048310

nanoseconds = 1029225576

After

18:15:57.390297976
flux-dump: archived 1 keys
18:15:57.616373318

nanoseconds = 226075342

So about a 78% improvement! Against a 4000 entry valref I also did, it was about 67%.

Side note: assuming we like this change, I have not yet updated rc3 to use this.

@chu11 chu11 force-pushed the issue6782_flux_dump_perf branch from ea251e9 to a2876f1 Compare October 17, 2025 22:01
@garlick
Copy link
Member

garlick commented Oct 17, 2025

Yeah maybe --fast would be more intuitive than --async, making it more obvious why you'd want to use the option?

I haven't looked at this yet but remember I was asking for code sharing between dump and fsck, for fetching large value arrays under one future.. Does that seem possible now? I believe before you didn't want to pause for that because of the urgent need for fsck at the time.

@grondo
Copy link
Contributor

grondo commented Oct 17, 2025

Just for my own curiosity, when would we not want the "fast" behavior?

Nevermind. After I typed that I remembered that we were worried about impact to running systems when doing nightly dumps.

@chu11 chu11 force-pushed the issue6782_flux_dump_perf branch from a2876f1 to 916d008 Compare October 18, 2025 06:57
@chu11
Copy link
Member Author

chu11 commented Oct 18, 2025

haven't looked at this yet but remember I was asking for code sharing between dump and fsck, for fetching large value arrays under one future.. Does that seem possible now?

It's doable, but I felt it wasn't going to be much of a win. fsck & dump have begun diverging, so there'd have to be special handling of all the differences between them. e.g. validate vs. load, save response messages vs not, continue with all blobrefs on error vs give up after first error (I just realized I did not handle this one correctly on the dump side, re-pushed), etc.

@chu11 chu11 force-pushed the issue6782_flux_dump_perf branch 2 times, most recently from d0012c7 to b4b7fca Compare October 18, 2025 14:46
@chu11 chu11 changed the title flux-dump: support --async option for faster dumping flux-dump: support --fast option for faster dumping Oct 18, 2025
@chu11
Copy link
Member Author

chu11 commented Oct 18, 2025

re-pushed, converting --async to --fast.

@chu11 chu11 force-pushed the issue6782_flux_dump_perf branch from b4b7fca to cb9b72a Compare October 20, 2025 23:57
@garlick
Copy link
Member

garlick commented Oct 21, 2025

Could we just pause for a second and make sure that's not a viable approach? If we could abstract all that it would make the tools far simpler to work on down the road.

validate vs. load

Same request payload. As far as an interface goes, it could just be like a VALIDATE_ONLY flag that causes the topic string to change

save response messages vs not

Easily handled I think? If flag is VALIDATE_ONLY, then no save, otherwise save in an array I guess?

continue with all blobrefs on error vs give up after first error

Not sure if that really matters. Errors are unlikely so if you continued and fetched blobrefs after encountering an error it wouldn't be the end of the world. But probably also doable with flux_future_continue_error()? (I would need to refresh my memory on composite futures to be sure)

It's possible I'm spouting nonsense, and you've actually been working on the code, so if you're pretty sure that's the case then say so and we can move on :-)

@chu11
Copy link
Member Author

chu11 commented Oct 21, 2025

Could we just pause for a second and make sure that's not a viable approach? If we could abstract all that it would make the tools far simpler to work on down the road.

It's certainly viable, I just wasn't convinced that it was a big enough "win" to make it worthwhile.

I'll go ahead and give it a shot. Devils in the details and we'll see how it pans out.

@chu11 chu11 force-pushed the issue6782_flux_dump_perf branch from cb9b72a to 03a1ce0 Compare October 23, 2025 18:29
@chu11
Copy link
Member Author

chu11 commented Oct 23, 2025

re-pushed, add common code via a library in src/cmd/util (new commits are on top of what I did before)

Went through several iterations, one attempt to create a "true library function" in libkvs was dropped b/c the code became ridiculous (i.e. way more complex than it should have been for some ~50 lines of common code between flux-dump and flux-fsck). So I instead made a generalized "common code function" just for this.

is it a net win? I think a bit, although it's not as clean as I would have loved. I could go either way on it really.

/* return -1 to discontinue getting more blobrefs */
typedef int (*get_blobrefs_f)(const flux_msg_t *msg,
                              int index,
                              void *arg);

/* return -1 to discontinue getting more blobrefs */
typedef int (*err_blobrefs_f)(int errnum,
                              int index,
                              void *arg);

/* asynchronously lookup blobrefs
 * - get_cb is optional, only if you want the data
 * - err_cb is required when a blobref lookup error occurs
 */
int valref_blobrefs (flux_t *h,
                     const char *topic,
                     json_t *treeobj,
                     get_blobrefs_f get_cb,
                     err_blobrefs_f err_cb,
                     void *arg);

The idea was that flux-dump can set both callbacks and flux-fsck only will set the error callback, b/c it doesn't care about the data. Passing the topic string was less than ideal, but felt easiest way to deal with content.load vs content-backing.load vs content-backing.validate between the two tools. Had to leave more code in flux-dump that I would have liked, b/c we still deal with the "serial" path there.

@chu11 chu11 force-pushed the issue6782_flux_dump_perf branch from 03a1ce0 to 8147cc1 Compare October 23, 2025 18:42
@garlick
Copy link
Member

garlick commented Oct 23, 2025

What I was hoping for was something like

typedef enum { LOOKUP_VALREF_VALIDATE_ONLY=1 } lookup_valref_flags_t;

// Future is fullfilled after all blobrefs have been fetched/validated
// If any blobref is invalid, future is fulfilled with error
flux_future_t *lookup_valref (flux_t *h, json_t *treeobj, int flags);

// Functions to access individual blobs
// get_blob() blocks until future is fulfilled
int lookup_valref_get_blobcount (flux_future_t *f);
int lookup_valref_get_blob (flux_future_t *f, size_t index, const void **buf, size_t *size);

Is that approach where the implementation got absurdly complex?

@chu11
Copy link
Member Author

chu11 commented Oct 23, 2025

As I was prototyping, it began moving towards a async_valref_t *async_valref_create(json_t *treeobj, int flags) kinda style with a context, and I felt it just wasn't worth the energy moving in that direction. I didn't think of it in your "composite future" style API though.

One example, flux-fsck needs all the errors and indexes of blobrefs with errors. Errors can arrive out of order with flux-dump (b/c it can choose to go through content.load instead of content-backing.load).

So, we need to store all the lookup error indexes and errnos on a list somewhere for later retrieval. But that just seemed like a lot of effort for the amount of gain. (thus my eventual "err_cb" mechanism instead)

It's worth noting, that I went through the original PR push with the mentality that behavior should not change ("behavior" here can mean error messages, etc.). If we were ok with behavior changing, then that changes the mental math about how to do this.

Given what I just wrote here, I'd like to propose we drop the shared function for the time being. I'll write up an issue for this for a later time. There are flux-fsck issues (#7121, #7122) in the pipeline, so we'd perhaps want to see how all that falls out first.

@garlick
Copy link
Member

garlick commented Oct 23, 2025

So, we need to store all the lookup error indexes and errnos on a list somewhere for later retrieval. But that just seemed like a lot of effort for the amount of gain. (thus my eventual "err_cb" mechanism instead)

It's worth noting, that I went through the original PR push with the mentality that behavior should not change

We could probably change that to have the composite future be fulfilled with the first error. I'm not sure it's important to call out each failed blobref and its index, or even count them. The value is corrupt is the main thing that needs reporting IMHO.

I'd like to propose we drop the shared function for the time being

OK

@chu11 chu11 force-pushed the issue6782_flux_dump_perf branch from 8147cc1 to 8f27050 Compare October 24, 2025 00:02
@chu11
Copy link
Member Author

chu11 commented Oct 24, 2025

re-pushed, removing the "common function" discussed above. Fixed up a few commit messages I noticed still said --async

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few suggestions inline. None of them are huge deal so I'll tentatively approve and let you sort through them and take what's useful.

Comment on lines 262 to 263
rm -f content.sqlite &&
flux module load content-sqlite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: use

flux module load content-sqlite truncate

Comment on lines 280 to 283
test_expect_success LONGTEST 'unload content-sqlite, remove backing, reload content-sqlite' '
flux module remove content-sqlite &&
rm -f content.sqlite &&
flux module load content-sqlite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncate option


# Cover value with a very large number of appends

test_expect_success LONGTEST 'remove backing file and load content-sqlite' '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the LONGTEST tests are likely to be run automatically anywhere (correct me if I'm wrong).
Could LONGTEST be dropped if the number of appends were reduced?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, I just run LONGTEST when I run tests :-)

Hmmmm, I think having "quite large" is important, even if it's only when I run personally.

Perhaps we add something "medium", like 50-100 appends that can run all the time?

Comment on lines 175 to 185
/* N.B. first attempt was to save the futures in an array, but ran into:
* flux: ev_epoll.c:134: epoll_modify: Assertion `("libev: I/O watcher
* with invalid fd found in epoll_ctl", errno != EBADF && errno != ELOOP
* && errno != EINVAL)' failed.
* while archiving a resource.eventlog with 781 entries. Instead of
* retaining the futures for a second pass, just retain references to the
* content.load response messages.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this comment is still relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're moving to an array of msgs I thought no longer relevant? But we could certainly leave it as an explanation for why we use an array of msgs vs futures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO array of futures is the more straightforward way, so I thought the comment explaining why array of msgs was informative.

#include "src/common/libutil/blobref.h"
#include "src/common/libcontent/content.h"
#include "ccan/str/str.h"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message:

Rate cap the lookups at 1000 RPCs.

This isn't really a rate cap - it's a concurrency cap, right?

Comment on lines 224 to 226
if (!(indexp = (int *)malloc (sizeof (int))))
log_err_exit ("cannot allocate index memory");
(*indexp) = dvd->index;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a nice place to use a ccan ptrint.

Comment on lines 233 to 238
while (dvd->in_flight < BLOBREF_ASYNC_MAX
&& dvd->index < dvd->count) {
get_blobref (dvd);
dvd->in_flight++;
dvd->index++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought but maybe instead of --fast with a hard coded limit, it would be nicer to have a --max-concurrent-requests=N option with a low default?

Comment on lines 300 to 307
if (fast) {
if (dump_valref_async (&dvd) < 0)
goto cleanup;
}
else {
if (dump_valref_serial (&dvd) < 0)
goto cleanup;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous sugestion about --max-concurrent-requests. That would allow these two code paths to be collapsed.

Comment on lines 288 to 292
/* We need the total size before we start writing archive data,
* so make a first pass, saving the data for writing later.
*/
if (!(msgs = calloc (count, sizeof (msgs[0]))))
log_err_exit ("could not create messages array");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit confusing. Maybe just say something like

Load all data comprising the valref before starting the archive entry. This is because the total size of the value must calculated and written before any data, and any load errors must be detected so the whole entry can be skipped if --ignore-failed-read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was comments from before my work :-) I'll update.

Comment on lines 79 to 82
.. option:: --fast

Speed up :program:`flux-dump` by running some operations asynchronously

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm a user, I may wonder why I wouldn't always want --fast.
Probably should say something like

When the dump is large, this could create a heavy burden on the KVS and should not be used unless the system is quiescent or it is an emergency.

@garlick
Copy link
Member

garlick commented Nov 3, 2025

If you are considering my --max-concurrent-request=N suggestion, I think we could come up with a shorter name. LIke --maxreqs=N 🤷

@grondo
Copy link
Contributor

grondo commented Nov 3, 2025

I really like the idea of --maxreqs=N FWIW

@chu11
Copy link
Member Author

chu11 commented Nov 4, 2025

If you are considering my --max-concurrent-request=N suggestion, I think we could come up with a shorter name. LIke --maxreqs=N 🤷

Are you thinking as a replacement of --fast or as a supplement ... i.e.

flux dump --fast --maxreqs=3000 or just flux dump --maxreqs=3000 and the default is 1?

I do like the idea of the latter. My only "hmmmm" thought is that we didn't do that with flux fsck. MIght be good to do that over there too (possibly before this PR, if we do not want --fast).

@garlick
Copy link
Member

garlick commented Nov 4, 2025

I was thinking just the one option.

fsck seems less important since the KVS is normally offline when that runs.

@chu11 chu11 changed the title flux-dump: support --fast option for faster dumping flux-dump: support --maxreqs option for faster dumping Nov 5, 2025
@chu11 chu11 force-pushed the issue6782_flux_dump_perf branch from 8f27050 to 343e4dc Compare November 5, 2025 23:43
@chu11
Copy link
Member Author

chu11 commented Nov 5, 2025

re-pushed, going with --maxreqs instead of --fast. the cleanups are now in #7186

One idea I had ... would we like to keep --fast, it can be a shortcut for --maxreqs=1000?? ehhhh

@chu11
Copy link
Member Author

chu11 commented Nov 10, 2025

oh, I know this is approved, but since i did rename the option to --maxreqs and it takes an argument, figured the PR should get another once over before I set MWP.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I did have a couple of minor suggestions but nothing much.

Comment on lines 186 to 187
if (!(msgs = calloc (count, sizeof (msg))))
log_err_exit ("could not create messages array");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be (techinically) sizeof (msg[0])?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+    const flux_msg_t **msgs;
     const flux_msg_t *msg;

I went with msg b/c it happened to be defined here, but I agree, msgs[0] is a bit more clear.

}
dvd->in_flight--;
dvd->total_size += len;
dvd->msgs[index] = flux_msg_incref (msg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to bounds check index before using it here.

Comment on lines 217 to 221
if (!(f = content_load_byblobref (dvd->h, blobref, content_flags))
|| flux_future_then (f, -1, get_blobref_continuation, dvd) < 0)
log_err_exit ("%s: cannot load blobref %d", dvd->path, dvd->index);
if (flux_future_aux_set (f, "index", int2ptr (dvd->index), NULL) < 0)
log_err_exit ("could not save index value");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably just have one error message instead of two, since it likely won't be helpful to distinguish which one failed.

Comment on lines 257 to 258
while (dvd.in_flight < async_max
&& dvd.index < dvd.count) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: one 4-space indent before && or don't break the line at all.

Comment on lines 500 to 503
/* default = 1, no parallelism */
async_max = optparse_get_int (p, "maxreqs", 1);
if (async_max <= 0)
log_err_exit ("invalid value for maxreqs");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the default should be something like 8?

It feels like a case where a little parallelism is likely to help a lot and yet not be too hard on the system, but I guess we can always raise it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to do some experiments and report back here. Might be good to get a reasonable default on the first go.

Comment on lines 191 to 194
flux_future_destroy (f);
dvd->errorcount++;
dvd->errnum = errno; /* we'll report the last errno */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should in_flight be decremented here as well?

Copy link
Member Author

@chu11 chu11 Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I should do it in flux-fsck too. N/m not in flux-fsck, it falls through and doesn't return immediately.

Comment on lines 500 to 503
/* default = 1, no parallelism */
async_max = optparse_get_int (p, "maxreqs", 1);
if (async_max <= 0)
log_err_exit ("invalid value for maxreqs");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to do some experiments and report back here. Might be good to get a reasonable default on the first go.

@chu11
Copy link
Member Author

chu11 commented Nov 12, 2025

I wonder if the default should be something like 8?

It feels like a case where a little parallelism is likely to help a lot and yet not be too hard on the system, but I guess we can always raise it later.

I did wonder about having the default be like 4 or 8, but I left it at 1 out of a desire to not change current behavior.

I guess the question in my mind is what use case is there where we want to have flux-dump a little faster by default, but not fast enough to burden a live running system? My understanding is we have one current use case where we do run flux-dump on a live running system. And we do it at 3am or something for a "nightly just in case backup", so at that time performance probably does not matter?

@garlick
Copy link
Member

garlick commented Nov 12, 2025

I think the other potential use case is a sys admin taking a backup when something is not working right. In that case, it's probably better if the tool doesn't take 40m out of their day.

I wouldn't expect someone in the hot seat like that to be reading man pages to discover optimizations either. They would rightly expect the tool to have a workable default.

@chu11
Copy link
Member Author

chu11 commented Nov 12, 2025

I think the other potential use case is a sys admin taking a backup when something is not working right. In that case, it's probably better if the tool doesn't take 40m out of their day.

I wouldn't expect someone in the hot seat like that to be reading man pages to discover optimizations either. They would rightly expect the tool to have a workable default.

That's a good example! Lemme run some experiments, see if I can come up with a number that seems pretty good.

@chu11
Copy link
Member Author

chu11 commented Nov 13, 2025

So after some experimentation, I found that performance gains with --maxreqs plateaus maybe around 8-16 or so. I suspect on a real system with mountains of more data than my experiments, perhaps plateau at 32-64 might happen.

With a bit of profiling, I suspect this is b/c flux-dump actually post-processes the data it receives (i.e. the real data is sent, then it gets written to archive). This is in contrast to flux-fsck which only validates the blobref is correct (i.e. did validate return error or not) and immediately moves on.

Going from maxreqs of 1 to 2 is a big win, going from 2 to 4 is too, then gains begin to slow down.

Given that, I'm going to go with a default of 2. It gets us a pretty big win w/ only doubling the requests sent to the broker. I think 4 might be ok as well, but I was intentionally conservative.

Problem: In the near future we may wish to asynchronously
load blobrefs to increase parallelization.  This can lead to
RPCs responding out of order, so using a list and appending to
it will not work for storing responses.

Convert from using a msglist to a simple array to store responses.
Problem: The blobrefs within a KVS valref tree object are looked
up synchronously one by one.  This can be slow.

Support a new option --maxreqs that will parallelize the valref lookups
up to the specified concurrency level.

flux-dump may have use cases where we want the dump to be slow, such as
on a running system, and do not want it to interfere with other broker
activities.  So we want a default that will give us noticeable speedup
but not overwhelm the broker.  After some experimentation, we go with a
default of 2.

Fixes flux-framework#6782
Problem: There is no coverage for the new flux-dump --maxreqs
option.

Add coverage in t2807-dump-cmd.t.
Problem: The new --maxreqs option in flux-dump is not documented.

Add it to flux-dump(1).
@chu11 chu11 force-pushed the issue6782_flux_dump_perf branch from 7c2bfbf to ec62a8b Compare November 13, 2025 19:51
@codecov
Copy link

codecov bot commented Nov 13, 2025

Codecov Report

❌ Patch coverage is 92.42424% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.71%. Comparing base (4b5a517) to head (ec62a8b).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/cmd/builtin/dump.c 92.42% 5 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7158   +/-   ##
=======================================
  Coverage   83.70%   83.71%           
=======================================
  Files         553      553           
  Lines       92350    92390   +40     
=======================================
+ Hits        77304    77346   +42     
+ Misses      15046    15044    -2     
Files with missing lines Coverage Δ
src/cmd/builtin/dump.c 88.46% <92.42%> (+0.59%) ⬆️

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants