-
Notifications
You must be signed in to change notification settings - Fork 54
flux-dump: support --maxreqs option for faster dumping #7158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
ea251e9 to
a2876f1
Compare
|
Yeah maybe I haven't looked at this yet but remember I was asking for code sharing between dump and fsck, for fetching large value arrays under one future.. Does that seem possible now? I believe before you didn't want to pause for that because of the urgent need for fsck at the time. |
|
Just for my own curiosity, when would we not want the "fast" behavior? Nevermind. After I typed that I remembered that we were worried about impact to running systems when doing nightly dumps. |
a2876f1 to
916d008
Compare
It's doable, but I felt it wasn't going to be much of a win. fsck & dump have begun diverging, so there'd have to be special handling of all the differences between them. e.g. validate vs. load, save response messages vs not, continue with all blobrefs on error vs give up after first error (I just realized I did not handle this one correctly on the dump side, re-pushed), etc. |
d0012c7 to
b4b7fca
Compare
|
re-pushed, converting --async to --fast. |
b4b7fca to
cb9b72a
Compare
|
Could we just pause for a second and make sure that's not a viable approach? If we could abstract all that it would make the tools far simpler to work on down the road.
Same request payload. As far as an interface goes, it could just be like a VALIDATE_ONLY flag that causes the topic string to change
Easily handled I think? If flag is VALIDATE_ONLY, then no save, otherwise save in an array I guess?
Not sure if that really matters. Errors are unlikely so if you continued and fetched blobrefs after encountering an error it wouldn't be the end of the world. But probably also doable with It's possible I'm spouting nonsense, and you've actually been working on the code, so if you're pretty sure that's the case then say so and we can move on :-) |
It's certainly viable, I just wasn't convinced that it was a big enough "win" to make it worthwhile. I'll go ahead and give it a shot. Devils in the details and we'll see how it pans out. |
cb9b72a to
03a1ce0
Compare
|
re-pushed, add common code via a library in Went through several iterations, one attempt to create a "true library function" in is it a net win? I think a bit, although it's not as clean as I would have loved. I could go either way on it really. The idea was that |
03a1ce0 to
8147cc1
Compare
|
What I was hoping for was something like typedef enum { LOOKUP_VALREF_VALIDATE_ONLY=1 } lookup_valref_flags_t;
// Future is fullfilled after all blobrefs have been fetched/validated
// If any blobref is invalid, future is fulfilled with error
flux_future_t *lookup_valref (flux_t *h, json_t *treeobj, int flags);
// Functions to access individual blobs
// get_blob() blocks until future is fulfilled
int lookup_valref_get_blobcount (flux_future_t *f);
int lookup_valref_get_blob (flux_future_t *f, size_t index, const void **buf, size_t *size);Is that approach where the implementation got absurdly complex? |
|
As I was prototyping, it began moving towards a One example, So, we need to store all the lookup error indexes and errnos on a list somewhere for later retrieval. But that just seemed like a lot of effort for the amount of gain. (thus my eventual "err_cb" mechanism instead) It's worth noting, that I went through the original PR push with the mentality that behavior should not change ("behavior" here can mean error messages, etc.). If we were ok with behavior changing, then that changes the mental math about how to do this. Given what I just wrote here, I'd like to propose we drop the shared function for the time being. I'll write up an issue for this for a later time. There are |
We could probably change that to have the composite future be fulfilled with the first error. I'm not sure it's important to call out each failed blobref and its index, or even count them. The value is corrupt is the main thing that needs reporting IMHO.
OK |
8147cc1 to
8f27050
Compare
|
re-pushed, removing the "common function" discussed above. Fixed up a few commit messages I noticed still said |
garlick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a few suggestions inline. None of them are huge deal so I'll tentatively approve and let you sort through them and take what's useful.
t/t2807-dump-cmd.t
Outdated
| rm -f content.sqlite && | ||
| flux module load content-sqlite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: use
flux module load content-sqlite truncate
t/t2807-dump-cmd.t
Outdated
| test_expect_success LONGTEST 'unload content-sqlite, remove backing, reload content-sqlite' ' | ||
| flux module remove content-sqlite && | ||
| rm -f content.sqlite && | ||
| flux module load content-sqlite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncate option
t/t2807-dump-cmd.t
Outdated
|
|
||
| # Cover value with a very large number of appends | ||
|
|
||
| test_expect_success LONGTEST 'remove backing file and load content-sqlite' ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the LONGTEST tests are likely to be run automatically anywhere (correct me if I'm wrong).
Could LONGTEST be dropped if the number of appends were reduced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, I just run LONGTEST when I run tests :-)
Hmmmm, I think having "quite large" is important, even if it's only when I run personally.
Perhaps we add something "medium", like 50-100 appends that can run all the time?
| /* N.B. first attempt was to save the futures in an array, but ran into: | ||
| * flux: ev_epoll.c:134: epoll_modify: Assertion `("libev: I/O watcher | ||
| * with invalid fd found in epoll_ctl", errno != EBADF && errno != ELOOP | ||
| * && errno != EINVAL)' failed. | ||
| * while archiving a resource.eventlog with 781 entries. Instead of | ||
| * retaining the futures for a second pass, just retain references to the | ||
| * content.load response messages. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this comment is still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're moving to an array of msgs I thought no longer relevant? But we could certainly leave it as an explanation for why we use an array of msgs vs futures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO array of futures is the more straightforward way, so I thought the comment explaining why array of msgs was informative.
| #include "src/common/libutil/blobref.h" | ||
| #include "src/common/libcontent/content.h" | ||
| #include "ccan/str/str.h" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message:
Rate cap the lookups at 1000 RPCs.
This isn't really a rate cap - it's a concurrency cap, right?
src/cmd/builtin/dump.c
Outdated
| if (!(indexp = (int *)malloc (sizeof (int)))) | ||
| log_err_exit ("cannot allocate index memory"); | ||
| (*indexp) = dvd->index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a nice place to use a ccan ptrint.
src/cmd/builtin/dump.c
Outdated
| while (dvd->in_flight < BLOBREF_ASYNC_MAX | ||
| && dvd->index < dvd->count) { | ||
| get_blobref (dvd); | ||
| dvd->in_flight++; | ||
| dvd->index++; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought but maybe instead of --fast with a hard coded limit, it would be nicer to have a --max-concurrent-requests=N option with a low default?
src/cmd/builtin/dump.c
Outdated
| if (fast) { | ||
| if (dump_valref_async (&dvd) < 0) | ||
| goto cleanup; | ||
| } | ||
| else { | ||
| if (dump_valref_serial (&dvd) < 0) | ||
| goto cleanup; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous sugestion about --max-concurrent-requests. That would allow these two code paths to be collapsed.
src/cmd/builtin/dump.c
Outdated
| /* We need the total size before we start writing archive data, | ||
| * so make a first pass, saving the data for writing later. | ||
| */ | ||
| if (!(msgs = calloc (count, sizeof (msgs[0])))) | ||
| log_err_exit ("could not create messages array"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a bit confusing. Maybe just say something like
Load all data comprising the valref before starting the archive entry. This is because the total size of the value must calculated and written before any data, and any load errors must be detected so the whole entry can be skipped if
--ignore-failed-read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was comments from before my work :-) I'll update.
doc/man1/flux-dump.rst
Outdated
| .. option:: --fast | ||
|
|
||
| Speed up :program:`flux-dump` by running some operations asynchronously | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm a user, I may wonder why I wouldn't always want --fast.
Probably should say something like
When the dump is large, this could create a heavy burden on the KVS and should not be used unless the system is quiescent or it is an emergency.
|
If you are considering my |
|
I really like the idea of |
Are you thinking as a replacement of
I do like the idea of the latter. My only "hmmmm" thought is that we didn't do that with |
|
I was thinking just the one option. fsck seems less important since the KVS is normally offline when that runs. |
8f27050 to
343e4dc
Compare
|
re-pushed, going with One idea I had ... would we like to keep |
343e4dc to
7c2bfbf
Compare
|
oh, I know this is approved, but since i did rename the option to |
garlick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I did have a couple of minor suggestions but nothing much.
src/cmd/builtin/dump.c
Outdated
| if (!(msgs = calloc (count, sizeof (msg)))) | ||
| log_err_exit ("could not create messages array"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't that be (techinically) sizeof (msg[0])?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ const flux_msg_t **msgs;
const flux_msg_t *msg;
I went with msg b/c it happened to be defined here, but I agree, msgs[0] is a bit more clear.
| } | ||
| dvd->in_flight--; | ||
| dvd->total_size += len; | ||
| dvd->msgs[index] = flux_msg_incref (msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to bounds check index before using it here.
src/cmd/builtin/dump.c
Outdated
| if (!(f = content_load_byblobref (dvd->h, blobref, content_flags)) | ||
| || flux_future_then (f, -1, get_blobref_continuation, dvd) < 0) | ||
| log_err_exit ("%s: cannot load blobref %d", dvd->path, dvd->index); | ||
| if (flux_future_aux_set (f, "index", int2ptr (dvd->index), NULL) < 0) | ||
| log_err_exit ("could not save index value"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably just have one error message instead of two, since it likely won't be helpful to distinguish which one failed.
src/cmd/builtin/dump.c
Outdated
| while (dvd.in_flight < async_max | ||
| && dvd.index < dvd.count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: one 4-space indent before && or don't break the line at all.
src/cmd/builtin/dump.c
Outdated
| /* default = 1, no parallelism */ | ||
| async_max = optparse_get_int (p, "maxreqs", 1); | ||
| if (async_max <= 0) | ||
| log_err_exit ("invalid value for maxreqs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the default should be something like 8?
It feels like a case where a little parallelism is likely to help a lot and yet not be too hard on the system, but I guess we can always raise it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to do some experiments and report back here. Might be good to get a reasonable default on the first go.
| flux_future_destroy (f); | ||
| dvd->errorcount++; | ||
| dvd->errnum = errno; /* we'll report the last errno */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should in_flight be decremented here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! I should do it in flux-fsck too. N/m not in flux-fsck, it falls through and doesn't return immediately.
src/cmd/builtin/dump.c
Outdated
| /* default = 1, no parallelism */ | ||
| async_max = optparse_get_int (p, "maxreqs", 1); | ||
| if (async_max <= 0) | ||
| log_err_exit ("invalid value for maxreqs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to do some experiments and report back here. Might be good to get a reasonable default on the first go.
I did wonder about having the default be like 4 or 8, but I left it at 1 out of a desire to not change current behavior. I guess the question in my mind is what use case is there where we want to have |
|
I think the other potential use case is a sys admin taking a backup when something is not working right. In that case, it's probably better if the tool doesn't take 40m out of their day. I wouldn't expect someone in the hot seat like that to be reading man pages to discover optimizations either. They would rightly expect the tool to have a workable default. |
That's a good example! Lemme run some experiments, see if I can come up with a number that seems pretty good. |
|
So after some experimentation, I found that performance gains with With a bit of profiling, I suspect this is b/c Going from maxreqs of 1 to 2 is a big win, going from 2 to 4 is too, then gains begin to slow down. Given that, I'm going to go with a default of 2. It gets us a pretty big win w/ only doubling the requests sent to the broker. I think 4 might be ok as well, but I was intentionally conservative. |
Problem: In the near future we may wish to asynchronously load blobrefs to increase parallelization. This can lead to RPCs responding out of order, so using a list and appending to it will not work for storing responses. Convert from using a msglist to a simple array to store responses.
Problem: The blobrefs within a KVS valref tree object are looked up synchronously one by one. This can be slow. Support a new option --maxreqs that will parallelize the valref lookups up to the specified concurrency level. flux-dump may have use cases where we want the dump to be slow, such as on a running system, and do not want it to interfere with other broker activities. So we want a default that will give us noticeable speedup but not overwhelm the broker. After some experimentation, we go with a default of 2. Fixes flux-framework#6782
Problem: There is no coverage for the new flux-dump --maxreqs option. Add coverage in t2807-dump-cmd.t.
Problem: The new --maxreqs option in flux-dump is not documented. Add it to flux-dump(1).
7c2bfbf to
ec62a8b
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7158 +/- ##
=======================================
Coverage 83.70% 83.71%
=======================================
Files 553 553
Lines 92350 92390 +40
=======================================
+ Hits 77304 77346 +42
+ Misses 15046 15044 -2
🚀 New features to boost your workflow:
|
Per discussion in #6782. This largely mirrors the "parallelization" done for valrefs in
flux-fsck. The maximum number of RPCs that can be sent out in parallel is hard coded to 1000.We have some circumstances (i.e.
flux-dumpon a live broker) that we may intentionally want to be slower to not cause performance issues for the running system. So we don't want to parallelize all the time. I elected to support this as an option--async(name can be debated) vs the opposite--no-async, so that the current behavior could stay the same.Against a 9000 entry KVS valref array put a few
datecalls around aflux-dump.Before
nanoseconds = 1029225576
After
nanoseconds = 226075342
So about a 78% improvement! Against a 4000 entry valref I also did, it was about 67%.
Side note: assuming we like this change, I have not yet updated
rc3to use this.