-
Notifications
You must be signed in to change notification settings - Fork 699
Change default max partition bytes in DR consumers #28417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default max partition bytes in DR consumers #28417
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses a throughput limitation in disaster recovery (DR) consumers by increasing the default maximum partition bytes from 1MiB to 5MiB. The change resolves a scenario where consumers were artificially limited to 2MiB/s throughput when consuming from a single active partition, caused by the mismatch between the 1MiB max partition bytes and 5MiB min fetch bytes settings combined with a 500ms max wait time.
Key Changes:
- Increased
default_fetch_partition_max_bytesfrom 1MiB to 5MiB to match the default min fetch bytes setting
|
Another option here is lowering the fetch min bytes to 1MiB. Either would prevent the issue from occurring. |
Retry command for Build#75796please wait until all jobs are finished before running the slash command |
97d02da to
58fac1e
Compare
| ORDER_BY_FIELD_NUMBER: builtins.int | ||
| page_size: builtins.int | ||
| 'The maximum number of connections to return. If unspecified or 0, a\n default value may be applied. Note that paging is currently not fully\n supported, and this field only acts as a limit for the first page of data\n returned. Subsequent pages of data cannot be requested.\n ' | ||
| 'The maximum number of connections to return. If unspecified or 0, a\n default value may be applied. The server may return fewer connections\n than requested due to memory constraints; the limit is set to allow\n listing all connections for a single broker. Consider filtering by\n node_id to view connections for specific brokers. Note that paging is\n currently not fully supported, and this field only acts as a limit for\n the first page of data returned. Subsequent pages of data cannot be\n requested.\n ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing someone forgot to run task rp:generate-ducktape-protos after modifying the cluster proto at some point. Happy to separate this into a separate commit if folks want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a note to proto/redpanda/README.md about running the command after updating a proto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will just add a github action to check in another PR
Retry command for Build#75806please wait until all jobs are finished before running the slash command |
CI test resultstest results on build#75806
test results on build#75995
|
df99985
df99985 to
2ded4b0
Compare
Prior to this commit the default max partition bytes was 1MiB. The default min fetch bytes was 5MiB. And the max wait ms was 500ms. In cases where only one partition on the shard was being produced to these defaults limited our consumer throughput to 2MiB/s. This is since we'd only be able to read 1MiB every 500ms from the one partition with new data. Hence this commit sets max partition bytes to equal min fetch bytes to prevent this issue.
The metrics test checks for a non-zero value for lag. Whether the metric is non-zero by the time the check occurs depends on whether all messages have been produced and if the shadow cluster has caught up. This may not always be the case depending how the test is ran(i.e, in CDT things will run much faster due to the greater resources). Before the default max partition bytes was 5MiB in the direct consumer the shadow cluster could at most consume 1MiB/s from the single partition topic in the test. This meant that more likely than not the shadow cluster would still be catching up by the time the metric is checked. However, not that the default max partition bytes is 5MiB the throughput limit is no longer the case and the shadow cluster is more likely than not to be caught up by the time the metric is checked. This commit increases the messages being produced to be much higher than before to ensure that the shadow cluster is still catching up by the time the lag metric is checked. This change also increases the test runtime from 1min to 2min.
2ded4b0 to
9921473
Compare
Prior to this PR the default max partition bytes was 1MiB. The default min fetch bytes was 5MiB. And the max wait ms was 500ms.
In cases where only one partition on the shadow cluster shard was being produced to these defaults limited our consumer throughput to 2MiB/s. This is since we'd only be able to read 1MiB every 500ms from the one partition with new data. And in cases where the throughput to that partition exceeded 2MiB/s we'd see lag increase indefinitely between the source and shadow cluster because of this.
Hence this commit sets max partition bytes to equal min fetch bytes to prevent this issue.
Backports Required
Release Notes