Change default max partition bytes in DR consumers #28417

ballard26 · 2025-11-07T01:21:05Z

Prior to this PR the default max partition bytes was 1MiB. The default min fetch bytes was 5MiB. And the max wait ms was 500ms.

In cases where only one partition on the shadow cluster shard was being produced to these defaults limited our consumer throughput to 2MiB/s. This is since we'd only be able to read 1MiB every 500ms from the one partition with new data. And in cases where the throughput to that partition exceeded 2MiB/s we'd see lag increase indefinitely between the source and shadow cluster because of this.

Hence this commit sets max partition bytes to equal min fetch bytes to prevent this issue.

Backports Required

Release Notes

none

Copilot

Pull Request Overview

This PR addresses a throughput limitation in disaster recovery (DR) consumers by increasing the default maximum partition bytes from 1MiB to 5MiB. The change resolves a scenario where consumers were artificially limited to 2MiB/s throughput when consuming from a single active partition, caused by the mismatch between the 1MiB max partition bytes and 5MiB min fetch bytes settings combined with a 500ms max wait time.

Key Changes:

Increased default_fetch_partition_max_bytes from 1MiB to 5MiB to match the default min fetch bytes setting

ballard26 · 2025-11-07T01:22:25Z

Another option here is lowering the fetch min bytes to 1MiB. Either would prevent the issue from occurring.

vbotbuildovich · 2025-11-07T03:10:57Z

Retry command for Build#75796

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkBasicTests.test_create_default_link

ballard26 · 2025-11-07T03:44:53Z

tests/rptest/clients/admin/proto/redpanda/core/admin/v2/cluster_pb2.pyi

    ORDER_BY_FIELD_NUMBER: builtins.int
    page_size: builtins.int
-    'The maximum number of connections to return. If unspecified or 0, a\n    default value may be applied. Note that paging is currently not fully\n    supported, and this field only acts as a limit for the first page of data\n    returned. Subsequent pages of data cannot be requested.\n    '
+    'The maximum number of connections to return. If unspecified or 0, a\n    default value may be applied. The server may return fewer connections\n    than requested due to memory constraints; the limit is set to allow\n    listing all connections for a single broker. Consider filtering by\n    node_id to view connections for specific brokers. Note that paging is\n    currently not fully supported, and this field only acts as a limit for\n    the first page of data returned. Subsequent pages of data cannot be\n    requested.\n    '


I'm guessing someone forgot to run task rp:generate-ducktape-protos after modifying the cluster proto at some point. Happy to separate this into a separate commit if folks want.

Should we add a note to proto/redpanda/README.md about running the command after updating a proto?

I will just add a github action to check in another PR

vbotbuildovich · 2025-11-07T05:47:06Z

Retry command for Build#75806

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingMetricsTests.test_link_metrics

vbotbuildovich · 2025-11-07T05:54:44Z

CI test results

test results on build#75806

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
ShadowLinkingMetricsTests	test_link_metrics	null	integration	https://buildkite.com/redpanda/redpanda/builds/75806#019a5c86-e28c-43a9-af83-d3aab349fc6f	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingMetricsTests&test_method=test_link_metrics
ShadowLinkingReplicationTests	test_replication_basic	{"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "redpanda"}}	integration	https://buildkite.com/redpanda/redpanda/builds/75806#019a5c8a-faff-4ae0-a7dc-22d63efd7c4a	FLAKY	17/21	upstream reliability is '95.91836734693877'. current run reliability is '80.95238095238095'. drift is 14.96599 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic

test results on build#75995

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
WriteCachingFailureInjectionTest	test_unavoidable_data_loss	null	integration	https://buildkite.com/redpanda/redpanda/builds/75995#019a7236-c4a7-4ded-9abf-b932222eca27	FLAKY	19/21	upstream reliability is '94.76510067114094'. current run reliability is '90.47619047619048'. drift is 4.28891 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss

Prior to this commit the default max partition bytes was 1MiB. The default min fetch bytes was 5MiB. And the max wait ms was 500ms. In cases where only one partition on the shard was being produced to these defaults limited our consumer throughput to 2MiB/s. This is since we'd only be able to read 1MiB every 500ms from the one partition with new data. Hence this commit sets max partition bytes to equal min fetch bytes to prevent this issue.

The metrics test checks for a non-zero value for lag. Whether the metric is non-zero by the time the check occurs depends on whether all messages have been produced and if the shadow cluster has caught up. This may not always be the case depending how the test is ran(i.e, in CDT things will run much faster due to the greater resources). Before the default max partition bytes was 5MiB in the direct consumer the shadow cluster could at most consume 1MiB/s from the single partition topic in the test. This meant that more likely than not the shadow cluster would still be catching up by the time the metric is checked. However, not that the default max partition bytes is 5MiB the throughput limit is no longer the case and the shadow cluster is more likely than not to be caught up by the time the metric is checked. This commit increases the messages being produced to be much higher than before to ensure that the shadow cluster is still catching up by the time the lag metric is checked. This change also increases the test runtime from 1min to 2min.

ballard26 requested review from Copilot, michael-redpanda and mmaslankaprv November 7, 2025 01:21

github-actions bot added the area/redpanda label Nov 7, 2025

Copilot AI reviewed Nov 7, 2025

View reviewed changes

ballard26 force-pushed the dr-consumer-config-defaults branch from 97d02da to 58fac1e Compare November 7, 2025 03:42

ballard26 requested review from a team and rockwotj as code owners November 7, 2025 03:42

ballard26 commented Nov 7, 2025

View reviewed changes

rockwotj previously approved these changes Nov 7, 2025

View reviewed changes

mmaslankaprv previously approved these changes Nov 7, 2025

View reviewed changes

ballard26 dismissed stale reviews from mmaslankaprv and rockwotj via df99985 November 11, 2025 08:01

ballard26 requested a review from mmaslankaprv November 11, 2025 08:01

ballard26 force-pushed the dr-consumer-config-defaults branch from df99985 to 2ded4b0 Compare November 11, 2025 08:05

ballard26 added this to the v25.3.1-rc4 milestone Nov 11, 2025

ballard26 added 2 commits November 11, 2025 03:10

ballard26 force-pushed the dr-consumer-config-defaults branch from 2ded4b0 to 9921473 Compare November 11, 2025 08:10

ballard26 requested review from a team, kbatuigas and r-vasquez as code owners November 11, 2025 08:10

github-actions bot added the area/rpk label Nov 11, 2025

ballard26 removed request for a team and kbatuigas November 11, 2025 08:10

ballard26 removed the request for review from r-vasquez November 11, 2025 08:10

michael-redpanda approved these changes Nov 11, 2025

View reviewed changes

ballard26 merged commit 26855b0 into redpanda-data:dev Nov 11, 2025
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change default max partition bytes in DR consumers #28417

Change default max partition bytes in DR consumers #28417

Uh oh!

ballard26 commented Nov 7, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

ballard26 commented Nov 7, 2025

Uh oh!

vbotbuildovich commented Nov 7, 2025 •

edited

Loading

Uh oh!

ballard26 Nov 7, 2025

Uh oh!

ballard26 Nov 7, 2025

Uh oh!

rockwotj Nov 7, 2025

Uh oh!

vbotbuildovich commented Nov 7, 2025

Uh oh!

vbotbuildovich commented Nov 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Change default max partition bytes in DR consumers #28417

Change default max partition bytes in DR consumers #28417

Uh oh!

Conversation

ballard26 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

ballard26 commented Nov 7, 2025

Uh oh!

vbotbuildovich commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retry command for Build#75796

Uh oh!

ballard26 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

ballard26 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

rockwotj Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Nov 7, 2025

Retry command for Build#75806

Uh oh!

vbotbuildovich commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ballard26 commented Nov 7, 2025 •

edited

Loading

vbotbuildovich commented Nov 7, 2025 •

edited

Loading

vbotbuildovich commented Nov 7, 2025 •

edited

Loading