Skip to content

Conversation

@ballard26
Copy link
Contributor

@ballard26 ballard26 commented Nov 7, 2025

Prior to this PR the default max partition bytes was 1MiB. The default min fetch bytes was 5MiB. And the max wait ms was 500ms.

In cases where only one partition on the shadow cluster shard was being produced to these defaults limited our consumer throughput to 2MiB/s. This is since we'd only be able to read 1MiB every 500ms from the one partition with new data. And in cases where the throughput to that partition exceeded 2MiB/s we'd see lag increase indefinitely between the source and shadow cluster because of this.

Hence this commit sets max partition bytes to equal min fetch bytes to prevent this issue.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • none

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a throughput limitation in disaster recovery (DR) consumers by increasing the default maximum partition bytes from 1MiB to 5MiB. The change resolves a scenario where consumers were artificially limited to 2MiB/s throughput when consuming from a single active partition, caused by the mismatch between the 1MiB max partition bytes and 5MiB min fetch bytes settings combined with a 500ms max wait time.

Key Changes:

  • Increased default_fetch_partition_max_bytes from 1MiB to 5MiB to match the default min fetch bytes setting

@ballard26
Copy link
Contributor Author

Another option here is lowering the fetch min bytes to 1MiB. Either would prevent the issue from occurring.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Nov 7, 2025

Retry command for Build#75796

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkBasicTests.test_create_default_link

@ballard26 ballard26 force-pushed the dr-consumer-config-defaults branch from 97d02da to 58fac1e Compare November 7, 2025 03:42
@ballard26 ballard26 requested review from a team and rockwotj as code owners November 7, 2025 03:42
ORDER_BY_FIELD_NUMBER: builtins.int
page_size: builtins.int
'The maximum number of connections to return. If unspecified or 0, a\n default value may be applied. Note that paging is currently not fully\n supported, and this field only acts as a limit for the first page of data\n returned. Subsequent pages of data cannot be requested.\n '
'The maximum number of connections to return. If unspecified or 0, a\n default value may be applied. The server may return fewer connections\n than requested due to memory constraints; the limit is set to allow\n listing all connections for a single broker. Consider filtering by\n node_id to view connections for specific brokers. Note that paging is\n currently not fully supported, and this field only acts as a limit for\n the first page of data returned. Subsequent pages of data cannot be\n requested.\n '
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing someone forgot to run task rp:generate-ducktape-protos after modifying the cluster proto at some point. Happy to separate this into a separate commit if folks want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note to proto/redpanda/README.md about running the command after updating a proto?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just add a github action to check in another PR

rockwotj
rockwotj previously approved these changes Nov 7, 2025
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#75806

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingMetricsTests.test_link_metrics

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Nov 7, 2025

CI test results

test results on build#75806
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingMetricsTests test_link_metrics null integration https://buildkite.com/redpanda/redpanda/builds/75806#019a5c86-e28c-43a9-af83-d3aab349fc6f FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingMetricsTests&test_method=test_link_metrics
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/75806#019a5c8a-faff-4ae0-a7dc-22d63efd7c4a FLAKY 17/21 upstream reliability is '95.91836734693877'. current run reliability is '80.95238095238095'. drift is 14.96599 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
test results on build#75995
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
WriteCachingFailureInjectionTest test_unavoidable_data_loss null integration https://buildkite.com/redpanda/redpanda/builds/75995#019a7236-c4a7-4ded-9abf-b932222eca27 FLAKY 19/21 upstream reliability is '94.76510067114094'. current run reliability is '90.47619047619048'. drift is 4.28891 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss

mmaslankaprv
mmaslankaprv previously approved these changes Nov 7, 2025
@ballard26 ballard26 dismissed stale reviews from mmaslankaprv and rockwotj via df99985 November 11, 2025 08:01
@ballard26 ballard26 force-pushed the dr-consumer-config-defaults branch from df99985 to 2ded4b0 Compare November 11, 2025 08:05
@ballard26 ballard26 added this to the v25.3.1-rc4 milestone Nov 11, 2025
Prior to this commit the default max partition bytes was 1MiB. The
default min fetch bytes was 5MiB. And the max wait ms was 500ms.

In cases where only one partition on the shard was being produced to
these defaults limited our consumer throughput to 2MiB/s. This is since
we'd only be able to read 1MiB every 500ms from the one partition with
new data.

Hence this commit sets max partition bytes to equal min fetch bytes to
prevent this issue.
The metrics test checks for a non-zero value for lag. Whether the metric
is non-zero by the time the check occurs depends on whether all messages
have been produced and if the shadow cluster has caught up. This may not
always be the case depending how the test is ran(i.e, in CDT things will
run much faster due to the greater resources).

Before the default max partition bytes was 5MiB in the direct consumer
the shadow cluster could at most consume 1MiB/s from the single
partition topic in the test. This meant that more likely than not the
shadow cluster would still be catching up by the time the metric is
checked. However, not that the default max partition bytes is 5MiB the
throughput limit is no longer the case and the shadow cluster is more
likely than not to be caught up by the time the metric is checked.

This commit increases the messages being produced to be much higher than
before to ensure that the shadow cluster is still catching up by the
time the lag metric is checked. This change also increases the test
runtime from 1min to 2min.
@ballard26 ballard26 force-pushed the dr-consumer-config-defaults branch from 2ded4b0 to 9921473 Compare November 11, 2025 08:10
@ballard26 ballard26 requested review from a team, kbatuigas and r-vasquez as code owners November 11, 2025 08:10
@ballard26 ballard26 removed request for a team and kbatuigas November 11, 2025 08:10
@ballard26 ballard26 removed the request for review from r-vasquez November 11, 2025 08:10
@ballard26 ballard26 merged commit 26855b0 into redpanda-data:dev Nov 11, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants