Skip to content

Conversation

@andrwng
Copy link
Contributor

@andrwng andrwng commented Nov 6, 2025

We've seen a shutdown hang that appears to be caused by a datalake translator getting stuck repeatedly trying to query the schema registry.

This attempts to fix this by moving schema registry client shutdown to before datalake shutdown.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • None

Copilot AI review requested due to automatic review settings November 6, 2025 02:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a shutdown hang caused by datalake translators repeatedly attempting to query the schema registry during shutdown. The fix ensures the schema registry is stopped before datalake subsystems begin their shutdown sequence.

Key Changes:

  • Schema registry shutdown is now invoked earlier in the shutdown sequence, before datalake services
  • Schema registry lifecycle management is modified to support early shutdown while maintaining proper destruction order

Comment on lines 377 to 380
if (_schema_registry) {
stop_service(*_schema_registry, "pandaproxy::schema_registry::api");
}

Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Adding early shutdown logic for schema_registry while keeping its destruction in the normal order creates a split lifecycle pattern that could be confusing. Consider documenting this pattern more prominently (e.g., in a comment near the _schema_registry member declaration) to prevent future developers from inadvertently breaking this ordering.

Copilot uses AI. Check for mistakes.
Comment on lines 1364 to 1366
// NOTE: we'll stop the schema registry out of band before some other
// subsystems, but destruct it in the "normal" order as it was
// initialized.
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment should specify which subsystems require schema registry to be stopped first (specifically mentioning datalake subsystems) to make the reasoning clearer and help future maintainers understand the ordering constraint.

Suggested change
// NOTE: we'll stop the schema registry out of band before some other
// subsystems, but destruct it in the "normal" order as it was
// initialized.
// NOTE: we'll stop the schema registry out of band before certain
// subsystems that depend on it, specifically datalake-related subsystems
// such as data transforms and WASM. This ensures proper shutdown ordering
// and avoids issues with dangling references. The schema registry is
// destructed in the "normal" order as it was initialized.

Copilot uses AI. Check for mistakes.
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#75717
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ReplicatedMetastoreTest TestBasicRemoveTopics unit https://buildkite.com/redpanda/redpanda/builds/75717#019a56f6-771a-46ea-b09c-108fe98aad9a FAIL 0/1
ShadowLinkingReplicationTests test_topic_delete {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/75717#019a5716-f6b5-4a81-9657-1801612e4af1 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_topic_delete
NodePostRestartProbeTest post_restart_probe_test null integration https://buildkite.com/redpanda/redpanda/builds/75717#019a5716-f6b8-4539-96b6-e5dffd08529b FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodePostRestartProbeTest&test_method=post_restart_probe_test
SIPartitionMovementTest test_shadow_indexing {"cloud_storage_type": 2, "num_to_upgrade": 2, "with_cloud_topics": false} integration https://buildkite.com/redpanda/redpanda/builds/75717#019a5716-f6b7-410d-9827-03509d5e493c FLAKY 20/21 upstream reliability is '98.86363636363636'. current run reliability is '95.23809523809523'. drift is 3.62554 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/75717#019a5716-f6be-4810-b0ad-293479e3b5c5 FLAKY 18/21 upstream reliability is '90.19384264538198'. current run reliability is '85.71428571428571'. drift is 4.47956 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

@andrwng andrwng force-pushed the schema-registry-shutdown branch from 93a572c to 88e7917 Compare November 6, 2025 08:03
@andrwng andrwng changed the title WIP application: stop schema_registry before datalake subsystems application: stop schema_registry clients before datalake subsystems Nov 6, 2025
We've seen a shutdown hang that appears to be caused by a datalake
translator getting stuck repeatedly trying to query the schema registry.

This attempts to fix this by moving schema registry client shutdown to
before datalake shutdown.
@andrwng andrwng force-pushed the schema-registry-shutdown branch from 88e7917 to e96b3ae Compare November 10, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants