Handle stale fetch streams after worker restart by mzhou-oai · Pull Request #3632 · apache/celeborn

mzhou-oai · 2026-03-17T05:34:33Z

Problem Statement

A worker restart can leave an in-flight reducer holding a stale streamId even though the worker comes back on the same stable hostname and still has the shuffle data on disk. In that case the worker correctly reports:

Stream <id> is not registered with worker. This can happen if the worker was restart recently.

That is a restart-specific condition, not necessarily a hard worker failure. Live stream registrations are process-local and are not reconstructed by recoverPath, so the old streamId cannot be used after the worker process comes back.

Before this change, CelebornInputStream treated that response like a generic fetch failure. It excluded the worker, consumed normal retry budget and backoff, and entered the usual peer-failover or retry path instead of reopening the stream on the same worker. As a result, a Celeborn worker restart can strand active Spark tasks in Celeborn fetch retry loops even when the worker is already back and the shuffle data is still available.

Proposal

This change makes stale-stream handling explicit in the client retry path:

mark the stale-stream case with a stable coded failure marker on the existing ChunkFetchFailure error string
parse that marker into ChunkFetchFailureException and treat it as the structured signal for same-worker reopen
keep a fallback to the legacy human-readable message so new clients still recover from older workers that do not send the code yet
do not classify that specific failure as a critical fetch cause
do not exclude the restarted worker before retrying
recreate the reader on the same PartitionLocation with pbStreamHandler = null so the client issues a fresh OPEN_STREAM
reuse checkpoint metadata so already returned chunks are skipped on the reopened stream

Validation

mvn -pl client -am -DskipTests compile
mvn -pl common -am -Dtest=ExceptionUtilsSuiteJ,TransportResponseHandlerSuiteJ test
Verified the fix e2e on a running celeborn cluster in k8s, ran tpc-ds 3TB benchmark.

(cherry picked from commit 8664706461c398ab48b541075a7ee11e2717a155)

eolivelli

Can you please add an end to end test with a worker that is restarted ?

eolivelli · 2026-03-17T05:43:30Z

common/src/main/java/org/apache/celeborn/common/util/ExceptionUtils.java

+    Throwable current = throwable;
+    while (current != null) {
+      String message = current.getMessage();
+      if (message != null && message.contains("is not registered with worker")) {


This check is too fragile, we need a better way.
Can we add some error code ?

ok, just did an attempt, ChunkFetchFailureException doesn't carry an error code ,and it's hard to wire in RPC. see the commit history for the attempts.

@eolivelli Please let me know which revision is preferred

mzhou-oai · 2026-03-17T05:51:10Z

Can you please add an end to end test with a worker that is restarted ?

Yes, it's verified e2e. let me update the PR body about that.

(cherry picked from commit ec59e1bb24394cdd62a0203ed532bdbaf18e6a04)

This reverts commit 04df8ba.

This reverts commit 2f76c95.

This reverts commit 18d57cb.

This reverts commit a91385f.

Handle stale fetch streams after worker restart

8d3b019

(cherry picked from commit 8664706461c398ab48b541075a7ee11e2717a155)

github-actions bot added module:client module:common labels Mar 17, 2026

eolivelli suggested changes Mar 17, 2026

View reviewed changes

Add coded stale-stream fetch failure handling

a91385f

github-actions bot added the module:worker label Mar 17, 2026

mzhou-oai added 7 commits March 16, 2026 23:04

Simplify stale-stream detection to coded failures only

18d57cb

(cherry picked from commit ec59e1bb24394cdd62a0203ed532bdbaf18e6a04)

Inline stale-stream error code helpers

2f76c95

Inline chunk fetch error code parsing

04df8ba

Revert "Inline chunk fetch error code parsing"

43f4887

This reverts commit 04df8ba.

Revert "Inline stale-stream error code helpers"

d56afca

This reverts commit 2f76c95.

Revert "Simplify stale-stream detection to coded failures only"

873e1c9

This reverts commit 18d57cb.

Revert "Add coded stale-stream fetch failure handling"

6be6b3e

This reverts commit a91385f.

github-actions bot removed the module:worker label Mar 17, 2026

Fix spotless import order in CelebornInputStream

a52d414

mzhou-oai closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle stale fetch streams after worker restart#3632

Handle stale fetch streams after worker restart#3632
mzhou-oai wants to merge 10 commits intoapache:mainfrom
mzhou-oai:dev/mzhou/dpcdi-3260-stale-stream-reopen-upstream

mzhou-oai commented Mar 17, 2026 •

edited

Loading

Uh oh!

eolivelli left a comment •

edited

Loading

Uh oh!

eolivelli Mar 17, 2026

Uh oh!

mzhou-oai Mar 17, 2026

Uh oh!

mzhou-oai Mar 17, 2026

Uh oh!

mzhou-oai commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mzhou-oai commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

Proposal

Validation

Uh oh!

eolivelli left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eolivelli Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mzhou-oai Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mzhou-oai Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mzhou-oai commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mzhou-oai commented Mar 17, 2026 •

edited

Loading

eolivelli left a comment •

edited

Loading

mzhou-oai commented Mar 17, 2026 •

edited

Loading