Skip to content

[CELEBORN-2291] Support fsync on commit to ensure shuffle data durability#3635

Open
kaybhutani wants to merge 2 commits intoapache:mainfrom
kaybhutani:kartikay/fsync-on-commit
Open

[CELEBORN-2291] Support fsync on commit to ensure shuffle data durability#3635
kaybhutani wants to merge 2 commits intoapache:mainfrom
kaybhutani:kartikay/fsync-on-commit

Conversation

@kaybhutani
Copy link

@kaybhutani kaybhutani commented Mar 24, 2026

What changes were proposed in this pull request?

Add a new configuration celeborn.worker.commitFiles.fsync (default false) that calls FileChannel.force(false) (fdatasync) before closing the channel in
LocalTierWriter.closeStreams().

Why are the changes needed?

Without this, committed shuffle data can sit in the OS page cache before the kernel flushes it to disk. A hard crash in that window loses data even though Celeborn considers it committed. This option lets operators opt into stronger durability guarantees.

Does this PR resolve a correctness bug?

No. It adds an optional durability enhancement.

Does this PR introduce any user-facing change?

Yes. New configuration key celeborn.worker.commitFiles.fsync (boolean, default false).

How was this patch tested?

Existing unit tests. Configuration verified via ConfigurationSuite and for LocalTierWriter added a new test with fsync enabled and ran TierWriterSuite.

Additional context: slack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant