Skip to content

bcachefs: don't hold srcu lock across blocking operations#1069

Open
matthiasgoergens wants to merge 1 commit intokoverstreet:masterfrom
matthiasgoergens:fix/drop-srcu-pr
Open

bcachefs: don't hold srcu lock across blocking operations#1069
matthiasgoergens wants to merge 1 commit intokoverstreet:masterfrom
matthiasgoergens:fix/drop-srcu-pr

Conversation

@matthiasgoergens
Copy link

@matthiasgoergens matthiasgoergens commented Feb 27, 2026

Audit and fix 25 sites where bch2_trans_unlock() is used before a
potentially long-blocking operation, leaving the SRCU read lock held.
Under memory pressure this creates a reclaim deadlock: the shrinker
needs an SRCU grace period to free old btree nodes, but the grace period
can't complete because we're holding the read lock.

The fix at each site is to use bch2_trans_unlock_long() instead. Also
introduces a drop_locks_long_do() macro — the SRCU-dropping variant of
drop_locks_do() — and converts the callers that pass blocking
operations (mutex_lock, wait_event, journal reservation, etc.).

Also cleans up bch2_fsck_ask_yn(): the deferred unlock_long_at
workaround is no longer needed since we drop SRCU immediately before
waiting for user input.

Triggered by a full system deadlock on a 128 GB machine with bcachefs as
root on tiered SSD+HDD storage, where the reconcile thread held SRCU for
13+ seconds during GFP_KERNEL bio allocation in
bch2_data_update_init().

15 files changed, 25 sites fixed, net zero lines changed.

Investigation notes, full audit, GitHub issue survey, and a QEMU-based
reproducer are on the stacked branch.

Related: #934
Related: #936
Related: #882
Related: #636
Related: #605
Related: #826
Related: #811
Related: #1016

Copilot AI review requested due to automatic review settings February 27, 2026 17:13
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses a critical deadlock issue in bcachefs where SRCU read locks were held across long-blocking operations, preventing memory reclaim under pressure. Under low memory conditions, the shrinker needs SRCU grace periods to complete before freeing old btree nodes, but these grace periods cannot complete while read locks are held, creating a deadlock.

Changes:

  • Introduces drop_locks_long_do() macro as an SRCU-dropping variant of drop_locks_do()
  • Systematically replaces bch2_trans_unlock() with bch2_trans_unlock_long() at 25 sites before blocking operations
  • Simplifies bch2_fsck_ask_yn() by removing deferred unlock logic since SRCU is now dropped immediately

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file
File Description
fs/bcachefs/btree/iter.h Adds drop_locks_long_do() macro for SRCU-aware lock dropping
fs/bcachefs/vfs/fs.c Drops SRCU before waiting on freeing inode (up to 10 seconds)
fs/bcachefs/vfs/buffered.c Drops SRCU before bio allocation with GFP_KERNEL
fs/bcachefs/init/error.c Drops SRCU before waiting for user input; simplifies timeout logic
fs/bcachefs/debug/tests.c Drops SRCU before journal flush
fs/bcachefs/data/write.c Drops SRCU before nocow bucket locking
fs/bcachefs/data/update.c Drops SRCU before closure_sync (2 sites) and GFP_KERNEL allocation
fs/bcachefs/data/migrate.c Drops SRCU before btree interior updates flush
fs/bcachefs/data/ec/io.c Drops SRCU before stripe buffer allocation
fs/bcachefs/btree/read.c Drops SRCU before btree node read IO
fs/bcachefs/btree/locking.c Drops SRCU in mutex lock helper
fs/bcachefs/btree/interior.c Drops SRCU before journal wait, gc.lock, btree interior updates (5 sites)
fs/bcachefs/btree/commit.c Drops SRCU before journal reservation, GFP_KERNEL allocation, journal reclaim wait
fs/bcachefs/btree/cache.c Drops SRCU before GFP_KERNEL allocations and btree node read (4 sites)
fs/bcachefs/alloc/foreground.c Drops SRCU before mutex lock

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Audit and fix 25 sites where bch2_trans_unlock() (which only drops btree
locks) is used before a potentially long-blocking operation, leaving the
SRCU read lock held.  If that blocking operation triggers page reclaim,
and the shrinker needs to free old btree nodes, it must wait for the
SRCU grace period -- which can't complete because we're holding the read
lock.  This creates a reclaim deadlock under memory pressure.

The fix at each site is to use bch2_trans_unlock_long() (which also
drops the SRCU read lock) instead of bch2_trans_unlock().

Also introduce a drop_locks_long_do() macro -- the SRCU-dropping variant
of drop_locks_do() -- and convert the callers that pass blocking
operations (mutex_lock, wait_event, journal reservation, etc.).

Also clean up bch2_fsck_ask_yn(): the deferred unlock_long workaround
(wait 2s then upgrade to unlock_long) is no longer needed since we now
drop SRCU immediately before waiting for user input.

Sites fixed:

  btree/cache.c        4  GFP_KERNEL allocs, sync I/O, wait_on_bit_io
  btree/commit.c       3  GFP_KERNEL alloc, journal reclaim wait, journal res
  btree/interior.c     5  mutex, closure_sync, wait_event, down_read
  btree/locking.c      1  mutex_lock
  btree/read.c         1  sync disk I/O
  alloc/foreground.c   1  mutex_lock
  data/update.c        3  GFP_KERNEL bio alloc, closure_sync (x2)
  data/ec/io.c         1  kvmalloc + multi-device I/O
  data/migrate.c       1  closure_wait_event for in-flight writes
  data/write.c         1  nocow lock + GFP_KERNEL
  vfs/buffered.c       1  filemap_alloc_folio + GFP_KERNEL
  vfs/fs.c             1  __wait_on_freeing_inode
  init/error.c         1  user input wait (+ cleanup)
  debug/tests.c        1  journal flush

Reported on a 128 GB desktop (bcachefs root, tiered SSD+HDD): the
reconcile thread held SRCU for 13+ seconds during GFP_KERNEL bio
allocation in bch2_data_update_init(), blocking memory reclaim and
causing a full system deadlock.

Related: koverstreet#934
Related: koverstreet#636

Signed-off-by: Matthias Goergens <matthias.goergens@gmail.com>
@matthiasgoergens
Copy link
Author

I'm using the fixes from this branch together with the swap file support from the other PR on m desktop now. It's working well, and performance is great. Large background writes don't block my foreground activity anymore.

@koverstreet
Copy link
Owner

The approach of blanket-replacing bch2_trans_unlock() with bch2_trans_unlock_long() at all 25 sites is too aggressive. unlock_long() drops the SRCU read lock, which forces a full transaction restart on relock. In the common case, the blocking operation completes quickly and you don't need to drop SRCU at all.

What we want instead is a two-phase approach:

  1. bch2_trans_unlock() first (drops btree locks, keeps SRCU)
  2. Do the blocking operation
  3. Only if we're still blocked after a couple seconds, escalate to bch2_trans_unlock_long() to let SRCU grace periods complete

This way the fast path doesn't pay the transaction restart cost, but the pathological case (blocked for 13+ seconds during a writeback storm) still resolves the SRCU deadlock.

Ironically, bch2_fsck_ask_yn() already had something like this with its unlock_long_at = jiffies + HZ * 2 pattern — the PR removes it and calls it a cleanup, but it was actually the right idea, just with a clunky implementation. A drop_locks_do variant that starts with unlock() and escalates to unlock_long() after a timeout would be the clean version of that pattern.

— ProofOfConcept

@koverstreet
Copy link
Owner

I'm also watching the pull requests too - can give me a shout if anything's unclear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants