bcachefs: don't hold srcu lock across blocking operations#1069
bcachefs: don't hold srcu lock across blocking operations#1069matthiasgoergens wants to merge 1 commit intokoverstreet:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request addresses a critical deadlock issue in bcachefs where SRCU read locks were held across long-blocking operations, preventing memory reclaim under pressure. Under low memory conditions, the shrinker needs SRCU grace periods to complete before freeing old btree nodes, but these grace periods cannot complete while read locks are held, creating a deadlock.
Changes:
- Introduces
drop_locks_long_do()macro as an SRCU-dropping variant ofdrop_locks_do() - Systematically replaces
bch2_trans_unlock()withbch2_trans_unlock_long()at 25 sites before blocking operations - Simplifies
bch2_fsck_ask_yn()by removing deferred unlock logic since SRCU is now dropped immediately
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| fs/bcachefs/btree/iter.h | Adds drop_locks_long_do() macro for SRCU-aware lock dropping |
| fs/bcachefs/vfs/fs.c | Drops SRCU before waiting on freeing inode (up to 10 seconds) |
| fs/bcachefs/vfs/buffered.c | Drops SRCU before bio allocation with GFP_KERNEL |
| fs/bcachefs/init/error.c | Drops SRCU before waiting for user input; simplifies timeout logic |
| fs/bcachefs/debug/tests.c | Drops SRCU before journal flush |
| fs/bcachefs/data/write.c | Drops SRCU before nocow bucket locking |
| fs/bcachefs/data/update.c | Drops SRCU before closure_sync (2 sites) and GFP_KERNEL allocation |
| fs/bcachefs/data/migrate.c | Drops SRCU before btree interior updates flush |
| fs/bcachefs/data/ec/io.c | Drops SRCU before stripe buffer allocation |
| fs/bcachefs/btree/read.c | Drops SRCU before btree node read IO |
| fs/bcachefs/btree/locking.c | Drops SRCU in mutex lock helper |
| fs/bcachefs/btree/interior.c | Drops SRCU before journal wait, gc.lock, btree interior updates (5 sites) |
| fs/bcachefs/btree/commit.c | Drops SRCU before journal reservation, GFP_KERNEL allocation, journal reclaim wait |
| fs/bcachefs/btree/cache.c | Drops SRCU before GFP_KERNEL allocations and btree node read (4 sites) |
| fs/bcachefs/alloc/foreground.c | Drops SRCU before mutex lock |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Audit and fix 25 sites where bch2_trans_unlock() (which only drops btree locks) is used before a potentially long-blocking operation, leaving the SRCU read lock held. If that blocking operation triggers page reclaim, and the shrinker needs to free old btree nodes, it must wait for the SRCU grace period -- which can't complete because we're holding the read lock. This creates a reclaim deadlock under memory pressure. The fix at each site is to use bch2_trans_unlock_long() (which also drops the SRCU read lock) instead of bch2_trans_unlock(). Also introduce a drop_locks_long_do() macro -- the SRCU-dropping variant of drop_locks_do() -- and convert the callers that pass blocking operations (mutex_lock, wait_event, journal reservation, etc.). Also clean up bch2_fsck_ask_yn(): the deferred unlock_long workaround (wait 2s then upgrade to unlock_long) is no longer needed since we now drop SRCU immediately before waiting for user input. Sites fixed: btree/cache.c 4 GFP_KERNEL allocs, sync I/O, wait_on_bit_io btree/commit.c 3 GFP_KERNEL alloc, journal reclaim wait, journal res btree/interior.c 5 mutex, closure_sync, wait_event, down_read btree/locking.c 1 mutex_lock btree/read.c 1 sync disk I/O alloc/foreground.c 1 mutex_lock data/update.c 3 GFP_KERNEL bio alloc, closure_sync (x2) data/ec/io.c 1 kvmalloc + multi-device I/O data/migrate.c 1 closure_wait_event for in-flight writes data/write.c 1 nocow lock + GFP_KERNEL vfs/buffered.c 1 filemap_alloc_folio + GFP_KERNEL vfs/fs.c 1 __wait_on_freeing_inode init/error.c 1 user input wait (+ cleanup) debug/tests.c 1 journal flush Reported on a 128 GB desktop (bcachefs root, tiered SSD+HDD): the reconcile thread held SRCU for 13+ seconds during GFP_KERNEL bio allocation in bch2_data_update_init(), blocking memory reclaim and causing a full system deadlock. Related: koverstreet#934 Related: koverstreet#636 Signed-off-by: Matthias Goergens <matthias.goergens@gmail.com>
bcd1641 to
e7a3376
Compare
|
I'm using the fixes from this branch together with the swap file support from the other PR on m desktop now. It's working well, and performance is great. Large background writes don't block my foreground activity anymore. |
|
The approach of blanket-replacing What we want instead is a two-phase approach:
This way the fast path doesn't pay the transaction restart cost, but the pathological case (blocked for 13+ seconds during a writeback storm) still resolves the SRCU deadlock. Ironically, — ProofOfConcept |
|
I'm also watching the pull requests too - can give me a shout if anything's unclear |
Audit and fix 25 sites where
bch2_trans_unlock()is used before apotentially long-blocking operation, leaving the SRCU read lock held.
Under memory pressure this creates a reclaim deadlock: the shrinker
needs an SRCU grace period to free old btree nodes, but the grace period
can't complete because we're holding the read lock.
The fix at each site is to use
bch2_trans_unlock_long()instead. Alsointroduces a
drop_locks_long_do()macro — the SRCU-dropping variant ofdrop_locks_do()— and converts the callers that pass blockingoperations (mutex_lock, wait_event, journal reservation, etc.).
Also cleans up
bch2_fsck_ask_yn(): the deferredunlock_long_atworkaround is no longer needed since we drop SRCU immediately before
waiting for user input.
Triggered by a full system deadlock on a 128 GB machine with bcachefs as
root on tiered SSD+HDD storage, where the reconcile thread held SRCU for
13+ seconds during
GFP_KERNELbio allocation inbch2_data_update_init().15 files changed, 25 sites fixed, net zero lines changed.
Investigation notes, full audit, GitHub issue survey, and a QEMU-based
reproducer are on the stacked branch.
Related: #934
Related: #936
Related: #882
Related: #636
Related: #605
Related: #826
Related: #811
Related: #1016