bcachefs: implement online filesystem shrinking by jullanggit · Pull Request #1073 · koverstreet/bcachefs

jullanggit · 2026-03-02T19:58:00Z

Implement online filesystem shrinking through reconcile. Closes #781 once done.
This is hopefully complementary to #1070, which targets offline shrink.

Goal

A robust online shrinking implementation, that automatically resumes after restarts/crashes, as shrinking is a potentially long-running operation, and supports changing the target size mid-shrink.

Current state

Data movement, including stripes seems to work
Not all metadata (especially surrounding bucket gens) seems to be correctly handled yet
Journals are not explicitly handled, but may need to be
Resuming and changing target size is not yet implemented
Uses a whole-device reconcile scan, I plan to make it only scan the affected area

Implementation

Reuses large parts of the device remove/evacuate paths

Documentation

Not yet written

Testing

See https://github.com/jullanggit/ktest/tree/shrink for the tests used

koverstreet · 2026-03-09T19:04:15Z

Nice work — the reconcile-based approach for online shrinking is the right direction. Some feedback:

On-disk format:

Adding target_nbuckets to struct bch_member is a reasonable approach for tracking in-progress shrinks across restarts. But appending to bch_member needs care — make sure you're checking sizeof for the member struct version in the superblock validation path so older kernels reading a newer superblock don't read garbage. (This might already be handled by the existing versioned member struct machinery, but worth verifying.)

The evacuation loop:

The tail_is_empty() / sleep / retry loop in bch2_dev_shrink() is the right structure, but there are some issues:

The TODO about cached data is a real problem. After reconcile evacuates, the old extents get marked cached, so tail_is_empty() (which checks backpointers) might never see them clear. You'll need to handle this — either by making evacuate fully remove the old extent rather than caching it, or by adding a pass that invalidates cached buckets in the shrink region.
schedule_timeout_killable(HZ/2) is a reasonable poll interval but you should probably also check c->reconcile progress or wait on an event rather than blind polling.
The signal handling (signal_pending → -EINTR) is good — long-running operations should be interruptible.

Allocation cutoff:

if (unlikely(ca->mi.target_nbuckets && bucket >= ca->mi.target_nbuckets)) {

This replaces bch2_bucket_nouse() — but buckets_nouse was also used for other things (marking individual buckets bad). You've removed ENOMEM_buckets_nouse and no_resize_with_buckets_nouse from the error codes. Make sure nothing else was relying on that bitmap.

bch2_ptr_bad_or_evacuating_rcu:

This inline function does a division (div_u64(ptr->offset, ca->mi.bucket_size)) in what can be a hot path. Consider whether you can precompute the cutoff sector instead.

Commented-out code:

The __bch2_dev_resize_alloc block is commented out with a TODO. Either figure out what's needed there or remove it — commented-out code shouldn't ship.

Style nits:

Opening braces on function definitions should be on their own line (kernel style)
// TODO comments are fine for WIP but should be resolved before merge
The _typos.toml change doesn't belong in this PR
Several unrelated typo fixes (sentinal→sentinel, dosen't→doesn't, minumum→minimum, elligible→eligible) — those are welcome but should be a separate commit

Testing:

Good that you have ktest tests for this: https://github.com/jullanggit/ktest/tree/shrink. Consider adding cases for:

Shrink interrupted by signal, then resumed
Shrink with concurrent heavy writes to the device being shrunk
Shrink that needs to move striped/EC data
Shrink below journal location

Overall this is solid WIP. The hard parts (cached data handling, journal, resume after crash) are acknowledged as TODOs, which is the right approach — get the happy path working first.

— ProofOfConcept

jullanggit · 2026-03-15T20:32:58Z

Thank you for the review! I'll continue working on this, and will ping you once I feel like another review would help.

…of buckets

| Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202602181006.rLTgu86r-lkp@intel.com/

this addition is backwards compatible because new fields are initialized to zero, which means no pending resize, and are not read by older kernels

avoids stripe reshuffling

…y) legal shrinks

also comment in outline of shrink path

…allocation

…elying on ca->mi.target_nbuckets avoids possible edge cases if device is being removed mid-shrink etc.

…ned integer

This is done analogous to the remove alloc info path

Keep the shrink cutoff in force until tail alloc metadata is removed, move journal buckets out of the truncated tail before the final commit, and clear tail need_discard state so fsck/accounting do not retain bookkeeping for removed buckets. Also fix journal bucket deletion so the updated per-device journal superblock state is what gets written back.

Serialize duplicate data-update ownership in the update table so phys reconcile workers cannot race the same logical extent. Also tolerate device-scan pending data_replicas work when other reconcile bits are still set, stop background movers before globally disabling writes during RO shutdown, and avoid counting transient reconcile ENOSPC/ec_alloc_failed retries as hard data-update failures. This fixes the targeted three-device shrink tests, including the EC case.

During device shrink, an extent can become temporarily under-replicated as soon as one pointer lands on a bad or evacuating location, before the device scan has had a chance to rewrite the reconcile entry. Treating that window as a missing io-opts propagation cookie raises a false extent_io_opts_not_set fsck error. Skip that check when the new data_replicas work is explained by existing bad or evacuating pointers so shrink can hand the extent off to reconcile without tripping fsck.

Reconcile-only updates can change the phys-work class encoded in rotational-data backpointers without moving any extent pointers. The fast path in bch2_trigger_extent() treated identical pointer arrays as a no-op, which left stale or missing phys backpointers behind during three-device EC shrink. Teach that path to recognize when only the backpointer reconcile flags are changing, update the existing backpointer entries in place, and adjust the reconcile_work_phys bits without routing the unchanged key through a delete+insert cycle in the write buffer.

Shrink cleanup deletes alloc/freespace state in bucket coordinates, but backpointer keys are indexed by device+sector. Using the raw bucket cutoff for BTREE_ID_backpointers can delete live backpointers that still belong to buckets below the shrink point, which leaves offline fsck repairing missing phys backpointers in the three-device EC shrink tests. Translate the cutoff into backpointer key space before deleting the tail of the truncated device.

Repeated online shrink runs can leave copygc blocked in a move write while waiting for allocator space from its dedicated write point. Stopping copygc before those open buckets are dropped can then hang the read-only transition inside kthread_stop(). Close write points before stopping the background data movers so copygc does not stay wedged behind allocator state owned by the filesystem itself.

Reconcile always updates the extent it looks up, either to rewrite reconcile state or to queue a move. Starting those iterators in read mode forces a restart_upgrade during commit for each key, and the three-device EC shrink test can trip the slowpath counter threshold as those restarts add up. Open the logical and physical extent iterators with intent locks from the start so reconcile does not burn one upgrade restart per extent.

Variable-bucket shrink can fail before reconcile moves any data when the filesystem metadata target only contains the device being shrunk. The initial reconcile_scan update, later btree node rewrites, and journal allocations all kept preferring that shrinking target, so once the remaining below-cutoff buckets ran short the shrink aborted with no_buckets_found. Teach internal btree and journal metadata allocations to fall back to the full filesystem when every rw member of the preferred metadata target is currently shrinking. Also treat allocator no_buckets_found as the same transient ENOSPC-class condition reconcile already demotes to pending work, so later shrink-triggered retries do not escalate it into a hard reconcile failure.

`target_nbuckets` already persists an interrupted shrink target, but mount left that state stranded after a restart. Refactor the shrink path so recovery can reuse it, and resume pending shrinks from the resize-on- mount hook once btrees and normal journal reservations are available. The early resize-on-mount call still handles grow-on-mount image expansion before btrees are running. A second call after early recovery setup now requeues reconcile's device scan and finishes any pending shrink synchronously during mount.

Interpret `target_nbuckets` as the single requested-size field for both shrink and grow, with helpers that normalize `0` to the current device size and make shrink-only users consult the current resize direction. This lets allocator and metadata fallback paths stop treating every nonzero target as an active shrink. Move resize execution to a per-device background kthread. Requests now persist the latest target, bump a sequence number, and wait synchronously for that sequence to finish unless a newer request supersedes it. The worker re-reads the current target at explicit restart points so stale shrink passes bail out instead of finishing obsolete work, and recovery reuses the same path to resume either shrink or grow. Also stop resize workers during read-only shutdown before clean shutdown is marked. Without that, an interrupted shrink can keep issuing transactional alloc/accounting updates during unmount and trip write-path assertions.

bch2_do_discards() now clears need_discard buckets via alloc updates that commit the transaction and remove the corresponding need_discard entry. Holding the need_discard iterator open across that commit can deadlock against the alloc/freespace updates while shrink is reclaiming truncated tails. Fetch one need_discard key at a time, remember the successor position, and reopen the iterator after each bucket is processed. This keeps the upstream need_discard worker model intact while avoiding the lock cycle seen in shrink tests after the rebase.

Do not keep reconcile_scan iterators or other cached search paths alive across the actual move/rewrite operations. Return the next reconcile_scan entry without exiting from inside the iterator helper, and explicitly restart the transaction before the normal and phys reconcile workers begin mutating extents. This avoids deadlocks where cached reconcile paths stay pinned while move work needs alloc/freespace updates and btree node rewrites.

Shrink still needs discard progress on retained buckets, but discards in the tail being evacuated can deadlock against resize/reconcile and journal pin flushing. Only defer need_discard entries at or beyond the current shrink target, drain any in-flight discard workers before the shrink path starts, and re-kick discards once the current resize request reaches a terminal state. Also clear tail need_discard state by deleting the derived index entries directly instead of mutating alloc keys that are about to be truncated anyway.

Track completed reconcile kicks so shrink can wait for a full reconcile pass instead of polling forever for the tail to become empty. Shrink now queues a device scan plus a pending pass, retries pending work once more, and returns `-ENOSPC` if the tail is still occupied after both passes. Before returning that failure, clear the persisted shrink target and wake pending reconcile work so remount no longer retries a known-impossible shrink under the stale cutoff.

Shrink was queuing a full-device RECONCILE_SCAN_device pass and then starting the final journal flush as soon as the requested kick drained. In online_three_device_variable_buckets_shrink that could requeue metadata below the retained region for tens of seconds, and once the kick completed the resize worker could still collide with reconcile's cached paths during key-cache pin flushing. Start shrink-triggered backpointer scans at target_nbuckets translated into backpointer key space, and wait for reconcile to report idle before the final shrink flush. That keeps the resize worker focused on the tail that will actually be truncated and avoids the long post-resize stalls.

A shrink-triggered reconcile kick can keep draining unrelated global reconcile work long after the truncating tail has already been evacuated. In the variable-bucket shrink ktest that turned occasional runs into multi-minute stalls even though the tail was already empty. Poll tail emptiness while waiting for reconcile so shrink can move on as soon as evacuation has actually cleared the tail. Keep the final cutoff path as a separate helper so the control flow stays structured and does not need gotos.

Shrink's final commit path only needs to flush journal pins that were already outstanding when it proved the tail empty. Waiting for bch2_journal_flush_all_pins() lets unrelated reconcile/key-cache work keep adding newer pins, which turns the resize ioctl into a multi-minute stall even though the shrinking tail is already clear. Flush only journal_cur_seq() before the device-specific pin flush. That still drains the journal state shrink must fence before it commits the smaller nbuckets, without waiting on unrelated future journal traffic.

…check restart check is done below after acquiring the lock anyways. the journal move is not needed as we already move the journal at the start of the shrink, and new allocations (and thus journal buckets) are blocked in the tail, so it can't re-appear.

imo more descriptive

…ncel

Shrink still used a wall-clock no-progress deadline after moving to tail head plus aggregate backpointer tracking. That fixed the minute-scale outlier, but it kept the final ENOSPC heuristic tied to host speed and load. Keep the shrink-local tail snapshots, but replace the wall-clock deadline with counted reconcile work. Record how much work a completed reconcile kick actually scanned or processed, rescan the shrinking device whenever a completed kick found nothing to do, and only count completed no-progress kicks that did real reconcile work toward ENOSPC. Shrink still wakes once per second so it can rescan the tail instead of sitting behind one long reconcile kick, but the impossible-shrink heuristic itself is now based on completed no-progress work, not time.

A shrink tail can stay flat for many reconcile kicks while foreground IO or reconcile itself is still changing metadata. Counting those no-progress kicks as ENOSPC evidence can fail a shrink that would have completed on the next wave of writes. Tighten the heuristic so it only counts no-progress passes after a full device rescan and only when the journal stayed quiet across that pass. If the journal moved, force another device rescan from the current cutoff instead of claiming the tail is impossible to evacuate. That keeps ENOSPC on the stall path, but only after the blocker set has stopped moving for repeated full rescans.

koverstreet force-pushed the master branch 2 times, most recently from 6303f5b to 990d039 Compare March 14, 2026 03:32

jullanggit force-pushed the shrink branch from 369a184 to 47b0235 Compare March 15, 2026 20:46

koverstreet force-pushed the master branch from a437918 to 113cc1f Compare March 24, 2026 04:34

jullanggit force-pushed the shrink branch from 47b0235 to 52ca9a6 Compare March 24, 2026 16:18

koverstreet force-pushed the master branch 2 times, most recently from a6d79f5 to 82c906f Compare April 1, 2026 04:57

jullanggit added 21 commits April 15, 2026 18:45

bcachefs: validate_member(): fix device not enough buckets message

f47e0bd

bcachefs: validate_member(): extract variables

b5873eb

bcachefs: validate_member(): not enough buckets: print actual amount …

609a1af

…of buckets

bcachefs: validate_members(): fix bad search-and-replace

8157d95

| Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202602181006.rLTgu86r-lkp@intel.com/

bcachefs: add target_nbuckets to bch_member(_cpu)

289e32b

this addition is backwards compatible because new fields are initialized to zero, which means no pending resize, and are not read by older kernels

bcachefs: print target_nbuckets = 0 as inactive instead of 0

5456cdc

bcachefs: consider pointers after target_nbuckets as evacuating

151f43d

bcachefs: use dev-based evacuating check for may_reuse_stripe

b96ea41

avoids stripe reshuffling

bcachefs: sharpen target_nbuckets semantics to only include (triviall…

8c67419

…y) legal shrinks

bcachefs: split bch2_dev_resze into grow & shrink paths

24a0d16

also comment in outline of shrink path

bcachefs: implement shrink superblock & alloc nouse interactions

76bb037

bcachefs: refactor out per-device buckets_nouse alloc/free

c9714f5

bcachefs: implement filesystem shrinking mvp

1b2c742

bcachefs: shrink: avoid holding two transactions at once

75cc265

bcachefs: alloc: directly check against target_nbuckets for skipping …

ce82b76

…allocation

bcachefs: remove unused buckets_nouse code

6dab5dd

bcachefs: shrink: print error message if reconcile scan fails

7c80388

bcachefs: shrink: close open buckets before evacuating data

6e47a06

bcachefs: shrink: explicitly pass tail_cutoff instead of implicitly r…

5285b74

…elying on ca->mi.target_nbuckets avoids possible edge cases if device is being removed mid-shrink etc.

bcachefs: dev_resize_alloc: cast new - old buckets calculation to sig…

fae49d1

…ned integer

bcachefs: shrink: truncate alloc info

0c38949

This is done analogous to the remove alloc info path

jullanggit added 17 commits April 19, 2026 19:34

bcachefs: shrink: properly remove bucket gens

34abba4

bcachefs: shrink: bch2_dev_resize_wait_done: improve readability

15c35f6

bcachefs: shrink: small code cleanups

8386a41

jullanggit force-pushed the shrink branch from ab100fc to ca44f04 Compare April 19, 2026 20:29

jullanggit added 12 commits April 19, 2026 23:09

bcachefs: add comment

19cfd75

bcachefs: rename bch2_dev_shrink_finish -> bch2_dev_shrink_finalize

9f3a8b5

imo more descriptive

bcachefs: shorten comment about waking pending reconcile on shrink ca…

11c8da5

…ncel

bcachefs: shrink: add TODO

5b98324

bcachefs: shrink: change not-enough-space error message

3f605d3

bcachefs: shrink: document journal as imperfect progress signal

d2a62b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bcachefs: implement online filesystem shrinking#1073

bcachefs: implement online filesystem shrinking#1073
jullanggit wants to merge 62 commits intokoverstreet:masterfrom
jullanggit:shrink

jullanggit commented Mar 2, 2026 •

edited

Loading

Uh oh!

koverstreet commented Mar 9, 2026

Uh oh!

jullanggit commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jullanggit commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Current state

Implementation

Documentation

Testing

Uh oh!

koverstreet commented Mar 9, 2026

Uh oh!

jullanggit commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jullanggit commented Mar 2, 2026 •

edited

Loading