Skip to content

bcachefs: implement online filesystem shrinking#1073

Draft
jullanggit wants to merge 62 commits intokoverstreet:masterfrom
jullanggit:shrink
Draft

bcachefs: implement online filesystem shrinking#1073
jullanggit wants to merge 62 commits intokoverstreet:masterfrom
jullanggit:shrink

Conversation

@jullanggit
Copy link
Copy Markdown

@jullanggit jullanggit commented Mar 2, 2026

Implement online filesystem shrinking through reconcile. Closes #781 once done.
This is hopefully complementary to #1070, which targets offline shrink.

Goal

A robust online shrinking implementation, that automatically resumes after restarts/crashes, as shrinking is a potentially long-running operation, and supports changing the target size mid-shrink.

Current state

  • Data movement, including stripes seems to work
  • Not all metadata (especially surrounding bucket gens) seems to be correctly handled yet
  • Journals are not explicitly handled, but may need to be
  • Resuming and changing target size is not yet implemented
  • Uses a whole-device reconcile scan, I plan to make it only scan the affected area

Implementation

Reuses large parts of the device remove/evacuate paths

Documentation

Not yet written

Testing

See https://github.com/jullanggit/ktest/tree/shrink for the tests used

@koverstreet
Copy link
Copy Markdown
Owner

Nice work — the reconcile-based approach for online shrinking is the right direction. Some feedback:

On-disk format:

Adding target_nbuckets to struct bch_member is a reasonable approach for tracking in-progress shrinks across restarts. But appending to bch_member needs care — make sure you're checking sizeof for the member struct version in the superblock validation path so older kernels reading a newer superblock don't read garbage. (This might already be handled by the existing versioned member struct machinery, but worth verifying.)

The evacuation loop:

The tail_is_empty() / sleep / retry loop in bch2_dev_shrink() is the right structure, but there are some issues:

  • The TODO about cached data is a real problem. After reconcile evacuates, the old extents get marked cached, so tail_is_empty() (which checks backpointers) might never see them clear. You'll need to handle this — either by making evacuate fully remove the old extent rather than caching it, or by adding a pass that invalidates cached buckets in the shrink region.

  • schedule_timeout_killable(HZ/2) is a reasonable poll interval but you should probably also check c->reconcile progress or wait on an event rather than blind polling.

  • The signal handling (signal_pending → -EINTR) is good — long-running operations should be interruptible.

Allocation cutoff:

if (unlikely(ca->mi.target_nbuckets && bucket >= ca->mi.target_nbuckets)) {

This replaces bch2_bucket_nouse() — but buckets_nouse was also used for other things (marking individual buckets bad). You've removed ENOMEM_buckets_nouse and no_resize_with_buckets_nouse from the error codes. Make sure nothing else was relying on that bitmap.

bch2_ptr_bad_or_evacuating_rcu:

This inline function does a division (div_u64(ptr->offset, ca->mi.bucket_size)) in what can be a hot path. Consider whether you can precompute the cutoff sector instead.

Commented-out code:

The __bch2_dev_resize_alloc block is commented out with a TODO. Either figure out what's needed there or remove it — commented-out code shouldn't ship.

Style nits:

  • Opening braces on function definitions should be on their own line (kernel style)
  • // TODO comments are fine for WIP but should be resolved before merge
  • The _typos.toml change doesn't belong in this PR
  • Several unrelated typo fixes (sentinal→sentinel, dosen't→doesn't, minumum→minimum, elligible→eligible) — those are welcome but should be a separate commit

Testing:

Good that you have ktest tests for this: https://github.com/jullanggit/ktest/tree/shrink. Consider adding cases for:

  • Shrink interrupted by signal, then resumed
  • Shrink with concurrent heavy writes to the device being shrunk
  • Shrink that needs to move striped/EC data
  • Shrink below journal location

Overall this is solid WIP. The hard parts (cached data handling, journal, resume after crash) are acknowledged as TODOs, which is the right approach — get the happy path working first.

— ProofOfConcept

@koverstreet koverstreet force-pushed the master branch 2 times, most recently from 6303f5b to 990d039 Compare March 14, 2026 03:32
@jullanggit
Copy link
Copy Markdown
Author

Thank you for the review! I'll continue working on this, and will ping you once I feel like another review would help.

this addition is backwards compatible because new fields are initialized
to zero, which means no pending resize, and are not read by older
kernels
also comment in outline of shrink path
…elying on ca->mi.target_nbuckets

avoids possible edge cases if device is being removed mid-shrink etc.
This is done analogous to the remove alloc info path
Keep the shrink cutoff in force until tail alloc metadata is removed, move journal buckets out of the truncated tail before the final commit, and clear tail need_discard state so fsck/accounting do not retain bookkeeping for removed buckets. Also fix journal bucket deletion so the updated per-device journal superblock state is what gets written back.
Serialize duplicate data-update ownership in the update table so phys reconcile
workers cannot race the same logical extent. Also tolerate device-scan pending
data_replicas work when other reconcile bits are still set, stop background
movers before globally disabling writes during RO shutdown, and avoid counting
transient reconcile ENOSPC/ec_alloc_failed retries as hard data-update
failures.

This fixes the targeted three-device shrink tests, including the EC case.
During device shrink, an extent can become temporarily under-replicated as
soon as one pointer lands on a bad or evacuating location, before the
device scan has had a chance to rewrite the reconcile entry. Treating that
window as a missing io-opts propagation cookie raises a false
extent_io_opts_not_set fsck error.

Skip that check when the new data_replicas work is explained by existing
bad or evacuating pointers so shrink can hand the extent off to reconcile
without tripping fsck.
Reconcile-only updates can change the phys-work class encoded in
rotational-data backpointers without moving any extent pointers. The fast
path in bch2_trigger_extent() treated identical pointer arrays as a no-op,
which left stale or missing phys backpointers behind during three-device EC
shrink.

Teach that path to recognize when only the backpointer reconcile flags are
changing, update the existing backpointer entries in place, and adjust the
reconcile_work_phys bits without routing the unchanged key through a
delete+insert cycle in the write buffer.
Shrink cleanup deletes alloc/freespace state in bucket coordinates, but
backpointer keys are indexed by device+sector. Using the raw bucket cutoff
for BTREE_ID_backpointers can delete live backpointers that still belong to
buckets below the shrink point, which leaves offline fsck repairing missing
phys backpointers in the three-device EC shrink tests.

Translate the cutoff into backpointer key space before deleting the tail of
the truncated device.
Repeated online shrink runs can leave copygc blocked in a move write while
waiting for allocator space from its dedicated write point. Stopping copygc
before those open buckets are dropped can then hang the read-only transition
inside kthread_stop().

Close write points before stopping the background data movers so copygc does
not stay wedged behind allocator state owned by the filesystem itself.
Reconcile always updates the extent it looks up, either to rewrite
reconcile state or to queue a move. Starting those iterators in read mode
forces a restart_upgrade during commit for each key, and the three-device
EC shrink test can trip the slowpath counter threshold as those restarts add
up.

Open the logical and physical extent iterators with intent locks from the
start so reconcile does not burn one upgrade restart per extent.
Variable-bucket shrink can fail before reconcile moves any data when the
filesystem metadata target only contains the device being shrunk. The
initial reconcile_scan update, later btree node rewrites, and journal
allocations all kept preferring that shrinking target, so once the
remaining below-cutoff buckets ran short the shrink aborted with
no_buckets_found.

Teach internal btree and journal metadata allocations to fall back to the
full filesystem when every rw member of the preferred metadata target is
currently shrinking. Also treat allocator no_buckets_found as the same
transient ENOSPC-class condition reconcile already demotes to pending
work, so later shrink-triggered retries do not escalate it into a hard
reconcile failure.
`target_nbuckets` already persists an interrupted shrink target, but
mount left that state stranded after a restart. Refactor the shrink path
so recovery can reuse it, and resume pending shrinks from the resize-on-
mount hook once btrees and normal journal reservations are available.

The early resize-on-mount call still handles grow-on-mount image
expansion before btrees are running. A second call after early recovery
setup now requeues reconcile's device scan and finishes any pending
shrink synchronously during mount.
Interpret `target_nbuckets` as the single requested-size field for both
shrink and grow, with helpers that normalize `0` to the current device
size and make shrink-only users consult the current resize direction.
This lets allocator and metadata fallback paths stop treating every
nonzero target as an active shrink.

Move resize execution to a per-device background kthread. Requests now
persist the latest target, bump a sequence number, and wait synchronously
for that sequence to finish unless a newer request supersedes it. The
worker re-reads the current target at explicit restart points so stale
shrink passes bail out instead of finishing obsolete work, and recovery
reuses the same path to resume either shrink or grow.

Also stop resize workers during read-only shutdown before clean shutdown
is marked. Without that, an interrupted shrink can keep issuing
transactional alloc/accounting updates during unmount and trip
write-path assertions.
bch2_do_discards() now clears need_discard buckets via alloc updates that
commit the transaction and remove the corresponding need_discard entry.
Holding the need_discard iterator open across that commit can deadlock
against the alloc/freespace updates while shrink is reclaiming truncated
tails.

Fetch one need_discard key at a time, remember the successor position,
and reopen the iterator after each bucket is processed. This keeps the
upstream need_discard worker model intact while avoiding the lock cycle
seen in shrink tests after the rebase.
Do not keep reconcile_scan iterators or other cached search paths alive
across the actual move/rewrite operations.

Return the next reconcile_scan entry without exiting from inside the
iterator helper, and explicitly restart the transaction before the
normal and phys reconcile workers begin mutating extents.

This avoids deadlocks where cached reconcile paths stay pinned while
move work needs alloc/freespace updates and btree node rewrites.
Shrink still needs discard progress on retained buckets, but discards in
the tail being evacuated can deadlock against resize/reconcile and journal
pin flushing.

Only defer need_discard entries at or beyond the current shrink target,
drain any in-flight discard workers before the shrink path starts, and
re-kick discards once the current resize request reaches a terminal state.

Also clear tail need_discard state by deleting the derived index entries
directly instead of mutating alloc keys that are about to be truncated
anyway.
Track completed reconcile kicks so shrink can wait for a full reconcile
pass instead of polling forever for the tail to become empty. Shrink now
queues a device scan plus a pending pass, retries pending work once more,
and returns `-ENOSPC` if the tail is still occupied after both passes.

Before returning that failure, clear the persisted shrink target and wake
pending reconcile work so remount no longer retries a known-impossible
shrink under the stale cutoff.
Shrink was queuing a full-device RECONCILE_SCAN_device pass and then
starting the final journal flush as soon as the requested kick drained.
In online_three_device_variable_buckets_shrink that could requeue metadata
below the retained region for tens of seconds, and once the kick completed
the resize worker could still collide with reconcile's cached paths during
key-cache pin flushing.

Start shrink-triggered backpointer scans at target_nbuckets translated into
backpointer key space, and wait for reconcile to report idle before the
final shrink flush. That keeps the resize worker focused on the tail that
will actually be truncated and avoids the long post-resize stalls.
A shrink-triggered reconcile kick can keep draining unrelated global
reconcile work long after the truncating tail has already been
evacuated. In the variable-bucket shrink ktest that turned occasional
runs into multi-minute stalls even though the tail was already empty.

Poll tail emptiness while waiting for reconcile so shrink can move on as
soon as evacuation has actually cleared the tail. Keep the final cutoff
path as a separate helper so the control flow stays structured and does
not need gotos.
Shrink's final commit path only needs to flush journal pins that were
already outstanding when it proved the tail empty. Waiting for
bch2_journal_flush_all_pins() lets unrelated reconcile/key-cache work
keep adding newer pins, which turns the resize ioctl into a multi-minute
stall even though the shrinking tail is already clear.

Flush only journal_cur_seq() before the device-specific pin flush. That
still drains the journal state shrink must fence before it commits the
smaller nbuckets, without waiting on unrelated future journal traffic.
…check

restart check is done below after acquiring the lock anyways.
the journal move is not needed as we already move the journal at the
start of the shrink, and new allocations (and thus journal buckets)
are blocked in the tail, so it can't re-appear.
Shrink still used a wall-clock no-progress deadline after moving to tail
head plus aggregate backpointer tracking. That fixed the minute-scale
outlier, but it kept the final ENOSPC heuristic tied to host speed and
load.

Keep the shrink-local tail snapshots, but replace the wall-clock deadline
with counted reconcile work. Record how much work a completed reconcile
kick actually scanned or processed, rescan the shrinking device whenever
a completed kick found nothing to do, and only count completed
no-progress kicks that did real reconcile work toward ENOSPC.

Shrink still wakes once per second so it can rescan the tail instead of
sitting behind one long reconcile kick, but the impossible-shrink
heuristic itself is now based on completed no-progress work, not time.
A shrink tail can stay flat for many reconcile kicks while foreground IO or
reconcile itself is still changing metadata. Counting those no-progress kicks
as ENOSPC evidence can fail a shrink that would have completed on the next
wave of writes.

Tighten the heuristic so it only counts no-progress passes after a full
device rescan and only when the journal stayed quiet across that pass. If the
journal moved, force another device rescan from the current cutoff instead of
claiming the tail is impossible to evacuate.

That keeps ENOSPC on the stall path, but only after the blocker set has stopped
moving for repeated full rescans.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for shrinking filesystem

2 participants