bcachefs: implement online filesystem shrinking#1073
bcachefs: implement online filesystem shrinking#1073jullanggit wants to merge 62 commits intokoverstreet:masterfrom
Conversation
|
Nice work — the reconcile-based approach for online shrinking is the right direction. Some feedback: On-disk format: Adding The evacuation loop: The
Allocation cutoff: if (unlikely(ca->mi.target_nbuckets && bucket >= ca->mi.target_nbuckets)) {This replaces
This inline function does a division ( Commented-out code: The Style nits:
Testing: Good that you have ktest tests for this: https://github.com/jullanggit/ktest/tree/shrink. Consider adding cases for:
Overall this is solid WIP. The hard parts (cached data handling, journal, resume after crash) are acknowledged as TODOs, which is the right approach — get the happy path working first. — ProofOfConcept |
6303f5b to
990d039
Compare
|
Thank you for the review! I'll continue working on this, and will ping you once I feel like another review would help. |
a6d79f5 to
82c906f
Compare
| Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202602181006.rLTgu86r-lkp@intel.com/
this addition is backwards compatible because new fields are initialized to zero, which means no pending resize, and are not read by older kernels
avoids stripe reshuffling
also comment in outline of shrink path
…elying on ca->mi.target_nbuckets avoids possible edge cases if device is being removed mid-shrink etc.
This is done analogous to the remove alloc info path
Keep the shrink cutoff in force until tail alloc metadata is removed, move journal buckets out of the truncated tail before the final commit, and clear tail need_discard state so fsck/accounting do not retain bookkeeping for removed buckets. Also fix journal bucket deletion so the updated per-device journal superblock state is what gets written back.
Serialize duplicate data-update ownership in the update table so phys reconcile workers cannot race the same logical extent. Also tolerate device-scan pending data_replicas work when other reconcile bits are still set, stop background movers before globally disabling writes during RO shutdown, and avoid counting transient reconcile ENOSPC/ec_alloc_failed retries as hard data-update failures. This fixes the targeted three-device shrink tests, including the EC case.
During device shrink, an extent can become temporarily under-replicated as soon as one pointer lands on a bad or evacuating location, before the device scan has had a chance to rewrite the reconcile entry. Treating that window as a missing io-opts propagation cookie raises a false extent_io_opts_not_set fsck error. Skip that check when the new data_replicas work is explained by existing bad or evacuating pointers so shrink can hand the extent off to reconcile without tripping fsck.
Reconcile-only updates can change the phys-work class encoded in rotational-data backpointers without moving any extent pointers. The fast path in bch2_trigger_extent() treated identical pointer arrays as a no-op, which left stale or missing phys backpointers behind during three-device EC shrink. Teach that path to recognize when only the backpointer reconcile flags are changing, update the existing backpointer entries in place, and adjust the reconcile_work_phys bits without routing the unchanged key through a delete+insert cycle in the write buffer.
Shrink cleanup deletes alloc/freespace state in bucket coordinates, but backpointer keys are indexed by device+sector. Using the raw bucket cutoff for BTREE_ID_backpointers can delete live backpointers that still belong to buckets below the shrink point, which leaves offline fsck repairing missing phys backpointers in the three-device EC shrink tests. Translate the cutoff into backpointer key space before deleting the tail of the truncated device.
Repeated online shrink runs can leave copygc blocked in a move write while waiting for allocator space from its dedicated write point. Stopping copygc before those open buckets are dropped can then hang the read-only transition inside kthread_stop(). Close write points before stopping the background data movers so copygc does not stay wedged behind allocator state owned by the filesystem itself.
Reconcile always updates the extent it looks up, either to rewrite reconcile state or to queue a move. Starting those iterators in read mode forces a restart_upgrade during commit for each key, and the three-device EC shrink test can trip the slowpath counter threshold as those restarts add up. Open the logical and physical extent iterators with intent locks from the start so reconcile does not burn one upgrade restart per extent.
Variable-bucket shrink can fail before reconcile moves any data when the filesystem metadata target only contains the device being shrunk. The initial reconcile_scan update, later btree node rewrites, and journal allocations all kept preferring that shrinking target, so once the remaining below-cutoff buckets ran short the shrink aborted with no_buckets_found. Teach internal btree and journal metadata allocations to fall back to the full filesystem when every rw member of the preferred metadata target is currently shrinking. Also treat allocator no_buckets_found as the same transient ENOSPC-class condition reconcile already demotes to pending work, so later shrink-triggered retries do not escalate it into a hard reconcile failure.
`target_nbuckets` already persists an interrupted shrink target, but mount left that state stranded after a restart. Refactor the shrink path so recovery can reuse it, and resume pending shrinks from the resize-on- mount hook once btrees and normal journal reservations are available. The early resize-on-mount call still handles grow-on-mount image expansion before btrees are running. A second call after early recovery setup now requeues reconcile's device scan and finishes any pending shrink synchronously during mount.
Interpret `target_nbuckets` as the single requested-size field for both shrink and grow, with helpers that normalize `0` to the current device size and make shrink-only users consult the current resize direction. This lets allocator and metadata fallback paths stop treating every nonzero target as an active shrink. Move resize execution to a per-device background kthread. Requests now persist the latest target, bump a sequence number, and wait synchronously for that sequence to finish unless a newer request supersedes it. The worker re-reads the current target at explicit restart points so stale shrink passes bail out instead of finishing obsolete work, and recovery reuses the same path to resume either shrink or grow. Also stop resize workers during read-only shutdown before clean shutdown is marked. Without that, an interrupted shrink can keep issuing transactional alloc/accounting updates during unmount and trip write-path assertions.
bch2_do_discards() now clears need_discard buckets via alloc updates that commit the transaction and remove the corresponding need_discard entry. Holding the need_discard iterator open across that commit can deadlock against the alloc/freespace updates while shrink is reclaiming truncated tails. Fetch one need_discard key at a time, remember the successor position, and reopen the iterator after each bucket is processed. This keeps the upstream need_discard worker model intact while avoiding the lock cycle seen in shrink tests after the rebase.
Do not keep reconcile_scan iterators or other cached search paths alive across the actual move/rewrite operations. Return the next reconcile_scan entry without exiting from inside the iterator helper, and explicitly restart the transaction before the normal and phys reconcile workers begin mutating extents. This avoids deadlocks where cached reconcile paths stay pinned while move work needs alloc/freespace updates and btree node rewrites.
Shrink still needs discard progress on retained buckets, but discards in the tail being evacuated can deadlock against resize/reconcile and journal pin flushing. Only defer need_discard entries at or beyond the current shrink target, drain any in-flight discard workers before the shrink path starts, and re-kick discards once the current resize request reaches a terminal state. Also clear tail need_discard state by deleting the derived index entries directly instead of mutating alloc keys that are about to be truncated anyway.
Track completed reconcile kicks so shrink can wait for a full reconcile pass instead of polling forever for the tail to become empty. Shrink now queues a device scan plus a pending pass, retries pending work once more, and returns `-ENOSPC` if the tail is still occupied after both passes. Before returning that failure, clear the persisted shrink target and wake pending reconcile work so remount no longer retries a known-impossible shrink under the stale cutoff.
Shrink was queuing a full-device RECONCILE_SCAN_device pass and then starting the final journal flush as soon as the requested kick drained. In online_three_device_variable_buckets_shrink that could requeue metadata below the retained region for tens of seconds, and once the kick completed the resize worker could still collide with reconcile's cached paths during key-cache pin flushing. Start shrink-triggered backpointer scans at target_nbuckets translated into backpointer key space, and wait for reconcile to report idle before the final shrink flush. That keeps the resize worker focused on the tail that will actually be truncated and avoids the long post-resize stalls.
A shrink-triggered reconcile kick can keep draining unrelated global reconcile work long after the truncating tail has already been evacuated. In the variable-bucket shrink ktest that turned occasional runs into multi-minute stalls even though the tail was already empty. Poll tail emptiness while waiting for reconcile so shrink can move on as soon as evacuation has actually cleared the tail. Keep the final cutoff path as a separate helper so the control flow stays structured and does not need gotos.
Shrink's final commit path only needs to flush journal pins that were already outstanding when it proved the tail empty. Waiting for bch2_journal_flush_all_pins() lets unrelated reconcile/key-cache work keep adding newer pins, which turns the resize ioctl into a multi-minute stall even though the shrinking tail is already clear. Flush only journal_cur_seq() before the device-specific pin flush. That still drains the journal state shrink must fence before it commits the smaller nbuckets, without waiting on unrelated future journal traffic.
…check restart check is done below after acquiring the lock anyways. the journal move is not needed as we already move the journal at the start of the shrink, and new allocations (and thus journal buckets) are blocked in the tail, so it can't re-appear.
imo more descriptive
Shrink still used a wall-clock no-progress deadline after moving to tail head plus aggregate backpointer tracking. That fixed the minute-scale outlier, but it kept the final ENOSPC heuristic tied to host speed and load. Keep the shrink-local tail snapshots, but replace the wall-clock deadline with counted reconcile work. Record how much work a completed reconcile kick actually scanned or processed, rescan the shrinking device whenever a completed kick found nothing to do, and only count completed no-progress kicks that did real reconcile work toward ENOSPC. Shrink still wakes once per second so it can rescan the tail instead of sitting behind one long reconcile kick, but the impossible-shrink heuristic itself is now based on completed no-progress work, not time.
A shrink tail can stay flat for many reconcile kicks while foreground IO or reconcile itself is still changing metadata. Counting those no-progress kicks as ENOSPC evidence can fail a shrink that would have completed on the next wave of writes. Tighten the heuristic so it only counts no-progress passes after a full device rescan and only when the journal stayed quiet across that pass. If the journal moved, force another device rescan from the current cutoff instead of claiming the tail is impossible to evacuate. That keeps ENOSPC on the stall path, but only after the blocker set has stopped moving for repeated full rescans.
Implement online filesystem shrinking through reconcile. Closes #781 once done.
This is hopefully complementary to #1070, which targets offline shrink.
Goal
A robust online shrinking implementation, that automatically resumes after restarts/crashes, as shrinking is a potentially long-running operation, and supports changing the target size mid-shrink.
Current state
Implementation
Reuses large parts of the device remove/evacuate paths
Documentation
Not yet written
Testing
See https://github.com/jullanggit/ktest/tree/shrink for the tests used