RFC - Store inflight state in-memory and flush to sqlite periodically #487

markstory · 2025-09-15T20:32:48Z

⚠️ This is a rough proof of concept and will not be merged ⚠️

Currently all state changes are made in sqlite, and because of the way taskbroker's logic works out the application is almost entirely write operations. With sqlite only having a single write lock on the database, we often see both gRPC, upkeep and consumer latency increases at the same time as contention piles up in sqlite.

These changes move much of the activation state-machine into a set of in-memory heaps/sets that are wrapped with a mutex. This allows gRPC operations to become detached from SQLite writes which should reduce contention on the write lock. As activations are mutated by grpc, and upkeep, modified records are added to the dirty_ids set, and periodically flushed to SQLite during ingest and upkeep.

These changes mean that inflight state is no-longer fully durable. Instead, state changes can be lost between commit calls. This could lead to tasks being executed multiple times, but shouldn't result in tasks being lost or dropped. We already have the opportunity for duplicate execution (through processing deadlines), and we'd be expanding the scope of that problem but not creating new durability or data-loss scenarios (that I'm aware of).

I've also separated the 'blob storage' and 'metadata storage' into separate tables. We have tried this in #369 and didn't move forward then as we weren't able to see noticeable improvements. My hope is that by separating the tables again, and removing write traffic we can reduce fragmentation in the database as rows containing activation blobs will not be mutated anymore. Splitting storage in Sqlite is also step towards storing large activations on the filesystem (which is also on our future plans).

Next steps

I'd like to get this onto sandboxes and validate:

That contention on sqlite has been reduced and that these changes unlock additional broker throughput by being able to go above 24 * 32 workers per broker.
That grpc latency isn't impacted by slow downs in writes to sqlite.
That the shutdown/startup state flush/restore process behaves correctly.

If this prototype succeeds, I'll put together a more complete plan on how we could incrementally and safely ship these changes.

this logic moved to get_pending_activation now.

codecov · 2025-09-16T15:32:27Z

Codecov Report

❌ Patch coverage is 83.35832% with 111 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.42%. Comparing base (66748f2) to head (96f123c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/store/metadata_store.rs	85.81%	62 Missing ⚠️
src/store/records.rs	74.63%	35 Missing ⚠️
src/store/inflight_activation.rs	88.57%	8 Missing ⚠️
src/main.rs	0.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #487      +/-   ##
==========================================
- Coverage   88.15%   86.42%   -1.73%     
==========================================
  Files          20       22       +2     
  Lines        5359     5789     +430     
==========================================
+ Hits         4724     5003     +279     
- Misses        635      786     +151

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

markstory added 5 commits September 15, 2025 16:36

Import rough code from other machine

5cc4c04

Add migration for new tables.

5ee6fd1

Fix failing killswitch test

eae1c75

comment out some tests for upkeep

af0a9b5

this logic moved to get_pending_activation now.

Add load and flush during startup/shutdown

7c3fb12

markstory force-pushed the memory-storage branch from 043bf65 to 7c3fb12 Compare September 15, 2025 20:37

markstory added 2 commits September 16, 2025 11:26

Fix lint

d9b052c

Remove tests for no-op methods

2434150

Fix whitespace

96f123c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC - Store inflight state in-memory and flush to sqlite periodically #487

RFC - Store inflight state in-memory and flush to sqlite periodically #487

markstory commented Sep 15, 2025

Uh oh!

codecov bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

RFC - Store inflight state in-memory and flush to sqlite periodically #487

Are you sure you want to change the base?

RFC - Store inflight state in-memory and flush to sqlite periodically #487

Conversation

markstory commented Sep 15, 2025

Next steps

Uh oh!

codecov bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

codecov bot commented Sep 16, 2025 •

edited

Loading