Skip to content

Conversation

markstory
Copy link
Member

⚠️ This is a rough proof of concept and will not be merged ⚠️

Currently all state changes are made in sqlite, and because of the way taskbroker's logic works out the application is almost entirely write operations. With sqlite only having a single write lock on the database, we often see both gRPC, upkeep and consumer latency increases at the same time as contention piles up in sqlite.

These changes move much of the activation state-machine into a set of in-memory heaps/sets that are wrapped with a mutex. This allows gRPC operations to become detached from SQLite writes which should reduce contention on the write lock. As activations are mutated by grpc, and upkeep, modified records are added to the dirty_ids set, and periodically flushed to SQLite during ingest and upkeep.

These changes mean that inflight state is no-longer fully durable. Instead, state changes can be lost between commit calls. This could lead to tasks being executed multiple times, but shouldn't result in tasks being lost or dropped. We already have the opportunity for duplicate execution (through processing deadlines), and we'd be expanding the scope of that problem but not creating new durability or data-loss scenarios (that I'm aware of).

I've also separated the 'blob storage' and 'metadata storage' into separate tables. We have tried this in #369 and didn't move forward then as we weren't able to see noticeable improvements. My hope is that by separating the tables again, and removing write traffic we can reduce fragmentation in the database as rows containing activation blobs will not be mutated anymore. Splitting storage in Sqlite is also step towards storing large activations on the filesystem (which is also on our future plans).

Next steps

I'd like to get this onto sandboxes and validate:

  1. That contention on sqlite has been reduced and that these changes unlock additional broker throughput by being able to go above 24 * 32 workers per broker.
  2. That grpc latency isn't impacted by slow downs in writes to sqlite.
  3. That the shutdown/startup state flush/restore process behaves correctly.

If this prototype succeeds, I'll put together a more complete plan on how we could incrementally and safely ship these changes.

Copy link

codecov bot commented Sep 16, 2025

Codecov Report

❌ Patch coverage is 83.35832% with 111 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.42%. Comparing base (66748f2) to head (96f123c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/store/metadata_store.rs 85.81% 62 Missing ⚠️
src/store/records.rs 74.63% 35 Missing ⚠️
src/store/inflight_activation.rs 88.57% 8 Missing ⚠️
src/main.rs 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #487      +/-   ##
==========================================
- Coverage   88.15%   86.42%   -1.73%     
==========================================
  Files          20       22       +2     
  Lines        5359     5789     +430     
==========================================
+ Hits         4724     5003     +279     
- Misses        635      786     +151     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant