feat: optimistic scheduling #2258

abelanger5 · 2025-09-07T16:10:41Z

Description

Adds support for "optimistic" scheduling, meaning that if we can create tasks from the gRPC engine with transactional safety, and schedule tasks on workers which are connected to the current gRPC session (these are two separate concepts, referred to in code by localScheduler and localDispatcher). We allocate a small set of semaphores for that.

Features:

Up to a 3x speedup in scheduling performance, from 24ms -> 8ms for single-task workflows
Reduces overall pressure on the message queue and downstream components, as there are fewer messages being passed for scheduling purposes
Made some improvements to listening for a task completed event for a single-task workflow by hooking into an existing tenant message task-completed. We can similarly add task-failed and task-cancelled in the future.

Drawbacks:

Increases the complexity of scheduling as the paths for optimistic scheduling are quite different from the regular path, since we protect everything with a single transaction
Can increase pressure on the engines. I've tried to avoid major issues by only allocating 10 "scheduling slots" to each gRPC process (configurable via an env var)

Limitations:

Scheduling child workflows is still significantly slower than scheduling non-child workflows, because we have ~6ms of latency due to how we're checking idempotency on the child workflow trigger
- I think we could improve this a lot with idempotency keys, it really should only be a single database transaction to insert/lookup the idempotency keys
This won't be turned on in HA mode and as n engines are horizontally scales the chances of optimistic scheduling reduce by 1/n - we only use local schedulers when they have a lease on a tenant. We will need to build out a sticky load balancing strategy to take advantage of optimistic scheduling in HA setups.

Type of change

New feature (non-breaking change which adds functionality)

vercel · 2025-09-07T16:10:48Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
hatchet-docs	Ready	Preview	Comment	Sep 8, 2025 1:32pm
hatchet-v0-docs	Ready	Preview	Comment	Sep 8, 2025 1:32pm

…cheduling

abelanger5 · 2025-09-07T16:23:55Z

cmd/hatchet-engine/engine/run.go

+	var localScheduler *schedulerv1.Scheduler
+
+	if sc.HasService("all") || sc.HasService("scheduler") {
+		partitionCleanup, err := p.StartSchedulerPartition(ctx)
+
+		if err != nil {
+			return nil, fmt.Errorf("could not create create scheduler partition: %w", err)
+		}
+
+		teardown = append(teardown, Teardown{
+			Name: "scheduler partition",
+			Fn:   partitionCleanup,
+		})
+
+		// create the dispatcher
+		s, err := scheduler.New(
+			scheduler.WithAlerter(sc.Alerter),
+			scheduler.WithMessageQueue(sc.MessageQueue),
+			scheduler.WithRepository(sc.EngineRepository),
+			scheduler.WithLogger(sc.Logger),
+			scheduler.WithPartition(p),
+			scheduler.WithQueueLoggerConfig(&sc.AdditionalLoggers.Queue),
+			scheduler.WithSchedulerPool(sc.SchedulingPool),
+		)
+
+		if err != nil {
+			return nil, fmt.Errorf("could not create dispatcher: %w", err)
+		}
+
+		cleanup, err := s.Start()
+
+		if err != nil {
+			return nil, fmt.Errorf("could not start dispatcher: %w", err)
+		}
+
+		teardown = append(teardown, Teardown{
+			Name: "scheduler",
+			Fn:   cleanup,
+		})
+
+		sv1, err := schedulerv1.New(
+			schedulerv1.WithAlerter(sc.Alerter),
+			schedulerv1.WithMessageQueue(sc.MessageQueueV1),
+			schedulerv1.WithRepository(sc.EngineRepository),
+			schedulerv1.WithV2Repository(sc.V1),
+			schedulerv1.WithLogger(sc.Logger),
+			schedulerv1.WithPartition(p),
+			schedulerv1.WithQueueLoggerConfig(&sc.AdditionalLoggers.Queue),
+			schedulerv1.WithSchedulerPool(sc.SchedulingPoolV1),
+		)
+
+		if err != nil {
+			return nil, fmt.Errorf("could not create scheduler (v1): %w", err)
+		}
+
+		cleanup, err = sv1.Start()
+
+		if err != nil {
+			return nil, fmt.Errorf("could not start scheduler (v1): %w", err)
+		}
+
+		teardown = append(teardown, Teardown{
+			Name: "schedulerv1",
+			Fn:   cleanup,
+		})
+
+		localScheduler = sv1
+	}


Just moved this up and added the localScheduler assignment, but otherwise no code changes.

abelanger5 · 2025-09-07T16:24:19Z

cmd/hatchet-migrate/migrate/migrations/20250907161856_v1_0_41.sql

This will need to be updated before final merge, as we'll probably get this deployed after payloads and partitions.

abelanger5 · 2025-09-07T16:24:43Z

cmd/hatchet-migrate/migrate/migrations/20250907161856_v1_0_41.sql

+    rec RECORD;
+BEGIN
+    -- Only insert if there's a single task with initial_state = 'QUEUED' and concurrency_strategy_ids is not null
+    IF (SELECT COUNT(*) FROM new_table WHERE initial_state = 'QUEUED' AND concurrency_strategy_ids[1] IS NOT NULL) > 0 THEN


Added this IF statement to avoid cascading triggers on concurrency tables

abelanger5 · 2025-09-07T16:25:05Z

cmd/hatchet-migrate/migrate/migrations/20250907161856_v1_0_41.sql

+    WHERE initial_state = 'QUEUED' AND concurrency_strategy_ids[1] IS NULL;
+
+    -- Only insert into v1_dag and v1_dag_to_task if dag_id and dag_inserted_at are not null
+    IF (SELECT COUNT(*) FROM new_table WHERE dag_id IS NOT NULL AND dag_inserted_at IS NOT NULL) = 0 THEN


Similarly, added this IF statement to avoid extra statements in this trigger for single-task workflows

abelanger5 · 2025-09-07T16:27:19Z

internal/services/controllers/v1/olap/signal/signal.go

I had to consolidate signaling logic between the local scheduler (which creates tasks) and the tasks controller, and this was the cleanest way to do it. I also figured it'd give us an easier way to hook into two-phase commits to keep the tasks controller and OLAP controller in sync.

abelanger5 · 2025-09-07T16:28:59Z

internal/services/dispatcher/server.go

 }

-func (s *DispatcherImpl) Register(ctx context.Context, request *contracts.WorkerRegisterRequest) (*contracts.WorkerRegisterResponse, error) {
+func (d *DispatcherImpl) Register(ctx context.Context, request *contracts.WorkerRegisterRequest) (*contracts.WorkerRegisterResponse, error) {


Drive-by, but the golint rules were complaining about using s *DispatcherImpl in some places and d *DispatcherImpl in other places, and I agree it makes the code harder to read, so changed all locations to use d *DispatcherImpl.

abelanger5 · 2025-09-07T16:34:25Z

internal/services/dispatcher/server_v1.go

+				res = append(res, &contracts.WorkflowRunEvent{
+					WorkflowRunId:  payload.WorkflowRunId,
+					EventType:      contracts.WorkflowRunEventType_WORKFLOW_RUN_EVENT_TYPE_FINISHED,
+					EventTimestamp: timestamppb.New(time.Now()),
+					Results: []*contracts.StepRunResult{
+						{
+							StepRunId:      payload.ExternalId,
+							StepReadableId: payload.StepReadableId,
+							JobRunId:       payload.ExternalId,
+							Output:         &output,
+						},
+					},
+				})


This is the core change in this file, if we have a single-task workflow and we get a task-completed event, we send it immediately to the worker.

I think there's a chance of race conditions here, because we don't want to send a message to a worker before its been fully written/acknowledged by the database, which we risk here.

So if we were to fail to write the task-completed event to the database, we won't successfully release the task, and it may have a different status after we hit a timeout/reassignment than the parent workflow would see.

Yet another case where 2PC is necessary - or perhaps just moving the processing of the task-completed event to the gRPC layer like we used to have. I need to think a bit more about it.

abelanger5 · 2025-09-07T16:36:21Z

pkg/repository/v1/shared.go

+	tenantIdWorkflowNameCache   *expirable.LRU[string, *sqlcv1.ListWorkflowsByNamesRow]
+	stepsInWorkflowVersionCache *expirable.LRU[string, []*sqlcv1.ListStepsByWorkflowVersionIdsRow]
+	stepIdLabelsCache           *expirable.LRU[string, []*sqlcv1.GetDesiredLabelsRow]


I would like to start moving towards usage of the expirable package instead of our home-built cache, because of nicer typing/LRU support - I think we can fully make this transition after we deprecate v0.

mrkaye97

left some questions! I'm a little nervous about the amount of complexity this adds to some already-complex parts of the codebase, but if it's worth the performance gains then 🤷

mrkaye97 · 2025-09-08T15:35:06Z

internal/services/admin/v1/server.go

 func (i *AdminServiceImpl) ingest(ctx context.Context, tenantId string, opts ...*v1.WorkflowNameTriggerOpts) error {
+	if i.localScheduler != nil {
+		localWorkerIds := map[string]struct{}{}
+
+		if i.localDispatcher != nil {


is this an exact copy of the implementation in internal/services/admin/server_v1.go?

mrkaye97 · 2025-09-08T16:12:32Z

internal/services/controllers/v1/olap/signal/signal.go

+		dagCp := dag
+		msg, err := tasktypes.CreatedDAGMessage(tenantId, dagCp)


are we copying this because it gets mutated? if so, I'd much prefer to not mutate it 😅

mrkaye97 · 2025-09-08T16:17:04Z

internal/services/controllers/v1/olap/signal/signal.go

+	for _, task := range tasks {
+		taskExternalId := sqlchelpers.UUIDToStr(task.ExternalID)
+
+		dataBytes := v1.NewCancelledTaskOutputEventFromTask(task).Bytes()
+
+		internalEvents = append(internalEvents, v1.InternalTaskEvent{
+			TenantID:       tenantId,
+			TaskID:         task.ID,
+			TaskExternalID: taskExternalId,
+			RetryCount:     task.RetryCount,
+			EventType:      sqlcv1.V1TaskEventTypeCANCELLED,
+			Data:           dataBytes,
+		})
+	}
+
+	err := s.sendInternalEvents(ctx, tenantId, internalEvents)
+
+	if err != nil {
+		return err
+	}
+
+	// notify that tasks have been cancelled
+	// TODO: make this transactionally safe?
+	for _, task := range tasks {
+		msg, err := tasktypes.MonitoringEventMessageFromInternal(tenantId, tasktypes.CreateMonitoringEventPayload{
+			TaskId:         task.ID,
+			RetryCount:     task.RetryCount,


think it's worth consolidating these loops for performance?

mrkaye97 · 2025-09-08T16:26:42Z

internal/services/dispatcher/dispatcher_v1.go

+// Note: this is very similar to handleTaskBulkAssignedTask, with some differences in what's sync vs run in a goroutine
+// In this method, we wait until all tasks have been sent to the worker before returning
+func (d *DispatcherImpl) HandleLocalAssignments(ctx context.Context, tenantId, workerId string, tasks []*schedulingv1.AssignedItemWithTask) error {
+	// we set a timeout of 25 seconds because we don't want to hold the semaphore for longer than the visibility timeout (30 seconds)


is this 30 second visibility timeout hard-coded somewhere? it seems risky to hard-code this 25 sec if that's modifiable

mrkaye97 · 2025-09-08T16:28:39Z

internal/services/dispatcher/dispatcher_v1.go

+	for _, task := range bulkDatas {
+		if parentData, ok := parentDataMap[task.ID]; ok {
+			currInput := &v1.V1StepRunData{}
+
+			if task.Input != nil {
+				err := json.Unmarshal(task.Input, currInput)
+
+				if err != nil {
+					d.l.Warn().Err(err).Msg("failed to unmarshal input")
+					continue
+				}
+			}


just FYI, a bunch of this will conflict with the payloads changes

mrkaye97 · 2025-09-08T16:31:03Z

internal/services/dispatcher/dispatcher_v1.go

+					)
+				} else {
+					success = true
+					break


why do we break here on success?

we're technically looping over individual worker sessions, and only one of those will be valid, the others will fail when trying to send to the worker (or they should). this is an edge case when a worker reconnects and we haven't determined that the old connection has been interrupted yet.

mrkaye97 · 2025-09-08T18:21:12Z

internal/services/ingestor/ingestor_v1.go

-func eventToTaskV1(tenantId, eventExternalId, key string, data, additionalMeta []byte, priority *int32, scope *string, triggeringWebhookName *string) (*msgqueue.Message, error) {
-	payloadTyped := tasktypes.UserEventTaskPayload{
-		EventExternalId:         eventExternalId,
+func eventToPayload(tenantId, key string, data, additionalMeta []byte, priority *int32, scope *string, triggeringWebhookName *string) tasktypes.UserEventTaskPayload {


nit: I think this naming is a little confusing even if technically correct given our abuse of the term "payload"

mrkaye97 · 2025-09-08T18:35:48Z

pkg/repository/v1/scheduler_optimistic.go

+	return r.PrepareOptimisticTx(ctx)
+}
+
+func (r *optimisticSchedulingRepositoryImpl) TriggerFromEvents(ctx context.Context, tx *OptimisticTx, tenantId string, opts []EventTriggerOpts) ([]*sqlcv1.V1QueueItem, *TriggerFromEventsResult, error) {


this and TriggerFromNames are all copy and paste right? are there any changes here?

yeah there are some small changes in how we're handling pre/post commits and treating the transactions

mrkaye97 · 2025-09-08T18:45:42Z

pkg/repository/v1/trigger.go

+	// look up the workflow versions for the workflow names
+	workflowVersions, err := r.queries.ListWorkflowsByNames(ctx, tx, sqlcv1.ListWorkflowsByNamesParams{
+		Tenantid:      sqlchelpers.UUIDFromStr(tenantId),
+		Workflownames: workflowNamesToLookup,


should we check if workflowNamesToLookup is empty before running this query?

mrkaye97 · 2025-09-08T18:46:31Z

pkg/repository/v1/trigger.go

+
+	steps, err := r.queries.ListStepsByWorkflowVersionIds(ctx, tx, sqlcv1.ListStepsByWorkflowVersionIdsParams{
+		Tenantid: sqlchelpers.UUIDFromStr(tenantId),
+		Ids:      workflowVersionsToLookup,


same here maybe?

feat: e2e working version of optimistic scheduling

437ee21

abelanger5 added 2 commits September 7, 2025 12:17

Merge remote-tracking branch 'origin/main' into belanger/optimistic-s…

cc1b59c

…cheduling

add migration

8705e84

vercel bot deployed to Preview – hatchet-docs September 7, 2025 16:24 View deployment

vercel bot deployed to Preview – hatchet-v0-docs September 7, 2025 16:25 View deployment

abelanger5 commented Sep 7, 2025

View reviewed changes

add optimistic to matrix

ba913d8

vercel bot deployed to Preview – hatchet-v0-docs September 7, 2025 16:57 View deployment

vercel bot deployed to Preview – hatchet-docs September 7, 2025 17:00 View deployment

fix: ignore expirable goleak

82d2b99

vercel bot deployed to Preview – hatchet-v0-docs September 7, 2025 17:03 View deployment

vercel bot deployed to Preview – hatchet-docs September 7, 2025 17:06 View deployment

fix: case on err no rows when querying queue items

b1a7aec

vercel bot deployed to Preview – hatchet-v0-docs September 7, 2025 17:12 View deployment

vercel bot deployed to Preview – hatchet-docs September 7, 2025 17:15 View deployment

fix: trigger logic

e4ce3d5

vercel bot deployed to Preview – hatchet-v0-docs September 7, 2025 23:26 View deployment

vercel bot deployed to Preview – hatchet-docs September 7, 2025 23:29 View deployment

fix: proper casing on task-completed payload

fde8529

vercel bot deployed to Preview – hatchet-v0-docs September 8, 2025 00:04 View deployment

vercel bot deployed to Preview – hatchet-docs September 8, 2025 00:07 View deployment

fix: continue processing events after local scheduling

d294219

vercel bot deployed to Preview – hatchet-v0-docs September 8, 2025 13:27 View deployment

vercel bot deployed to Preview – hatchet-docs September 8, 2025 13:32 View deployment

mrkaye97 reviewed Sep 8, 2025

View reviewed changes

		dagCp := dag
		msg, err := tasktypes.CreatedDAGMessage(tenantId, dagCp)

feat: optimistic scheduling #2258

Are you sure you want to change the base?

feat: optimistic scheduling #2258

Uh oh!

Conversation

abelanger5 commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Uh oh!

vercel bot commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrkaye97 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abelanger5 commented Sep 7, 2025 •

edited

Loading

vercel bot commented Sep 7, 2025 •

edited

Loading