IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker by jcardozagc · Pull Request #118 · gocardless/que

jcardozagc · 2026-02-25T15:04:38Z

During a recent incident, after a CloudSQL maintenance event, Que worker pods can end up in a stale database connection state. Workers handle this internally - on a PG::Error, they return :postgres_error, sleep for the wake interval, and retry. The health check is a hardcoded lambda that always returns 200. Hence, Kubernetes also has no signal to restart the affected pods, and the worker in the incident ended up perpetually retrying a faulty connection every 5 seconds.

Let's track the result of each work cycle on the workers itself and expose it through a healthy? predicate (which purely checks whether the worker is in a postgres_error state on its last work_loop cycle). The health check endpoint now delegates to a new WorkerHealthCheck that checks every worker in the group and returns 503 if any of them are in a unhealthy state, otherwise returns the 200 it has always been returning.

Yes, this health check now can return a non-200 and this is the only 'breaking' change, but should mean Kubernetes has a signal to restart pods when liveness probe threshold gets breached.

During a recent incident, after a CloudSQL maintenance event, Que worker pods can end up in a stale database connection state. Workers handle this internally - on a PG::Error, they return :postgres_error, sleep for the wake interval, and retry. The health check is a hardcoded lambda that always returns 200. Hence, Kubernetes also has no signal to restart the affected pods, and the worker in the incident ended up perpetually retrying a faulty connection every 5 seconds. Let's track the result of each work cycle on the workers itself and expose it through a healthy? predicate (which purely checks whether the worker is in a postgres_error state). The health check endpoint now delegates to a new WorkerHealthCheck that checks every worker in the group and returns 503 if any of them are in a unhealthy state, otherwise returns the 200 it has always been returning. Yes, this endpoint now actually can return a non-200 and this is the only 'breaking' change, but should mean Kubernetes has a signal to restart pods when liveness probe threshold gets breached.

jcardozagc added 2 commits February 25, 2026 14:54

fixes

86c420b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker#118

IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker#118
jcardozagc wants to merge 2 commits intomasterfrom
ix-1782-healthcheck-checks-workers-for-pg-error

jcardozagc commented Feb 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jcardozagc commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jcardozagc commented Feb 25, 2026 •

edited

Loading