Skip to content

IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker#118

Draft
jcardozagc wants to merge 2 commits intomasterfrom
ix-1782-healthcheck-checks-workers-for-pg-error
Draft

IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker#118
jcardozagc wants to merge 2 commits intomasterfrom
ix-1782-healthcheck-checks-workers-for-pg-error

Conversation

@jcardozagc
Copy link

@jcardozagc jcardozagc commented Feb 25, 2026

During a recent incident, after a CloudSQL maintenance event, Que worker pods can end up in a stale database connection state. Workers handle this internally - on a PG::Error, they return :postgres_error, sleep for the wake interval, and retry. The health check is a hardcoded lambda that always returns 200. Hence, Kubernetes also has no signal to restart the affected pods, and the worker in the incident ended up perpetually retrying a faulty connection every 5 seconds.

Let's track the result of each work cycle on the workers itself and expose it through a healthy? predicate (which purely checks whether the worker is in a postgres_error state on its last work_loop cycle). The health check endpoint now delegates to a new WorkerHealthCheck that checks every worker in the group and returns 503 if any of them are in a unhealthy state, otherwise returns the 200 it has always been returning.

Yes, this health check now can return a non-200 and this is the only 'breaking' change, but should mean Kubernetes has a signal to restart pods when liveness probe threshold gets breached.

During a recent incident, after a CloudSQL maintenance event, Que worker pods can end up in a stale database connection state. Workers handle this internally - on a PG::Error, they return :postgres_error, sleep for the wake interval, and retry. The health check is a hardcoded lambda that always returns 200. Hence, Kubernetes also has no signal to restart the affected pods, and the worker in the incident ended up perpetually retrying a faulty connection every 5 seconds.

Let's track the result of each work cycle on the workers itself and expose it through a healthy? predicate (which purely checks whether the worker is in a postgres_error state). The health check endpoint now delegates to a new WorkerHealthCheck that checks every worker in the group and returns 503 if any of them are in a unhealthy state, otherwise returns the 200 it has always been returning. Yes, this endpoint now actually can return a non-200 and this is the only 'breaking' change, but should mean Kubernetes has a signal to restart pods when liveness probe threshold gets breached.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant