Proposal
Problem
When a notification receiver or the network path to it is unavailable for a long time, Alertmanager keeps retrying delivery. After connectivity returns, those retries can produce a large burst of stale notifications to webhooks, on-call systems, and other integrations. In practice this can overwhelm the receiver (rate limits, overload) and creates noise from alerts that are no longer actionable in the same form.
This is distinct from normal group_wait / group_interval behaviour: it is about prolonged total delivery failure, not steady-state grouping.
Proposal
Add an opt-in mechanism (global configuration) that:
- Tracks the wall-clock time since the first failed delivery attempt for a given notification key.
- After a configurable duration, stops calling the integration for that key (“abandon”) instead of retrying indefinitely.
- While abandoned, suppresses further delivery attempts for the same firing alert set without calling the integration, so recovery does not turn into a storm of HTTP calls.
- Clears or resets state when it is no longer relevant (e.g. firing set changes, group no longer firing), so legitimate new or changed alerts can still be delivered.
Default behaviour when the feature is disabled should remain identical to today.
Configuration (sketch)
Proposed global settings (names can be bikeshedded in the PR):
abandon_undelivered_notifications: <bool> — enable the feature.
abandon_undelivered_after: <duration> — time from first failure after which delivery is abandoned (must be positive when enabled).
Observability
New or extended metrics should make it possible to see abandons and suppressions after abandon, e.g. counters along the lines of:
notifications_abandoned_total
notifications_abandoned_suppressed_total
Use case
Datacenter or regional outage, receiver maintenance, or long integration outage → Alertmanager cannot deliver → when the path is healthy again, operators want to avoid DDOS-ing their own receivers with a backlog of stale notifications.
I am willing to submit a PR implementing this if the approach is acceptable to maintainers.
Proposal
Problem
When a notification receiver or the network path to it is unavailable for a long time, Alertmanager keeps retrying delivery. After connectivity returns, those retries can produce a large burst of stale notifications to webhooks, on-call systems, and other integrations. In practice this can overwhelm the receiver (rate limits, overload) and creates noise from alerts that are no longer actionable in the same form.
This is distinct from normal
group_wait/group_intervalbehaviour: it is about prolonged total delivery failure, not steady-state grouping.Proposal
Add an opt-in mechanism (global configuration) that:
Default behaviour when the feature is disabled should remain identical to today.
Configuration (sketch)
Proposed global settings (names can be bikeshedded in the PR):
abandon_undelivered_notifications: <bool>— enable the feature.abandon_undelivered_after: <duration>— time from first failure after which delivery is abandoned (must be positive when enabled).Observability
New or extended metrics should make it possible to see abandons and suppressions after abandon, e.g. counters along the lines of:
notifications_abandoned_totalnotifications_abandoned_suppressed_totalUse case
Datacenter or regional outage, receiver maintenance, or long integration outage → Alertmanager cannot deliver → when the path is healthy again, operators want to avoid DDOS-ing their own receivers with a backlog of stale notifications.
I am willing to submit a PR implementing this if the approach is acceptable to maintainers.