Skip to content

Optional abandon of long-undelivered notifications to protect receivers after outages #5134

@runnart

Description

@runnart

Proposal

Problem

When a notification receiver or the network path to it is unavailable for a long time, Alertmanager keeps retrying delivery. After connectivity returns, those retries can produce a large burst of stale notifications to webhooks, on-call systems, and other integrations. In practice this can overwhelm the receiver (rate limits, overload) and creates noise from alerts that are no longer actionable in the same form.

This is distinct from normal group_wait / group_interval behaviour: it is about prolonged total delivery failure, not steady-state grouping.

Proposal

Add an opt-in mechanism (global configuration) that:

  1. Tracks the wall-clock time since the first failed delivery attempt for a given notification key.
  2. After a configurable duration, stops calling the integration for that key (“abandon”) instead of retrying indefinitely.
  3. While abandoned, suppresses further delivery attempts for the same firing alert set without calling the integration, so recovery does not turn into a storm of HTTP calls.
  4. Clears or resets state when it is no longer relevant (e.g. firing set changes, group no longer firing), so legitimate new or changed alerts can still be delivered.

Default behaviour when the feature is disabled should remain identical to today.

Configuration (sketch)

Proposed global settings (names can be bikeshedded in the PR):

  • abandon_undelivered_notifications: <bool> — enable the feature.
  • abandon_undelivered_after: <duration> — time from first failure after which delivery is abandoned (must be positive when enabled).

Observability

New or extended metrics should make it possible to see abandons and suppressions after abandon, e.g. counters along the lines of:

  • notifications_abandoned_total
  • notifications_abandoned_suppressed_total

Use case

Datacenter or regional outage, receiver maintenance, or long integration outage → Alertmanager cannot deliver → when the path is healthy again, operators want to avoid DDOS-ing their own receivers with a backlog of stale notifications.


I am willing to submit a PR implementing this if the approach is acceptable to maintainers.

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions