Optional abandon of long-undelivered notifications to protect receivers after outages

### Proposal

## Problem

When a notification receiver or the network path to it is unavailable for a long time, Alertmanager keeps retrying delivery. After connectivity returns, those retries can produce a **large burst of stale notifications** to webhooks, on-call systems, and other integrations. In practice this can overwhelm the receiver (rate limits, overload) and creates noise from alerts that are no longer actionable in the same form.

This is distinct from normal `group_wait` / `group_interval` behaviour: it is about **prolonged total delivery failure**, not steady-state grouping.

## Proposal

Add an **opt-in** mechanism (global configuration) that:

1. Tracks the wall-clock time since the **first failed** delivery attempt for a given notification key.
2. After a configurable duration, **stops calling the integration** for that key (“abandon”) instead of retrying indefinitely.
3. While abandoned, **suppresses further delivery attempts** for the **same firing alert set** without calling the integration, so recovery does not turn into a storm of HTTP calls.
4. **Clears or resets** state when it is no longer relevant (e.g. firing set changes, group no longer firing), so legitimate new or changed alerts can still be delivered.

Default behaviour when the feature is **disabled** should remain identical to today.

## Configuration (sketch)

Proposed global settings (names can be bikeshedded in the PR):

- `abandon_undelivered_notifications: <bool>` — enable the feature.
- `abandon_undelivered_after: <duration>` — time from first failure after which delivery is abandoned (must be positive when enabled).

## Observability

New or extended metrics should make it possible to see abandons and suppressions after abandon, e.g. counters along the lines of:

- `notifications_abandoned_total`
- `notifications_abandoned_suppressed_total`

## Use case

Datacenter or regional outage, receiver maintenance, or long integration outage → Alertmanager cannot deliver → when the path is healthy again, operators want to **avoid DDOS-ing their own receivers** with a backlog of stale notifications.

---

I am willing to submit a PR implementing this if the approach is acceptable to maintainers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional abandon of long-undelivered notifications to protect receivers after outages #5134

Proposal

Problem

Proposal

Configuration (sketch)

Observability

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optional abandon of long-undelivered notifications to protect receivers after outages #5134

Description

Proposal

Problem

Proposal

Configuration (sketch)

Observability

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions