Skip to content

Commit b01699c

Browse files
authored
Support multiple teams per alert (#195)
Now you can specify the value of TEAM with a comma separated list of teams, and a separate `TEAM:abc` label will be applied to the github issue for each team. Both TEAM and TEAMS keyword are accepted interchangably here, so you can set `TEAMS=abc,xyc` or `TEAM=abc`
1 parent 85a90a3 commit b01699c

21 files changed

+734
-126
lines changed

OPERATIONS.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,8 @@ We map extra fields using what Grafana calls "Custom annotation name and content
1414
These must be present on every alert for it to fire:
1515

1616
They're basically key/value pairs:
17-
- **`team`** - The owning team identifier (e.g., `pytorch-dev-infra`, `pytorch-benchmarking`)
18-
- **`priority`** - Alert severity level: `P0`, `P1`, `P2`, or `P3`
17+
- **`Teams`** - The owning team identifier(s). Supports multiple teams separated by commas (e.g., `pytorch-dev-infra, pytorch-benchmarking`)
18+
- **`Priority`** - Alert severity level: `P0`, `P1`, `P2`, or `P3`
1919

2020
### Optional Annotations
2121

@@ -30,20 +30,20 @@ These fileds will also be populated in the alerts
3030
### Configuration
3131

3232
1. Create your alert rule in Grafana. Give the outputs of the query meaningful names, otherwise Grafana will default to A, B, C
33-
2. Add the required fields in labels:
33+
2. Add the required fields in annotations:
3434
```
35-
team = dev-infra
36-
priority = P1
35+
Teams = dev-infra, platform
36+
Priority = P1
3737
runbook_url = https://wiki.example.com/runbooks/disk-space
3838
```
3939
3. Under "Configure Notification" enable "advanced options". Alerts will now get routed to our Dev and Prod channels
4040

4141
### Example Configuration
4242

4343
```yaml
44-
labels:
45-
team: "dev-infra"
46-
priority: "P1"
44+
annotations:
45+
Teams: "dev-infra, platform"
46+
Priority: "P1"
4747
runbook_url: "https://wiki.pytorch.org/runbooks/disk-space"
4848
```
4949
@@ -55,7 +55,7 @@ CloudWatch alerts use the AlarmDescription field to pass metadata in a specific
5555
5656
These must be present in the AlarmDescription:
5757
58-
- **`TEAM`** - Owning team identifier
58+
- **`TEAMS`** - Owning team identifier(s). Supports multiple teams separated by commas
5959
- **`PRIORITY`** - Priority level (`P0`, `P1`, `P2`, `P3`)
6060

6161
### Optional Fields
@@ -68,7 +68,7 @@ The AlarmDescription should contain your alert description, followed by metadata
6868

6969
```
7070
High CPU usage detected on production instances
71-
TEAM=dev-infra
71+
TEAMS=dev-infra, platform
7272
PRIORITY=P1
7373
RUNBOOK=https://wiki.pytorch.org/runbooks/high-cpu
7474
```
@@ -91,7 +91,7 @@ When properly configured alerts fire:
9191

9292
The resulting GitHub issue includes:
9393
- Normalized title and description
94-
- Team and priority labels
94+
- Multiple team labels (Team:dev-infra, Team:platform, etc.) and priority labels
9595
- Links to runbooks if provided
9696
- Debug information from original alert payload
9797

README.md

Lines changed: 24 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -386,7 +386,7 @@ interface AlertEvent {
386386
reason?: string; // Provider-specific reason/message
387387
priority: "P0" | "P1" | "P2" | "P3"; // Canonical priority
388388
occurred_at: string; // ISO8601 timestamp of state change
389-
team: string; // Owning team identifier
389+
teams: string[]; // Owning team identifiers (supports multiple teams)
390390
resource: { // Resource information
391391
type: "runner" | "instance" | "job" | "service" | "generic";
392392
id?: string; // Resource identifier
@@ -421,7 +421,7 @@ interface AlertEvent {
421421
"summary": "Critical CPU alert on production web server",
422422
"priority": "P1",
423423
"occurred_at": "2024-01-15T10:30:00Z",
424-
"team": "platform-team",
424+
"teams": ["platform-team"],
425425
"resource": {
426426
"type": "instance",
427427
"id": "i-1234567890abcdef0",
@@ -512,20 +512,25 @@ aws secretsmanager update-secret \
512512

513513
**CloudWatch Alarms** - Add to AlarmDescription:
514514
```
515-
TEAM=dev-infra | PRIORITY=P1 | RUNBOOK=https://runbook.example.com
516515
High CPU usage detected on production instances.
516+
TEAMS=pytorch-dev-infra, pytorch-platform
517+
PRIORITY=P1
518+
RUNBOOK=https://runbook.example.com
517519
```
518520

519-
**Grafana Alerts** - Use labels:
521+
**Grafana Alerts** - Use annotations:
520522
```yaml
521-
labels:
522-
team: dev-infra
523-
priority: P2
524523
annotations:
524+
Teams: pytorch-dev-infra, pytorch-platform
525+
Priority: P2
525526
runbook_url: https://runbook.example.com
526527
description: Database connection pool exhausted
527528
```
528529
530+
**Multi-Team Support**: Use comma-separated teams:
531+
- Single team: `TEAMS=dev-infra` or `Teams: dev-infra`
532+
- Multiple teams: `TEAMS=dev-infra, platform, security` or `Teams: dev-infra, platform, security`
533+
529534
## 📋 Operations Guide
530535

531536
For detailed instructions on configuring new alerts in Grafana and CloudWatch, see [OPERATIONS.md](OPERATIONS.md).
@@ -610,7 +615,7 @@ All logs use structured JSON with correlation IDs:
610615
"messageId": "12345-abcde",
611616
"fingerprint": "abc123...",
612617
"action": "CREATE",
613-
"team": "dev-infra",
618+
"teams": ["pytorch-dev-infra", "pytorch-platform"],
614619
"priority": "P1",
615620
"source": "grafana"
616621
}
@@ -627,14 +632,19 @@ All logs use structured JSON with correlation IDs:
627632
4. Look for circuit breaker or rate limiting logs
628633

629634
**Missing required fields error:**
635+
636+
CloudWatch alerts need TEAMS and PRIORITY in AlarmDescription
630637
```bash
631-
# CloudWatch alerts need TEAM and PRIORITY in AlarmDescription
632-
TEAM=dev-infra | PRIORITY=P1 | RUNBOOK=https://...
638+
TEAMS=dev-infra, platform
639+
PRIORITY=P1
640+
RUNBOOK=https://...
641+
```
633642

634-
# Grafana alerts need team and priority labels
635-
labels:
636-
team: dev-infra
637-
priority: P2
643+
Grafana alerts need Teams and Priority annotations
644+
```
645+
annotations:
646+
Teams: dev-infra, platform
647+
Priority: P2
638648
```
639649
640650
**High DLQ depth:**

lambdas/collector/__tests__/basic-functionality.test.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ describe("Basic Functionality Tests", () => {
3131
expect(result.source).toBe("grafana");
3232
expect(result.state).toBe("FIRING");
3333
expect(result.title).toBe("Test Alert");
34-
expect(result.team).toBe("dev-infra");
34+
expect(result.teams).toEqual(["dev-infra"]);
3535
expect(result.priority).toBe("P1");
3636
});
3737

@@ -63,7 +63,7 @@ describe("Basic Functionality Tests", () => {
6363
expect(result.source).toBe("cloudwatch");
6464
expect(result.state).toBe("FIRING");
6565
expect(result.title).toBe("High CPU Usage");
66-
expect(result.team).toBe("platform");
66+
expect(result.teams).toEqual(["platform"]);
6767
expect(result.priority).toBe("P2");
6868
});
6969

0 commit comments

Comments
 (0)