Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions OPERATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ We map extra fields using what Grafana calls "Custom annotation name and content
These must be present on every alert for it to fire:

They're basically key/value pairs:
- **`team`** - The owning team identifier (e.g., `pytorch-dev-infra`, `pytorch-benchmarking`)
- **`priority`** - Alert severity level: `P0`, `P1`, `P2`, or `P3`
- **`Teams`** - The owning team identifier(s). Supports multiple teams separated by commas (e.g., `pytorch-dev-infra, pytorch-benchmarking`)
- **`Priority`** - Alert severity level: `P0`, `P1`, `P2`, or `P3`

### Optional Annotations

Expand All @@ -30,20 +30,20 @@ These fileds will also be populated in the alerts
### Configuration

1. Create your alert rule in Grafana. Give the outputs of the query meaningful names, otherwise Grafana will default to A, B, C
2. Add the required fields in labels:
2. Add the required fields in annotations:
```
team = dev-infra
priority = P1
Teams = dev-infra, platform
Priority = P1
runbook_url = https://wiki.example.com/runbooks/disk-space
```
3. Under "Configure Notification" enable "advanced options". Alerts will now get routed to our Dev and Prod channels

### Example Configuration

```yaml
labels:
team: "dev-infra"
priority: "P1"
annotations:
Teams: "dev-infra, platform"
Priority: "P1"
runbook_url: "https://wiki.pytorch.org/runbooks/disk-space"
```

Expand All @@ -55,7 +55,7 @@ CloudWatch alerts use the AlarmDescription field to pass metadata in a specific

These must be present in the AlarmDescription:

- **`TEAM`** - Owning team identifier
- **`TEAMS`** - Owning team identifier(s). Supports multiple teams separated by commas
- **`PRIORITY`** - Priority level (`P0`, `P1`, `P2`, `P3`)

### Optional Fields
Expand All @@ -68,7 +68,7 @@ The AlarmDescription should contain your alert description, followed by metadata

```
High CPU usage detected on production instances
TEAM=dev-infra
TEAMS=dev-infra, platform
PRIORITY=P1
RUNBOOK=https://wiki.pytorch.org/runbooks/high-cpu
```
Expand All @@ -91,7 +91,7 @@ When properly configured alerts fire:

The resulting GitHub issue includes:
- Normalized title and description
- Team and priority labels
- Multiple team labels (Team:dev-infra, Team:platform, etc.) and priority labels
- Links to runbooks if provided
- Debug information from original alert payload

Expand Down
38 changes: 24 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -386,7 +386,7 @@ interface AlertEvent {
reason?: string; // Provider-specific reason/message
priority: "P0" | "P1" | "P2" | "P3"; // Canonical priority
occurred_at: string; // ISO8601 timestamp of state change
team: string; // Owning team identifier
teams: string[]; // Owning team identifiers (supports multiple teams)
resource: { // Resource information
type: "runner" | "instance" | "job" | "service" | "generic";
id?: string; // Resource identifier
Expand Down Expand Up @@ -421,7 +421,7 @@ interface AlertEvent {
"summary": "Critical CPU alert on production web server",
"priority": "P1",
"occurred_at": "2024-01-15T10:30:00Z",
"team": "platform-team",
"teams": ["platform-team"],
"resource": {
"type": "instance",
"id": "i-1234567890abcdef0",
Expand Down Expand Up @@ -512,20 +512,25 @@ aws secretsmanager update-secret \

**CloudWatch Alarms** - Add to AlarmDescription:
```
TEAM=dev-infra | PRIORITY=P1 | RUNBOOK=https://runbook.example.com
High CPU usage detected on production instances.
TEAMS=pytorch-dev-infra, pytorch-platform
PRIORITY=P1
RUNBOOK=https://runbook.example.com
```

**Grafana Alerts** - Use labels:
**Grafana Alerts** - Use annotations:
```yaml
labels:
team: dev-infra
priority: P2
annotations:
Teams: pytorch-dev-infra, pytorch-platform
Priority: P2
runbook_url: https://runbook.example.com
description: Database connection pool exhausted
```

**Multi-Team Support**: Use comma-separated teams:
- Single team: `TEAMS=dev-infra` or `Teams: dev-infra`
- Multiple teams: `TEAMS=dev-infra, platform, security` or `Teams: dev-infra, platform, security`

## 📋 Operations Guide

For detailed instructions on configuring new alerts in Grafana and CloudWatch, see [OPERATIONS.md](OPERATIONS.md).
Expand Down Expand Up @@ -610,7 +615,7 @@ All logs use structured JSON with correlation IDs:
"messageId": "12345-abcde",
"fingerprint": "abc123...",
"action": "CREATE",
"team": "dev-infra",
"teams": ["pytorch-dev-infra", "pytorch-platform"],
"priority": "P1",
"source": "grafana"
}
Expand All @@ -627,14 +632,19 @@ All logs use structured JSON with correlation IDs:
4. Look for circuit breaker or rate limiting logs

**Missing required fields error:**

CloudWatch alerts need TEAMS and PRIORITY in AlarmDescription
```bash
# CloudWatch alerts need TEAM and PRIORITY in AlarmDescription
TEAM=dev-infra | PRIORITY=P1 | RUNBOOK=https://...
TEAMS=dev-infra, platform
PRIORITY=P1
RUNBOOK=https://...
```

# Grafana alerts need team and priority labels
labels:
team: dev-infra
priority: P2
Grafana alerts need Teams and Priority annotations
```
annotations:
Teams: dev-infra, platform
Priority: P2
```

**High DLQ depth:**
Expand Down
4 changes: 2 additions & 2 deletions lambdas/collector/__tests__/basic-functionality.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ describe("Basic Functionality Tests", () => {
expect(result.source).toBe("grafana");
expect(result.state).toBe("FIRING");
expect(result.title).toBe("Test Alert");
expect(result.team).toBe("dev-infra");
expect(result.teams).toEqual(["dev-infra"]);
expect(result.priority).toBe("P1");
});

Expand Down Expand Up @@ -63,7 +63,7 @@ describe("Basic Functionality Tests", () => {
expect(result.source).toBe("cloudwatch");
expect(result.state).toBe("FIRING");
expect(result.title).toBe("High CPU Usage");
expect(result.team).toBe("platform");
expect(result.teams).toEqual(["platform"]);
expect(result.priority).toBe("P2");
});

Expand Down
Loading