Skip to content

Commit 0e8ce7e

Browse files
committed
Implement alerts per bundle and remove core alerts
1 parent 963e3f9 commit 0e8ce7e

File tree

17 files changed

+283
-361
lines changed

17 files changed

+283
-361
lines changed

Tiltfile

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,17 +51,14 @@ dep_charts = {
5151
('dist/chart', 'cortex'),
5252
],
5353
'cortex-nova': [
54-
('helm/library/cortex-alerts', 'cortex-alerts'),
5554
('helm/library/cortex-postgres', 'cortex-postgres'),
5655
('dist/chart', 'cortex'),
5756
],
5857
'cortex-manila': [
59-
('helm/library/cortex-alerts', 'cortex-alerts'),
6058
('helm/library/cortex-postgres', 'cortex-postgres'),
6159
('dist/chart', 'cortex'),
6260
],
6361
'cortex-cinder': [
64-
('helm/library/cortex-alerts', 'cortex-alerts'),
6562
('helm/library/cortex-postgres', 'cortex-postgres'),
6663
('dist/chart', 'cortex'),
6764
],

helm/README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ helm/
2525
│ ├── cortex-ironcore/ # IronCore scheduling domain
2626
│ └── cortex-crds/ # CRDs for all operators
2727
├── library/ # Shared library charts
28-
│ ├── cortex-alerts/ # Common alerting infrastructure
2928
│ └── cortex-postgres/ # PostgreSQL database
3029
├── dev/ # Development-only charts
3130
│ └── cortex-prometheus-operator/ # Local monitoring stack
@@ -39,6 +38,7 @@ helm/
3938
Bundle charts are **umbrella charts** that represent complete deployments for specific scheduling domains. They aggregate operator charts and library charts into deployable units.
4039

4140
**Available bundles:**
41+
4242
- `cortex-nova` - Nova compute scheduling domain
4343
- `cortex-cinder` - Cinder block storage scheduling domain
4444
- `cortex-manila` - Manila shared filesystem scheduling domain
@@ -54,10 +54,11 @@ The operator chart contains the core Kubernetes operators built from the Go modu
5454
Library charts provide **shared, reusable components** that are consumed by bundle charts as dependencies.
5555

5656
**Available library charts:**
57-
- `cortex-alerts` - Common alerting infrastructure and templates
57+
5858
- `cortex-postgres` - PostgreSQL database deployment with monitoring
5959

6060
**Integration with bundles:**
61+
6162
- Library charts are **included as dependencies** in bundle Chart.yaml files
6263
- Provide common infrastructure components used across multiple domains
6364
- Reduce duplication of common services like databases and monitoring
@@ -68,15 +69,18 @@ Library charts provide **shared, reusable components** that are consumed by bund
6869
Dev charts support **local development and testing** but are not included in production releases.
6970

7071
**Available dev charts:**
72+
7173
- `cortex-prometheus-operator` - Prometheus operator setup for local development
7274

7375
## Usage Patterns
7476

7577
### Production Deployment
78+
7679
1. Deploy CRDs first: `helm install cortex-crds bundles/cortex-crds/`
7780
2. Deploy domain-specific bundle: `helm install cortex-nova bundles/cortex-nova/`
7881

7982
### Development Setup
83+
8084
1. Deploy monitoring: `helm install prometheus dev/cortex-prometheus-operator/`
8185
2. Deploy CRDs: `helm install cortex-crds bundles/cortex-crds/`
8286
3. Deploy and test bundles: `helm install cortex-nova bundles/cortex-nova/`

helm/bundles/cortex-cinder/Chart.yaml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,6 @@ type: application
88
version: 0.0.10
99
appVersion: 0.1.0
1010
dependencies:
11-
# from: file://../../library/cortex-alerts
12-
- name: cortex-alerts
13-
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
14-
version: 0.0.1
1511
# from: file://../../library/cortex-postgres
1612
- name: cortex-postgres
1713
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
Lines changed: 100 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
groups:
22
- name: cortex-cinder-alerts
33
rules:
4-
- alert: CortexCinderInitialPlacementDown
4+
- alert: CortexCinderSchedulingDown
55
expr: |
6-
up{component="cortex-cinder-scheduler", namespace="cortex-cinder"} != 1 or
7-
absent(up{component="cortex-cinder-scheduler", namespace="cortex-cinder"})
6+
up{pod=~"cortex-cinder-scheduling-.*"} != 1 or
7+
absent(up{pod=~"cortex-cinder-scheduling-.*"})
88
for: 5m
99
labels:
1010
context: liveness
@@ -14,8 +14,102 @@ groups:
1414
support_group: workload-management
1515
playbook: docs/support/playbook/cortex/down
1616
annotations:
17-
summary: "Cortex initial placement for Cinder is down"
17+
summary: "Cortex Scheduling for Cinder is down"
1818
description: >
19-
The Cortex initial placement is down. Initial placement requests from Cinder will
19+
The Cortex scheduling service is down. Scheduling requests from Cinder will
2020
not be served. This is no immediate problem, since Cinder will continue
21-
placing new volumes. However, the placement will be less desirable.
21+
placing new VMs. However, the placement will be less desirable.
22+
- alert: CortexCinderKnowledgeDown
23+
expr: |
24+
up{pod=~"cortex-cinder-knowledge-.*"} != 1 or
25+
absent(up{pod=~"cortex-cinder-knowledge-.*"})
26+
for: 5m
27+
labels:
28+
context: liveness
29+
dashboard: cortex/cortex
30+
service: cortex
31+
severity: warning
32+
support_group: workload-management
33+
playbook: docs/support/playbook/cortex/down
34+
annotations:
35+
summary: "Cortex Knowledge for Cinder is down"
36+
description: >
37+
The Cortex Knowledge service is down. This is no immediate problem,
38+
since cortex is still able to process requests,
39+
but the quality of the responses may be affected.
40+
- alert: CortexCinderHttpRequest400sTooHigh
41+
expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-cinder-metrics", status=~"4.+"}[5m]) > 0.1
42+
for: 5m
43+
labels:
44+
context: api
45+
dashboard: cortex/cortex
46+
service: cortex
47+
severity: warning
48+
support_group: workload-management
49+
annotations:
50+
summary: "Cinder Scheduler HTTP request 400 errors too high"
51+
description: >
52+
Cinder Scheduler is responding to placement requests with HTTP 4xx
53+
errors. This is expected when the scheduling request cannot be served
54+
by Cortex. However, it could also indicate that the request format has
55+
changed and Cortex is unable to parse it.
56+
- alert: CortexCinderSchedulingHttpRequest500sTooHigh
57+
expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-cinder-metrics", status=~"5.+" }[5m]) > 0.1
58+
for: 5m
59+
labels:
60+
context: api
61+
dashboard: cortex/cortex
62+
service: cortex
63+
severity: warning
64+
support_group: workload-management
65+
annotations:
66+
summary: "Cinder Scheduler HTTP request 500 errors too high"
67+
description: >
68+
Cinder Scheduler is responding to placement requests with HTTP 5xx errors.
69+
This is not expected and indicates that Cortex is having some internal problem.
70+
Cinder will continue to place new VMs, but the placement will be less desirable.
71+
Thus, no immediate action is needed.
72+
- alert: CortexCinderHighMemoryUsage
73+
expr: process_resident_memory_bytes{service="cortex-cinder-metrics"} > 6000 * 1024 * 1024
74+
for: 5m
75+
labels:
76+
context: memory
77+
dashboard: cortex/cortex
78+
service: cortex
79+
severity: warning
80+
support_group: workload-management
81+
annotations:
82+
summary: "`{{$labels.component}}` uses too much memory"
83+
description: >
84+
`{{$labels.component}}` should not be using more than 6000 MiB of memory. Usually it
85+
should use much less, so there may be a memory leak or other changes
86+
that are causing the memory usage to increase significantly.
87+
- alert: CortexCinderHighCPUUsage
88+
expr: rate(process_cpu_seconds_total{service="cortex-cinder-metrics"}[1m]) > 0.5
89+
for: 5m
90+
labels:
91+
context: cpu
92+
dashboard: cortex/cortex
93+
service: cortex
94+
severity: warning
95+
support_group: workload-management
96+
annotations:
97+
summary: "`{{$labels.component}}` uses too much CPU"
98+
description: >
99+
`{{$labels.component}}` should not be using more than 50% of a single CPU core. Usually
100+
it should use much less, so there may be a CPU leak or other changes
101+
that are causing the CPU usage to increase significantly.
102+
- alert: CortexCinderTooManyDBConnectionAttempts
103+
expr: rate(cortex_db_connection_attempts_total{service="cortex-cinder-metrics"}[5m]) > 0.1
104+
for: 5m
105+
labels:
106+
context: db
107+
dashboard: cortex/cortex
108+
service: cortex
109+
severity: warning
110+
support_group: workload-management
111+
annotations:
112+
summary: "`{{$labels.component}}` is trying to connect to the database too often"
113+
description: >
114+
`{{$labels.component}}` is trying to connect to the database too often. This may happen
115+
when the database is down or the connection parameters are misconfigured.

helm/bundles/cortex-cinder/values.yaml

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -111,9 +111,3 @@ cortex-knowledge-controllers:
111111
# Custom configuration for the cortex postgres chart.
112112
cortex-postgres:
113113
fullnameOverride: cortex-cinder-postgresql
114-
115-
# Custom configuration for the cortex core chart.
116-
cortex-alerts:
117-
fullnameOverride: cortex-cinder
118-
alerts:
119-
componentPrefix: cortex-cinder

helm/bundles/cortex-manila/Chart.yaml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,6 @@ type: application
88
version: 0.0.10
99
appVersion: 0.1.0
1010
dependencies:
11-
# from: file://../../library/cortex-alerts
12-
- name: cortex-alerts
13-
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
14-
version: 0.0.1
1511
# from: file://../../library/cortex-postgres
1612
- name: cortex-postgres
1713
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
Lines changed: 100 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
groups:
22
- name: cortex-manila-alerts
33
rules:
4-
- alert: CortexManilaInitialPlacementDown
4+
- alert: CortexManilaSchedulingDown
55
expr: |
6-
up{component="cortex-manila-scheduler", namespace="cortex-manila"} != 1 or
7-
absent(up{component="cortex-manila-scheduler", namespace="cortex-manila"})
6+
up{pod=~"cortex-manila-scheduling-.*"} != 1 or
7+
absent(up{pod=~"cortex-manila-scheduling-.*"})
88
for: 5m
99
labels:
1010
context: liveness
@@ -14,9 +14,102 @@ groups:
1414
support_group: workload-management
1515
playbook: docs/support/playbook/cortex/down
1616
annotations:
17-
summary: "Cortex initial placement for Manila is down"
17+
summary: "Cortex Scheduling for Manila is down"
1818
description: >
19-
The Cortex initial placement is down. Initial placement requests from Manila will
19+
The Cortex scheduling service is down. Scheduling requests from Manila will
2020
not be served. This is no immediate problem, since Manila will continue
21-
placing new shares. However, the placement will be less desirable.
22-
21+
placing new VMs. However, the placement will be less desirable.
22+
- alert: CortexManilaKnowledgeDown
23+
expr: |
24+
up{pod=~"cortex-manila-knowledge-.*"} != 1 or
25+
absent(up{pod=~"cortex-manila-knowledge-.*"})
26+
for: 5m
27+
labels:
28+
context: liveness
29+
dashboard: cortex/cortex
30+
service: cortex
31+
severity: warning
32+
support_group: workload-management
33+
playbook: docs/support/playbook/cortex/down
34+
annotations:
35+
summary: "Cortex Knowledge for Manila is down"
36+
description: >
37+
The Cortex Knowledge service is down. This is no immediate problem,
38+
since cortex is still able to process requests,
39+
but the quality of the responses may be affected.
40+
- alert: CortexManilaHttpRequest400sTooHigh
41+
expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-manila-metrics", status=~"4.+"}[5m]) > 0.1
42+
for: 5m
43+
labels:
44+
context: api
45+
dashboard: cortex/cortex
46+
service: cortex
47+
severity: warning
48+
support_group: workload-management
49+
annotations:
50+
summary: "Manila Scheduler HTTP request 400 errors too high"
51+
description: >
52+
Manila Scheduler is responding to placement requests with HTTP 4xx
53+
errors. This is expected when the scheduling request cannot be served
54+
by Cortex. However, it could also indicate that the request format has
55+
changed and Cortex is unable to parse it.
56+
- alert: CortexManilaSchedulingHttpRequest500sTooHigh
57+
expr: rate(cortex_scheduler_api_request_duration_seconds_count{service="cortex-manila-metrics", status=~"5.+" }[5m]) > 0.1
58+
for: 5m
59+
labels:
60+
context: api
61+
dashboard: cortex/cortex
62+
service: cortex
63+
severity: warning
64+
support_group: workload-management
65+
annotations:
66+
summary: "Manila Scheduler HTTP request 500 errors too high"
67+
description: >
68+
Manila Scheduler is responding to placement requests with HTTP 5xx errors.
69+
This is not expected and indicates that Cortex is having some internal problem.
70+
Manila will continue to place new VMs, but the placement will be less desirable.
71+
Thus, no immediate action is needed.
72+
- alert: CortexManilaHighMemoryUsage
73+
expr: process_resident_memory_bytes{service="cortex-manila-metrics"} > 6000 * 1024 * 1024
74+
for: 5m
75+
labels:
76+
context: memory
77+
dashboard: cortex/cortex
78+
service: cortex
79+
severity: warning
80+
support_group: workload-management
81+
annotations:
82+
summary: "`{{$labels.component}}` uses too much memory"
83+
description: >
84+
`{{$labels.component}}` should not be using more than 6000 MiB of memory. Usually it
85+
should use much less, so there may be a memory leak or other changes
86+
that are causing the memory usage to increase significantly.
87+
- alert: CortexManilaHighCPUUsage
88+
expr: rate(process_cpu_seconds_total{service="cortex-manila-metrics"}[1m]) > 0.5
89+
for: 5m
90+
labels:
91+
context: cpu
92+
dashboard: cortex/cortex
93+
service: cortex
94+
severity: warning
95+
support_group: workload-management
96+
annotations:
97+
summary: "`{{$labels.component}}` uses too much CPU"
98+
description: >
99+
`{{$labels.component}}` should not be using more than 50% of a single CPU core. Usually
100+
it should use much less, so there may be a CPU leak or other changes
101+
that are causing the CPU usage to increase significantly.
102+
- alert: CortexManilaTooManyDBConnectionAttempts
103+
expr: rate(cortex_db_connection_attempts_total{service="cortex-manila-metrics"}[5m]) > 0.1
104+
for: 5m
105+
labels:
106+
context: db
107+
dashboard: cortex/cortex
108+
service: cortex
109+
severity: warning
110+
support_group: workload-management
111+
annotations:
112+
summary: "`{{$labels.component}}` is trying to connect to the database too often"
113+
description: >
114+
`{{$labels.component}}` is trying to connect to the database too often. This may happen
115+
when the database is down or the connection parameters are misconfigured.

helm/bundles/cortex-manila/values.yaml

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -111,9 +111,3 @@ cortex-knowledge-controllers:
111111
# Custom configuration for the cortex postgres chart.
112112
cortex-postgres:
113113
fullnameOverride: cortex-manila-postgresql
114-
115-
# Custom configuration for the cortex core chart.
116-
cortex-alerts:
117-
fullnameOverride: cortex-manila
118-
alerts:
119-
componentPrefix: cortex-manila

helm/bundles/cortex-nova/Chart.yaml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,6 @@ type: application
88
version: 0.0.10
99
appVersion: 0.1.0
1010
dependencies:
11-
# from: file://../../library/cortex-alerts
12-
- name: cortex-alerts
13-
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
14-
version: 0.0.1
1511
# from: file://../../library/cortex-postgres
1612
- name: cortex-postgres
1713
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts

0 commit comments

Comments
 (0)