Skip to content

Commit 045c77d

Browse files
committed
Update Prometheus rule file
1 parent 9035e76 commit 045c77d

File tree

1 file changed

+111
-119
lines changed

1 file changed

+111
-119
lines changed

observability/prometheus/rule-file.yml

Lines changed: 111 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,38 @@ groups:
2222
severity: warning
2323
- name: rabbitmq
2424
rules:
25+
- alert: InsufficientEstablishedErlangDistributionLinks
26+
# erlang_vm_dist_node_state: 1=pending, 2=up_pending, 3=up
27+
expr: |
28+
count by (namespace, rabbitmq_cluster) (erlang_vm_dist_node_state * on(instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info) == 3)
29+
<
30+
count by (namespace, rabbitmq_cluster) (rabbitmq_build_info * on(instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info))
31+
*
32+
(count by (namespace, rabbitmq_cluster) (rabbitmq_build_info * on(instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)) -1 )
33+
for: 10m
34+
annotations:
35+
description: |
36+
There are only `{{ $value }}` established Erlang distribution links
37+
in RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
38+
summary: |
39+
RabbitMQ clusters have a full mesh topology.
40+
All RabbitMQ nodes connect to all other RabbitMQ nodes in both directions.
41+
The expected number of established Erlang distribution links is therefore `n*(n-1)` where `n` is the number of RabbitMQ nodes in the cluster.
42+
Therefore, the expected number of distribution links are `0` for a 1-node cluster, `6` for a 3-node cluster, and `20` for a 5-node cluster.
43+
This alert reports that the number of established distributions links is less than the expected number.
44+
Some reasons for this alert include failed network links, network partitions, failed clustering (i.e. nodes can't join the cluster).
45+
Check the panels `All distribution links`, `Established distribution links`, `Connecting distributions links`, `Waiting distribution links`, and `distribution links`
46+
of the Grafana dashboard `Erlang-Distribution`.
47+
Check the logs of the RabbitMQ nodes: `kubectl -n {{ $labels.namespace }} logs -l app.kubernetes.io/component=rabbitmq,app.kubernetes.io/name={{ $labels.rabbitmq_cluster }}`
48+
labels:
49+
rulesgroup: rabbitmq
50+
severity: warning
2551
- alert: UnroutableMessages
2652
expr: |
27-
sum by(namespace, rabbitmq_cluster) (increase(rabbitmq_channel_messages_unroutable_dropped_total[5m]) * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info)
53+
sum by(namespace, rabbitmq_cluster) (increase(rabbitmq_channel_messages_unroutable_dropped_total[5m]) * on (instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info))
2854
>= 1
2955
or
30-
sum by(namespace, rabbitmq_cluster) (increase(rabbitmq_channel_messages_unroutable_returned_total[5m]) * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info)
56+
sum by(namespace, rabbitmq_cluster) (increase(rabbitmq_channel_messages_unroutable_returned_total[5m]) * on (instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info))
3157
>= 1
3258
annotations:
3359
description: |
@@ -66,118 +92,11 @@ groups:
6692
rabbitmq_cluster: '{{ $labels.label_app_kubernetes_io_name }}'
6793
rulesgroup: rabbitmq
6894
severity: warning
69-
- alert: ContainerRestarts
70-
expr: |
71-
increase(kube_pod_container_status_restarts_total[10m]) * on(namespace, pod, container) group_left(rabbitmq_cluster) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)
72-
>=
73-
1
74-
for: 5m
75-
annotations:
76-
description: |
77-
Over the last 10 minutes, container `{{ $labels.container }}`
78-
restarted `{{ $value | printf "%.0f" }}` times in pod `{{ $labels.pod }}` of RabbitMQ cluster
79-
`{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
80-
summary: |
81-
Investigate why the container got restarted.
82-
Check the logs of the current container: `kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}`
83-
Check the logs of the previous container: `kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }} --previous`
84-
Check the last state of the container: `kubectl -n {{ $labels.namespace }} get pod {{ $labels.pod }} -o jsonpath='{.status.containerStatuses[].lastState}'`
85-
labels:
86-
rabbitmq_cluster: '{{ $labels.rabbitmq_cluster }}'
87-
rulesgroup: rabbitmq
88-
severity: warning
89-
- alert: FileDescriptorsNearLimit
90-
expr: |
91-
sum by(namespace, rabbitmq_cluster, pod, rabbitmq_node) (max_over_time(rabbitmq_process_open_fds[5m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster))
92-
/
93-
sum by(namespace, rabbitmq_cluster, pod, rabbitmq_node) (rabbitmq_process_max_fds * on(instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster))
94-
> 0.8
95-
for: 10m
96-
annotations:
97-
description: |
98-
`{{ $value | humanizePercentage }}` file descriptors of file
99-
descriptor limit are used in RabbitMQ node `{{ $labels.rabbitmq_node }}`,
100-
pod `{{ $labels.pod }}`, RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}`,
101-
namespace `{{ $labels.namespace }}`.
102-
summary: |
103-
More than 80% of file descriptors are used on the RabbitMQ node.
104-
When this value reaches 100%, new connections will not be accepted and disk write operations may fail.
105-
Client libraries, peer nodes and CLI tools will not be able to connect when the node runs out of available file descriptors.
106-
See https://www.rabbitmq.com/production-checklist.html#resource-limits-file-handle-limit.
107-
labels:
108-
rulesgroup: rabbitmq
109-
severity: warning
110-
- alert: TCPSocketsNearLimit
111-
expr: |
112-
sum by(namespace, rabbitmq_cluster, pod, rabbitmq_node) (max_over_time(rabbitmq_process_open_tcp_sockets[5m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) rabbitmq_identity_info)
113-
/
114-
sum by(namespace, rabbitmq_cluster, pod, rabbitmq_node) (rabbitmq_process_max_tcp_sockets * on(instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) rabbitmq_identity_info)
115-
> 0.8
116-
for: 10m
117-
annotations:
118-
description: |
119-
`{{ $value | humanizePercentage }}` TCP sockets of TCP socket
120-
limit are open in RabbitMQ node `{{ $labels.rabbitmq_node }}`, pod `{{ $labels.pod }}`,
121-
RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}`, namespace `{{ $labels.namespace }}`.
122-
summary: |
123-
More than 80% of TCP sockets are open on the RabbitMQ node.
124-
When this value reaches 100%, new connections will not be accepted.
125-
Client libraries, peer nodes and CLI tools will not be able to connect when the node runs out of available TCP sockets.
126-
See https://www.rabbitmq.com/networking.html.
127-
labels:
128-
rulesgroup: rabbitmq
129-
severity: warning
130-
- alert: PersistentVolumeMissing
131-
expr: |
132-
kube_persistentvolumeclaim_status_phase{phase="Bound"} * on (namespace, persistentvolumeclaim) group_left(label_app_kubernetes_io_name) kube_persistentvolumeclaim_labels{label_app_kubernetes_io_component="rabbitmq"}
133-
==
134-
0
135-
for: 10m
136-
annotations:
137-
description: |
138-
PersistentVolumeClaim `{{ $labels.persistentvolumeclaim }}` of
139-
RabbitMQ cluster `{{ $labels.label_app_kubernetes_io_name }}` in namespace
140-
`{{ $labels.namespace }}` is not bound.
141-
summary: |
142-
RabbitMQ needs a PersistentVolume for its data.
143-
However, there is no PersistentVolume bound to the PersistentVolumeClaim.
144-
This means the requested storage could not be provisioned.
145-
Check the status of the PersistentVolumeClaim: `kubectl -n {{ $labels.namespace }} describe pvc {{ $labels.persistentvolumeclaim }}`.
146-
labels:
147-
rabbitmq_cluster: '{{ $labels.label_app_kubernetes_io_name }}'
148-
rulesgroup: rabbitmq
149-
severity: critical
150-
- alert: InsufficientEstablishedErlangDistributionLinks
151-
# erlang_vm_dist_node_state: 1=pending, 2=up_pending, 3=up
152-
expr: |
153-
count by (namespace, rabbitmq_cluster) (erlang_vm_dist_node_state * on(instance) group_left(rabbitmq_cluster) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster) == 3)
154-
<
155-
count by (namespace, rabbitmq_cluster) (rabbitmq_build_info * on(instance) group_left(rabbitmq_cluster) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster))
156-
*
157-
(count by (namespace, rabbitmq_cluster) (rabbitmq_build_info * on(instance) group_left(rabbitmq_cluster) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)) -1 )
158-
for: 10m
159-
annotations:
160-
description: |
161-
There are only `{{ $value }}` established Erlang distribution links
162-
in RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
163-
summary: |
164-
RabbitMQ clusters have a full mesh topology.
165-
All RabbitMQ nodes connect to all other RabbitMQ nodes in both directions.
166-
The expected number of established Erlang distribution links is therefore `n*(n-1)` where `n` is the number of RabbitMQ nodes in the cluster.
167-
Therefore, the expected number of distribution links are `0` for a 1-node cluster, `6` for a 3-node cluster, and `20` for a 5-node cluster.
168-
This alert reports that the number of established distributions links is less than the expected number.
169-
Some reasons for this alert include failed network links, network partitions, failed clustering (i.e. nodes can't join the cluster).
170-
Check the panels `All distribution links`, `Established distribution links`, `Connecting distributions links`, `Waiting distribution links`, and `distribution links`
171-
of the Grafana dashboard `Erlang-Distribution`.
172-
Check the logs of the RabbitMQ nodes: `kubectl -n {{ $labels.namespace }} logs -l app.kubernetes.io/component=rabbitmq,app.kubernetes.io/name={{ $labels.rabbitmq_cluster }}`
173-
labels:
174-
rulesgroup: rabbitmq
175-
severity: warning
17695
- alert: MemoryAlarm
17796
expr: |
17897
max by(rabbitmq_cluster) (
17998
max_over_time(rabbitmq_alarms_memory_used_watermark[5m])
180-
* on(instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)
99+
* on(instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)
181100
) > 0
182101
keep_firing_for: 5m
183102
annotations:
@@ -193,7 +112,7 @@ groups:
193112
expr: |
194113
max by(rabbitmq_cluster) (
195114
max_over_time(rabbitmq_alarms_free_disk_space_watermark[5m])
196-
* on(instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)
115+
* on(instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)
197116
) > 0
198117
keep_firing_for: 5m
199118
annotations:
@@ -209,7 +128,7 @@ groups:
209128
expr: |
210129
max by(rabbitmq_cluster) (
211130
max_over_time(rabbitmq_alarms_file_descriptor_limit[5m])
212-
* on(instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)
131+
* on(instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)
213132
) > 0
214133
keep_firing_for: 5m
215134
annotations:
@@ -221,18 +140,71 @@ groups:
221140
labels:
222141
rulesgroup: rabbitmq
223142
severity: warning
143+
- alert: FileDescriptorsNearLimit
144+
expr: |
145+
sum by(namespace, rabbitmq_cluster, pod, rabbitmq_node) (max_over_time(rabbitmq_process_open_fds[5m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max by (instance, rabbitmq_node, rabbitmq_cluster) (rabbitmq_identity_info))
146+
/
147+
sum by(namespace, rabbitmq_cluster, pod, rabbitmq_node) (rabbitmq_process_max_fds * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max by (instance, rabbitmq_node, rabbitmq_cluster) (rabbitmq_identity_info))
148+
> 0.8
149+
for: 10m
150+
annotations:
151+
description: |
152+
`{{ $value | humanizePercentage }}` file descriptors of file
153+
descriptor limit are used in RabbitMQ node `{{ $labels.rabbitmq_node }}`,
154+
pod `{{ $labels.pod }}`, RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}`,
155+
namespace `{{ $labels.namespace }}`.
156+
summary: |
157+
More than 80% of file descriptors are used on the RabbitMQ node.
158+
When this value reaches 100%, new connections will not be accepted and disk write operations may fail.
159+
Client libraries, peer nodes and CLI tools will not be able to connect when the node runs out of available file descriptors.
160+
See https://www.rabbitmq.com/production-checklist.html#resource-limits-file-handle-limit.
161+
labels:
162+
rulesgroup: rabbitmq
163+
severity: warning
164+
- alert: ContainerRestarts
165+
expr: |
166+
increase(kube_pod_container_status_restarts_total[10m])
167+
* on(namespace, pod, container) group_left(rabbitmq_cluster) max by (namespace, pod, container, rabbitmq_cluster) (rabbitmq_identity_info)
168+
>= 1
169+
for: 5m
170+
annotations:
171+
description: |
172+
Over the last 10 minutes, container `{{ $labels.container }}`
173+
restarted `{{ $value | printf "%.0f" }}` times in pod `{{ $labels.pod }}` of RabbitMQ cluster
174+
`{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
175+
summary: |
176+
Investigate why the container got restarted.
177+
Check the logs of the current container: `kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}`
178+
Check the logs of the previous container: `kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }} --previous`
179+
Check the last state of the container: `kubectl -n {{ $labels.namespace }} get pod {{ $labels.pod }} -o jsonpath='{.status.containerStatuses[].lastState}'`
180+
labels:
181+
rabbitmq_cluster: '{{ $labels.rabbitmq_cluster }}'
182+
rulesgroup: rabbitmq
183+
severity: warning
224184
- alert: HighConnectionChurn
225185
expr: |
226186
(
227-
sum(rate(rabbitmq_connections_closed_total[5m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)) by(namespace, rabbitmq_cluster)
187+
sum by (namespace, rabbitmq_cluster) (
188+
rate(rabbitmq_connections_closed_total[5m])
189+
* on (instance) group_left (rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)
190+
)
228191
+
229-
sum(rate(rabbitmq_connections_opened_total[5m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)) by(namespace, rabbitmq_cluster)
192+
sum by (namespace, rabbitmq_cluster) (
193+
rate(rabbitmq_connections_opened_total[5m])
194+
* on (instance) group_left (rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)
195+
)
230196
)
231197
/
232-
sum (rabbitmq_connections * on(instance) group_left(rabbitmq_cluster) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)) by (namespace, rabbitmq_cluster)
198+
sum by (namespace, rabbitmq_cluster) (
199+
rabbitmq_connections
200+
* on (instance) group_left (rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)
201+
)
233202
> 0.1
234203
unless
235-
sum (rabbitmq_connections * on(instance) group_left(rabbitmq_cluster) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)) by (namespace, rabbitmq_cluster)
204+
sum by (namespace, rabbitmq_cluster) (
205+
rabbitmq_connections
206+
* on (instance) group_left (rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)
207+
)
236208
< 100
237209
for: 10m
238210
annotations:
@@ -251,13 +223,13 @@ groups:
251223
# The 2nd condition ensures that data points are available until 24 hours ago such that no false positive alerts are triggered for newly created RabbitMQ clusters.
252224
expr: |
253225
(
254-
predict_linear(rabbitmq_disk_space_available_bytes[24h], 60*60*24) * on (instance, pod) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)
226+
predict_linear(rabbitmq_disk_space_available_bytes[24h], 60*60*24) * on (instance) group_left(rabbitmq_cluster, rabbitmq_node) max by (instance, rabbitmq_node, rabbitmq_cluster) (rabbitmq_identity_info)
255227
<
256-
rabbitmq_disk_space_available_limit_bytes * on (instance, pod) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)
228+
rabbitmq_disk_space_available_limit_bytes * on (instance) group_left(rabbitmq_cluster, rabbitmq_node) max by (instance, rabbitmq_node, rabbitmq_cluster) (rabbitmq_identity_info)
257229
)
258230
and
259231
(
260-
count_over_time(rabbitmq_disk_space_available_limit_bytes[2h] offset 22h) * on (instance, pod) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)
232+
count_over_time(rabbitmq_disk_space_available_limit_bytes[2h] offset 22h) * on (instance) group_left(rabbitmq_cluster, rabbitmq_node) max by (instance, rabbitmq_node, rabbitmq_cluster) (rabbitmq_identity_info)
261233
>
262234
0
263235
)
@@ -280,6 +252,26 @@ groups:
280252
labels:
281253
rulesgroup: rabbitmq
282254
severity: warning
255+
- alert: PersistentVolumeMissing
256+
expr: |
257+
kube_persistentvolumeclaim_status_phase{phase="Bound"} * on (namespace, persistentvolumeclaim) group_left(label_app_kubernetes_io_name) kube_persistentvolumeclaim_labels{label_app_kubernetes_io_component="rabbitmq"}
258+
==
259+
0
260+
for: 10m
261+
annotations:
262+
description: |
263+
PersistentVolumeClaim `{{ $labels.persistentvolumeclaim }}` of
264+
RabbitMQ cluster `{{ $labels.label_app_kubernetes_io_name }}` in namespace
265+
`{{ $labels.namespace }}` is not bound.
266+
summary: |
267+
RabbitMQ needs a PersistentVolume for its data.
268+
However, there is no PersistentVolume bound to the PersistentVolumeClaim.
269+
This means the requested storage could not be provisioned.
270+
Check the status of the PersistentVolumeClaim: `kubectl -n {{ $labels.namespace }} describe pvc {{ $labels.persistentvolumeclaim }}`.
271+
labels:
272+
rabbitmq_cluster: '{{ $labels.label_app_kubernetes_io_name }}'
273+
rulesgroup: rabbitmq
274+
severity: critical
283275
# The first 2 rules create a metric ALERTS:rabbitmq_alert_state_numeric which has value 1 for alertstate pending and value 2 for alertstate firing
284276
- expr: |
285277
ALERTS{rulesgroup="rabbitmq", alertstate="pending"} * 0 + 1

0 commit comments

Comments
 (0)