You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
count by (namespace, rabbitmq_cluster) (erlang_vm_dist_node_state * on(instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info) == 3)
29
+
<
30
+
count by (namespace, rabbitmq_cluster) (rabbitmq_build_info * on(instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info))
31
+
*
32
+
(count by (namespace, rabbitmq_cluster) (rabbitmq_build_info * on(instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)) -1 )
33
+
for: 10m
34
+
annotations:
35
+
description: |
36
+
There are only `{{ $value }}` established Erlang distribution links
37
+
in RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
38
+
summary: |
39
+
RabbitMQ clusters have a full mesh topology.
40
+
All RabbitMQ nodes connect to all other RabbitMQ nodes in both directions.
41
+
The expected number of established Erlang distribution links is therefore `n*(n-1)` where `n` is the number of RabbitMQ nodes in the cluster.
42
+
Therefore, the expected number of distribution links are `0` for a 1-node cluster, `6` for a 3-node cluster, and `20` for a 5-node cluster.
43
+
This alert reports that the number of established distributions links is less than the expected number.
44
+
Some reasons for this alert include failed network links, network partitions, failed clustering (i.e. nodes can't join the cluster).
45
+
Check the panels `All distribution links`, `Established distribution links`, `Connecting distributions links`, `Waiting distribution links`, and `distribution links`
46
+
of the Grafana dashboard `Erlang-Distribution`.
47
+
Check the logs of the RabbitMQ nodes: `kubectl -n {{ $labels.namespace }} logs -l app.kubernetes.io/component=rabbitmq,app.kubernetes.io/name={{ $labels.rabbitmq_cluster }}`
48
+
labels:
49
+
rulesgroup: rabbitmq
50
+
severity: warning
25
51
- alert: UnroutableMessages
26
52
expr: |
27
-
sum by(namespace, rabbitmq_cluster) (increase(rabbitmq_channel_messages_unroutable_dropped_total[5m]) * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info)
53
+
sum by(namespace, rabbitmq_cluster) (increase(rabbitmq_channel_messages_unroutable_dropped_total[5m]) * on(instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info))
28
54
>= 1
29
55
or
30
-
sum by(namespace, rabbitmq_cluster) (increase(rabbitmq_channel_messages_unroutable_returned_total[5m]) * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info)
56
+
sum by(namespace, rabbitmq_cluster) (increase(rabbitmq_channel_messages_unroutable_returned_total[5m]) * on(instance) group_left(rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info))
Over the last 10 minutes, container `{{ $labels.container }}`
78
-
restarted `{{ $value | printf "%.0f" }}` times in pod `{{ $labels.pod }}` of RabbitMQ cluster
79
-
`{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
80
-
summary: |
81
-
Investigate why the container got restarted.
82
-
Check the logs of the current container: `kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}`
83
-
Check the logs of the previous container: `kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }} --previous`
84
-
Check the last state of the container: `kubectl -n {{ $labels.namespace }} get pod {{ $labels.pod }} -o jsonpath='{.status.containerStatuses[].lastState}'`
count by (namespace, rabbitmq_cluster) (erlang_vm_dist_node_state * on(instance) group_left(rabbitmq_cluster) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster) == 3)
154
-
<
155
-
count by (namespace, rabbitmq_cluster) (rabbitmq_build_info * on(instance) group_left(rabbitmq_cluster) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster))
156
-
*
157
-
(count by (namespace, rabbitmq_cluster) (rabbitmq_build_info * on(instance) group_left(rabbitmq_cluster) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)) -1 )
158
-
for: 10m
159
-
annotations:
160
-
description: |
161
-
There are only `{{ $value }}` established Erlang distribution links
162
-
in RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
163
-
summary: |
164
-
RabbitMQ clusters have a full mesh topology.
165
-
All RabbitMQ nodes connect to all other RabbitMQ nodes in both directions.
166
-
The expected number of established Erlang distribution links is therefore `n*(n-1)` where `n` is the number of RabbitMQ nodes in the cluster.
167
-
Therefore, the expected number of distribution links are `0` for a 1-node cluster, `6` for a 3-node cluster, and `20` for a 5-node cluster.
168
-
This alert reports that the number of established distributions links is less than the expected number.
169
-
Some reasons for this alert include failed network links, network partitions, failed clustering (i.e. nodes can't join the cluster).
170
-
Check the panels `All distribution links`, `Established distribution links`, `Connecting distributions links`, `Waiting distribution links`, and `distribution links`
171
-
of the Grafana dashboard `Erlang-Distribution`.
172
-
Check the logs of the RabbitMQ nodes: `kubectl -n {{ $labels.namespace }} logs -l app.kubernetes.io/component=rabbitmq,app.kubernetes.io/name={{ $labels.rabbitmq_cluster }}`
* on(namespace, pod, container) group_left(rabbitmq_cluster) max by (namespace, pod, container, rabbitmq_cluster) (rabbitmq_identity_info)
168
+
>= 1
169
+
for: 5m
170
+
annotations:
171
+
description: |
172
+
Over the last 10 minutes, container `{{ $labels.container }}`
173
+
restarted `{{ $value | printf "%.0f" }}` times in pod `{{ $labels.pod }}` of RabbitMQ cluster
174
+
`{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
175
+
summary: |
176
+
Investigate why the container got restarted.
177
+
Check the logs of the current container: `kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}`
178
+
Check the logs of the previous container: `kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }} --previous`
179
+
Check the last state of the container: `kubectl -n {{ $labels.namespace }} get pod {{ $labels.pod }} -o jsonpath='{.status.containerStatuses[].lastState}'`
* on (instance) group_left (rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)
195
+
)
230
196
)
231
197
/
232
-
sum (rabbitmq_connections * on(instance) group_left(rabbitmq_cluster) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)) by (namespace, rabbitmq_cluster)
198
+
sum by (namespace, rabbitmq_cluster) (
199
+
rabbitmq_connections
200
+
* on (instance) group_left (rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)
201
+
)
233
202
> 0.1
234
203
unless
235
-
sum (rabbitmq_connections * on(instance) group_left(rabbitmq_cluster) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)) by (namespace, rabbitmq_cluster)
204
+
sum by (namespace, rabbitmq_cluster) (
205
+
rabbitmq_connections
206
+
* on (instance) group_left (rabbitmq_cluster) max by (instance, rabbitmq_cluster) (rabbitmq_identity_info)
207
+
)
236
208
< 100
237
209
for: 10m
238
210
annotations:
@@ -251,13 +223,13 @@ groups:
251
223
# The 2nd condition ensures that data points are available until 24 hours ago such that no false positive alerts are triggered for newly created RabbitMQ clusters.
252
224
expr: |
253
225
(
254
-
predict_linear(rabbitmq_disk_space_available_bytes[24h], 60*60*24) * on (instance, pod) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)
226
+
predict_linear(rabbitmq_disk_space_available_bytes[24h], 60*60*24) * on (instance) group_left(rabbitmq_cluster, rabbitmq_node) max by (instance, rabbitmq_node, rabbitmq_cluster) (rabbitmq_identity_info)
255
227
<
256
-
rabbitmq_disk_space_available_limit_bytes * on (instance, pod) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)
228
+
rabbitmq_disk_space_available_limit_bytes * on (instance) group_left(rabbitmq_cluster, rabbitmq_node) max by (instance, rabbitmq_node, rabbitmq_cluster) (rabbitmq_identity_info)
257
229
)
258
230
and
259
231
(
260
-
count_over_time(rabbitmq_disk_space_available_limit_bytes[2h] offset 22h) * on (instance, pod) group_left(rabbitmq_cluster, rabbitmq_node) max(rabbitmq_identity_info) by (namespace, pod, container, rabbitmq_cluster)
232
+
count_over_time(rabbitmq_disk_space_available_limit_bytes[2h] offset 22h) * on (instance) group_left(rabbitmq_cluster, rabbitmq_node) max by (instance, rabbitmq_node, rabbitmq_cluster) (rabbitmq_identity_info)
261
233
>
262
234
0
263
235
)
@@ -280,6 +252,26 @@ groups:
280
252
labels:
281
253
rulesgroup: rabbitmq
282
254
severity: warning
255
+
- alert: PersistentVolumeMissing
256
+
expr: |
257
+
kube_persistentvolumeclaim_status_phase{phase="Bound"} * on (namespace, persistentvolumeclaim) group_left(label_app_kubernetes_io_name) kube_persistentvolumeclaim_labels{label_app_kubernetes_io_component="rabbitmq"}
258
+
==
259
+
0
260
+
for: 10m
261
+
annotations:
262
+
description: |
263
+
PersistentVolumeClaim `{{ $labels.persistentvolumeclaim }}` of
264
+
RabbitMQ cluster `{{ $labels.label_app_kubernetes_io_name }}` in namespace
265
+
`{{ $labels.namespace }}` is not bound.
266
+
summary: |
267
+
RabbitMQ needs a PersistentVolume for its data.
268
+
However, there is no PersistentVolume bound to the PersistentVolumeClaim.
269
+
This means the requested storage could not be provisioned.
270
+
Check the status of the PersistentVolumeClaim: `kubectl -n {{ $labels.namespace }} describe pvc {{ $labels.persistentvolumeclaim }}`.
0 commit comments