-
Notifications
You must be signed in to change notification settings - Fork 1
Closed
Labels
Pri:P2Source:grafanaTeam:nvidia-infraAlert label for Team:nvidia-infraAlert label for Team:nvidia-infraTeam:pytorch-dev-infraarea:alerting
Description
Nvidia B200 jobs are queueing, please investigate. Max queue time: 241 mins, Max queue size: 8 runners. Please visit http://hud.pytorch.org/metrics to see which runners are queueing.
Alert Details
- Occurred At: Dec 2, 3:41pm PST
- State: FIRING
- Teams: pytorch-dev-infra, nvidia-infra
- Priority: P2
- Description: Alerts when the B200 runners are queuing for a long time or when many of them are queuing.
- Reason: max_queue_size=8, max_queue_time_mins=241, queue_size_threshold=0, queue_time_threshold=1, threshold_breached=1
- Runbook: https://hud.pytorch.org/metrics
- View Alert: https://pytorchci.grafana.net/alerting/grafana/eez5ua39adslce/view?orgId=1
- Silence Alert: https://pytorchci.grafana.net/alerting/silence/new?alertmanager=grafana&matcher=__alert_rule_uid__%3Deez5ua39adslce&matcher=type%3Dalerting-infra&orgId=1
- Source: grafana
- Fingerprint:
c13639384b5c8c24bb89711a42bbdc76c3eee865c371a943b41e875a13103436
Metadata
Metadata
Assignees
Labels
Pri:P2Source:grafanaTeam:nvidia-infraAlert label for Team:nvidia-infraAlert label for Team:nvidia-infraTeam:pytorch-dev-infraarea:alerting