-
Notifications
You must be signed in to change notification settings - Fork 714
Description
Describe the Bug
planner expects
<dynamo namespace from DGD>-prometheus.<dynamo's namespace>
as prometheus endpoint by default
however, docs (which have broken link from kube-prometheus-stack) mentions it should be deployed into monitoring namespace
https://docs.nvidia.com/dynamo/latest/planner/sla_planner_quickstart.html#prerequisites
which is http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
https://docs.nvidia.com/dynamo/latest/kubernetes/observability/metrics.html
Error getting avg input sequence tokens: HTTPConnectionPool(host='trtllm-disagg-prometheus.dynamo-system
.svc.cluster.local', port=9090): Max retries exceeded with url: /api/v1/query?query=increase%28dynamo_frontend_input_sequence_tokens_sum%5B180s%5D%29%2Fincrease%28dynamo_
frontend_input_sequence_tokens_count%5B180s%5D%29 (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f3a8b93d550>: Failed to resolve 'trtllm-
disagg-prometheus.dynamo-system.svc.cluster.local' ([Errno -2] Name or service not known)"))
workaround is to manually set PROMETHEUS_ENDPOINT in planner deployment to http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 or create new prom deployment in the expected endpoint
Steps to Reproduce
deploy dgdr from one of the example
look at planner logs
Expected Behavior
there is a cohesive experience with docs and implementation/defaults
Actual Behavior
planner doesn't recognize prom endpoint and needs to be manually set
Environment
dynamo 0.6.1
Additional Context
No response
Screenshots
No response