Skip to content

[BUG]: planner does not set correct prom endpoint #4412

@sozercan

Description

@sozercan

Describe the Bug

planner expects

<dynamo namespace from DGD>-prometheus.<dynamo's namespace>

as prometheus endpoint by default

however, docs (which have broken link from kube-prometheus-stack) mentions it should be deployed into monitoring namespace
https://docs.nvidia.com/dynamo/latest/planner/sla_planner_quickstart.html#prerequisites

which is http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
https://docs.nvidia.com/dynamo/latest/kubernetes/observability/metrics.html

Error getting avg input sequence tokens: HTTPConnectionPool(host='trtllm-disagg-prometheus.dynamo-system
.svc.cluster.local', port=9090): Max retries exceeded with url: /api/v1/query?query=increase%28dynamo_frontend_input_sequence_tokens_sum%5B180s%5D%29%2Fincrease%28dynamo_
frontend_input_sequence_tokens_count%5B180s%5D%29 (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f3a8b93d550>: Failed to resolve 'trtllm-
disagg-prometheus.dynamo-system.svc.cluster.local' ([Errno -2] Name or service not known)"))

workaround is to manually set PROMETHEUS_ENDPOINT in planner deployment to http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 or create new prom deployment in the expected endpoint

Steps to Reproduce

deploy dgdr from one of the example
look at planner logs

Expected Behavior

there is a cohesive experience with docs and implementation/defaults

Actual Behavior

planner doesn't recognize prom endpoint and needs to be manually set

Environment

dynamo 0.6.1

Additional Context

No response

Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions