Skip to content

Commit afccc9d

Browse files
authored
refactor: consolidate Observability files (e.g. OTEL docker-compose, md files) (#4173)
Signed-off-by: Keiven Chang <[email protected]> Co-authored-by: Keiven Chang <[email protected]>
1 parent 3577b5c commit afccc9d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+1148
-1348
lines changed

README.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,9 +101,8 @@ To coordinate across a data center, Dynamo relies on etcd and NATS. To run Dynam
101101

102102
To quickly setup etcd & NATS, you can also run:
103103

104-
```
104+
```bash
105105
# At the root of the repository:
106-
# Edit deploy/docker-compose.yml to comment out "runtime: nvidia" of the dcgm-exporter service if the nvidia container runtime isn't deployed or to be used.
107106
docker compose -f deploy/docker-compose.yml up -d
108107
```
109108

components/src/dynamo/sglang/publisher.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,7 @@ def setup_prometheus_registry(
204204
SGLang uses multiprocess architecture where metrics are stored in shared memory.
205205
MultiProcessCollector aggregates metrics from all worker processes. The Prometheus
206206
registry collects sglang:* metrics which are exposed via the metrics server endpoint
207-
(typically port 8081) when DYN_SYSTEM_ENABLED=true.
207+
(set DYN_SYSTEM_PORT to a positive value to enable, e.g., DYN_SYSTEM_PORT=8081).
208208
209209
Args:
210210
engine: The SGLang engine instance.

deploy/docker-compose.yml

Lines changed: 3 additions & 122 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,13 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
3-
#
4-
# Licensed under the Apache License, Version 2.0 (the "License");
5-
# you may not use this file except in compliance with the License.
6-
# You may obtain a copy of the License at
7-
#
8-
# http://www.apache.org/licenses/LICENSE-2.0
9-
#
10-
# Unless required by applicable law or agreed to in writing, software
11-
# distributed under the License is distributed on an "AS IS" BASIS,
12-
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13-
# See the License for the specific language governing permissions and
14-
# limitations under the License.
153

16-
# IMPORT NOTE: Make sure this is in sync with lib/runtime/docker-compose.yml
4+
# Bare minimum infrastructure services for Dynamo.
5+
# For observability (metrics, tracing, dashboards), use docker-observability.yml
6+
177
networks:
188
server:
199
driver: bridge
20-
monitoring:
21-
driver: bridge
2210

23-
# Note that the images are pinned to specific versions to avoid breaking changes.
2411
services:
2512
nats-server:
2613
image: nats:2.11.4
@@ -31,7 +18,6 @@ services:
3118
- 8222:8222 # the endpoints include /varz, /healthz, ...
3219
networks:
3320
- server
34-
- monitoring
3521

3622
etcd-server:
3723
image: bitnamilegacy/etcd:3.6.1
@@ -42,108 +28,3 @@ services:
4228
- 2380:2380
4329
networks:
4430
- server
45-
- monitoring
46-
47-
# All the services below are part of the metrics profile and monitoring network.
48-
49-
# The exporter translates from /varz and other stats to Prometheus metrics
50-
nats-prometheus-exporter:
51-
image: natsio/prometheus-nats-exporter:0.17.3
52-
command: ["-varz", "-connz", "-routez", "-subz", "-gatewayz", "-leafz", "-jsz=all", "http://nats-server:8222"]
53-
ports:
54-
- 7777:7777
55-
networks:
56-
- monitoring
57-
profiles: [metrics]
58-
depends_on:
59-
- nats-server
60-
61-
# DCGM stands for Data Center GPU Manager: https://developer.nvidia.com/dcgm
62-
# dcgm-exporter is a tool from NVIDIA that exposes DCGM metrics in Prometheus format.
63-
dcgm-exporter:
64-
image: nvidia/dcgm-exporter:4.2.3-4.1.3-ubi9
65-
ports:
66-
# Expose dcgm-exporter on port 9401 both inside and outside the container
67-
# to avoid conflicts with other dcgm-exporter instances in distributed environments.
68-
# To access DCGM metrics:
69-
# Outside the container: curl http://localhost:9401/metrics (or the host IP)
70-
# Inside the container (container-to-container): curl http://dcgm-exporter:9401/metrics
71-
- 9401:9401
72-
cap_add:
73-
- SYS_ADMIN
74-
deploy:
75-
resources:
76-
reservations:
77-
devices:
78-
- driver: nvidia
79-
count: all
80-
capabilities: [gpu]
81-
environment:
82-
# dcgm uses NVIDIA_VISIBLE_DEVICES variable but normally it is CUDA_VISIBLE_DEVICES
83-
- NVIDIA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-all}
84-
- DCGM_EXPORTER_LISTEN=:9401
85-
runtime: nvidia # Specify the NVIDIA runtime
86-
networks:
87-
- monitoring
88-
89-
# To access Prometheus from another machine, you may need to disable te firewall on your host. On Ubuntu:
90-
# sudo ufw allow 9090/tcp
91-
prometheus:
92-
image: prom/prometheus:v3.4.1
93-
container_name: prometheus
94-
volumes:
95-
- ./metrics/prometheus.yml:/etc/prometheus/prometheus.yml
96-
command:
97-
- '--config.file=/etc/prometheus/prometheus.yml'
98-
- '--storage.tsdb.path=/prometheus'
99-
# These provide the web console functionality
100-
- '--web.console.libraries=/etc/prometheus/console_libraries'
101-
- '--web.console.templates=/etc/prometheus/consoles'
102-
- '--web.enable-lifecycle'
103-
restart: unless-stopped
104-
# Example to pull from the /query endpoint:
105-
# {__name__=~"DCGM.*", job="dcgm-exporter"}
106-
networks:
107-
- monitoring
108-
ports:
109-
- "9090:9090"
110-
profiles: [metrics]
111-
extra_hosts:
112-
- "host.docker.internal:host-gateway"
113-
depends_on:
114-
- dcgm-exporter
115-
- nats-prometheus-exporter
116-
- etcd-server
117-
118-
# grafana connects to prometheus via the /query endpoint.
119-
# Default credentials are dynamo/dynamo.
120-
# To access Grafana from another machine, you may need to disable te firewall on your host. On Ubuntu:
121-
# sudo ufw allow 3001/tcp
122-
grafana:
123-
image: grafana/grafana-enterprise:12.0.1
124-
container_name: grafana
125-
volumes:
126-
- ./metrics/grafana_dashboards:/etc/grafana/provisioning/dashboards
127-
- ./metrics/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
128-
environment:
129-
- GF_SERVER_HTTP_PORT=3001
130-
# do not make it admin/admin, because you will be prompted to change the password every time
131-
- GF_SECURITY_ADMIN_USER=dynamo
132-
- GF_SECURITY_ADMIN_PASSWORD=dynamo
133-
- GF_USERS_ALLOW_SIGN_UP=false
134-
- GF_INSTALL_PLUGINS=grafana-piechart-panel
135-
# Default min interval is 5s, but can be configured lower
136-
- GF_DASHBOARDS_MIN_REFRESH_INTERVAL=2s
137-
# Disable password change requirement
138-
- GF_SECURITY_DISABLE_INITIAL_ADMIN_CREATION=false
139-
- GF_SECURITY_ADMIN_PASSWORD_POLICY=false
140-
- GF_AUTH_DISABLE_LOGIN_FORM=false
141-
- GF_AUTH_DISABLE_SIGNOUT_MENU=false
142-
restart: unless-stopped
143-
ports:
144-
- "3001:3001"
145-
networks:
146-
- monitoring
147-
profiles: [metrics]
148-
depends_on:
149-
- prometheus

deploy/docker-observability.yml

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
# Observability stack for Dynamo: metrics, tracing, and visualization.
5+
# Requires deploy/docker-compose.yml to be running for NATS and etcd connectivity.
6+
#
7+
# Usage:
8+
# docker compose -f deploy/docker-observability.yml up -d
9+
10+
version: '3.8'
11+
12+
networks:
13+
server:
14+
external: true
15+
name: deploy_server
16+
17+
volumes:
18+
grafana-data:
19+
tempo-data:
20+
21+
services:
22+
# DCGM stands for Data Center GPU Manager: https://developer.nvidia.com/dcgm
23+
# dcgm-exporter is a tool from NVIDIA that exposes DCGM metrics in Prometheus format.
24+
dcgm-exporter:
25+
image: nvidia/dcgm-exporter:4.2.3-4.1.3-ubi9
26+
ports:
27+
# Expose dcgm-exporter on port 9401 both inside and outside the container
28+
# to avoid conflicts with other dcgm-exporter instances in distributed environments.
29+
# To access DCGM metrics:
30+
# Outside the container: curl http://localhost:9401/metrics (or the host IP)
31+
# Inside the container (container-to-container): curl http://dcgm-exporter:9401/metrics
32+
- 9401:9401
33+
cap_add:
34+
- SYS_ADMIN
35+
deploy:
36+
resources:
37+
reservations:
38+
devices:
39+
- driver: nvidia
40+
count: all
41+
capabilities: [gpu]
42+
environment:
43+
# dcgm uses NVIDIA_VISIBLE_DEVICES variable but normally it is CUDA_VISIBLE_DEVICES
44+
- NVIDIA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-all}
45+
- DCGM_EXPORTER_LISTEN=:9401
46+
runtime: nvidia # Specify the NVIDIA runtime
47+
networks:
48+
- server
49+
50+
# The exporter translates from /varz and other stats to Prometheus metrics
51+
nats-prometheus-exporter:
52+
image: natsio/prometheus-nats-exporter:0.17.3
53+
command: ["-varz", "-connz", "-routez", "-subz", "-gatewayz", "-leafz", "-jsz=all", "http://nats-server:8222"]
54+
ports:
55+
- 7777:7777
56+
networks:
57+
- server
58+
59+
# To access Prometheus from another machine, you may need to disable te firewall on your host. On Ubuntu:
60+
# sudo ufw allow 9090/tcp
61+
prometheus:
62+
image: prom/prometheus:v3.4.1
63+
container_name: prometheus
64+
volumes:
65+
- ./observability/prometheus.yml:/etc/prometheus/prometheus.yml
66+
command:
67+
- '--config.file=/etc/prometheus/prometheus.yml'
68+
- '--storage.tsdb.path=/prometheus'
69+
# These provide the web console functionality
70+
- '--web.console.libraries=/etc/prometheus/console_libraries'
71+
- '--web.console.templates=/etc/prometheus/consoles'
72+
- '--web.enable-lifecycle'
73+
restart: unless-stopped
74+
# Example to pull from the /query endpoint:
75+
# {__name__=~"DCGM.*", job="dcgm-exporter"}
76+
ports:
77+
- "9090:9090"
78+
networks:
79+
- server
80+
extra_hosts:
81+
- "host.docker.internal:host-gateway"
82+
depends_on:
83+
- dcgm-exporter
84+
- nats-prometheus-exporter
85+
86+
# Tempo - Distributed tracing backend
87+
tempo:
88+
image: grafana/tempo:2.8.2
89+
command: [ "-config.file=/etc/tempo.yaml" ]
90+
user: root
91+
volumes:
92+
- ./observability/tempo.yaml:/etc/tempo.yaml
93+
- tempo-data:/tmp/tempo
94+
ports:
95+
- "3200:3200" # Tempo HTTP
96+
- "4317:4317" # OTLP gRPC receiver (accessible from host)
97+
- "4318:4318" # OTLP HTTP receiver (accessible from host)
98+
networks:
99+
- server
100+
101+
# Grafana - Visualization and dashboards
102+
# Supports both Prometheus (metrics) and Tempo (tracing) datasources
103+
# Default credentials: dynamo/dynamo
104+
# To access Grafana from another machine, you may need to disable te firewall on your host. On Ubuntu:
105+
# sudo ufw allow 3000/tcp
106+
grafana:
107+
image: grafana/grafana:12.2.0
108+
container_name: grafana
109+
volumes:
110+
- grafana-data:/var/lib/grafana
111+
- ./observability/grafana_dashboards:/etc/grafana/provisioning/dashboards
112+
- ./observability/grafana-datasources.yml:/etc/grafana/provisioning/datasources/prometheus.yml
113+
- ./observability/tempo-datasource.yml:/etc/grafana/provisioning/datasources/tempo.yml
114+
environment:
115+
- GF_SERVER_HTTP_PORT=3000
116+
# do not make it admin/admin, because you will be prompted to change the password every time
117+
- GF_SECURITY_ADMIN_USER=dynamo
118+
- GF_SECURITY_ADMIN_PASSWORD=dynamo
119+
- GF_USERS_ALLOW_SIGN_UP=false
120+
- GF_FEATURE_TOGGLES_ENABLE=traceqlEditor
121+
- GF_INSTALL_PLUGINS=grafana-piechart-panel
122+
# Default min interval is 5s, but can be configured lower
123+
- GF_DASHBOARDS_MIN_REFRESH_INTERVAL=2s
124+
# Disable password change requirement
125+
- GF_SECURITY_DISABLE_INITIAL_ADMIN_CREATION=false
126+
- GF_SECURITY_ADMIN_PASSWORD_POLICY=false
127+
- GF_AUTH_DISABLE_LOGIN_FORM=false
128+
- GF_AUTH_DISABLE_SIGNOUT_MENU=false
129+
restart: unless-stopped
130+
ports:
131+
- "3000:3000"
132+
networks:
133+
- server
134+
depends_on:
135+
- prometheus
136+
- tempo
137+

deploy/metrics/k8s/frontend-podmonitor.yaml

Lines changed: 0 additions & 25 deletions
This file was deleted.

deploy/metrics/k8s/planner-podmonitor.yaml

Lines changed: 0 additions & 20 deletions
This file was deleted.

deploy/metrics/k8s/worker-podmonitor.yaml

Lines changed: 0 additions & 20 deletions
This file was deleted.

deploy/observability/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Dynamo Observability
2+
3+
For detailed documentation on Observability (Prometheus metrics, tracing, and logging), please refer to [docs/observability/](../../docs/observability/).
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Example Grafana Dashboards
2+
3+
This directory contains example Grafana dashboards for Dynamo observability. These are starter files that you can use as references for building your own custom dashboards.
4+
5+
- `dynamo.json` - General Dynamo dashboard showing software and hardware metrics
6+
- `dcgm-metrics.json` - GPU metrics dashboard using DCGM exporter data
7+
- `kvbm.json` - KV Block Manager metrics dashboard
8+
- `temp-loki.json` - Logging dashboard for Loki integration
9+
- `dashboard-providers.yml` - Configuration file for dashboard provisioning
10+
11+
For setup instructions and usage, see [Observability Documentation](../../../docs/observability/).

0 commit comments

Comments
 (0)