chore: unify 2 grafana's and add some docs around observability #3720

ishandhanani · 2025-10-17T22:41:53Z

Summary

Consolidates two separate Grafana instances (one for metrics, one for tracing) into a single unified observability stack. All metrics and traces are now accessible from one Grafana instance at localhost:3001.

Motivation

Previously, we had:

Metrics Grafana at localhost:3001 (in deploy/docker-compose.yml)
Tracing Grafana at localhost:3000 (in deploy/tracing/docker-compose.yml)

This made me sad 😢

Grafana natively supports multiple datasources, so consolidating into one instance provides a better user experience without sacrificing functionality.

Changes

Added Tempo service to main deploy/docker-compose.yml under the metrics profile
Removed ephemeral volume for Tempo (traces are cleared on restart for dev use)
All observability services now start with: docker compose --profile metrics up -d
Created docs/observability/tracing.md with updated instructions for unified setup

Usage

Start the observability stack:

cd deploy
docker compose --profile metrics up -d

Access Grafana:

URL: http://localhost:3001
Login: dynamo / dynamo

View metrics:

Navigate to Dashboards → select a dashboard
Datasource: Prometheus (default)

View traces:

Navigate to Explore (compass icon)
Select Tempo from datasource dropdown
Use Search tab or TraceQL queries

Some more notes

I am not enjoying Grafana's trace viewer (but I am most def spoiled by my previous Datadog experiences)
We need to figure out how to make it easier to view traces and metrics. Right now I have to go to explore > traces which is not intuitive

coderabbitai · 2025-10-17T22:46:16Z

Walkthrough

This PR adds OpenTelemetry tracing support across the sglang deployment infrastructure. It introduces a --enable-otel CLI flag to launch scripts, integrates Tempo as a distributed tracing backend in docker-compose, configures Grafana datasources for trace visualization, and consolidates the tracing stack into the main deployment setup.

Changes

Cohort / File(s)	Summary
Shell Scripts - OTEL Flag Implementation `components/backends/sglang/launch/agg.sh`, `agg_embed.sh`, `agg_router.sh`, `disagg.sh`, `disagg_dp_attn.sh`, `disagg_router.sh`, `multimodal_agg.sh`, `multimodal_disagg.sh`	Added `--enable-otel` command-line flag parsing. When enabled, exports OTEL environment variables (DYN_LOGGING_JSONL, OTEL_EXPORT_ENABLED, OTEL_EXPORT_ENDPOINT, DYN_SYSTEM_ENABLED, DYN_SYSTEM_PORT) and prefixes launched processes with service-specific `OTEL_SERVICE_NAME` for distributed tracing context. Behavior unchanged when flag absent.
Docker Compose Infrastructure `deploy/docker-compose.yml`	Added Tempo service (grafana/tempo:2.8.2) for distributed tracing with ports 3200, 4317, 4318. Updated Grafana with TraceQL editor toggle and dependency on Tempo.
Docker Tracing Stack Consolidation `deploy/tracing/docker-compose.yml`, `deploy/tracing/grafana/provisioning/datasources/tempo.yaml`	Removed dedicated tracing docker-compose and Tempo datasource provisioning file; functionality consolidated into main deploy/docker-compose.yml.
Observability Configuration `deploy/metrics/grafana-datasources.yml`	Added Tempo datasource entry with TraceQL support, service map, and span time shift configuration.
Observability Documentation `deploy/metrics/README.md`	Updated scope to include Tempo and distributed tracing; expanded sections to reference Tempo integration and trace visualization guidance.
Tracing Documentation `docs/observability/tracing.md`	Comprehensive updates: changed endpoint references to localhost, updated deployment commands to docker-compose syntax, revised deployment examples from vLLM to sglang, added instructions for enabling tracing via `--enable-otel` flag, updated Grafana trace viewing steps and TraceQL filtering guidance.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

The changes follow a consistent, repetitive pattern across shell scripts (flag parsing and conditional OTEL environment setup with service naming). Docker-compose and configuration additions are straightforward and homogeneous. Documentation updates are self-contained. While multiple files are affected, the homogeneous nature of changes across deployment scripts reduces cognitive load.

Poem

🐰 Hops of tracing, springs of sight,
Tempo keeps the traces bright,
OTEL flags leap through the land,
Service names from my pawprint hand,
Distributed dreams now unified—
Observability nationwide! 🔍✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is entirely empty—all required sections from the template contain only placeholder comments and no actual content. The Overview, Details, and "Where should the reviewer start?" sections are blank, and the Related Issues section lists only a generic "#xxx" placeholder rather than referencing an actual issue. This provides reviewers with no information about the purpose, scope, or rationale for the changes made across the eight launch scripts, docker-compose configurations, and documentation files.	The author should fill in all required sections with substantive content: a clear overview of the unification and documentation improvements, specific details about each category of changes (launch scripts, docker-compose, documentation), guidance on which files reviewers should prioritize, and a reference to the actual GitHub issue being resolved. Without this information, reviewers cannot efficiently assess the PR's correctness and impact.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "chore: unify 2 grafana's and add some docs around observability" clearly captures the main objectives of this pull request. The changes consolidate the observability infrastructure by integrating Tempo into the main docker-compose configuration, removing the separate tracing stack, and updating documentation to reflect this unified approach. The title accurately reflects these consolidation efforts and documentation updates without being vague or overly broad.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/backends/sglang/launch/multimodal_disagg.sh (1)
72-99: Inconsistent OTEL_SERVICE_NAME handling—remove redundant export and standardize to inline assignment.

Line 72 exports OTEL_SERVICE_NAME=dynamo-frontend globally, but subsequent processes (lines 77, 80, 85, 99) override it with inline assignment. The export is unnecessary and creates confusion about which service each process actually belongs to. Standardize by using inline assignment for all processes.
 # run ingress
-export OTEL_SERVICE_NAME=dynamo-frontend
-python3 -m dynamo.frontend --http-port=8000 &
+OTEL_SERVICE_NAME=dynamo-frontend python3 -m dynamo.frontend --http-port=8000 &

♻️ Duplicate comments (8)

components/backends/sglang/launch/agg_router.sh (2)

29-36: Duplicate OTEL endpoint configuration—see related comment in agg.sh.

Same hardcoded endpoint issue. Apply the same recommended refactor (use environment variable with default).

14-27: Duplicate OTEL parsing—see related comment in agg.sh.

This block is identical to agg.sh (lines 14–27). Once you extract this into a shared script per the recommendation in agg.sh, apply the same refactor here.

components/backends/sglang/launch/agg_embed.sh (2)

29-36: Duplicate OTEL endpoint—apply environment variable override as recommended in agg.sh.

Use ${OTEL_EXPORT_ENDPOINT:-http://localhost:4317} for consistency and flexibility.

14-27: Duplicate OTEL parsing—extract into shared script as noted in agg.sh.

Apply the same refactoring recommended for agg.sh to eliminate this duplication.

components/backends/sglang/launch/disagg_dp_attn.sh (2)

29-36: Duplicate OTEL endpoint—apply environment variable override per agg.sh recommendation.

Use ${OTEL_EXPORT_ENDPOINT:-http://localhost:4317} for consistency.

14-27: Duplicate OTEL parsing—extract into shared script per agg.sh recommendation.

Apply the common refactoring approach described in agg.sh.

components/backends/sglang/launch/disagg_router.sh (2)

29-36: Duplicate OTEL endpoint—apply environment variable override per agg.sh recommendation.

Use ${OTEL_EXPORT_ENDPOINT:-http://localhost:4317} for consistency.

14-27: Duplicate OTEL parsing—extract into shared script per agg.sh recommendation.

Apply the common refactoring approach.

🧹 Nitpick comments (3)

components/backends/sglang/launch/agg.sh (1)

29-36: Make OTEL_EXPORT_ENDPOINT configurable via environment variable.

The OTEL endpoint is hardcoded as http://localhost:4317, which only works for localhost deployments. For flexibility in distributed or cloud environments, make it configurable:
- export OTEL_EXPORT_ENDPOINT=http://localhost:4317
+ export OTEL_EXPORT_ENDPOINT=${OTEL_EXPORT_ENDPOINT:-http://localhost:4317}
This preserves the default for local setups while allowing override for other scenarios.

components/backends/sglang/launch/disagg.sh (1)

14-27: Add help text and usage information to argument parsing.

Unlike similar scripts (multimodal_disagg.sh, multimodal_agg.sh), this script lacks a -h|--help option and provides no usage guidance when unknown options are encountered. Add help documentation to improve usability and consistency.

 # Parse command line arguments
 ENABLE_OTEL=false
 while [[ $# -gt 0 ]]; do
     case $1 in
         --enable-otel)
             ENABLE_OTEL=true
             shift
             ;;
+        -h|--help)
+            echo "Usage: $0 [OPTIONS]"
+            echo "Options:"
+            echo "  --enable-otel  Enable OpenTelemetry tracing"
+            echo "  -h, --help     Show this help message"
+            exit 0
+            ;;
         *)
             echo "Unknown option: $1"
+            echo "Use --help for usage information"
             exit 1
             ;;
     esac
 done

docs/observability/tracing.md (1)

99-111: Update documentation to reference all launch scripts supporting --enable-otel.

Line 101–108 mentions agg.sh as an example, but the documentation should note that other launch scripts (e.g., disagg.sh, multimodal_agg.sh, multimodal_disagg.sh) similarly support the --enable-otel flag. This will help users discover the feature across the deployment ecosystem.

 Many launch scripts support the `--enable-otel` flag to automatically configure tracing. For example:
 
 ```bash
 # Navigate to SGLang launch directory
 cd components/backends/sglang/launch
 
-# Run aggregated deployment with tracing enabled
+# Run aggregated deployment with tracing enabled (also supported in disagg.sh, multimodal_agg.sh, etc.)
 ./agg.sh --enable-otel


</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 66fd6f84ad7d10933aafd9fd1ef768447aa91b36 and b389db06bb42a57bbc0e186388f4c5e87f4cc416.

</details>

<details>
<summary>⛔ Files ignored due to path filters (1)</summary>

* `docs/images/trace.png` is excluded by `!**/*.png`

</details>

<details>
<summary>📒 Files selected for processing (14)</summary>

* `components/backends/sglang/launch/agg.sh` (1 hunks)
* `components/backends/sglang/launch/agg_embed.sh` (1 hunks)
* `components/backends/sglang/launch/agg_router.sh` (2 hunks)
* `components/backends/sglang/launch/disagg.sh` (3 hunks)
* `components/backends/sglang/launch/disagg_dp_attn.sh` (3 hunks)
* `components/backends/sglang/launch/disagg_router.sh` (6 hunks)
* `components/backends/sglang/launch/multimodal_agg.sh` (4 hunks)
* `components/backends/sglang/launch/multimodal_disagg.sh` (5 hunks)
* `deploy/docker-compose.yml` (3 hunks)
* `deploy/metrics/README.md` (4 hunks)
* `deploy/metrics/grafana-datasources.yml` (1 hunks)
* `deploy/tracing/docker-compose.yml` (0 hunks)
* `deploy/tracing/grafana/provisioning/datasources/tempo.yaml` (0 hunks)
* `docs/observability/tracing.md` (10 hunks)

</details>

<details>
<summary>💤 Files with no reviewable changes (2)</summary>

* deploy/tracing/docker-compose.yml
* deploy/tracing/grafana/provisioning/datasources/tempo.yaml

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>docs/observability/tracing.md</summary>

[uncategorized] ~179-~179: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...on tracing environment variables at the top level and service-specific names in each comp...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>docs/observability/tracing.md</summary>

147-147: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>
<details>
<summary>🪛 Shellcheck (0.11.0)</summary>

<details>
<summary>components/backends/sglang/launch/multimodal_agg.sh</summary>

[warning] 68-68: SGLANG_BACKEND_DIR appears unused. Verify use (or export if used externally).

(SC2034)

</details>
<details>
<summary>components/backends/sglang/launch/multimodal_disagg.sh</summary>

[warning] 68-68: SGLANG_BACKEND_DIR appears unused. Verify use (or export if used externally).

(SC2034)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)</summary>

* GitHub Check: vllm (amd64)
* GitHub Check: vllm (arm64)
* GitHub Check: sglang
* GitHub Check: trtllm (arm64)
* GitHub Check: operator (arm64)
* GitHub Check: trtllm (amd64)
* GitHub Check: operator (amd64)
* GitHub Check: Build and Test - dynamo

</details>

<details>
<summary>🔇 Additional comments (17)</summary><blockquote>

<details>
<summary>components/backends/sglang/launch/agg.sh (1)</summary><blockquote>

`39-49`: **LGTM: OTEL service naming and tracing setup is correct.**

The service naming strategy (dynamo-frontend and dynamo-worker) follows observability best practices, and the conditional OTEL setup ensures tracing is only enabled when explicitly requested.

</blockquote></details>
<details>
<summary>components/backends/sglang/launch/agg_router.sh (2)</summary><blockquote>

`44-59`: **Per-worker OTEL service naming and CUDA device assignment look good.**

The naming scheme (dynamo-worker-1, dynamo-worker-2) and CUDA_VISIBLE_DEVICES per worker are appropriate for multi-worker deployments.

---

`8-9`: **Verify: Cleanup trap may not track all spawned worker PIDs.**

The cleanup trap only references `$DYNAMO_PID $WORKER_PID`, but two workers are spawned (lines 44 and 53). If only the last worker's PID is tracked, earlier workers may not be properly cleaned up. Confirm whether this is a pre-existing issue or if PID tracking should be updated to handle multiple workers (e.g., using process groups or tracking both worker PIDs).

</blockquote></details>
<details>
<summary>components/backends/sglang/launch/agg_embed.sh (1)</summary><blockquote>

`44-51`: **Embedding-specific service naming is clear and appropriate.**

The service name `dynamo-worker-embedding` provides good trace granularity for embedding workloads, and embedding-specific flags (--embedding-worker, --use-sglang-tokenizer) are correctly preserved.

</blockquote></details>
<details>
<summary>components/backends/sglang/launch/disagg_dp_attn.sh (1)</summary><blockquote>

`44-70`: **OTEL service naming for disaggregation architecture is well-designed.**

The prefill/decode distinction (dynamo-worker-prefill, dynamo-worker-decode) enables clear trace isolation between pipeline stages, which is valuable for performance analysis in disaggregated workloads.

</blockquote></details>
<details>
<summary>deploy/docker-compose.yml (3)</summary><blockquote>

`48-60`: **Tempo service configuration looks good—verify tempo.yaml exists.**

The Tempo service is properly configured with correct ports and profiles. The volume mount references `./tracing/tempo.yaml`; verify this file exists in the repository and contains valid Tempo configuration.

---

`155-156`: **TraceQL editor toggle is appropriate for trace exploration.**

Enabling `GF_FEATURE_TOGGLES_ENABLE=traceqlEditor` provides users with Grafana's query builder for Tempo traces, improving usability.

---

`163-165`: **Grafana correctly depends on Tempo.**

Adding Tempo to `depends_on` ensures proper startup order. This is necessary for Grafana datasource provisioning to find Tempo when the container starts.

</blockquote></details>
<details>
<summary>components/backends/sglang/launch/disagg_router.sh (2)</summary><blockquote>

`39-104`: **Hierarchical service naming strategy is excellent for complex disaggregated pipelines.**

The naming scheme (dynamo-frontend → dynamo-router-prefill → dynamo-worker-prefill-1/2 → dynamo-worker-decode-1/2) provides clear trace boundaries and enables precise performance analysis across pipeline stages.

---

`8-9`: **Verify: Cleanup trap may not capture all spawned worker PIDs.**

The trap only references `$DYNAMO_PID $PREFILL_PID $PREFILL_ROUTER_PID`, but multiple prefill and decode workers are spawned. Confirm whether additional PIDs should be tracked, or if this is an acceptable architectural trade-off for this launcher.

</blockquote></details>
<details>
<summary>deploy/metrics/grafana-datasources.yml (1)</summary><blockquote>

`25-44`: **Tempo datasource configuration is comprehensive and well-configured.**

The datasource includes all necessary options for trace exploration (service map, node graph, search, traceQL). The service URL `http://tempo:3200` correctly resolves within the Docker Compose network.

</blockquote></details>
<details>
<summary>deploy/metrics/README.md (3)</summary><blockquote>

`1-8`: **Documentation title and overview clearly integrate tracing with metrics.**

The updated title "Observability with Prometheus, Tempo, and Grafana" and the new Tracing Guide reference provide clear context for users unfamiliar with the observability stack.

---

`234-236`: **Tempo service documentation is clear and includes necessary port/protocol info.**

Users are provided with Tempo's HTTP endpoint (:3200) and OTLP receiver ports (gRPC :4317, HTTP :4318), enabling both exploration and integration.

---

`267-267`: **Clear instructions for accessing traces via Grafana.**

Users are directed to Explore → Tempo datasource and provided with reference to the detailed Tracing Guide, lowering the barrier to entry for trace exploration.

</blockquote></details>
<details>
<summary>docs/observability/tracing.md (3)</summary><blockquote>

`32-44`: **Documentation clearly explains tracing environment variables and service naming.**

The environment variables table (lines 38–43) and example configuration (lines 47–59) provide clear guidance on enabling tracing. Service naming conventions are well documented and align with the shell script implementations (`dynamo-frontend`, `dynamo-worker-prefill`, etc.).

---

`177-223`: **Kubernetes deployment examples are clear and well-structured.**

The `DynamoGraphDeployment` YAML examples (lines 181–223) correctly demonstrate service-specific `OTEL_SERVICE_NAME` environment variables for `SglangDecodeWorker` and `SglangPrefillWorker`, with common tracing configuration at the top level. Aligns well with docker-compose setup.

---

`235-243`: **Unified observability section clearly explains the consolidated tracing stack.**

The new "Unified Observability" section (lines 235–243) effectively communicates the shift from separate tracing and metrics stacks to a single Grafana instance. This reflects the PR objective to consolidate infrastructure and improves user experience.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

components/backends/sglang/launch/agg.sh

coderabbitai · 2025-10-17T22:46:19Z

components/backends/sglang/launch/disagg.sh

+fi

 # run ingress
+export OTEL_SERVICE_NAME=dynamo-frontend


⚠️ Potential issue | 🟡 Minor

Standardize OTEL_SERVICE_NAME to inline assignment (no export).

Line 39 exports OTEL_SERVICE_NAME=dynamo-frontend, but subsequent workers (lines 44, 57) use inline assignment without export. Standardize by removing the export and using inline assignment for all processes, consistent with the pattern in multimodal_disagg.sh.

# run ingress -export OTEL_SERVICE_NAME=dynamo-frontend -python3 -m dynamo.frontend --http-port=8000 & +OTEL_SERVICE_NAME=dynamo-frontend python3 -m dynamo.frontend --http-port=8000 &

Also applies to: 44-44, 57-57

🤖 Prompt for AI Agents

In components/backends/sglang/launch/disagg.sh around lines 39, 44 and 57, the OTEL_SERVICE_NAME is exported on line 39 but used as inline assignments (no export) on lines 44 and 57; change the exported form to match the inline pattern used elsewhere by removing the export and using plain inline assignment (OTEL_SERVICE_NAME=...) for the process invocations so all three occurrences are consistent with multimodal_disagg.sh.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@coderabbitai - please trace through the code and provide a very succinct summary on why it was included in the first place and what happens if we remove it

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check how OTEL_SERVICE_NAME is set across all launch scripts echo "=== Checking OTEL_SERVICE_NAME usage patterns in all launch scripts ===" rg -n "OTEL_SERVICE_NAME" components/backends/sglang/launch/ -A 1 echo "" echo "=== Checking for 'export OTEL_SERVICE_NAME' specifically ===" rg -n "export OTEL_SERVICE_NAME" components/backends/sglang/launch/ echo "" echo "=== Checking inline OTEL_SERVICE_NAME assignment pattern ===" rg -n "OTEL_SERVICE_NAME=\S+ python" components/backends/sglang/launch/

Length of output: 8893

@ishandhanani - The export was likely included initially to make OTEL_SERVICE_NAME available to the frontend process. However:

Why inline is better:

export persists the variable in the shell environment for all subsequent commands

Inline assignment (OTEL_SERVICE_NAME=value python3 ...) only sets it for that specific process

Since each worker sets its own OTEL_SERVICE_NAME, there's no need for the frontend's to persist

What happens if we remove export (use inline):

Frontend process still gets OTEL_SERVICE_NAME=dynamo-frontend via its environment

Variable doesn't leak to subsequent commands or interfere with worker assignments

More precise scope control

Current state:
The file already uses the inline pattern consistently (line 39, 43, 56), matching all other launch scripts in the codebase. The search confirms zero instances of export OTEL_SERVICE_NAME exist anywhere, so you're already following the correct pattern. ✓

🐰

🧬 Code Graph Analysis Results

[Relevant code snippets]

File: components/backends/sglang/launch/disagg_dp_attn.sh, lines 6-11

cleanup() { echo "Cleaning up background processes..." kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true echo "Cleanup complete." }

File: components/backends/sglang/launch/agg_embed.sh, lines 6-11

cleanup() { echo "Cleaning up background processes..." kill $DYNAMO_PID 2>/dev/null || true wait $DYNAMO_PID 2>/dev/null || true echo "Cleanup complete." }

File: components/backends/sglang/launch/agg.sh, lines 6-11

cleanup() { echo "Cleaning up background processes..." kill $DYNAMO_PID 2>/dev/null || true wait $DYNAMO_PID 2>/dev/null || true echo "Cleanup complete." }

File: components/backends/sglang/launch/disagg_router.sh, lines 6-11

cleanup() { echo "Cleaning up background processes..." kill $DYNAMO_PID $PREFILL_PID $PREFILL_ROUTER_PID 2>/dev/null || true wait $DYNAMO_PID $PREFILL_PID $PREFILL_ROUTER_PID 2>/dev/null || true echo "Cleanup complete." }

File: components/backends/sglang/launch/multimodal_disagg.sh, lines 6-11

cleanup() { echo "Cleaning up background processes..." kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true echo "Cleanup complete." }

File: components/backends/sglang/launch/multimodal_agg.sh, lines 6-11

cleanup() { echo "Cleaning up background processes..." kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true echo "Cleanup complete." }

File: tests/serve/launch/sglang_agg.sh, lines 6-11

cleanup() { echo "Cleaning up background processes..." kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true echo "Cleanup complete." }

File: components/backends/sglang/launch/agg_router.sh, lines 6-11

cleanup() { echo "Cleaning up background processes..." kill $DYNAMO_PID $WORKER_PID 2>/dev/null || true wait $DYNAMO_PID $WORKER_PID 2>/dev/null || true echo "Cleanup complete." }

File: (Python) components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py, lines 408-411

def cleanup(self): self.engine.shutdown() logger.info("Multimodal worker engine shutdown") super().cleanup()

File: (Python) components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py, lines 523-526

def cleanup(self): self.engine.shutdown() logger.info("Multimodal prefill engine shutdown") super().cleanup()

components/backends/sglang/launch/multimodal_agg.sh

components/backends/sglang/launch/multimodal_disagg.sh

docs/observability/tracing.md

keivenchang

OTEL != Prometheus, but I see that if OTEL is enabled, Prometheus is also enabled. So perhaps the name should be --enable-all-metrics-and-logging or something to that effect. @nnshah1 what do you think?
Do you plan to sync this to other frameworks as well, maybe in the next PR? It'd be weird if this only exists for sglang.

keivenchang · 2025-10-17T23:03:16Z

components/backends/sglang/launch/agg_embed.sh

+    esac
+done
+
+# Enable tracing if requested


How about renaming this to ENABLE_ALL_METRICS?

OTEL and Prometheus are separate systems. So enabling one doesn't necessarily mean enabling the other.

Good catch here.

I like --enable-metrics-and-logging

keivenchang · 2025-10-17T23:04:12Z

components/backends/sglang/launch/agg_embed.sh

+    export DYN_SYSTEM_PORT=8081
+fi

 # run ingress


Note that the frontend doesn't currently use DYN_SYSTEM_ variable, only the workers.

I'll add a comment. If the frontend just ignores it its probably fine

deploy/docker-compose.yml

ishandhanani · 2025-11-11T21:25:44Z

Closing this PR as the majority of the work has been completed by other PRs:

Grafana/Tempo Consolidation (main goal of this PR):

✅ PR refactor: consolidate Observability files (e.g. OTEL docker-compose, md files) #4173 - Consolidated observability stack into deploy/docker-observability.yml, unified Grafana instance with both Prometheus and Tempo datasources, removed separate tracing docker-compose

Related Infrastructure:

✅ PR feat: deprecate DYN_SYSTEM_ENABLED in favor of DYN_SYSTEM_PORT #4082 - Deprecated DYN_SYSTEM_ENABLED in favor of DYN_SYSTEM_PORT
✅ PR fix: remove remaining DYN_SYSTEM_ENABLED references #4197 - Removed remaining DYN_SYSTEM_ENABLED references
✅ PR refactor: move backend deploy, launch and slurm files from components to examples #3849 - Moved backend launch scripts from components/ to examples/

Remaining Unique Value:
The --enable-otel flag functionality from this PR has been extracted and submitted as PR #4243, which adds the convenient one-flag enablement to SGLang launch scripts.

Thanks for the consolidation work - it made the codebase cleaner!

bump

b389db0

ishandhanani requested a review from a team as a code owner October 17, 2025 22:41

pull-request-size bot added the size/L label Oct 17, 2025

github-actions bot added the chore label Oct 17, 2025

coderabbitai bot reviewed Oct 17, 2025

View reviewed changes

otel

6c92a0b

copy-pr-bot bot temporarily deployed to GITLAB October 17, 2025 22:54 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 17, 2025 22:55 Inactive

bump

41141c9

keivenchang reviewed Oct 17, 2025

View reviewed changes

ishandhanani mentioned this pull request Nov 11, 2025

feat: add --enable-otel flag to SGLang launch scripts #4243

Merged

ishandhanani closed this Nov 11, 2025

chore: unify 2 grafana's and add some docs around observability #3720

chore: unify 2 grafana's and add some docs around observability #3720

Uh oh!

Conversation

ishandhanani commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Usage

Start the observability stack:

Access Grafana:

View metrics:

View traces:

Uh oh!

coderabbitai bot commented Oct 17, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ishandhanani Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

ishandhanani Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keivenchang left a comment

Choose a reason for hiding this comment

Uh oh!

keivenchang Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

ishandhanani Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

keivenchang Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

ishandhanani Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ishandhanani commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ishandhanani commented Oct 17, 2025 •

edited

Loading

coderabbitai bot Oct 17, 2025 •

edited

Loading