-
Notifications
You must be signed in to change notification settings - Fork 690
chore: unify 2 grafana's and add some docs around observability #3720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis PR adds OpenTelemetry tracing support across the sglang deployment infrastructure. It introduces a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes The changes follow a consistent, repetitive pattern across shell scripts (flag parsing and conditional OTEL environment setup with service naming). Docker-compose and configuration additions are straightforward and homogeneous. Documentation updates are self-contained. While multiple files are affected, the homogeneous nature of changes across deployment scripts reduces cognitive load. Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/backends/sglang/launch/multimodal_disagg.sh (1)
72-99: InconsistentOTEL_SERVICE_NAMEhandling—remove redundant export and standardize to inline assignment.Line 72 exports
OTEL_SERVICE_NAME=dynamo-frontendglobally, but subsequent processes (lines 77, 80, 85, 99) override it with inline assignment. The export is unnecessary and creates confusion about which service each process actually belongs to. Standardize by using inline assignment for all processes.# run ingress -export OTEL_SERVICE_NAME=dynamo-frontend -python3 -m dynamo.frontend --http-port=8000 & +OTEL_SERVICE_NAME=dynamo-frontend python3 -m dynamo.frontend --http-port=8000 &
♻️ Duplicate comments (8)
components/backends/sglang/launch/agg_router.sh (2)
29-36: Duplicate OTEL endpoint configuration—see related comment in agg.sh.Same hardcoded endpoint issue. Apply the same recommended refactor (use environment variable with default).
14-27: Duplicate OTEL parsing—see related comment in agg.sh.This block is identical to agg.sh (lines 14–27). Once you extract this into a shared script per the recommendation in agg.sh, apply the same refactor here.
components/backends/sglang/launch/agg_embed.sh (2)
29-36: Duplicate OTEL endpoint—apply environment variable override as recommended in agg.sh.Use
${OTEL_EXPORT_ENDPOINT:-http://localhost:4317}for consistency and flexibility.
14-27: Duplicate OTEL parsing—extract into shared script as noted in agg.sh.Apply the same refactoring recommended for agg.sh to eliminate this duplication.
components/backends/sglang/launch/disagg_dp_attn.sh (2)
29-36: Duplicate OTEL endpoint—apply environment variable override per agg.sh recommendation.Use
${OTEL_EXPORT_ENDPOINT:-http://localhost:4317}for consistency.
14-27: Duplicate OTEL parsing—extract into shared script per agg.sh recommendation.Apply the common refactoring approach described in agg.sh.
components/backends/sglang/launch/disagg_router.sh (2)
29-36: Duplicate OTEL endpoint—apply environment variable override per agg.sh recommendation.Use
${OTEL_EXPORT_ENDPOINT:-http://localhost:4317}for consistency.
14-27: Duplicate OTEL parsing—extract into shared script per agg.sh recommendation.Apply the common refactoring approach.
🧹 Nitpick comments (3)
components/backends/sglang/launch/agg.sh (1)
29-36: Make OTEL_EXPORT_ENDPOINT configurable via environment variable.The OTEL endpoint is hardcoded as
http://localhost:4317, which only works for localhost deployments. For flexibility in distributed or cloud environments, make it configurable:- export OTEL_EXPORT_ENDPOINT=http://localhost:4317 + export OTEL_EXPORT_ENDPOINT=${OTEL_EXPORT_ENDPOINT:-http://localhost:4317}This preserves the default for local setups while allowing override for other scenarios.
components/backends/sglang/launch/disagg.sh (1)
14-27: Add help text and usage information to argument parsing.Unlike similar scripts (
multimodal_disagg.sh,multimodal_agg.sh), this script lacks a-h|--helpoption and provides no usage guidance when unknown options are encountered. Add help documentation to improve usability and consistency.# Parse command line arguments ENABLE_OTEL=false while [[ $# -gt 0 ]]; do case $1 in --enable-otel) ENABLE_OTEL=true shift ;; + -h|--help) + echo "Usage: $0 [OPTIONS]" + echo "Options:" + echo " --enable-otel Enable OpenTelemetry tracing" + echo " -h, --help Show this help message" + exit 0 + ;; *) echo "Unknown option: $1" + echo "Use --help for usage information" exit 1 ;; esac donedocs/observability/tracing.md (1)
99-111: Update documentation to reference all launch scripts supporting--enable-otel.Line 101–108 mentions
agg.shas an example, but the documentation should note that other launch scripts (e.g.,disagg.sh,multimodal_agg.sh,multimodal_disagg.sh) similarly support the--enable-otelflag. This will help users discover the feature across the deployment ecosystem.Many launch scripts support the `--enable-otel` flag to automatically configure tracing. For example: ```bash # Navigate to SGLang launch directory cd components/backends/sglang/launch -# Run aggregated deployment with tracing enabled +# Run aggregated deployment with tracing enabled (also supported in disagg.sh, multimodal_agg.sh, etc.) ./agg.sh --enable-otel</blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 66fd6f84ad7d10933aafd9fd1ef768447aa91b36 and b389db06bb42a57bbc0e186388f4c5e87f4cc416. </details> <details> <summary>⛔ Files ignored due to path filters (1)</summary> * `docs/images/trace.png` is excluded by `!**/*.png` </details> <details> <summary>📒 Files selected for processing (14)</summary> * `components/backends/sglang/launch/agg.sh` (1 hunks) * `components/backends/sglang/launch/agg_embed.sh` (1 hunks) * `components/backends/sglang/launch/agg_router.sh` (2 hunks) * `components/backends/sglang/launch/disagg.sh` (3 hunks) * `components/backends/sglang/launch/disagg_dp_attn.sh` (3 hunks) * `components/backends/sglang/launch/disagg_router.sh` (6 hunks) * `components/backends/sglang/launch/multimodal_agg.sh` (4 hunks) * `components/backends/sglang/launch/multimodal_disagg.sh` (5 hunks) * `deploy/docker-compose.yml` (3 hunks) * `deploy/metrics/README.md` (4 hunks) * `deploy/metrics/grafana-datasources.yml` (1 hunks) * `deploy/tracing/docker-compose.yml` (0 hunks) * `deploy/tracing/grafana/provisioning/datasources/tempo.yaml` (0 hunks) * `docs/observability/tracing.md` (10 hunks) </details> <details> <summary>💤 Files with no reviewable changes (2)</summary> * deploy/tracing/docker-compose.yml * deploy/tracing/grafana/provisioning/datasources/tempo.yaml </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🪛 LanguageTool</summary> <details> <summary>docs/observability/tracing.md</summary> [uncategorized] ~179-~179: If this is a compound adjective that modifies the following noun, use a hyphen. Context: ...on tracing environment variables at the top level and service-specific names in each comp... (EN_COMPOUND_ADJECTIVE_INTERNAL) </details> </details> <details> <summary>🪛 markdownlint-cli2 (0.18.1)</summary> <details> <summary>docs/observability/tracing.md</summary> 147-147: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🪛 Shellcheck (0.11.0)</summary> <details> <summary>components/backends/sglang/launch/multimodal_agg.sh</summary> [warning] 68-68: SGLANG_BACKEND_DIR appears unused. Verify use (or export if used externally). (SC2034) </details> <details> <summary>components/backends/sglang/launch/multimodal_disagg.sh</summary> [warning] 68-68: SGLANG_BACKEND_DIR appears unused. Verify use (or export if used externally). (SC2034) </details> </details> </details> <details> <summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)</summary> * GitHub Check: vllm (amd64) * GitHub Check: vllm (arm64) * GitHub Check: sglang * GitHub Check: trtllm (arm64) * GitHub Check: operator (arm64) * GitHub Check: trtllm (amd64) * GitHub Check: operator (amd64) * GitHub Check: Build and Test - dynamo </details> <details> <summary>🔇 Additional comments (17)</summary><blockquote> <details> <summary>components/backends/sglang/launch/agg.sh (1)</summary><blockquote> `39-49`: **LGTM: OTEL service naming and tracing setup is correct.** The service naming strategy (dynamo-frontend and dynamo-worker) follows observability best practices, and the conditional OTEL setup ensures tracing is only enabled when explicitly requested. </blockquote></details> <details> <summary>components/backends/sglang/launch/agg_router.sh (2)</summary><blockquote> `44-59`: **Per-worker OTEL service naming and CUDA device assignment look good.** The naming scheme (dynamo-worker-1, dynamo-worker-2) and CUDA_VISIBLE_DEVICES per worker are appropriate for multi-worker deployments. --- `8-9`: **Verify: Cleanup trap may not track all spawned worker PIDs.** The cleanup trap only references `$DYNAMO_PID $WORKER_PID`, but two workers are spawned (lines 44 and 53). If only the last worker's PID is tracked, earlier workers may not be properly cleaned up. Confirm whether this is a pre-existing issue or if PID tracking should be updated to handle multiple workers (e.g., using process groups or tracking both worker PIDs). </blockquote></details> <details> <summary>components/backends/sglang/launch/agg_embed.sh (1)</summary><blockquote> `44-51`: **Embedding-specific service naming is clear and appropriate.** The service name `dynamo-worker-embedding` provides good trace granularity for embedding workloads, and embedding-specific flags (--embedding-worker, --use-sglang-tokenizer) are correctly preserved. </blockquote></details> <details> <summary>components/backends/sglang/launch/disagg_dp_attn.sh (1)</summary><blockquote> `44-70`: **OTEL service naming for disaggregation architecture is well-designed.** The prefill/decode distinction (dynamo-worker-prefill, dynamo-worker-decode) enables clear trace isolation between pipeline stages, which is valuable for performance analysis in disaggregated workloads. </blockquote></details> <details> <summary>deploy/docker-compose.yml (3)</summary><blockquote> `48-60`: **Tempo service configuration looks good—verify tempo.yaml exists.** The Tempo service is properly configured with correct ports and profiles. The volume mount references `./tracing/tempo.yaml`; verify this file exists in the repository and contains valid Tempo configuration. --- `155-156`: **TraceQL editor toggle is appropriate for trace exploration.** Enabling `GF_FEATURE_TOGGLES_ENABLE=traceqlEditor` provides users with Grafana's query builder for Tempo traces, improving usability. --- `163-165`: **Grafana correctly depends on Tempo.** Adding Tempo to `depends_on` ensures proper startup order. This is necessary for Grafana datasource provisioning to find Tempo when the container starts. </blockquote></details> <details> <summary>components/backends/sglang/launch/disagg_router.sh (2)</summary><blockquote> `39-104`: **Hierarchical service naming strategy is excellent for complex disaggregated pipelines.** The naming scheme (dynamo-frontend → dynamo-router-prefill → dynamo-worker-prefill-1/2 → dynamo-worker-decode-1/2) provides clear trace boundaries and enables precise performance analysis across pipeline stages. --- `8-9`: **Verify: Cleanup trap may not capture all spawned worker PIDs.** The trap only references `$DYNAMO_PID $PREFILL_PID $PREFILL_ROUTER_PID`, but multiple prefill and decode workers are spawned. Confirm whether additional PIDs should be tracked, or if this is an acceptable architectural trade-off for this launcher. </blockquote></details> <details> <summary>deploy/metrics/grafana-datasources.yml (1)</summary><blockquote> `25-44`: **Tempo datasource configuration is comprehensive and well-configured.** The datasource includes all necessary options for trace exploration (service map, node graph, search, traceQL). The service URL `http://tempo:3200` correctly resolves within the Docker Compose network. </blockquote></details> <details> <summary>deploy/metrics/README.md (3)</summary><blockquote> `1-8`: **Documentation title and overview clearly integrate tracing with metrics.** The updated title "Observability with Prometheus, Tempo, and Grafana" and the new Tracing Guide reference provide clear context for users unfamiliar with the observability stack. --- `234-236`: **Tempo service documentation is clear and includes necessary port/protocol info.** Users are provided with Tempo's HTTP endpoint (:3200) and OTLP receiver ports (gRPC :4317, HTTP :4318), enabling both exploration and integration. --- `267-267`: **Clear instructions for accessing traces via Grafana.** Users are directed to Explore → Tempo datasource and provided with reference to the detailed Tracing Guide, lowering the barrier to entry for trace exploration. </blockquote></details> <details> <summary>docs/observability/tracing.md (3)</summary><blockquote> `32-44`: **Documentation clearly explains tracing environment variables and service naming.** The environment variables table (lines 38–43) and example configuration (lines 47–59) provide clear guidance on enabling tracing. Service naming conventions are well documented and align with the shell script implementations (`dynamo-frontend`, `dynamo-worker-prefill`, etc.). --- `177-223`: **Kubernetes deployment examples are clear and well-structured.** The `DynamoGraphDeployment` YAML examples (lines 181–223) correctly demonstrate service-specific `OTEL_SERVICE_NAME` environment variables for `SglangDecodeWorker` and `SglangPrefillWorker`, with common tracing configuration at the top level. Aligns well with docker-compose setup. --- `235-243`: **Unified observability section clearly explains the consolidated tracing stack.** The new "Unified Observability" section (lines 235–243) effectively communicates the shift from separate tracing and metrics stacks to a single Grafana instance. This reflects the PR objective to consolidate infrastructure and improves user experience. </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| fi | ||
|
|
||
| # run ingress | ||
| export OTEL_SERVICE_NAME=dynamo-frontend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Standardize OTEL_SERVICE_NAME to inline assignment (no export).
Line 39 exports OTEL_SERVICE_NAME=dynamo-frontend, but subsequent workers (lines 44, 57) use inline assignment without export. Standardize by removing the export and using inline assignment for all processes, consistent with the pattern in multimodal_disagg.sh.
# run ingress
-export OTEL_SERVICE_NAME=dynamo-frontend
-python3 -m dynamo.frontend --http-port=8000 &
+OTEL_SERVICE_NAME=dynamo-frontend python3 -m dynamo.frontend --http-port=8000 &Also applies to: 44-44, 57-57
🤖 Prompt for AI Agents
In components/backends/sglang/launch/disagg.sh around lines 39, 44 and 57, the
OTEL_SERVICE_NAME is exported on line 39 but used as inline assignments (no
export) on lines 44 and 57; change the exported form to match the inline pattern
used elsewhere by removing the export and using plain inline assignment
(OTEL_SERVICE_NAME=...) for the process invocations so all three occurrences are
consistent with multimodal_disagg.sh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coderabbitai - please trace through the code and provide a very succinct summary on why it was included in the first place and what happens if we remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how OTEL_SERVICE_NAME is set across all launch scripts
echo "=== Checking OTEL_SERVICE_NAME usage patterns in all launch scripts ==="
rg -n "OTEL_SERVICE_NAME" components/backends/sglang/launch/ -A 1
echo ""
echo "=== Checking for 'export OTEL_SERVICE_NAME' specifically ==="
rg -n "export OTEL_SERVICE_NAME" components/backends/sglang/launch/
echo ""
echo "=== Checking inline OTEL_SERVICE_NAME assignment pattern ==="
rg -n "OTEL_SERVICE_NAME=\S+ python" components/backends/sglang/launch/
Length of output: 8893
@ishandhanani - The export was likely included initially to make OTEL_SERVICE_NAME available to the frontend process. However:
Why inline is better:
exportpersists the variable in the shell environment for all subsequent commands- Inline assignment (
OTEL_SERVICE_NAME=value python3 ...) only sets it for that specific process - Since each worker sets its own
OTEL_SERVICE_NAME, there's no need for the frontend's to persist
What happens if we remove export (use inline):
- Frontend process still gets
OTEL_SERVICE_NAME=dynamo-frontendvia its environment - Variable doesn't leak to subsequent commands or interfere with worker assignments
- More precise scope control
Current state:
The file already uses the inline pattern consistently (line 39, 43, 56), matching all other launch scripts in the codebase. The search confirms zero instances of export OTEL_SERVICE_NAME exist anywhere, so you're already following the correct pattern. ✓
🐰
🧬 Code Graph Analysis Results
[Relevant code snippets]
- File: components/backends/sglang/launch/disagg_dp_attn.sh, lines 6-11
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
echo "Cleanup complete."
}
- File: components/backends/sglang/launch/agg_embed.sh, lines 6-11
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID 2>/dev/null || true
wait $DYNAMO_PID 2>/dev/null || true
echo "Cleanup complete."
}
- File: components/backends/sglang/launch/agg.sh, lines 6-11
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID 2>/dev/null || true
wait $DYNAMO_PID 2>/dev/null || true
echo "Cleanup complete."
}
- File: components/backends/sglang/launch/disagg_router.sh, lines 6-11
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID $PREFILL_PID $PREFILL_ROUTER_PID 2>/dev/null || true
wait $DYNAMO_PID $PREFILL_PID $PREFILL_ROUTER_PID 2>/dev/null || true
echo "Cleanup complete."
}
- File: components/backends/sglang/launch/multimodal_disagg.sh, lines 6-11
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
echo "Cleanup complete."
}
- File: components/backends/sglang/launch/multimodal_agg.sh, lines 6-11
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
echo "Cleanup complete."
}
- File: tests/serve/launch/sglang_agg.sh, lines 6-11
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
echo "Cleanup complete."
}
- File: components/backends/sglang/launch/agg_router.sh, lines 6-11
cleanup() {
echo "Cleaning up background processes..."
kill $DYNAMO_PID $WORKER_PID 2>/dev/null || true
wait $DYNAMO_PID $WORKER_PID 2>/dev/null || true
echo "Cleanup complete."
}
- File: (Python) components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py, lines 408-411
def cleanup(self):
self.engine.shutdown()
logger.info("Multimodal worker engine shutdown")
super().cleanup()
- File: (Python) components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py, lines 523-526
def cleanup(self):
self.engine.shutdown()
logger.info("Multimodal prefill engine shutdown")
super().cleanup()
keivenchang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- OTEL != Prometheus, but I see that if OTEL is enabled, Prometheus is also enabled. So perhaps the name should be --enable-all-metrics-and-logging or something to that effect. @nnshah1 what do you think?
- Do you plan to sync this to other frameworks as well, maybe in the next PR? It'd be weird if this only exists for sglang.
| esac | ||
| done | ||
|
|
||
| # Enable tracing if requested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming this to ENABLE_ALL_METRICS?
OTEL and Prometheus are separate systems. So enabling one doesn't necessarily mean enabling the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch here.
I like --enable-metrics-and-logging
| export DYN_SYSTEM_PORT=8081 | ||
| fi | ||
|
|
||
| # run ingress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the frontend doesn't currently use DYN_SYSTEM_ variable, only the workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment. If the frontend just ignores it its probably fine
|
Closing this PR as the majority of the work has been completed by other PRs: Grafana/Tempo Consolidation (main goal of this PR):
Related Infrastructure:
Remaining Unique Value: Thanks for the consolidation work - it made the codebase cleaner! |
Summary
Consolidates two separate Grafana instances (one for metrics, one for tracing) into a single unified observability stack. All metrics and traces are now accessible from one Grafana instance at
localhost:3001.Motivation
Previously, we had:
localhost:3001(indeploy/docker-compose.yml)localhost:3000(indeploy/tracing/docker-compose.yml)This made me sad 😢
Grafana natively supports multiple datasources, so consolidating into one instance provides a better user experience without sacrificing functionality.
Changes
deploy/docker-compose.ymlunder themetricsprofiledocker compose --profile metrics up -ddocs/observability/tracing.mdwith updated instructions for unified setupUsage
Start the observability stack:
cd deploy docker compose --profile metrics up -dAccess Grafana:
http://localhost:3001dynamo/dynamoView metrics:
View traces:
Some more notes