Skip to content

Conversation

@ishandhanani
Copy link
Contributor

@ishandhanani ishandhanani commented Oct 17, 2025

Summary

Consolidates two separate Grafana instances (one for metrics, one for tracing) into a single unified observability stack. All metrics and traces are now accessible from one Grafana instance at localhost:3001.

Motivation

Previously, we had:

  • Metrics Grafana at localhost:3001 (in deploy/docker-compose.yml)
  • Tracing Grafana at localhost:3000 (in deploy/tracing/docker-compose.yml)

This made me sad 😢

Grafana natively supports multiple datasources, so consolidating into one instance provides a better user experience without sacrificing functionality.

Changes

  • Added Tempo service to main deploy/docker-compose.yml under the metrics profile
  • Removed ephemeral volume for Tempo (traces are cleared on restart for dev use)
  • All observability services now start with: docker compose --profile metrics up -d
  • Created docs/observability/tracing.md with updated instructions for unified setup

Usage

Start the observability stack:

cd deploy
docker compose --profile metrics up -d

Access Grafana:

  • URL: http://localhost:3001
  • Login: dynamo / dynamo

View metrics:

  • Navigate to Dashboards → select a dashboard
  • Datasource: Prometheus (default)

View traces:

  • Navigate to Explore (compass icon)
  • Select Tempo from datasource dropdown
  • Use Search tab or TraceQL queries

Some more notes

  1. I am not enjoying Grafana's trace viewer (but I am most def spoiled by my previous Datadog experiences)
  2. We need to figure out how to make it easier to view traces and metrics. Right now I have to go to explore > traces which is not intuitive

@ishandhanani ishandhanani requested a review from a team as a code owner October 17, 2025 22:41
@github-actions github-actions bot added the chore label Oct 17, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 17, 2025

Walkthrough

This PR adds OpenTelemetry tracing support across the sglang deployment infrastructure. It introduces a --enable-otel CLI flag to launch scripts, integrates Tempo as a distributed tracing backend in docker-compose, configures Grafana datasources for trace visualization, and consolidates the tracing stack into the main deployment setup.

Changes

Cohort / File(s) Summary
Shell Scripts - OTEL Flag Implementation
components/backends/sglang/launch/agg.sh, agg_embed.sh, agg_router.sh, disagg.sh, disagg_dp_attn.sh, disagg_router.sh, multimodal_agg.sh, multimodal_disagg.sh
Added --enable-otel command-line flag parsing. When enabled, exports OTEL environment variables (DYN_LOGGING_JSONL, OTEL_EXPORT_ENABLED, OTEL_EXPORT_ENDPOINT, DYN_SYSTEM_ENABLED, DYN_SYSTEM_PORT) and prefixes launched processes with service-specific OTEL_SERVICE_NAME for distributed tracing context. Behavior unchanged when flag absent.
Docker Compose Infrastructure
deploy/docker-compose.yml
Added Tempo service (grafana/tempo:2.8.2) for distributed tracing with ports 3200, 4317, 4318. Updated Grafana with TraceQL editor toggle and dependency on Tempo.
Docker Tracing Stack Consolidation
deploy/tracing/docker-compose.yml, deploy/tracing/grafana/provisioning/datasources/tempo.yaml
Removed dedicated tracing docker-compose and Tempo datasource provisioning file; functionality consolidated into main deploy/docker-compose.yml.
Observability Configuration
deploy/metrics/grafana-datasources.yml
Added Tempo datasource entry with TraceQL support, service map, and span time shift configuration.
Observability Documentation
deploy/metrics/README.md
Updated scope to include Tempo and distributed tracing; expanded sections to reference Tempo integration and trace visualization guidance.
Tracing Documentation
docs/observability/tracing.md
Comprehensive updates: changed endpoint references to localhost, updated deployment commands to docker-compose syntax, revised deployment examples from vLLM to sglang, added instructions for enabling tracing via --enable-otel flag, updated Grafana trace viewing steps and TraceQL filtering guidance.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

The changes follow a consistent, repetitive pattern across shell scripts (flag parsing and conditional OTEL environment setup with service naming). Docker-compose and configuration additions are straightforward and homogeneous. Documentation updates are self-contained. While multiple files are affected, the homogeneous nature of changes across deployment scripts reduces cognitive load.

Poem

🐰 Hops of tracing, springs of sight,
Tempo keeps the traces bright,
OTEL flags leap through the land,
Service names from my pawprint hand,
Distributed dreams now unified—
Observability nationwide! 🔍✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is entirely empty—all required sections from the template contain only placeholder comments and no actual content. The Overview, Details, and "Where should the reviewer start?" sections are blank, and the Related Issues section lists only a generic "#xxx" placeholder rather than referencing an actual issue. This provides reviewers with no information about the purpose, scope, or rationale for the changes made across the eight launch scripts, docker-compose configurations, and documentation files. The author should fill in all required sections with substantive content: a clear overview of the unification and documentation improvements, specific details about each category of changes (launch scripts, docker-compose, documentation), guidance on which files reviewers should prioritize, and a reference to the actual GitHub issue being resolved. Without this information, reviewers cannot efficiently assess the PR's correctness and impact.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "chore: unify 2 grafana's and add some docs around observability" clearly captures the main objectives of this pull request. The changes consolidate the observability infrastructure by integrating Tempo into the main docker-compose configuration, removing the separate tracing stack, and updating documentation to reflect this unified approach. The title accurately reflects these consolidation efforts and documentation updates without being vague or overly broad.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/backends/sglang/launch/multimodal_disagg.sh (1)

72-99: Inconsistent OTEL_SERVICE_NAME handling—remove redundant export and standardize to inline assignment.

Line 72 exports OTEL_SERVICE_NAME=dynamo-frontend globally, but subsequent processes (lines 77, 80, 85, 99) override it with inline assignment. The export is unnecessary and creates confusion about which service each process actually belongs to. Standardize by using inline assignment for all processes.

 # run ingress
-export OTEL_SERVICE_NAME=dynamo-frontend
-python3 -m dynamo.frontend --http-port=8000 &
+OTEL_SERVICE_NAME=dynamo-frontend python3 -m dynamo.frontend --http-port=8000 &
♻️ Duplicate comments (8)
components/backends/sglang/launch/agg_router.sh (2)

29-36: Duplicate OTEL endpoint configuration—see related comment in agg.sh.

Same hardcoded endpoint issue. Apply the same recommended refactor (use environment variable with default).


14-27: Duplicate OTEL parsing—see related comment in agg.sh.

This block is identical to agg.sh (lines 14–27). Once you extract this into a shared script per the recommendation in agg.sh, apply the same refactor here.

components/backends/sglang/launch/agg_embed.sh (2)

29-36: Duplicate OTEL endpoint—apply environment variable override as recommended in agg.sh.

Use ${OTEL_EXPORT_ENDPOINT:-http://localhost:4317} for consistency and flexibility.


14-27: Duplicate OTEL parsing—extract into shared script as noted in agg.sh.

Apply the same refactoring recommended for agg.sh to eliminate this duplication.

components/backends/sglang/launch/disagg_dp_attn.sh (2)

29-36: Duplicate OTEL endpoint—apply environment variable override per agg.sh recommendation.

Use ${OTEL_EXPORT_ENDPOINT:-http://localhost:4317} for consistency.


14-27: Duplicate OTEL parsing—extract into shared script per agg.sh recommendation.

Apply the common refactoring approach described in agg.sh.

components/backends/sglang/launch/disagg_router.sh (2)

29-36: Duplicate OTEL endpoint—apply environment variable override per agg.sh recommendation.

Use ${OTEL_EXPORT_ENDPOINT:-http://localhost:4317} for consistency.


14-27: Duplicate OTEL parsing—extract into shared script per agg.sh recommendation.

Apply the common refactoring approach.

🧹 Nitpick comments (3)
components/backends/sglang/launch/agg.sh (1)

29-36: Make OTEL_EXPORT_ENDPOINT configurable via environment variable.

The OTEL endpoint is hardcoded as http://localhost:4317, which only works for localhost deployments. For flexibility in distributed or cloud environments, make it configurable:

- export OTEL_EXPORT_ENDPOINT=http://localhost:4317
+ export OTEL_EXPORT_ENDPOINT=${OTEL_EXPORT_ENDPOINT:-http://localhost:4317}

This preserves the default for local setups while allowing override for other scenarios.

components/backends/sglang/launch/disagg.sh (1)

14-27: Add help text and usage information to argument parsing.

Unlike similar scripts (multimodal_disagg.sh, multimodal_agg.sh), this script lacks a -h|--help option and provides no usage guidance when unknown options are encountered. Add help documentation to improve usability and consistency.

 # Parse command line arguments
 ENABLE_OTEL=false
 while [[ $# -gt 0 ]]; do
     case $1 in
         --enable-otel)
             ENABLE_OTEL=true
             shift
             ;;
+        -h|--help)
+            echo "Usage: $0 [OPTIONS]"
+            echo "Options:"
+            echo "  --enable-otel  Enable OpenTelemetry tracing"
+            echo "  -h, --help     Show this help message"
+            exit 0
+            ;;
         *)
             echo "Unknown option: $1"
+            echo "Use --help for usage information"
             exit 1
             ;;
     esac
 done
docs/observability/tracing.md (1)

99-111: Update documentation to reference all launch scripts supporting --enable-otel.

Line 101–108 mentions agg.sh as an example, but the documentation should note that other launch scripts (e.g., disagg.sh, multimodal_agg.sh, multimodal_disagg.sh) similarly support the --enable-otel flag. This will help users discover the feature across the deployment ecosystem.

 Many launch scripts support the `--enable-otel` flag to automatically configure tracing. For example:
 
 ```bash
 # Navigate to SGLang launch directory
 cd components/backends/sglang/launch
 
-# Run aggregated deployment with tracing enabled
+# Run aggregated deployment with tracing enabled (also supported in disagg.sh, multimodal_agg.sh, etc.)
 ./agg.sh --enable-otel

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 66fd6f84ad7d10933aafd9fd1ef768447aa91b36 and b389db06bb42a57bbc0e186388f4c5e87f4cc416.

</details>

<details>
<summary>⛔ Files ignored due to path filters (1)</summary>

* `docs/images/trace.png` is excluded by `!**/*.png`

</details>

<details>
<summary>📒 Files selected for processing (14)</summary>

* `components/backends/sglang/launch/agg.sh` (1 hunks)
* `components/backends/sglang/launch/agg_embed.sh` (1 hunks)
* `components/backends/sglang/launch/agg_router.sh` (2 hunks)
* `components/backends/sglang/launch/disagg.sh` (3 hunks)
* `components/backends/sglang/launch/disagg_dp_attn.sh` (3 hunks)
* `components/backends/sglang/launch/disagg_router.sh` (6 hunks)
* `components/backends/sglang/launch/multimodal_agg.sh` (4 hunks)
* `components/backends/sglang/launch/multimodal_disagg.sh` (5 hunks)
* `deploy/docker-compose.yml` (3 hunks)
* `deploy/metrics/README.md` (4 hunks)
* `deploy/metrics/grafana-datasources.yml` (1 hunks)
* `deploy/tracing/docker-compose.yml` (0 hunks)
* `deploy/tracing/grafana/provisioning/datasources/tempo.yaml` (0 hunks)
* `docs/observability/tracing.md` (10 hunks)

</details>

<details>
<summary>💤 Files with no reviewable changes (2)</summary>

* deploy/tracing/docker-compose.yml
* deploy/tracing/grafana/provisioning/datasources/tempo.yaml

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>docs/observability/tracing.md</summary>

[uncategorized] ~179-~179: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...on tracing environment variables at the top level and service-specific names in each comp...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>docs/observability/tracing.md</summary>

147-147: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>
<details>
<summary>🪛 Shellcheck (0.11.0)</summary>

<details>
<summary>components/backends/sglang/launch/multimodal_agg.sh</summary>

[warning] 68-68: SGLANG_BACKEND_DIR appears unused. Verify use (or export if used externally).

(SC2034)

</details>
<details>
<summary>components/backends/sglang/launch/multimodal_disagg.sh</summary>

[warning] 68-68: SGLANG_BACKEND_DIR appears unused. Verify use (or export if used externally).

(SC2034)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)</summary>

* GitHub Check: vllm (amd64)
* GitHub Check: vllm (arm64)
* GitHub Check: sglang
* GitHub Check: trtllm (arm64)
* GitHub Check: operator (arm64)
* GitHub Check: trtllm (amd64)
* GitHub Check: operator (amd64)
* GitHub Check: Build and Test - dynamo

</details>

<details>
<summary>🔇 Additional comments (17)</summary><blockquote>

<details>
<summary>components/backends/sglang/launch/agg.sh (1)</summary><blockquote>

`39-49`: **LGTM: OTEL service naming and tracing setup is correct.**

The service naming strategy (dynamo-frontend and dynamo-worker) follows observability best practices, and the conditional OTEL setup ensures tracing is only enabled when explicitly requested.

</blockquote></details>
<details>
<summary>components/backends/sglang/launch/agg_router.sh (2)</summary><blockquote>

`44-59`: **Per-worker OTEL service naming and CUDA device assignment look good.**

The naming scheme (dynamo-worker-1, dynamo-worker-2) and CUDA_VISIBLE_DEVICES per worker are appropriate for multi-worker deployments.

---

`8-9`: **Verify: Cleanup trap may not track all spawned worker PIDs.**

The cleanup trap only references `$DYNAMO_PID $WORKER_PID`, but two workers are spawned (lines 44 and 53). If only the last worker's PID is tracked, earlier workers may not be properly cleaned up. Confirm whether this is a pre-existing issue or if PID tracking should be updated to handle multiple workers (e.g., using process groups or tracking both worker PIDs).

</blockquote></details>
<details>
<summary>components/backends/sglang/launch/agg_embed.sh (1)</summary><blockquote>

`44-51`: **Embedding-specific service naming is clear and appropriate.**

The service name `dynamo-worker-embedding` provides good trace granularity for embedding workloads, and embedding-specific flags (--embedding-worker, --use-sglang-tokenizer) are correctly preserved.

</blockquote></details>
<details>
<summary>components/backends/sglang/launch/disagg_dp_attn.sh (1)</summary><blockquote>

`44-70`: **OTEL service naming for disaggregation architecture is well-designed.**

The prefill/decode distinction (dynamo-worker-prefill, dynamo-worker-decode) enables clear trace isolation between pipeline stages, which is valuable for performance analysis in disaggregated workloads.

</blockquote></details>
<details>
<summary>deploy/docker-compose.yml (3)</summary><blockquote>

`48-60`: **Tempo service configuration looks good—verify tempo.yaml exists.**

The Tempo service is properly configured with correct ports and profiles. The volume mount references `./tracing/tempo.yaml`; verify this file exists in the repository and contains valid Tempo configuration.

---

`155-156`: **TraceQL editor toggle is appropriate for trace exploration.**

Enabling `GF_FEATURE_TOGGLES_ENABLE=traceqlEditor` provides users with Grafana's query builder for Tempo traces, improving usability.

---

`163-165`: **Grafana correctly depends on Tempo.**

Adding Tempo to `depends_on` ensures proper startup order. This is necessary for Grafana datasource provisioning to find Tempo when the container starts.

</blockquote></details>
<details>
<summary>components/backends/sglang/launch/disagg_router.sh (2)</summary><blockquote>

`39-104`: **Hierarchical service naming strategy is excellent for complex disaggregated pipelines.**

The naming scheme (dynamo-frontend → dynamo-router-prefill → dynamo-worker-prefill-1/2 → dynamo-worker-decode-1/2) provides clear trace boundaries and enables precise performance analysis across pipeline stages.

---

`8-9`: **Verify: Cleanup trap may not capture all spawned worker PIDs.**

The trap only references `$DYNAMO_PID $PREFILL_PID $PREFILL_ROUTER_PID`, but multiple prefill and decode workers are spawned. Confirm whether additional PIDs should be tracked, or if this is an acceptable architectural trade-off for this launcher.

</blockquote></details>
<details>
<summary>deploy/metrics/grafana-datasources.yml (1)</summary><blockquote>

`25-44`: **Tempo datasource configuration is comprehensive and well-configured.**

The datasource includes all necessary options for trace exploration (service map, node graph, search, traceQL). The service URL `http://tempo:3200` correctly resolves within the Docker Compose network.

</blockquote></details>
<details>
<summary>deploy/metrics/README.md (3)</summary><blockquote>

`1-8`: **Documentation title and overview clearly integrate tracing with metrics.**

The updated title "Observability with Prometheus, Tempo, and Grafana" and the new Tracing Guide reference provide clear context for users unfamiliar with the observability stack.

---

`234-236`: **Tempo service documentation is clear and includes necessary port/protocol info.**

Users are provided with Tempo's HTTP endpoint (:3200) and OTLP receiver ports (gRPC :4317, HTTP :4318), enabling both exploration and integration.

---

`267-267`: **Clear instructions for accessing traces via Grafana.**

Users are directed to Explore → Tempo datasource and provided with reference to the detailed Tracing Guide, lowering the barrier to entry for trace exploration.

</blockquote></details>
<details>
<summary>docs/observability/tracing.md (3)</summary><blockquote>

`32-44`: **Documentation clearly explains tracing environment variables and service naming.**

The environment variables table (lines 38–43) and example configuration (lines 47–59) provide clear guidance on enabling tracing. Service naming conventions are well documented and align with the shell script implementations (`dynamo-frontend`, `dynamo-worker-prefill`, etc.).

---

`177-223`: **Kubernetes deployment examples are clear and well-structured.**

The `DynamoGraphDeployment` YAML examples (lines 181–223) correctly demonstrate service-specific `OTEL_SERVICE_NAME` environment variables for `SglangDecodeWorker` and `SglangPrefillWorker`, with common tracing configuration at the top level. Aligns well with docker-compose setup.

---

`235-243`: **Unified observability section clearly explains the consolidated tracing stack.**

The new "Unified Observability" section (lines 235–243) effectively communicates the shift from separate tracing and metrics stacks to a single Grafana instance. This reflects the PR objective to consolidate infrastructure and improves user experience.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

fi

# run ingress
export OTEL_SERVICE_NAME=dynamo-frontend
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Standardize OTEL_SERVICE_NAME to inline assignment (no export).

Line 39 exports OTEL_SERVICE_NAME=dynamo-frontend, but subsequent workers (lines 44, 57) use inline assignment without export. Standardize by removing the export and using inline assignment for all processes, consistent with the pattern in multimodal_disagg.sh.

 # run ingress
-export OTEL_SERVICE_NAME=dynamo-frontend
-python3 -m dynamo.frontend --http-port=8000 &
+OTEL_SERVICE_NAME=dynamo-frontend python3 -m dynamo.frontend --http-port=8000 &

Also applies to: 44-44, 57-57

🤖 Prompt for AI Agents
In components/backends/sglang/launch/disagg.sh around lines 39, 44 and 57, the
OTEL_SERVICE_NAME is exported on line 39 but used as inline assignments (no
export) on lines 44 and 57; change the exported form to match the inline pattern
used elsewhere by removing the export and using plain inline assignment
(OTEL_SERVICE_NAME=...) for the process invocations so all three occurrences are
consistent with multimodal_disagg.sh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai - please trace through the code and provide a very succinct summary on why it was included in the first place and what happens if we remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Check how OTEL_SERVICE_NAME is set across all launch scripts
echo "=== Checking OTEL_SERVICE_NAME usage patterns in all launch scripts ==="
rg -n "OTEL_SERVICE_NAME" components/backends/sglang/launch/ -A 1

echo ""
echo "=== Checking for 'export OTEL_SERVICE_NAME' specifically ==="
rg -n "export OTEL_SERVICE_NAME" components/backends/sglang/launch/

echo ""
echo "=== Checking inline OTEL_SERVICE_NAME assignment pattern ==="
rg -n "OTEL_SERVICE_NAME=\S+ python" components/backends/sglang/launch/

Length of output: 8893


@ishandhanani - The export was likely included initially to make OTEL_SERVICE_NAME available to the frontend process. However:

Why inline is better:

  • export persists the variable in the shell environment for all subsequent commands
  • Inline assignment (OTEL_SERVICE_NAME=value python3 ...) only sets it for that specific process
  • Since each worker sets its own OTEL_SERVICE_NAME, there's no need for the frontend's to persist

What happens if we remove export (use inline):

  • Frontend process still gets OTEL_SERVICE_NAME=dynamo-frontend via its environment
  • Variable doesn't leak to subsequent commands or interfere with worker assignments
  • More precise scope control

Current state:
The file already uses the inline pattern consistently (line 39, 43, 56), matching all other launch scripts in the codebase. The search confirms zero instances of export OTEL_SERVICE_NAME exist anywhere, so you're already following the correct pattern. ✓

🐰


🧬 Code Graph Analysis Results

[Relevant code snippets]

  • File: components/backends/sglang/launch/disagg_dp_attn.sh, lines 6-11
cleanup() {
    echo "Cleaning up background processes..."
    kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
    wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
    echo "Cleanup complete."
}
  • File: components/backends/sglang/launch/agg_embed.sh, lines 6-11
cleanup() {
    echo "Cleaning up background processes..."
    kill $DYNAMO_PID 2>/dev/null || true
    wait $DYNAMO_PID 2>/dev/null || true
    echo "Cleanup complete."
}
  • File: components/backends/sglang/launch/agg.sh, lines 6-11
cleanup() {
    echo "Cleaning up background processes..."
    kill $DYNAMO_PID 2>/dev/null || true
    wait $DYNAMO_PID 2>/dev/null || true
    echo "Cleanup complete."
}
  • File: components/backends/sglang/launch/disagg_router.sh, lines 6-11
cleanup() {
    echo "Cleaning up background processes..."
    kill $DYNAMO_PID $PREFILL_PID $PREFILL_ROUTER_PID 2>/dev/null || true
    wait $DYNAMO_PID $PREFILL_PID $PREFILL_ROUTER_PID 2>/dev/null || true
    echo "Cleanup complete."
}
  • File: components/backends/sglang/launch/multimodal_disagg.sh, lines 6-11
cleanup() {
    echo "Cleaning up background processes..."
    kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
    wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
    echo "Cleanup complete."
}
  • File: components/backends/sglang/launch/multimodal_agg.sh, lines 6-11
cleanup() {
    echo "Cleaning up background processes..."
    kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
    wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
    echo "Cleanup complete."
}
  • File: tests/serve/launch/sglang_agg.sh, lines 6-11
cleanup() {
    echo "Cleaning up background processes..."
    kill $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
    wait $DYNAMO_PID $PREFILL_PID 2>/dev/null || true
    echo "Cleanup complete."
}
  • File: components/backends/sglang/launch/agg_router.sh, lines 6-11
cleanup() {
    echo "Cleaning up background processes..."
    kill $DYNAMO_PID $WORKER_PID 2>/dev/null || true
    wait $DYNAMO_PID $WORKER_PID 2>/dev/null || true
    echo "Cleanup complete."
}
  • File: (Python) components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py, lines 408-411
def cleanup(self):
        self.engine.shutdown()
        logger.info("Multimodal worker engine shutdown")
        super().cleanup()
  • File: (Python) components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py, lines 523-526
def cleanup(self):
        self.engine.shutdown()
        logger.info("Multimodal prefill engine shutdown")
        super().cleanup()

Copy link
Contributor

@keivenchang keivenchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. OTEL != Prometheus, but I see that if OTEL is enabled, Prometheus is also enabled. So perhaps the name should be --enable-all-metrics-and-logging or something to that effect. @nnshah1 what do you think?
  2. Do you plan to sync this to other frameworks as well, maybe in the next PR? It'd be weird if this only exists for sglang.

esac
done

# Enable tracing if requested
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming this to ENABLE_ALL_METRICS?

OTEL and Prometheus are separate systems. So enabling one doesn't necessarily mean enabling the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch here.

I like --enable-metrics-and-logging

export DYN_SYSTEM_PORT=8081
fi

# run ingress
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the frontend doesn't currently use DYN_SYSTEM_ variable, only the workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment. If the frontend just ignores it its probably fine

@ishandhanani
Copy link
Contributor Author

Closing this PR as the majority of the work has been completed by other PRs:

Grafana/Tempo Consolidation (main goal of this PR):

Related Infrastructure:

Remaining Unique Value:
The --enable-otel flag functionality from this PR has been extracted and submitted as PR #4243, which adds the convenient one-flag enablement to SGLang launch scripts.

Thanks for the consolidation work - it made the codebase cleaner!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants