Skip to content

Conversation

namanlalitnyu
Copy link
Contributor

@namanlalitnyu namanlalitnyu commented Oct 11, 2025

Purpose

In this PR, we are implementing a lite-weight profiling framework to capture the overhead of critical system components. This framework is controlled by the environment variable VLLM_LITE_PROFILER_LOG_PATH, which specifies where the logs are going to be stored, The logs will then be used to generate a summary representing the time spent by each function. This helps us identify areas for optimization.

Explained about the implementation with more details in the issue here: 26606

The final generated output looks like this:
Screenshot 2025-10-15 at 2 00 44 PM

TODO:

  1. Need to implement support for Bench serving command as well, making sure that there is no data leakage.
  2. Try to optimize the logging approach.

Test Plan

  1. Verify that there is no regression when the environment variable is disabled (not set). There should be no performance overhead in the benchmark results like throughput and latency.
  2. Verify that there is no impact on the eval with the new changes.
  3. Verify that the profiler report is getting generated correctly once lite-profiling is enabled.

Test Result

The framework is tested on 3 different LLM Models, with different configurations: facebook/opt-125m, meta-llama/Meta-Llama-3.1-8B, and meta-llama/Llama-4-Maverick-17B-128E.

  1. Benchmark Throughput tests on three different scenarios (main-branch, vllm_lite_profiler branch with profiling disabled, and vllm_lite_profiler branch with profiling enabled). Did 5 runs and took the average.
    -> From results, we observed that there is no performance regression when profiling is disabled. Moreover, when profiling is enabled, the performance overhead is very minimal, leading for us to easily identify the possible time consuming sections in the code.

    Screenshot 2025-10-15 at 1 57 26 PM

    -> Commands for different models, with configurations:

    1. meta-llama/Meta-Llama-3.1-8B: VLLM_LITE_PROFILER_LOG_PATH=/tmp/vllm-lite-profiler-llama3-8b.log vllm bench throughput --backend vllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name sharegpt --dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200
    2. Facebook/opt-125m: VLLM_LITE_PROFILER_LOG_PATH=/tmp/vllm-lite-profiler-facebook-125m.log vllm bench throughput --backend vllm --model facebook/opt-125m --dataset-name sharegpt --dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200
    3. meta-llama/Llama-4-Maverick-17B-128E: VLLM_LITE_PROFILER_LOG_PATH=/tmp/vllm-lite-profiler-llama4.log vllm bench throughput --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --tensor-parallel-size 8 --load-format dummy --dataset ~/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200 --backend vllm --max-model-len 8192
  2. Eval testing
    (Results are the same, no change)
    lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1 --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250

a) vllm_lite_profiler branch

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.768 ± 0.0268
strict-match 5 exact_match 0.740 ± 0.0278

b) Main branch

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.768 ± 0.0268
strict-match 5 exact_match 0.740 ± 0.0278

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Naman Lalit <[email protected]>
@namanlalitnyu
Copy link
Contributor Author

cc: @yeqcharlotte @Jialin @linzebing
Thank you!

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Copy link

mergify bot commented Oct 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @namanlalitnyu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 12, 2025
Signed-off-by: Naman Lalit <[email protected]>
os.makedirs(directory, exist_ok=True)
# ruff: noqa: SIM115 - intentionally keeping file handle cached globally
_log_file = open(_LOG_PATH, "w", buffering=50000)
atexit.register(_log_file.close)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only closes the file when Python interpreter shuts down. This makes me think whether it's possible for us to read incomplete data if we don't shut down the server (due to buffered data not being flushed yet).

This is a bit tricky. One way is to explicitly close in maybe_emit_lite_profiler_report, but this only works for benchmark_latency and benchmark_throughput. benchmark serve runs server and client separately.

If we can't find a good way, we can skip benchmark serve for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try to flush the file data before going for generating the summary, but I think that needs a bit of testing for serving part.
I'll skip the serving part in this PR, and will take it up a new PR with a few more changes.

Comment on lines +30 to +31
# Cache for log file handle
_log_file: TextIO | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, as we might wire logs from different processors, I'm wondering if there's anything we need to do to synchronize the write accordingly.

I might feel safer to reuse logger instead. And currently, just have a knob to control whether to enable the logging.

Comment on lines +20 to +27
def _should_log_results() -> bool:
"""Check if the current process should log results.
Only the data-parallel rank 0 engine core and worker 0 should emit logs in
multi-process deployments so that we avoid duplicating identical timing
data.
"""
process = multiprocessing.current_process()
return process.name in ("EngineCore_DP0", "VllmWorker-0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just let all ranks logs currently, and iterating on it later on.

The logging overhead, should be relatively small per rank, and we could leave the heavy lifting filtering to the reporting and reduce the branching and computations to runtime as much as possible.

@Jialin
Copy link
Contributor

Jialin commented Oct 14, 2025

Discussed with @namanlalitnyu and @linzebing offline. The overall design looks good now. Let's try to address the following comments:

  1. Ensure the log file is closed before report generation
  2. Add a quick runbook for the usage in profiling.md

And please ignore other conflicting comments.

Copy link

mergify bot commented Oct 14, 2025

Documentation preview: https://vllm--26648.org.readthedocs.build/en/26648/

@mergify mergify bot added the documentation Improvements or additions to documentation label Oct 14, 2025
try:
# Generate and display the summary report
lite_profiler_report.summarize_log(log_path)
os.remove(log_path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the file is opened in append mode while profiling to avoid having processes overwrite data, once report generation completes, we need to remove the file, so that there is no data duplication for the future runs.

@namanlalitnyu
Copy link
Contributor Author

namanlalitnyu commented Oct 15, 2025

Thanks @Jialin for the comments!

  1. Ensure the log file is closed before report generation
    I have implemented a single line level log flushing to avoid data loss, and also explicitly closing the file, before generating the report. But, I'll follow up these changes with improvements in the coming PRs on the logging part, to make it more efficient in terms of performance overhead, and safety.
  1. Add a quick runbook for the usage in profiling.md
    Have added the documentation for lite-profiling with an example: https://vllm--26648.org.readthedocs.build/en/26648/contributing/profiling.html#lite-profiler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation needs-rebase performance Performance-related issues v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants