[Core] Lite weight profiler implementation #26648

namanlalitnyu · 2025-10-11T20:47:11Z

Purpose

In this PR, we are implementing a lite-weight profiling framework to capture the overhead of critical system components. This framework is controlled by the environment variable VLLM_LITE_PROFILER_LOG_PATH, which specifies where the logs are going to be stored, The logs will then be used to generate a summary representing the time spent by each function. This helps us identify areas for optimization.

Explained about the implementation with more details in the issue here: 26606

The final generated output looks like this:

TODO:

Need to implement support for Bench serving command as well, making sure that there is no data leakage.
Try to optimize the logging approach.

Test Plan

Verify that there is no regression when the environment variable is disabled (not set). There should be no performance overhead in the benchmark results like throughput and latency.
Verify that there is no impact on the eval with the new changes.
Verify that the profiler report is getting generated correctly once lite-profiling is enabled.

Test Result

The framework is tested on 3 different LLM Models, with different configurations: facebook/opt-125m, meta-llama/Meta-Llama-3.1-8B, and meta-llama/Llama-4-Maverick-17B-128E.

Benchmark Throughput tests on three different scenarios (main-branch, vllm_lite_profiler branch with profiling disabled, and vllm_lite_profiler branch with profiling enabled). Did 5 runs and took the average.
-> From results, we observed that there is no performance regression when profiling is disabled. Moreover, when profiling is enabled, the performance overhead is very minimal, leading for us to easily identify the possible time consuming sections in the code.

-> Commands for different models, with configurations:
1. meta-llama/Meta-Llama-3.1-8B: VLLM_LITE_PROFILER_LOG_PATH=/tmp/vllm-lite-profiler-llama3-8b.log vllm bench throughput --backend vllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name sharegpt --dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200
2. Facebook/opt-125m: VLLM_LITE_PROFILER_LOG_PATH=/tmp/vllm-lite-profiler-facebook-125m.log vllm bench throughput --backend vllm --model facebook/opt-125m --dataset-name sharegpt --dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200
3. meta-llama/Llama-4-Maverick-17B-128E: VLLM_LITE_PROFILER_LOG_PATH=/tmp/vllm-lite-profiler-llama4.log vllm bench throughput --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --tensor-parallel-size 8 --load-format dummy --dataset ~/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200 --backend vllm --max-model-len 8192
Eval testing
(Results are the same, no change)
lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1 --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250

a) vllm_lite_profiler branch

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.768	±	0.0268
		strict-match	5	exact_match	↑	0.740	±	0.0278

b) Main branch

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.768	±	0.0268
		strict-match	5	exact_match	↑	0.740	±	0.0278

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Naman Lalit <[email protected]>

namanlalitnyu · 2025-10-12T16:45:10Z

cc: @yeqcharlotte @Jialin @linzebing
Thank you!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/utils/lite_profiler.py

mergify · 2025-10-12T16:55:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @namanlalitnyu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/envs.py

vllm/utils/lite_profiler_report.py

vllm/utils/lite_profiler.py

vllm/utils/lite_profiler_report.py

Signed-off-by: Naman Lalit <[email protected]>

vllm/utils/lite_profiler.py

linzebing · 2025-10-13T17:25:14Z

vllm/utils/lite_profiler/lite_profiler.py

+            os.makedirs(directory, exist_ok=True)
+        # ruff: noqa: SIM115 - intentionally keeping file handle cached globally
+        _log_file = open(_LOG_PATH, "w", buffering=50000)
+        atexit.register(_log_file.close)


This only closes the file when Python interpreter shuts down. This makes me think whether it's possible for us to read incomplete data if we don't shut down the server (due to buffered data not being flushed yet).

This is a bit tricky. One way is to explicitly close in maybe_emit_lite_profiler_report, but this only works for benchmark_latency and benchmark_throughput. benchmark serve runs server and client separately.

If we can't find a good way, we can skip benchmark serve for now.

We can try to flush the file data before going for generating the summary, but I think that needs a bit of testing for serving part.
I'll skip the serving part in this PR, and will take it up a new PR with a few more changes.

Jialin · 2025-10-13T19:43:11Z

vllm/utils/lite_profiler/lite_profiler.py

+# Cache for log file handle
+_log_file: TextIO | None = None


Hmmm, as we might wire logs from different processors, I'm wondering if there's anything we need to do to synchronize the write accordingly.

I might feel safer to reuse logger instead. And currently, just have a knob to control whether to enable the logging.

vllm/utils/lite_profiler.py

vllm/utils/lite_profiler_report.py

vllm/v1/engine/core.py

vllm/utils/lite_profiler/lite_profiler.py

Jialin · 2025-10-13T19:51:49Z

vllm/utils/lite_profiler/lite_profiler.py

+def _should_log_results() -> bool:
+    """Check if the current process should log results.
+    Only the data-parallel rank 0 engine core and worker 0 should emit logs in
+    multi-process deployments so that we avoid duplicating identical timing
+    data.
+    """
+    process = multiprocessing.current_process()
+    return process.name in ("EngineCore_DP0", "VllmWorker-0")


Maybe we should just let all ranks logs currently, and iterating on it later on.

The logging overhead, should be relatively small per rank, and we could leave the heavy lifting filtering to the reporting and reduce the branching and computations to runtime as much as possible.

Jialin · 2025-10-14T19:51:06Z

Discussed with @namanlalitnyu and @linzebing offline. The overall design looks good now. Let's try to address the following comments:

Ensure the log file is closed before report generation
Add a quick runbook for the usage in profiling.md

And please ignore other conflicting comments.

Signed-off-by: Naman Lalit <[email protected]>

mergify · 2025-10-14T23:08:20Z

Documentation preview: https://vllm--26648.org.readthedocs.build/en/26648/

namanlalitnyu · 2025-10-14T23:10:28Z

vllm/utils/lite_profiler/lite_profiler.py

    try:
        # Generate and display the summary report
        lite_profiler_report.summarize_log(log_path)
+        os.remove(log_path)


Since the file is opened in append mode while profiling to avoid having processes overwrite data, once report generation completes, we need to remove the file, so that there is no data duplication for the future runs.

Signed-off-by: Naman Lalit <[email protected]>

namanlalitnyu · 2025-10-15T21:23:05Z

Thanks @Jialin for the comments!

Ensure the log file is closed before report generation
I have implemented a single line level log flushing to avoid data loss, and also explicitly closing the file, before generating the report. But, I'll follow up these changes with improvements in the coming PRs on the logging part, to make it more efficient in terms of performance overhead, and safety.

Add a quick runbook for the usage in profiling.md
Have added the documentation for lite-profiling with an example: https://vllm--26648.org.readthedocs.build/en/26648/contributing/profiling.html#lite-profiler

Lite profiler implementation

314d2fa

Signed-off-by: Naman Lalit <[email protected]>

mergify bot added performance Performance-related issues v1 labels Oct 11, 2025

namanlalitnyu mentioned this pull request Oct 12, 2025

[Performance]: Lite-Profiler Framework for Model Analysis #26606

Open

1 task

namanlalitnyu marked this pull request as ready for review October 12, 2025 16:43

namanlalitnyu requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners October 12, 2025 16:43

chatgpt-codex-connector bot reviewed Oct 12, 2025

View reviewed changes

vllm/utils/lite_profiler.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Oct 12, 2025

linzebing reviewed Oct 13, 2025

View reviewed changes

vllm/utils/lite_profiler_report.py Outdated Show resolved Hide resolved

Addressed review comments

be66fef

Signed-off-by: Naman Lalit <[email protected]>

namanlalitnyu requested a review from linzebing October 13, 2025 17:00

linzebing reviewed Oct 13, 2025

View reviewed changes

Jialin reviewed Oct 13, 2025

View reviewed changes

address comments and add documentation

8dbe58a

Signed-off-by: Naman Lalit <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Oct 14, 2025

namanlalitnyu commented Oct 14, 2025

View reviewed changes

namanlalitnyu added 3 commits October 14, 2025 16:45

fix documentation error

abee1ae

Signed-off-by: Naman Lalit <[email protected]>

add line buffer to avoid data leakage

910dd31

Signed-off-by: Naman Lalit <[email protected]>

Update documentation and add comments

e2ced02

Signed-off-by: Naman Lalit <[email protected]>

Uh oh!

[Core] Lite weight profiler implementation #26648

Are you sure you want to change the base?

[Core] Lite weight profiler implementation #26648

Conversation

namanlalitnyu commented Oct 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

namanlalitnyu commented Oct 12, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mergify bot commented Oct 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linzebing Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

namanlalitnyu Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Jialin Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jialin Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Jialin commented Oct 14, 2025

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

namanlalitnyu Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

namanlalitnyu commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

namanlalitnyu commented Oct 11, 2025 •

edited by github-actions bot

Loading

namanlalitnyu commented Oct 15, 2025 •

edited

Loading