-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Core] Lite weight profiler implementation #26648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Naman Lalit <[email protected]>
cc: @yeqcharlotte @Jialin @linzebing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Naman Lalit <[email protected]>
os.makedirs(directory, exist_ok=True) | ||
# ruff: noqa: SIM115 - intentionally keeping file handle cached globally | ||
_log_file = open(_LOG_PATH, "w", buffering=50000) | ||
atexit.register(_log_file.close) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only closes the file when Python interpreter shuts down. This makes me think whether it's possible for us to read incomplete data if we don't shut down the server (due to buffered data not being flushed yet).
This is a bit tricky. One way is to explicitly close in maybe_emit_lite_profiler_report
, but this only works for benchmark_latency
and benchmark_throughput
. benchmark serve
runs server and client separately.
If we can't find a good way, we can skip benchmark serve
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can try to flush the file data before going for generating the summary, but I think that needs a bit of testing for serving part.
I'll skip the serving part in this PR, and will take it up a new PR with a few more changes.
# Cache for log file handle | ||
_log_file: TextIO | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, as we might wire logs from different processors, I'm wondering if there's anything we need to do to synchronize the write accordingly.
I might feel safer to reuse logger instead. And currently, just have a knob to control whether to enable the logging.
def _should_log_results() -> bool: | ||
"""Check if the current process should log results. | ||
Only the data-parallel rank 0 engine core and worker 0 should emit logs in | ||
multi-process deployments so that we avoid duplicating identical timing | ||
data. | ||
""" | ||
process = multiprocessing.current_process() | ||
return process.name in ("EngineCore_DP0", "VllmWorker-0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just let all ranks logs currently, and iterating on it later on.
The logging overhead, should be relatively small per rank, and we could leave the heavy lifting filtering to the reporting and reduce the branching and computations to runtime as much as possible.
Discussed with @namanlalitnyu and @linzebing offline. The overall design looks good now. Let's try to address the following comments:
And please ignore other conflicting comments. |
Signed-off-by: Naman Lalit <[email protected]>
Documentation preview: https://vllm--26648.org.readthedocs.build/en/26648/ |
try: | ||
# Generate and display the summary report | ||
lite_profiler_report.summarize_log(log_path) | ||
os.remove(log_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the file is opened in append mode while profiling to avoid having processes overwrite data, once report generation completes, we need to remove the file, so that there is no data duplication for the future runs.
Signed-off-by: Naman Lalit <[email protected]>
Signed-off-by: Naman Lalit <[email protected]>
Signed-off-by: Naman Lalit <[email protected]>
Thanks @Jialin for the comments!
|
Purpose
In this PR, we are implementing a lite-weight profiling framework to capture the overhead of critical system components. This framework is controlled by the environment variable
VLLM_LITE_PROFILER_LOG_PATH
, which specifies where the logs are going to be stored, The logs will then be used to generate a summary representing the time spent by each function. This helps us identify areas for optimization.Explained about the implementation with more details in the issue here: 26606
The final generated output looks like this:

TODO:
Test Plan
Test Result
The framework is tested on 3 different LLM Models, with different configurations:
facebook/opt-125m
,meta-llama/Meta-Llama-3.1-8B
, andmeta-llama/Llama-4-Maverick-17B-128E
.Benchmark Throughput tests on three different scenarios (main-branch, vllm_lite_profiler branch with profiling disabled, and vllm_lite_profiler branch with profiling enabled). Did 5 runs and took the average.
-> From results, we observed that there is no performance regression when profiling is disabled. Moreover, when profiling is enabled, the performance overhead is very minimal, leading for us to easily identify the possible time consuming sections in the code.
-> Commands for different models, with configurations:
VLLM_LITE_PROFILER_LOG_PATH=/tmp/vllm-lite-profiler-llama3-8b.log vllm bench throughput --backend vllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name sharegpt --dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200
VLLM_LITE_PROFILER_LOG_PATH=/tmp/vllm-lite-profiler-facebook-125m.log vllm bench throughput --backend vllm --model facebook/opt-125m --dataset-name sharegpt --dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200
VLLM_LITE_PROFILER_LOG_PATH=/tmp/vllm-lite-profiler-llama4.log vllm bench throughput --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --tensor-parallel-size 8 --load-format dummy --dataset ~/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200 --backend vllm --max-model-len 8192
Eval testing
(Results are the same, no change)
lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1 --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
a) vllm_lite_profiler branch
b) Main branch
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.