-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Description
Motivation.
Hidden states are a valuable feature for many downstream applications, and their support has been requested and discussed in the vLLM community for a while(#15434 #4435 #6165 )
Currently, users who need both the prompt hidden states and the generated text must make two separate calls: LLM.encode (to obtain hidden states) and LLM.generate (to obtain generated text). This workflow is inconvenient and inefficient, especially for applications that require both outputs. The inability to return hidden states alongside generated text has also been highlighted in #12249
Proposed Change.
This is a very initial draft PR: #24202. Will polish it after RFC.
Configuration
Add return_prompt_hidden_states in Sampling Config.
There is prior discussion #15434 (comment) mentioning using such request-level parameter can't be handled well. In local testing, this approach appears feasible for prompt hidden states, but feedback on potential disadvantages is welcome.
How to build up prompt_hidden_states_tensor
The process for building the prompt hidden states tensor is similar to the existing _get_prompt_logprobs_dict implementation
vllm/vllm/v1/worker/gpu_model_runner.py
Line 203 in d0944b2
The main difference is that we construct a CPU tensor of shape (num_prompt_tokens, hidden_size). During chunked prefill, we copy the relevant chunk slices into it.
Hidden States Aggregation
To support flexible downstream use, we propose adding a plugin interface that allows users to perform custom aggregations on the hidden states (e.g., pooling, etc.).
Output
When return_prompt_hidden_states is enabled, the output will include:
- prompt_hidden_states: a tensor of shape (num_prompt_tokens, hidden_size)
Perf Concerns
- Default (disabled): No performance regression is expected.
- Enabled: Storing and serializing a large (num_prompt_tokens, hidden_size) tensor may impact performance, especially with long prompts. We will benchmark performance with large input lengths to assess the impact and determine if the regression is acceptable.
Feedback Period.
1-3 weeks
CC List.
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.