[RFC]: Support Returning Prompt Hidden States

### Motivation.


Hidden states are a valuable feature for many downstream applications, and their support has been requested and discussed in the vLLM community for a while(https://github.com/vllm-project/vllm/pull/15434 https://github.com/vllm-project/vllm/issues/4435 https://github.com/vllm-project/vllm/issues/6165 )

Currently, users who need both the prompt hidden states and the generated text must make two separate calls: LLM.encode (to obtain hidden states) and LLM.generate (to obtain generated text). This workflow is inconvenient and inefficient, especially for applications that require both outputs. The inability to return hidden states alongside generated text has also been highlighted in https://github.com/vllm-project/vllm/issues/12249
 

### Proposed Change.

This is a very initial draft PR: https://github.com/vllm-project/vllm/pull/24202. Will polish it after RFC.

#### Configuration

Add `return_prompt_hidden_states` in Sampling Config. 

There is prior discussion https://github.com/vllm-project/vllm/pull/15434#discussion_r2050595239 mentioning using such request-level parameter can't be handled well. In local testing, this approach appears feasible for prompt hidden states, but feedback on potential disadvantages is welcome.

####  How to build up prompt_hidden_states_tensor

The process for building the prompt hidden states tensor is similar to the existing _get_prompt_logprobs_dict implementation https://github.com/vllm-project/vllm/blob/d0944b273c861004cd08e77db4f823c246575124/vllm/v1/worker/gpu_model_runner.py#L203

The main difference is that we construct a CPU tensor of shape (num_prompt_tokens, hidden_size). During chunked prefill, we copy the relevant chunk slices into it.


#### Hidden States Aggregation
To support flexible downstream use, we propose adding a plugin interface that allows users to perform custom aggregations on the hidden states (e.g., pooling, etc.).

#### Output
When return_prompt_hidden_states is enabled, the output will include:
- prompt_hidden_states: a tensor of shape (num_prompt_tokens, hidden_size)


####  Perf Concerns
- Default (disabled): No performance regression is expected.
- Enabled: Storing and serializing a large (num_prompt_tokens, hidden_size) tensor may impact performance, especially with long prompts. We will benchmark performance with large input lengths to assess the impact and determine if the regression is acceptable.






 

### Feedback Period.

1-3 weeks

### CC List.

@luccafong  @zhuohan123 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Support Returning Prompt Hidden States #24288

Motivation.

Proposed Change.

Configuration

How to build up prompt_hidden_states_tensor

Hidden States Aggregation

Output

Perf Concerns

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Support Returning Prompt Hidden States #24288

Description

Motivation.

Proposed Change.

Configuration

How to build up prompt_hidden_states_tensor

Hidden States Aggregation

Output

Perf Concerns

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions