-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Description
🚀 The feature, motivation and pitch
In August after GPTOSS was released, a lot of work was put in to enable MCP tool calling with vLLM. GPTOSS is the first model that supported this. However it is a special case as GPTOSS comes with OpenAI harmony (https://github.com/openai/harmony/) as its own parser, as opposed to most models which come with a chat template (ie Minimax https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/chat_template.jinja) which we use via the _preprocess_chat() function.
This feature request focuses on MCP support for non harmony models --basically any model that has a chat template.
On a high level, MCP works as follows in responsesAPI:
- a client provides an input
- the input is converted to tokens via _preprocess_chat().
- the main generate_with_builtin_tools loop handles the token generation / tool calling loop
- tokens are generated by engine and stored in our context
- if we need a tool call, we run https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/serving_engine.py#L1281
- via render_for_completion(), we generate new tokens and the loop continues until no more tool calls are needed
Example
Example MCP client / server
See #29798 for more details
Minimax M2
VLLM_GPT_OSS_SYSTEM_TOOL_MCP_LABELS=web_search_preview,container,code_interpreter VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT=1 vllm serve MiniMaxAI/MiniMax-M2 --tensor-parallel-size 4 --tool-call-parser minimax_m2 --reasoning-parser minimax_m2 --enable-auto-tool-choice --trust-remote-code --tool-server=localhost:8081/container,localhost:8081/browser,localhost:8081/python
client request
curl -X POST "http://localhost:8000/v1/responses" -H "Content-Type: application/json" -H "Authorization: Bearer dummy-api-key" -d '{
"model": "MiniMaxAI/MiniMax-M2",
"input": "Multiply 64548*15151 using the python tool.",
"tools": [
{
"type": "mcp",
"server_label": "code_interpreter",
"headers": {"test": "test"},
"server_url": "IGNORED"
}
]
}'
response
{
"id": "resp_a42bc867864795cd",
"created_at": 1764137463,
"incomplete_details": null,
"instructions": null,
"metadata": null,
"model": "moonshotai/Kimi-K2-Thinking",
"object": "response",
"output": [
{
"id": "rs_a59c0ff3d139f3ad",
"summary": [],
"type": "reasoning",
"content": [
{
"text": " The user wants me to multiply two numbers: 64548 and 15151. I should use the Python tool to compute this accurately.\n\nLet me set up the calculation. I'll use the arithmetic multiplication operator (*) in Python. ",
"type": "reasoning_text"
}
],
"encrypted_content": null,
"status": null
},
{
"id": "lol",
"arguments": "{\"code\": \"result = 64548 * 15151\\nresult\", \"restart\": false}",
"name": "code_interpreter",
"server_label": "code_interpreter",
"type": "mcp_call",
"approval_request_id": null,
"error": null,
"output": "977966748\n",
"status": "completed"
},
{
"id": "rs_818e3eeeb7e9efa7",
"summary": [],
"type": "reasoning",
"content": [
{
"text": " The result of multiplying 64548 by 15151 is **977,966,748**. ",
"type": "reasoning_text"
}
],
"encrypted_content": null,
"status": null
},
{
"id": "msg_bf62d1a50301381c",
"content": [
{
"annotations": [],
"text": " The result of multiplying 64548 by 15151 is **977,966,748**.",
"type": "output_text",
"logprobs": null
}
],
"role": "assistant",
"status": "completed",
"type": "message"
}
],
"parallel_tool_calls": true,
"temperature": 1.0,
"tool_choice": "auto",
"tools": [
{
"server_label": "code_interpreter",
"type": "mcp",
"allowed_tools": null,
"authorization": null,
"connector_id": null,
"headers": {
"test": "test"
},
"require_approval": null,
"server_description": null,
"server_url": "IGNORED"
}
],
"top_p": 1.0,
"background": false,
"max_output_tokens": 261990,
"max_tool_calls": null,
"previous_response_id": null,
"prompt": null,
"reasoning": null,
"service_tier": "auto",
"status": "completed",
"text": null,
"top_logprobs": null,
"truncation": "disabled",
"usage": {
"input_tokens": 154,
"input_tokens_details": {
"cached_tokens": 64,
"input_tokens_per_turn": [],
"cached_tokens_per_turn": []
},
"output_tokens": 121,
"output_tokens_details": {
"reasoning_tokens": 0,
"tool_output_tokens": 0,
"output_tokens_per_turn": [],
"tool_output_tokens_per_turn": []
},
"total_tokens": 275
},
"user": null,
"input_messages": null,
"output_messages": null
}
What we've done so far
- Minor fixes / feature adds such as [responsesAPI][4] fix responseOutputItem Kimi K2 thinking bug #29555, [responsesAPI][2] parse ResponseFunctionToolCallOutputItem #29383, [responsesAPI][1] refactor construct_input_messages #29359, [Frontend] split append tool output #28333
- Set up the ResponsesParser class with a ParsableContext. These two PRs allow models that use the chat template (minimax M2, kimi k2 thinking, qwen3) to call the python MCP tool [responsesAPI][3] ResponsesParser to set up non harmony MCP #29413, [responsesAPI][5] ResponsesParser with tools for full MCP python loop #29798
- Build support for browser & container tool [responsesAPI][7] Browser, Container MCP tools for non harmony models #29989
What's next
- Build proper logging in the ResponsesParser
- Add support for passing input_messages / output_messages in the response (similar to [responsesAPI] support input output messages for non harmony models #29549) so to help debug the ParsableContext. I have a WIP PR in [responsesAPI][6] input/output messages for ResponsesParser qandrew/vllm#12
- Right now we store internal state in ResponsesAPI as it allows for easy conversion to the chat template and is more expressive than ChatCompletions (due to having reasoning, etc). However responsesAPI assumes one sentence only has one of reasoning / tool call / output, while in other models a sentence could have both reasoning & tool call. We need to fix the mapping here properly.
- Build support for generic MCP tools, for both harmony and non harmony models
- support streaming for non harmony MCP calls
- support other entrypoints such as Anthropic MessagesAPI
If anyone is interested in helping here please let us know!
cc @chaunceyjiang @yeqcharlotte @daniel-salib @alecsolder @heheda12345 @mgoin @njhill
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.