Skip to content

[Feature]: MCP Support for Non Harmony Models #30115

@qandrew

Description

@qandrew

🚀 The feature, motivation and pitch

In August after GPTOSS was released, a lot of work was put in to enable MCP tool calling with vLLM. GPTOSS is the first model that supported this. However it is a special case as GPTOSS comes with OpenAI harmony (https://github.com/openai/harmony/) as its own parser, as opposed to most models which come with a chat template (ie Minimax https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/chat_template.jinja) which we use via the _preprocess_chat() function.

This feature request focuses on MCP support for non harmony models --basically any model that has a chat template.

On a high level, MCP works as follows in responsesAPI:

Example

Example MCP client / server

See #29798 for more details

Minimax M2

VLLM_GPT_OSS_SYSTEM_TOOL_MCP_LABELS=web_search_preview,container,code_interpreter VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT=1 vllm serve MiniMaxAI/MiniMax-M2   --tensor-parallel-size 4   --tool-call-parser minimax_m2   --reasoning-parser minimax_m2    --enable-auto-tool-choice --trust-remote-code  --tool-server=localhost:8081/container,localhost:8081/browser,localhost:8081/python

client request

curl -X POST "http://localhost:8000/v1/responses"   -H "Content-Type: application/json"   -H "Authorization: Bearer dummy-api-key"   -d '{
        "model": "MiniMaxAI/MiniMax-M2",
        "input": "Multiply 64548*15151 using the python tool.",
        "tools": [
          {
            "type": "mcp",
            "server_label": "code_interpreter",
            "headers": {"test": "test"},
            "server_url": "IGNORED"
          }
        ]
      }'

response

{
    "id": "resp_a42bc867864795cd",
    "created_at": 1764137463,
    "incomplete_details": null,
    "instructions": null,
    "metadata": null,
    "model": "moonshotai/Kimi-K2-Thinking",
    "object": "response",
    "output": [
        {
            "id": "rs_a59c0ff3d139f3ad",
            "summary": [],
            "type": "reasoning",
            "content": [
                {
                    "text": " The user wants me to multiply two numbers: 64548 and 15151. I should use the Python tool to compute this accurately.\n\nLet me set up the calculation. I'll use the arithmetic multiplication operator (*) in Python. ",
                    "type": "reasoning_text"
                }
            ],
            "encrypted_content": null,
            "status": null
        },
        {
            "id": "lol",
            "arguments": "{\"code\": \"result = 64548 * 15151\\nresult\", \"restart\": false}",
            "name": "code_interpreter",
            "server_label": "code_interpreter",
            "type": "mcp_call",
            "approval_request_id": null,
            "error": null,
            "output": "977966748\n",
            "status": "completed"
        },
        {
            "id": "rs_818e3eeeb7e9efa7",
            "summary": [],
            "type": "reasoning",
            "content": [
                {
                    "text": " The result of multiplying 64548 by 15151 is **977,966,748**. ",
                    "type": "reasoning_text"
                }
            ],
            "encrypted_content": null,
            "status": null
        },
        {
            "id": "msg_bf62d1a50301381c",
            "content": [
                {
                    "annotations": [],
                    "text": " The result of multiplying 64548 by 15151 is **977,966,748**.",
                    "type": "output_text",
                    "logprobs": null
                }
            ],
            "role": "assistant",
            "status": "completed",
            "type": "message"
        }
    ],
    "parallel_tool_calls": true,
    "temperature": 1.0,
    "tool_choice": "auto",
    "tools": [
        {
            "server_label": "code_interpreter",
            "type": "mcp",
            "allowed_tools": null,
            "authorization": null,
            "connector_id": null,
            "headers": {
                "test": "test"
            },
            "require_approval": null,
            "server_description": null,
            "server_url": "IGNORED"
        }
    ],
    "top_p": 1.0,
    "background": false,
    "max_output_tokens": 261990,
    "max_tool_calls": null,
    "previous_response_id": null,
    "prompt": null,
    "reasoning": null,
    "service_tier": "auto",
    "status": "completed",
    "text": null,
    "top_logprobs": null,
    "truncation": "disabled",
    "usage": {
        "input_tokens": 154,
        "input_tokens_details": {
            "cached_tokens": 64,
            "input_tokens_per_turn": [],
            "cached_tokens_per_turn": []
        },
        "output_tokens": 121,
        "output_tokens_details": {
            "reasoning_tokens": 0,
            "tool_output_tokens": 0,
            "output_tokens_per_turn": [],
            "tool_output_tokens_per_turn": []
        },
        "total_tokens": 275
    },
    "user": null,
    "input_messages": null,
    "output_messages": null
}

What we've done so far

What's next

  • Build proper logging in the ResponsesParser
  • Add support for passing input_messages / output_messages in the response (similar to [responsesAPI] support input output messages for non harmony models #29549) so to help debug the ParsableContext. I have a WIP PR in [responsesAPI][6] input/output messages for ResponsesParser qandrew/vllm#12
  • Right now we store internal state in ResponsesAPI as it allows for easy conversion to the chat template and is more expressive than ChatCompletions (due to having reasoning, etc). However responsesAPI assumes one sentence only has one of reasoning / tool call / output, while in other models a sentence could have both reasoning & tool call. We need to fix the mapping here properly.
  • Build support for generic MCP tools, for both harmony and non harmony models
  • support streaming for non harmony MCP calls
  • support other entrypoints such as Anthropic MessagesAPI

If anyone is interested in helping here please let us know!

cc @chaunceyjiang @yeqcharlotte @daniel-salib @alecsolder @heheda12345 @mgoin @njhill

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions