Multi-turn tool calling support #4115

qgallouedec · 2025-09-21T00:44:23Z

from datasets import Dataset
from trl import GRPOTrainer, GRPOConfig
import random

factors = [(random.randint(0, 9999), random.randint(0, 9999)) for _ in range(100)]
dataset = Dataset.from_dict(
    {
        "prompt": [[{"role": "user", "content": f"Multiply {a} and {b}."}] for a, b in factors],
        "result": [a * b for a, b in factors],
    }
)


def multiply(a: int, b: int) -> int:
    """Multiply two integers.

    Args:
        a: The first integer.
        b: The second integer.

    Returns:
        The product of the two integers.
    """
    return a * b

def accuracy_reward(completions, result, **kwargs):
    return [int(str(r) in c[0]["content"]) for c, r in zip(completions, result)]


trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(use_vllm=True, vllm_mode="colocate", vllm_importance_sampling_correction=False),
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
    tools=[multiply],
)
trainer.train()

…_thw` in GRPO and RLOO trainers; update `split_pixel_values_by_grid` to use `image_grid_thw`

lewtun · 2025-09-22T13:57:04Z

trl/trainer/grpo_trainer.py

+    """
+    Given a list of strings, extract all <tool_call> JSON blocks and return them as a list of dictionaries.
+    """
+    pattern = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.DOTALL)


There is unfortunately no standardisation for the tool call tags across model families, but Matt is working on extending the chat templates so they can auto-parse tools calls (internal Slack thread): https://huggingface.slack.com/archives/C06JKEMK6BZ/p1757691450090859

Note that vllm works around this by providing a dedicated set of parsers that can be set when spinning up the server: https://docs.vllm.ai/en/stable/features/tool_calling.html

I'm not sure we want to go down this route, since it's quite messy in my experience to match the parser to the desired model (e.g. some Qwen models use the hermes parser, others not)

So in the meantime, we might want to give uses the ability to provide their own parsing function and default to yours (which is the most common I've seen)

Perfect timing :)

https://moon-ci-docs.huggingface.co/docs/transformers/pr_40894/en/chat_response_parsing

That is the approach we usually followed on smolagents: to provide a sensible default, but allow users to fully customize the function/object instance.

lewtun · 2025-09-22T13:58:45Z

trl/trainer/grpo_trainer.py

+                [prompt_mask[needs_tool], completion_mask[needs_tool]], dim=1
+            ).sum(-1)
+            tool_ids = [ids[-num:] for ids, num in zip(new_prompt_ids, num_tool_ids)]
+            tool_mask = [torch.ones_like(ids) for ids in tool_ids]


Would be cool to have a unit test for this masking so we're confident it is behaving as expected

albertvillanova

Awesome!! Looking forward to have this feature!

Some comments, suggestions and question below.

albertvillanova · 2025-09-23T07:14:16Z

trl/trainer/grpo_trainer.py

+    """
+    Given a list of strings, extract all <tool_call> JSON blocks and return them as a list of dictionaries.
+    """
+    pattern = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.DOTALL)


That is the approach we usually followed on smolagents: to provide a sensible default, but allow users to fully customize the function/object instance.

albertvillanova · 2025-09-23T07:15:09Z

trl/trainer/grpo_trainer.py

 RewardFunc = Union[str, PreTrainedModel, Callable[[list, list], list[float]]]


+def extract_tool_calls(text: str) -> dict[str, Any]:


What about moving this function to a non-specific trainer module, so it can be used by any trainer in the future?

albertvillanova · 2025-09-23T07:16:56Z

trl/trainer/grpo_trainer.py

+    """
+    Given a list of strings, extract all <tool_call> JSON blocks and return them as a list of dictionaries.
+    """
+    pattern = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.DOTALL)


Note that the compile function will be called at each function call. If optimizing performance is necessary in this case, we should set the compilation as a constant at the module level, so it is called only once at import time.

trl/trainer/grpo_trainer.py

albertvillanova · 2025-09-23T07:31:03Z

trl/trainer/grpo_trainer.py

        callbacks: Optional[list[TrainerCallback]] = None,
        optimizers: tuple[Optional[torch.optim.Optimizer], Optional[torch.optim.lr_scheduler.LambdaLR]] = (None, None),
        peft_config: Optional["PeftConfig"] = None,
+        tools=None,


You will need to add the tools param to the trainer docstring. And give a type hint.

albertvillanova · 2025-09-23T07:34:05Z

trl/trainer/grpo_trainer.py

+
+    for match in pattern.findall(text):
+        try:
+            return json.loads(match)


You only return the first match?

albertvillanova · 2025-09-23T08:48:16Z

trl/trainer/grpo_trainer.py

+        tool_calls = [extract_tool_calls(completion) for completion in completions]
+        tool_results = [self._tool_dict[tc["name"]](**tc["arguments"]) if tc else None for tc in tool_calls]
+        tool_messages = [
+            [{"role": "tool", "name": tc["name"], "content": str(tr)}] if tc else None
+            for tc, tr in zip(tool_calls, tool_results)
+        ]


Not sure of this handles potential multiple tool calls in a single completion...

albertvillanova · 2025-09-23T08:50:53Z

trl/trainer/grpo_trainer.py

+        tool_calls = [extract_tool_calls(completion) for completion in completions]
+        tool_results = [self._tool_dict[tc["name"]](**tc["arguments"]) if tc else None for tc in tool_calls]
+        tool_messages = [
+            [{"role": "tool", "name": tc["name"], "content": str(tr)}] if tc else None
+            for tc, tr in zip(tool_calls, tool_results)
+        ]


Before the messages with the tool results ("role": "tool"), shouldn't we prepend the messages with the tool calls themselves ("role": "assistant", "tool_calls":...)? Not sure of this though... A real question! 😅

August-murr · 2025-09-23T19:06:43Z

I suggest that we remove any references to including tools as a parameter in the GRPOTrainer:

trainer = GRPOTrainer(…,
tools = [tool_1, tool_2]
)

While I am not necessarily against chat templates or tool parsing implementations, I believe the attribute tools should be eliminated.

To maximize scalability, a better approach would be to create an all-in-one tools sandbox. This means that the only tool the training script interacts with would be the sandboxed code executors, with all the necessary tools defined within it, like using a Docker image that contains the dependencies of the tools to initialize the sandbox

Therefore, I propose that we only add the parameter Environment, which would encompass the code executor along with all initialized tools.

The environment needs to be responsible for handling tool usage and execution.

Then you could create your own built-in environments using your preferred parsing and chat templates.

qgallouedec · 2025-09-23T19:44:36Z

Yes, I have seen some works that suggest that when scaling up, multiplying tools works less well than a smaller, more generic set of tools, such as the one you describe. However, it seems to me that the approach proposed here is in fact compatible:

trainer = GRPOTrainer(…,
    tools = [my_big_containerized_all_in_one_tool],
)

in the end it's up to the user decide how to design it, most important here is to allow for this flexibility

August-murr · 2025-09-23T19:53:30Z

but the name tools as a param in GRPOTrainer is misleading.
What I suggest is more like

MyEnv = DefaultEnv(code_executer=my_big_containerized_all_in_one_tool)

trainer = GRPOTrainer(…,
Environment=MyEnv
)

The user can then customize tool use in their own environment.

qgallouedec · 2025-09-24T18:11:47Z

the name tool is actually quite standard now:

https://huggingface.co/blog/unified-tool-use

https://huggingface.co/docs/transformers/main/chat_extras

https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.tools

Co-authored-by: Albert Villanova del Moral <[email protected]>

qgallouedec and others added 26 commits September 19, 2025 20:57

Refactor image handling: replace image_split_sizes with `image_grid…

552e899

…_thw` in GRPO and RLOO trainers; update `split_pixel_values_by_grid` to use `image_grid_thw`

simpler

449ef07

gfpo

c8933aa

multi-image grpo

229c554

log with wandb

3ca6ad5

no vlm reward models

dcf4b92

rloo

30ad7ca

gfpo

86cc30b

fix

088897b

test peft

d2adc63

fix gfpo

f4c82bf

rloo test

1257796

peft rloo

099a39b

oops

529add6

update test

fc6b11f

generate method

ae1f497

debug

f998432

skip failing test

fa73876

Merge branch 'main' into drop-image_split_sizes

52d8bd9

Merge branch 'drop-image_split_sizes' into multi-image-support

dfc0d38

test fixed!

fc52e68

Merge branch 'multi-image-support' into generate-method

4d12aeb

gfpo

4fc2b5b

rm vllm

b628744

fix doc

d3a769f

a bit messy!

c9693b2

qgallouedec changed the title ~~a bit messy!~~ Multi-turn tool calling support Sep 21, 2025

lewtun reviewed Sep 22, 2025

View reviewed changes

Merge branch 'main' into drop-image_split_sizes

e17ec42

qgallouedec added 2 commits September 22, 2025 16:20

Merge branch 'drop-image_split_sizes' into multi-image-support

efbb03a

Merge branch 'main' into multi-image-support

562c662

qgallouedec mentioned this pull request Sep 22, 2025

🔭 Align param passing to VLM configs in generate_tiny_models #4118

Merged

qgallouedec and others added 8 commits September 22, 2025 17:47

Merge branch 'main' into multi-image-support

485781c

update layers to ignore

05270f8

clarify image column desc

1c53094

rm VLM x RM warning

9b6652e

Merge branch 'multi-image-support' into generate-method

c500440

Merge branch 'main' into generate-method

a6a8c44

Merge branch 'generate-method' into multi-turn

b8656e0

Merge branch 'main' into generate-method

d8665e1

albertvillanova reviewed Sep 23, 2025

View reviewed changes

qgallouedec added 2 commits September 23, 2025 08:55

Merge branch 'main' into generate-method

365d501

Merge branch 'generate-method' into multi-turn

acb44bc

qgallouedec mentioned this pull request Sep 24, 2025

Chat response parsing huggingface/transformers#40894

Open

19 tasks

qgallouedec and others added 3 commits September 24, 2025 10:09

Merge branch 'main' into generate-method

cdb4c76

same for rloo

c83e710

nits style and align

ec6ad25

qgallouedec added 2 commits September 24, 2025 13:57

Merge branch 'main' into generate-method

b4cadde

Merge branch 'generate-method' into multi-turn

594a07d

Base automatically changed from generate-method to main September 26, 2025 02:48

qgallouedec and others added 2 commits September 25, 2025 21:16

Update trl/trainer/grpo_trainer.py

04e4bd7

Co-authored-by: Albert Villanova del Moral <[email protected]>

Merge branch 'main' into multi-turn

242d66a

		RewardFunc = Union[str, PreTrainedModel, Callable[[list, list], list[float]]]


		def extract_tool_calls(text: str) -> dict[str, Any]:

Multi-turn tool calling support #4115

Are you sure you want to change the base?

Multi-turn tool calling support #4115

Uh oh!

Conversation

qgallouedec commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

August-murr commented Sep 23, 2025

Uh oh!

qgallouedec commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

August-murr commented Sep 23, 2025

Uh oh!

qgallouedec commented Sep 24, 2025

Uh oh!

Uh oh!

qgallouedec commented Sep 21, 2025 •

edited

Loading

albertvillanova Sep 23, 2025 •

edited

Loading

qgallouedec commented Sep 23, 2025 •

edited

Loading