perf: math rubric skip overlong answers#1046
Merged
Conversation
…nses
Add strict_extract_boxed_answer that returns empty string on no \boxed{}
match (instead of returning the full text). Add max_verify_chars guard
to MathRubric to skip math_verify for responses exceeding 50k chars,
preventing thread pool starvation from pathologically long expressions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7f37a0f to
9aeb520
Compare
Problem:
When a completion contains no \boxed{} tag, extract_boxed_answer returns
the entire input text. This is passed to math_verify, which matches any
number in the text — allowing a model to get correct-answer credit by
mentioning the answer anywhere without using \boxed{}.
During RL training, this means a model can skip the \boxed{} format
entirely and still score 1.0 by embedding the correct number in its
reasoning text. The strategy scoreboard from rewardprobe shows the
impact: "correct_lazy" (just outputting the answer) scores 1.0, while
"perfect" (full reasoning + boxed answer) scores only 0.67.
Fix:
Add a `strict` parameter to extract_boxed_answer (default: False).
When strict=True, returns "" on no match instead of the full text.
MathRubric now uses strict=True via functools.partial.
This is backwards compatible:
- extract_boxed_answer(text) still returns text (default strict=False)
- Only MathRubric's parser uses strict=True
- Other callers (rlm_env.py, etc.) are unaffected
- Tests updated to use \boxed{} format in completions
Found using rewardprobe (https://github.com/chopratejas/rewardprobe).
Wrap the timeout test completion in \boxed{} so the strict parser can
extract it, and raise max_verify_chars to allow the 100k-char string
through the length check to actually exercise the timeout logic.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
67eb88b to
823541a
Compare
Now redundant since extract_boxed_answer supports strict=True directly via the cherry-picked fix from chopratejas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
The invalid-answer tests were still passing raw completions without
\boxed{}, so the strict parser returned "" before math_verify ran.
Wrapping in \boxed{} ensures the tests exercise actual math verification.
Also update docs/reference.md with the new extract_boxed_answer(strict)
signature.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description
Two fixes to improve the performance of the math rubric at high concurrency:
extract_boxed_answerin strict mode to avoid symbolic parsing of the full model response (and also false positives, as mentioned in fix: extract_boxed_answer returns full text when no \boxed{} found #1028)Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Medium Risk
Moderate risk because it changes
MathRubricscoring behavior (unboxed answers now score 0 and long parsed responses are skipped), which can affect training/eval metrics; changes are localized and guarded by configurable limits.Overview
MathRubric now enforces boxed-format answers by default and avoids expensive verification on huge outputs. It switches the default parser extractor to
extract_boxed_answer(strict=True), so responses without a\boxed{}final answer no longer get passed through to symbolic parsing.Adds a configurable
max_verify_chars(default50_000) and skipsmath_verifywhen the parsed response exceeds this limit (with a warning), improving throughput under high concurrency.Updates
extract_boxed_answerAPI/docs to supportstrictmode, and adjusts tests to require boxed completions and to cover the new length limit behavior.Written by Cursor Bugbot for commit 8d5dbbe. This will update automatically on new commits. Configure here.