Skip to content

TOML delegator filters (just/mise/task) truncate wrapped tool output, hiding test failures #1065

@shaswatshah

Description

@shaswatshah

Summary

Built-in TOML filters for task-runner delegators — just, mise, task, and the in-flight poe (#1062) — apply max_lines = 50 to the wrapped tool's raw output. For tasks that wrap pytest / cargo test / eslint / pyright / etc., this routinely truncates the summary line and failure tracebacks — exactly the part the LLM needs to act on.

In some cases the filter also performs zero useful stripping on the wrapped output and only contributes the 50-line cap, making it a pure regression vs. running with RTK_NO_TOML=1.

Concrete repro: rtk just test wrapping pytest

Given a Justfile like:

test:
    pytest -n auto -m 'not integration and not flaky'

Real pytest output for a project with a few hundred tests and one failure:

============================= test session starts ==============================
platform darwin -- Python 3.13.x, pytest-8.x.x, pluggy-1.x.x
rootdir: /Users/me/proj
configfile: pyproject.toml
plugins: xdist-3.x, asyncio-0.21, mock-3.14, ...
8 workers [342 items]
........................................                                 [ 11%]
........................................                                 [ 23%]
........................................                                 [ 35%]
... (35+ more lines of progress dots) ...
=================================== FAILURES ===================================
___________________________ test_pacing_calculation ____________________________
[20-line traceback]
========================== short test summary info =============================
FAILED tests/budget_pacer/test_pacer.py::test_pacing_calculation - AssertionError
======================= 1 failed, 341 passed in 14.52s =========================

After rtk just test is processed by src/filters/just.toml:

  1. strip_ansi
  2. strip_lines_matching — matches nothing in pytest output (the patterns only match just --list's own help banner)
  3. truncate_lines_at = 150 — harmless
  4. max_lines = 50 — keeps the header (~6 lines) + ~44 lines of progress dots, then cuts off
  5. The LLM never sees FAILURES, the traceback, the FAILED line, or the 1 failed, 341 passed summary

The result is worse than no filter at all: pytest looks like it hung mid-run instead of failing. The agent has no signal to act on.

Why this happens

When rtk just test falls into run_fallback (src/main.rs:1054), the first TOML filter whose match_command regex hits is selected — ^just\b. The TOML pipeline (src/core/toml_filter.rs) then applies generic line filtering with no awareness that just test actually executed pytest underneath. There is no routing from a delegator's output back to the wrapped tool's dedicated filter.

Affected filters

File max_lines Stripping useful for wrapped output?
src/filters/just.toml 50 No — patterns only match just --list
src/filters/mise.toml 50 Partial — strips mise install noise, useless for mise run <task>
src/filters/task.toml 50 Partial — strips task: [name] cmd headers
src/filters/poe.toml (#1062) 50 Partial — strips Poe => headers

All four follow the same delegator pattern and share the same architectural problem. poe is on track to inherit it via #1062.

Proposed direction (sketch, not prescriptive)

A delegator filter type that, after stripping the wrapper's own preamble, routes the remaining stdout through whatever filter would have applied to the wrapped command. Concretely for rtk just test running pytest:

  1. TOML filter matches ^just\b
  2. Strip just-specific preamble (currently a no-op)
  3. Detect or declare the wrapped tool (pytest)
  4. Re-apply RTK's pytest filter (Rust module, src/cmds/python/) to the remaining output
  5. Return final filtered output

Recommended primary approach: parse the project file

The cleanest path is parsing the delegator's own project file to resolve task → wrapped command before the task runs. Each delegator has exactly one well-known config file format:

Delegator Project file Where the wrapped command lives
just Justfile (or justfile, .justfile) recipe body lines
mise .mise.toml / mise.toml / .config/mise.toml [tasks.<name>] run = "..."
task Taskfile.yml / Taskfile.yaml tasks.<name>.cmds
poe (#1062) pyproject.toml [tool.poe.tasks] <name> = "..." or {cmd = "..."}

Why this is the right primary path:

  1. Unambiguous and authoritative. The project file is the source of truth for what a task does. No guessing, no parsing tool stdout, no relying on the wrapper to echo the command.
  2. Resolved before execution. RTK can decide which downstream filter to apply before spawning the child, which means it can pick the right Stdio strategy (streamed vs. buffered) based on the wrapped tool. This also fixes the streaming-server problem (uvicorn --reload etc.) for free — if the resolved command is a streaming server, skip TOML buffering.
  3. No new TOML schema. No need to add wraps = ... to every filter or every task. Filters stay declarative and small.
  4. Resilient to missing project files. If parsing fails or the file isn't found, fall back to the current line-stripping behavior — strictly no regression.
  5. Works for chained commands. just lint-fixruff check --fix && ruff format resolves to two commands; RTK picks the dominant filter (or applies sequentially) instead of giving up.

Sketch of the flow for rtk just test:

1. run_fallback sees `just test`
2. Match TOML filter `^just\b` (existing behavior)
3. NEW: parse ./Justfile, find recipe `test`, extract command line `pytest -n auto -m '...'`
4. NEW: classify the resolved command via the existing `find_matching_filter` / Clap dispatch
5. NEW: if a Rust filter matches (`pytest` → src/cmds/python/), spawn `just test` and pipe output through that filter instead of the generic `just` TOML pipeline
6. NEW: if no specific filter matches, fall back to the current `just` TOML pipeline (today's behavior)

The new logic is additive: existing filter behavior is the fallback, so it's a strict improvement.

Alternative detection paths (for reference, not recommended as primary)

  • Parse the delegator's own stdout. Some delegators echo the command they're about to run (task: [build] go build ./..., Poe => pytest ...). Works after the fact but can't inform Stdio choices, and not all delegators echo (just doesn't by default).
  • Declarative TOML config. Add wraps = "pytest" per filter. Requires authors to manually map every task — doesn't scale across projects with custom recipes.
  • Output-format heuristics. Sniff for ===== test session starts ===== etc. Fragile and order-dependent.

These can supplement the primary path (e.g. as fallbacks for unparseable project files), but should not be the main mechanism.

Other open questions

  • Where does routing live? A second pass through find_matching_filter after parsing the project file would suffice, but it breaks the current "one filter per command" mental model in src/core/toml_filter.rs. Probably wants its own dispatch helper.
  • Caching. Project file parsing on every invocation adds startup cost. A cheap mtime-based cache (parse once, invalidate when the file changes) keeps RTK under the <10ms startup target.
  • Trust boundary. Justfile/Taskfile/.mise.toml/pyproject.toml are checked into the repo and are not under RTK's existing .rtk/filters.toml trust gate. Reading them to decide which RTK filter to apply is safe (no execution, no replace/match_output rules), but worth noting in SECURITY.md so it's intentional.
  • Interaction with the rewrite hook. just / mise / task / poe aren't in src/discover/rules.rs RULES, so they're never auto-rewritten — only invoked when an agent explicitly types rtk just test. The fix should not change rewrite behavior.

Interim workarounds (please document if delegator routing is out of scope)

  1. Don't go through the delegator. Call the wrapped tool directly: rtk pytest -n auto -m '...' instead of rtk just test. This is the right answer today but isn't documented anywhere — agents will reach for the task-runner alias.
  2. Project-local override in .rtk/filters.toml with much higher max_lines (e.g. 500). Trades meaningful compression for not-actively-harmful, requires per-user rtk trust.
  3. Bypass. RTK_NO_TOML=1 just test or rtk proxy just test.

Why this matters

The whole point of RTK is to give agents useful compressed output. Truncating before the FAILED summary line silently inverts that goal — the agent sees a passing-looking truncated run and moves on, when in reality the build is broken. This is the worst failure mode for an LLM proxy.

Happy to take a stab at the fix if there's agreement on direction (declarative wraps = ... config seems lowest-risk to me). Flagging the architectural question first since it affects four filters and would shape how future delegator filters are written.

cc related: #1062 (poe filter) — would inherit the same fix for free.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions