Skip to content

fix: task cancelation race + RLM sandbox workers#1035

Open
rasdani wants to merge 2 commits intomainfrom
daniel/fix-rlm-and-cancel-race
Open

fix: task cancelation race + RLM sandbox workers#1035
rasdani wants to merge 2 commits intomainfrom
daniel/fix-rlm-and-cancel-race

Conversation

@rasdani
Copy link
Contributor

@rasdani rasdani commented Mar 19, 2026

Description

Fix a missed-cancel race in the ZMQ env server and make RLM sandbox upload+download failures fail as vf.SandboxError.

Changes

  • register request tasks at creation time so cancels cannot miss the pre-registration window
  • add a regression test for that cancel race
  • add shared retry predicates for current sandbox timeout/error types
  • wire RLM sandbox client pool knobs through to the real executor client
  • add sandbox_transfer_max_retries to RLMEnv
  • wrap exhausted RLM upload/download retries as vf.SandboxError

Why

This stops stale cancelled rollouts from continuing invisibly and makes exhausted sandbox transfer failures follow the normal rescheduleable sandbox-error path.

Testing

  • added targeted unit tests
  • uv run ruff check passed

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Changes request-task tracking and cancellation behavior in ZMQEnvServer, plus adds retry/wrapping around RLM sandbox upload/download operations; mistakes could lead to leaked tasks or masked sandbox transfer failures.

Overview
Fixes a client-cancel race in ZMQEnvServer by registering per-request tasks at creation time and centralizing task cleanup, ensuring cancel signals can’t miss the pre-registration window.

Hardens RLM sandbox file transfers by introducing shared retry predicates for transient sandbox/API read errors, threading sandbox client pool settings from RLMEnv into RLMExecutor, and retrying upload/download operations with failures surfaced as vf.SandboxError after sandbox_transfer_max_retries.

Adds targeted regression tests covering the pre-processing cancel race, retryable sandbox error classification, executor client pool configuration wiring, and upload-timeout retry behavior.

Written by Cursor Bugbot for commit 0ee0546. This will update automatically on new commits. Configure here.

@rasdani rasdani marked this pull request as ready for review March 19, 2026 05:23
Comment on lines +147 to 148
self.request_tasks[frame_request_id] = task
self.pending_tasks.add(task)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know not part of this pr but would it make sense to merge those two dicts to avoid future races?

except Exception:
pass

async def _upload_file_with_retry(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this part of sandbox nixon so all envs get the benefits?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants