Skip to content

get_free_port_pair() TOCTOU race causes port collision with multiple env servers #1012

@zch42

Description

@zch42

Description:

Spawning multiple ZMQEnvServer instances in a loop (e.g., orchestrator with 2+ [[env]] entries) intermittently fails:

zmq.error.ZMQError: Address already in use (addr='tcp://127.0.0.1:xxxxx')

Root cause

verifiers/utils/worker_utils.py::get_free_port_pair() closes both sockets before returning (via with), so the OS can reassign the same ephemeral port to the next call before the first ZMQ server has bound it.

Affects both verifiers/envs/environment.py::Environment.start_server() and prime_rl/orchestrator/vf_utils.py::spawn_env_server().

Proposed fix

Hold the reservation sockets open with SO_REUSEADDR until the parent process exits. This prevents the OS from recycling ports while child processes are starting, and SO_REUSEADDR allows ZMQ in the child to bind the same address.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions