Skip to content

openai/frontier-evals

Repository files navigation

Frontier Evals

Code for evals measuring frontier model capabilities.

Usage

Requirements

We manage environments with uv. Install uv once, then run uv sync (or uv pip install -r ...) inside the project of interest to create its virtual environment from the checked-in uv.lock.

Running Evals

Each eval directory documents how to reproduce runs, configure models, and interpret results. Start with the suite README.md, then consult any scripts under scripts/ or runtime_*/ directories for orchestration details. When in doubt:

  1. cd into the eval directory.
  2. uv sync to install dependencies.
  3. Follow the local instructions in the README.md.

Contributing

Layout

.
├── pyproject.toml             # Shared tooling configuration (Ruff, Black, etc.)
└── project/
    ├── common/               # Shared libraries
    ├── paperbench/           # PaperBench eval
    └── swelancer/            # SWE-Lancer eval

Each eval directory is its own isolated project with a README.md, pyproject.toml and uv.lock.

Development Workflow

  • Create or activate the environment for the project you are working on with uv. Example for PaperBench:
    • cd project/paperbench
    • uv sync
    • uv run pytest
  • Code style and linting use Ruff (with autofix profiles in pyproject.toml and project/common/tooling/ruff_autofix_minimal.toml) and Black. Run uv run ruff check --fix or use the provided Poe/make tasks where available.
  • Shared utilities live under project/common; changes there may affect multiple evals. Bump the relevant editable dependencies if you create new shared subpackages.