Code for evals measuring frontier model capabilities.
- PaperBench: End-to-end replication of state-of-the-art AI papers. Paper | Blog
- SWE-Lancer: Real freelance software engineering tasks with end-to-end tests. Paper | Blog
We manage environments with uv
. Install uv
once, then run uv sync
(or uv pip install -r ...
) inside the project of interest to create its virtual environment from the checked-in uv.lock
.
Each eval directory documents how to reproduce runs, configure models, and interpret results. Start with the suite README.md
, then consult any scripts under scripts/
or runtime_*/
directories for orchestration details. When in doubt:
cd
into the eval directory.uv sync
to install dependencies.- Follow the local instructions in the
README.md
.
.
├── pyproject.toml # Shared tooling configuration (Ruff, Black, etc.)
└── project/
├── common/ # Shared libraries
├── paperbench/ # PaperBench eval
└── swelancer/ # SWE-Lancer eval
Each eval directory is its own isolated project with a README.md
, pyproject.toml
and uv.lock
.
- Create or activate the environment for the project you are working on with
uv
. Example for PaperBench:cd project/paperbench
uv sync
uv run pytest
- Code style and linting use Ruff (with autofix profiles in
pyproject.toml
andproject/common/tooling/ruff_autofix_minimal.toml
) and Black. Runuv run ruff check --fix
or use the provided Poe/make
tasks where available. - Shared utilities live under
project/common
; changes there may affect multiple evals. Bump the relevant editable dependencies if you create new shared subpackages.