Frontier Evals

Code for evals measuring frontier model capabilities.

PaperBench: End-to-end replication of state-of-the-art AI papers. Paper | Blog
SWE-Lancer: Real freelance software engineering tasks with end-to-end tests. Paper | Blog

Usage

Requirements

We manage environments with uv. Install uv once, then run uv sync (or uv pip install -r ...) inside the project of interest to create its virtual environment from the checked-in uv.lock.

Running Evals

Each eval directory documents how to reproduce runs, configure models, and interpret results. Start with the suite README.md, then consult any scripts under scripts/ or runtime_*/ directories for orchestration details. When in doubt:

cd into the eval directory.
uv sync to install dependencies.
Follow the local instructions in the README.md.

Contributing

Layout

.
├── pyproject.toml             # Shared tooling configuration (Ruff, Black, etc.)
└── project/
    ├── common/               # Shared libraries
    ├── paperbench/           # PaperBench eval
    └── swelancer/            # SWE-Lancer eval

Each eval directory is its own isolated project with a README.md, pyproject.toml and uv.lock.

Development Workflow

Create or activate the environment for the project you are working on with uv. Example for PaperBench:
- cd project/paperbench
- uv sync
- uv run pytest
Code style and linting use Ruff (with autofix profiles in pyproject.toml and project/common/tooling/ruff_autofix_minimal.toml) and Black. Run uv run ruff check --fix or use the provided Poe/make tasks where available.
Shared utilities live under project/common; changes there may affect multiple evals. Bump the relevant editable dependencies if you create new shared subpackages.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
project		project
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Frontier Evals

Usage

Requirements

Running Evals

Contributing

Layout

Development Workflow

About

Uh oh!

Uh oh!

Contributors 11

Languages

License

openai/frontier-evals

Folders and files

Latest commit

History

Repository files navigation

Frontier Evals

Usage

Requirements

Running Evals

Contributing

Layout

Development Workflow

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 11

Languages