Benchmark Framework V2 #275

yudhiesh · 2025-11-08T07:26:35Z

What does this PR do?

This PR is for a new benchmarking framework as detailed within this dicsussion covering points 5 & 6
Adds the benchmark_v2/ package with a dataset registry from the Set Similarity Search Benchmark, cache/downloader, NumPy memmap builder, Typer CLI, and ground-truth helpers. Every long-running step goes through tqdm so downloads/extracts/memmaps show progress.
Introduces datasets list, datasets sync, and datasets sync-all commands that share a single sync helper.
Brings in uv-managed dependencies (typer, numpy, datasketch, tqdm, ruff, pytest) plus a Justfile for install/lint/test/CLI sync recipes.
Adds pytest testcases for the code

Why do it this way?

The structure mirrors the NeurIPS Big ANN Benchmark (the same approach I used when building the vector DB benchmark in this write-up based on the official instructions in the harsha-simhadri/big-ann-benchmarks repo).
In simple terms we want to follow this pattern where we leverage docker containers for consistent benchmarks on the algorithms, and in the future storage backends.

TODO:

This is a draft PR that covers the initial step of dataset generation & caching. I still need to work on the following steps but wanted to get some feedback on the approach before investing more time into it:

Metrics + reporting layer
- Build the metrics package to compute recall/precision, latency percentiles, build cost, etc., from the memmapped sets + ground truth.
- Emit structured JSON with schema/versioning and add comparison/report commands so we can diff runs (similar to the Big ANN benchmark output pipeline).
Benchmark runners & Docker harness
- Implement the sketch/index benchmark drivers, parameter sweeps, and structured result collection.
- Package everything in a Docker image with controlled CPU/mem limits, following the NeurIPS Big ANN Dockerfile pattern for reproducible environments.
- Should also have a benchmark comparison across all experiments(covers algorithms for now will create another PR for storage backends) to generate an image like this:

CLI quickstart

# Install uv (single binary)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install just
curl -fsSL https://just.systems/install.sh | bash -s -- --to /usr/local/bin

cd benchmark_v2

# Install project deps via uv + the Justfile
just install

# List available datasets
just list-datasets

# Sync a single dataset (defaults to ~/.cache/datasketch/benchmark_v2)
just sync kosarak

# Sync everything (runs datasets in parallel, shows tqdm progress)
just sync-all cache_root=/tmp/benchmark-cache with_ground_truth="--with-ground-truth" memmap="--memmap"

# Run tests
just test

ekzhu · 2025-11-10T07:58:09Z

Thanks for the PR. As a general comment about benchmarks, I believe it is important for users to have a transparent understanding of what is being benchmarked, and how to add new ones. The other thing is I think we should move with smaller steps focusing on continuity rather than a big new version.

Can we solves each of these problems in separate work:

Dataset sourcing, this could be a separate module in itself. We should make it really simple to setup and add new sources. With clear documentation on how to do it. It should include random dataset generation with a seed. We should update existing benchmark scripts to use this new module.
Benchmark harnesses: this is to deal with the problem of how to add new algorithms and scenarios to be tested, and how to run them.
Containerization. This is regarding deployment of the harnesses.

I am not sure if we should have a separate package for all of these, or introducing new packages. I think it is possible to start without committing to a new package and using the current pyproject.toml file for dependency management -- we just need to ensure we don't include the benchmark code in the PyPI release.

yudhiesh · 2025-11-10T13:58:35Z

Thanks for the PR. As a general comment about benchmarks, I believe it is important for users to have a transparent understanding of what is being benchmarked, and how to add new ones. The other thing is I think we should move with smaller steps focusing on continuity rather than a big new version.

Can we solves each of these problems in separate work:

Dataset sourcing, this could be a separate module in itself. We should make it really simple to setup and add new sources. With clear documentation on how to do it. It should include random dataset generation with a seed. We should update existing benchmark scripts to use this new module.

Benchmark harnesses: this is to deal with the problem of how to add new algorithms and scenarios to be tested, and how to run them.

Containerization. This is regarding deployment of the harnesses.

I am not sure if we should have a separate package for all of these, or introducing new packages. I think it is possible to start without committing to a new package and using the current pyproject.toml file for dependency management -- we just need to ensure we don't include the benchmark code in the PyPI release.

Sure no issue, with the current dataset sourcing I can move it into the benchmark directory. I can make it simpler to add in new datasets by creating a datasets.yaml where you just add in new links to datasets based on the category of the and the benchmark script will pick it up. I will add in another category of datasets that cover random dataset generation such as uniformly random sets across different sizes. For documentation I will just write a clear README about the entire process and how do add in new datasets.

yudhiesh added 4 commits November 8, 2025 14:17

feat: add in dataset generation for benchmarking

b950375

feat: upgrade python to 3.12

0a4e8fe

feat: update path issue with Justfile

215a0b6

feat: add coverage package

84e5132

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark Framework V2 #275

Benchmark Framework V2 #275

Uh oh!

yudhiesh commented Nov 8, 2025 •

edited

Loading

Uh oh!

ekzhu commented Nov 10, 2025 •

edited

Loading

Uh oh!

yudhiesh commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Benchmark Framework V2 #275

Are you sure you want to change the base?

Benchmark Framework V2 #275

Uh oh!

Conversation

yudhiesh commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why do it this way?

TODO:

CLI quickstart

Uh oh!

ekzhu commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yudhiesh commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yudhiesh commented Nov 8, 2025 •

edited

Loading

ekzhu commented Nov 10, 2025 •

edited

Loading