Skip to content

Conversation

@yudhiesh
Copy link

@yudhiesh yudhiesh commented Nov 8, 2025

What does this PR do?

  • This PR is for a new benchmarking framework as detailed within this dicsussion covering points 5 & 6
  • Adds the benchmark_v2/ package with a dataset registry from the Set Similarity Search Benchmark, cache/downloader, NumPy memmap builder, Typer CLI, and ground-truth helpers. Every long-running step goes through tqdm so downloads/extracts/memmaps show progress.
  • Introduces datasets list, datasets sync, and datasets sync-all commands that share a single sync helper.
  • Brings in uv-managed dependencies (typer, numpy, datasketch, tqdm, ruff, pytest) plus a Justfile for install/lint/test/CLI sync recipes.
  • Adds pytest testcases for the code

Why do it this way?

  • The structure mirrors the NeurIPS Big ANN Benchmark (the same approach I used when building the vector DB benchmark in this write-up based on the official instructions in the harsha-simhadri/big-ann-benchmarks repo).
  • In simple terms we want to follow this pattern where we leverage docker containers for consistent benchmarks on the algorithms, and in the future storage backends.

TODO:

This is a draft PR that covers the initial step of dataset generation & caching. I still need to work on the following steps but wanted to get some feedback on the approach before investing more time into it:

  1. Metrics + reporting layer

    • Build the metrics package to compute recall/precision, latency percentiles, build cost, etc., from the memmapped sets + ground truth.
    • Emit structured JSON with schema/versioning and add comparison/report commands so we can diff runs (similar to the Big ANN benchmark output pipeline).
  2. Benchmark runners & Docker harness

    • Implement the sketch/index benchmark drivers, parameter sweeps, and structured result collection.
    • Package everything in a Docker image with controlled CPU/mem limits, following the NeurIPS Big ANN Dockerfile pattern for reproducible environments.
    • Should also have a benchmark comparison across all experiments(covers algorithms for now will create another PR for storage backends) to generate an image like this:

CLI quickstart

# Install uv (single binary)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install just
curl -fsSL https://just.systems/install.sh | bash -s -- --to /usr/local/bin

cd benchmark_v2

# Install project deps via uv + the Justfile
just install

# List available datasets
just list-datasets

# Sync a single dataset (defaults to ~/.cache/datasketch/benchmark_v2)
just sync kosarak

# Sync everything (runs datasets in parallel, shows tqdm progress)
just sync-all cache_root=/tmp/benchmark-cache with_ground_truth="--with-ground-truth" memmap="--memmap"

# Run tests
just test

@ekzhu
Copy link
Owner

ekzhu commented Nov 10, 2025

Thanks for the PR. As a general comment about benchmarks, I believe it is important for users to have a transparent understanding of what is being benchmarked, and how to add new ones. The other thing is I think we should move with smaller steps focusing on continuity rather than a big new version.

Can we solves each of these problems in separate work:

  1. Dataset sourcing, this could be a separate module in itself. We should make it really simple to setup and add new sources. With clear documentation on how to do it. It should include random dataset generation with a seed. We should update existing benchmark scripts to use this new module.
  2. Benchmark harnesses: this is to deal with the problem of how to add new algorithms and scenarios to be tested, and how to run them.
  3. Containerization. This is regarding deployment of the harnesses.

I am not sure if we should have a separate package for all of these, or introducing new packages. I think it is possible to start without committing to a new package and using the current pyproject.toml file for dependency management -- we just need to ensure we don't include the benchmark code in the PyPI release.

@yudhiesh
Copy link
Author

Thanks for the PR. As a general comment about benchmarks, I believe it is important for users to have a transparent understanding of what is being benchmarked, and how to add new ones. The other thing is I think we should move with smaller steps focusing on continuity rather than a big new version.

Can we solves each of these problems in separate work:

  1. Dataset sourcing, this could be a separate module in itself. We should make it really simple to setup and add new sources. With clear documentation on how to do it. It should include random dataset generation with a seed. We should update existing benchmark scripts to use this new module.
  2. Benchmark harnesses: this is to deal with the problem of how to add new algorithms and scenarios to be tested, and how to run them.
  3. Containerization. This is regarding deployment of the harnesses.

I am not sure if we should have a separate package for all of these, or introducing new packages. I think it is possible to start without committing to a new package and using the current pyproject.toml file for dependency management -- we just need to ensure we don't include the benchmark code in the PyPI release.

Sure no issue, with the current dataset sourcing I can move it into the benchmark directory. I can make it simpler to add in new datasets by creating a datasets.yaml where you just add in new links to datasets based on the category of the and the benchmark script will pick it up. I will add in another category of datasets that cover random dataset generation such as uniformly random sets across different sizes. For documentation I will just write a clear README about the entire process and how do add in new datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants