[WIP] Adds initial benchmarking framework #1197

rlratzel · 2025-10-23T03:11:53Z

Initial version of Curator benchmarking framework, based off of PR1011.

copy-pr-bot · 2025-10-23T03:11:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

docker/common/install_ffmpeg.sh

docker/Dockerfile

benchmarking/run.py

benchmarking/Dockerfile

benchmarking/tools/build_docker.sh

benchmarking/runner/env_capture.py

Signed-off-by: rlratzel <[email protected]>

…ons, minor cleanup of results dict. Signed-off-by: rlratzel <[email protected]>

…script. Signed-off-by: rlratzel <[email protected]>

Signed-off-by: rlratzel <[email protected]>

…mage name. Signed-off-by: rlratzel <[email protected]>

Signed-off-by: rlratzel <[email protected]>

jnke2016 · 2025-10-24T15:03:03Z

benchmarking/README.md

+results_dir: "./results"
+entries:
+  - name: quick_test
+    script: test_benchmark.py


I think there is a mismatch between the between the argument passed to test_benchmark.py in this README.md and the arguments required in the script.

parser.add_argument("--input-path", required=True, help="Path to input data") parser.add_argument("--output-path", required=True, help="Path to output data")

In fact, test_benchmark.py requires the above arguments. --benchmark-results-path is required in the script however, the framework automatically appends it in matrix.py (L44:L47) and this is why you have a FIXME there I believe.

benchmarking/run.py

benchmarking/README.md

jnke2016 · 2025-10-24T16:20:25Z

benchmarking/README.md

+
+entries:
+  - name: benchmark_v1
+    script: my_benchmark.py


my_benchmark.py is not included in this repo so it might be odd to have test_bechmark.py in Example 1 and a different benchmark script which doesn't exist in Example 2. Maybe you meant test_benchmark.py ?

jnke2016 · 2025-10-24T16:23:22Z

benchmarking/README.md

+  - name: benchmark_v2
+    script: my_benchmark.py
+    args: --input {dataset:sample_data,parquet} --algorithm v2
+```


python benchmarking/run.py --config config.yaml

You forgot the above to be consistent with Example 1

benchmarking/run.py

jnke2016 · 2025-10-24T17:04:17Z

benchmarking/README.md

+  - name: cc_large
+    formats:
+      - type: parquet
+        path: /data/common_crawl_large.parquet


I assume the user is expected to provide a path to a .parquet file ?

jnke2016 · 2025-10-24T17:06:30Z

benchmarking/README.md

+
+entries:
+  - name: cc_extraction
+    script: common_crawl_benchmark.py


Is the user expected to create its own common_crawl_benchmark.py or should it be test_benchmark.py ?

…rs present, adds convenience options to run.sh for debugging. Signed-off-by: rlratzel <[email protected]>

Signed-off-by: rlratzel <[email protected]>

… deps, updates config.yaml to use the same env vars as used in tools/run.sh, updates tools/run.sh to expose env vars used for volume mounts so they can be used in config.YAML Signed-off-by: rlratzel <[email protected]>

… no args, makes default config specified in run.py instead of run.sh Signed-off-by: rlratzel <[email protected]>

Signed-off-by: rlratzel <[email protected]>

…n for using local Curator for benchmark/debug without rebuilding image, allows for env to override each of the LOCAL_ paths. Signed-off-by: rlratzel <[email protected]>

Signed-off-by: rlratzel <[email protected]>

…nal sinks may be explained differently. Signed-off-by: rlratzel <[email protected]>

Signed-off-by: rlratzel <[email protected]>

…bility. Signed-off-by: rlratzel <[email protected]>

…rs args for sinks and MatrixConfig for clarity. Signed-off-by: rlratzel <[email protected]>

…nment. Signed-off-by: rlratzel <[email protected]>

…, uses sys.executable for Python exe, searches parent dirs when looking for repo commit hash. Signed-off-by: rlratzel <[email protected]>

rlratzel changed the title ~~Adds initial benchmarking framework~~ [WIP] Adds initial benchmarking framework Oct 23, 2025