Skip to content

Conversation

@rlratzel
Copy link

@rlratzel rlratzel commented Oct 23, 2025

Initial version of Curator benchmarking framework, based off of PR1011.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rlratzel rlratzel changed the title Adds initial benchmarking framework [WIP] Adds initial benchmarking framework Oct 23, 2025
…ons, minor cleanup of results dict.

Signed-off-by: rlratzel <[email protected]>
Signed-off-by: rlratzel <[email protected]>
results_dir: "./results"
entries:
- name: quick_test
script: test_benchmark.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a mismatch between the between the argument passed to test_benchmark.py in this README.md and the arguments required in the script.

parser.add_argument("--input-path", required=True, help="Path to input data")
parser.add_argument("--output-path", required=True, help="Path to output data")

In fact, test_benchmark.py requires the above arguments. --benchmark-results-path is required in the script however, the framework automatically appends it in matrix.py (L44:L47) and this is why you have a FIXME there I believe.


entries:
- name: benchmark_v1
script: my_benchmark.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my_benchmark.py is not included in this repo so it might be odd to have test_bechmark.py in Example 1 and a different benchmark script which doesn't exist in Example 2. Maybe you meant test_benchmark.py ?

- name: benchmark_v2
script: my_benchmark.py
args: --input {dataset:sample_data,parquet} --algorithm v2
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python benchmarking/run.py --config config.yaml

You forgot the above to be consistent with Example 1

- name: cc_large
formats:
- type: parquet
path: /data/common_crawl_large.parquet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the user is expected to provide a path to a .parquet file ?


entries:
- name: cc_extraction
script: common_crawl_benchmark.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the user expected to create its own common_crawl_benchmark.py or should it be test_benchmark.py ?

…rs present, adds convenience options to run.sh for debugging.

Signed-off-by: rlratzel <[email protected]>
… deps, updates config.yaml to use the same env vars as used in tools/run.sh, updates tools/run.sh to expose env vars used for volume mounts so they can be used in config.YAML

Signed-off-by: rlratzel <[email protected]>
… no args, makes default config specified in run.py instead of run.sh

Signed-off-by: rlratzel <[email protected]>
Signed-off-by: rlratzel <[email protected]>
…n for using local Curator for benchmark/debug without rebuilding image, allows for env to override each of the LOCAL_ paths.

Signed-off-by: rlratzel <[email protected]>
Signed-off-by: rlratzel <[email protected]>
…nal sinks may be explained differently.

Signed-off-by: rlratzel <[email protected]>
…rs args for sinks and MatrixConfig for clarity.

Signed-off-by: rlratzel <[email protected]>
…, uses sys.executable for Python exe, searches parent dirs when looking for repo commit hash.

Signed-off-by: rlratzel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants