Skip to content

Conversation

@goutamvenkat-anyscale
Copy link

@goutamvenkat-anyscale goutamvenkat-anyscale commented Sep 5, 2025

Install deps: uv sync --extra text
Cmd:

python nightly_benchmarking/common_crawl_benchmark.py \
  --download_path {session}/scratch/downloads \
  --output_path {session}/scratch/output \
  --output_format parquet \
  --crawl_type main \
  --start_snapshot 2023-01 \
  --end_snapshot 2023-10 \
  --html_extraction justext \
  --url_limit 768 \
  --add_filename_column \
  --executor ray_data \
  --ray_data_cast_as_actor \
  --benchmark_results_path {session}/results

Run on 64 CPUs
ray_data_cast_as_actor casts all stages in the pipeline to actors

Signed-off-by: Goutam V. <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 5, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant