Add benchmarking abilitiy #1011

praateekmahajan · 2025-09-02T22:15:20Z

Description

Nightly benchmarking runs a configurable matrix of benchmark scripts with per-entry Ray cluster isolation, standardized result outputs, environment capture, and optional logging to MLflow/W&B. It’s designed to provide reproducible, comparable performance measurements across datasets and executors, with clean per-run artifacts and robust failure isolation.

Usage

python -m nightly_benchmarking.run \
  --matrix nightly_benchmarking/matrix.yaml \
  --datasets nightly_benchmarking/dataset_paths.json

Future TODOs (not ranked by priority)

Upload results / artifacts to Google Drive
Post to Slack ability
Be able to create charts from logged results over time
See if we can run each of the entry in the matrix on a separate cluster (ie. increase parallelism)
Figure out how to run in CI? If running in CI, then how to get environment variables

Signed-off-by: Praateek <[email protected]>

* New API Spec with Ray Backend (#726) * Create package + reorganize (#2) * fc Signed-off-by: Praateek <[email protected]> * remove per file ignore Signed-off-by: Praateek <[email protected]> * sc Signed-off-by: Praateek <[email protected]> * ruff Signed-off-by: Praateek <[email protected]> * use curator_id_str Signed-off-by: Praateek <[email protected]> --------- Signed-off-by: Praateek <[email protected]> * fc Signed-off-by: Praateek <[email protected]> * kmeans works Signed-off-by: Praateek <[email protected]> * Fuzzy dedup fixes (#11) * high level method for each step Signed-off-by: Ayush Dattagupta <[email protected]> * Fixes/changes after testing Signed-off-by: Ayush Dattagupta <[email protected]> * Updates to existing fuzzy_dedup modules Signed-off-by: Ayush Dattagupta <[email protected]> * Add high level fuzzy dedup api and e2e example Signed-off-by: Ayush Dattagupta <[email protected]> * Add e2e example Signed-off-by: Ayush Dattagupta <[email protected]> * Add config Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> * fc Signed-off-by: Praateek <[email protected]> * fc Signed-off-by: Praateek <[email protected]> * removal works Signed-off-by: Praateek <[email protected]> * bug fix Signed-off-by: Praateek <[email protected]> * working streaming embedding with id generator Signed-off-by: Praateek <[email protected]> * Dump high level skeleton Signed-off-by: Ayush Dattagupta <[email protected]> * update xenna executor Signed-off-by: Ayush Dattagupta <[email protected]> * More changes Signed-off-by: Ayush Dattagupta <[email protected]> * working example Signed-off-by: Praateek <[email protected]> * Revert "working example" This reverts commit 7b3e65173dd1df92b0de9431fcfebdbc0b93d6c9. * [WIP] Add reader + utf modifier (#31) * Dump high level skeleton Signed-off-by: Ayush Dattagupta <[email protected]> * update xenna executor Signed-off-by: Ayush Dattagupta <[email protected]> * More changes Signed-off-by: Ayush Dattagupta <[email protected]> * Updates for utfModifier+ high level updates Signed-off-by: Ayush Dattagupta <[email protected]> * Remove old examples and add new modifier and stages Signed-off-by: Ayush Dattagupta <[email protected]> * Add modify stage Signed-off-by: Ayush Dattagupta <[email protected]> * More updates Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> * Revert "[WIP] Add reader + utf modifier (#31)" (#32) This reverts commit ef25e3eff6502cb9bfc4a57ba48f0939284fd49b. * rebase Signed-off-by: Praateek <[email protected]> * rebase continue Signed-off-by: Praateek <[email protected]> * Remove older file versions Signed-off-by: Sarah Yurick <[email protected]> * Final changes as per the meeting * refactor Signed-off-by: Praateek <[email protected]> * example works Signed-off-by: Praateek <[email protected]> * add base classes Signed-off-by: Praateek <[email protected]> * example works Signed-off-by: Praateek <[email protected]> * .. Signed-off-by: Praateek <[email protected]> * more google style Signed-off-by: Praateek <[email protected]> * add init for backends Signed-off-by: Praateek <[email protected]> * Update example script * add impl Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> * add suggestions Signed-off-by: Sarah Yurick <[email protected]> * add another check Signed-off-by: Sarah Yurick <[email protected]> * Move changes one level deeper in ray-curator, add pyproject toml Signed-off-by: Ayush Dattagupta <[email protected]> * Update dependencies to include cosmos-xenna and pyarrow explicitly Signed-off-by: Ayush Dattagupta <[email protected]> * Update python upper bound Signed-off-by: Ayush Dattagupta <[email protected]> * Add a simple contributing file with instructions Signed-off-by: Ayush Dattagupta <[email protected]> * Remove pyarrow check since it's an explicit dependency Signed-off-by: Ayush Dattagupta <[email protected]> * Remove unusued file utils Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Praateek <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Co-authored-by: Praateek Mahajan <[email protected]> Co-authored-by: Praateek <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Co-authored-by: Abhinav Garg <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> * [Ray] Allow loguru to be serialized #729 * [Ray] Add Jsonl / Parquet Writer Stage (#730) * Update CI testing workflow for ray branch (#739) * Update ci workflow to build ray-curator package instead Signed-off-by: Ayush Dattagupta <[email protected]> * Split out CPU and GPU modules Signed-off-by: Ayush Dattagupta <[email protected]> * Update pytest command Signed-off-by: Ayush Dattagupta <[email protected]> * update crossfit dep to use pinned version (avoiding absl dep issues) Signed-off-by: Ayush Dattagupta <[email protected]> * Explicitly add absl-py dependency to avoid python 3.10 errors Signed-off-by: Ayush Dattagupta <[email protected]> * Update paths for codecov Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> * Initial API desing doc (#737) * Intial APi desing doc Signed-off-by: Abhinav Garg <[email protected]> * Update ray-curator/api-design.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> * Update ray-curator/api-design.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> * Update ray-curator/api-design.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> * Update ray-curator/api-design.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> * Update ray-curator/api-design.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> * Update ray-curator/api-design.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> * Update ray-curator/api-design.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> * Update ray-curator/api-design.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> * Update ray-curator/api-design.md Co-authored-by: Ayush Dattagupta <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> * Refine map-style execution description in API design document to clarify task transformation and mapping flexibility. * Remove redundant sections on Tasks, Stages, and Pipelines from the API design document to streamline content and improve clarity. * Add quickstart example and update API design documentation - Introduced a new quickstart example in `ray_curator/examples/quickstart.py` demonstrating a sentiment analysis pipeline with three stages: TaskCreationStage, WordCountStage, and SentimentStage. - Updated `api-design.md` to include a new section for examples, linking to the quickstart for user reference. - Clarified resource requirements in `resources.py` documentation for GPU usage and constraints. * Ruff related changes Signed-off-by: Abhinav Garg <[email protected]> * PR changes Signed-off-by: Abhinav Garg <[email protected]> * Update DocumentTask to DocumentBatch in API design for improved type flexibility Signed-off-by: Abhinav Garg <[email protected]> * Add fault tolerance requirements to API design documentation - Introduced a new section outlining the necessity for fault tolerance and retry safety in all stages. - Highlighted critical aspects such as task preemption and handling of partial operations to ensure robustness during execution. Signed-off-by: Abhinav Garg <[email protected]> --------- Signed-off-by: Abhinav Garg <[email protected]> Co-authored-by: Praateek Mahajan <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> * Refactor XennaExecutor by removing the cluster initialization function and deleting the associated ray_cluster_init.py file. This streamlines the execution process by eliminating unnecessary setup code. (#768) Signed-off-by: Abhinav Garg <[email protected]> * [Ray] Add Ray Data as an experimental backend (#740) * [Ray] Add integration test to test backends for a specified pipeline (#770) * Adding with_ for options in ProcessingStage and CompositeStage (#764) * [Ray] `DocumentFilter` and `Filter`/`Score`/`ScoreFilter` (#746) * add documentfilter implementation Signed-off-by: Sarah Yurick <[email protected]> * fix nits and ruff Signed-off-by: Sarah Yurick <[email protected]> * add additional logic for setup, setup_on_node, and process_batch Signed-off-by: Sarah Yurick <[email protected]> * add pytests Signed-off-by: Sarah Yurick <[email protected]> * add dep Signed-off-by: Sarah Yurick <[email protected]> * more dep edits Signed-off-by: Sarah Yurick <[email protected]> * another dep Signed-off-by: Sarah Yurick <[email protected]> * add fasttext dep Signed-off-by: Sarah Yurick <[email protected]> * add jieba and mecab Signed-off-by: Sarah Yurick <[email protected]> * add default None params for setup_on_node and setup functions Signed-off-by: Sarah Yurick <[email protected]> * add praateek's suggestions Signed-off-by: Sarah Yurick <[email protected]> * organize imports Signed-off-by: Sarah Yurick <[email protected]> * remove process_batch Signed-off-by: Sarah Yurick <[email protected]> * add _metadata to result Signed-off-by: Sarah Yurick <[email protected]> * add praateek's suggestions Signed-off-by: Sarah Yurick <[email protected]> * ruff and post init for _name Signed-off-by: Sarah Yurick <[email protected]> * modify test Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> * [Ray] Add Download Extract Base Class + Common Crawl Stage (#738) * [Ray] Use Ray Actors where viable (#792) * Extract And download for WIkipedia (#795) * copy over Signed-off-by: Praateek <[email protected]> * copy over Signed-off-by: Praateek <[email protected]> * add init to download Signed-off-by: Praateek <[email protected]> * move justext Signed-off-by: Praateek <[email protected]> * move resiliparse Signed-off-by: Praateek <[email protected]> * move trafilatura Signed-off-by: Praateek <[email protected]> * move get_stop_list_dict Signed-off-by: Praateek <[email protected]> * move download_utils.py to utils/download_utils.py Signed-off-by: Praateek <[email protected]> * move out to download.py Signed-off-by: Praateek <[email protected]> * move WarcIterator towarc_reader.py Signed-off-by: Praateek <[email protected]> * move CommonCrawlWARCExtractor to html_extractor Signed-off-by: Praateek <[email protected]> * remove commoncrawl.py Signed-off-by: Praateek <[email protected]> * create url_generation.py from download_utils Signed-off-by: Praateek <[email protected]> * tests dir Signed-off-by: Praateek <[email protected]> * copy over test_download.py as test_common_crawl.py Signed-off-by: Praateek <[email protected]> * add html_extractors/__init__ Signed-off-by: Praateek <[email protected]> * move html_extractor to ProcessingStage Signed-off-by: Praateek <[email protected]> * update WarcReader to use ProecssingStage Signed-off-by: Praateek <[email protected]> * move to classes for url generation Signed-off-by: Praateek <[email protected]> * typo in name Signed-off-by: Praateek <[email protected]> * bug fixes in justext; rename resiliparse func; utils modular Signed-off-by: Praateek <[email protected]> * init file in for download/text Signed-off-by: Praateek <[email protected]> * justtext minor change Signed-off-by: Praateek <[email protected]> * support str in htmlextractor Signed-off-by: Praateek <[email protected]> * add a working example Signed-off-by: Praateek <[email protected]> * set source_files so that write can be hashed Signed-off-by: Praateek <[email protected]> * use pprint in example Signed-off-by: Praateek <[email protected]> * update comment Signed-off-by: Praateek <[email protected]> * all tests migrated + work Signed-off-by: Praateek <[email protected]> * update defaults in example; comments in stage Signed-off-by: Praateek <[email protected]> * add tests for url generation + PR review Signed-off-by: Praateek <[email protected]> * update download for aws Signed-off-by: Praateek <[email protected]> * rename aws to use_aws_to_donwload Signed-off-by: Praateek <[email protected]> * update resources Signed-off-by: Praateek <[email protected]> * change url generation to have ray-stage-spec Signed-off-by: Praateek <[email protected]> * make download fault tolerant Signed-off-by: Praateek <[email protected]> * refactor as per pr reviews; with tests Signed-off-by: Praateek <[email protected]> * add readme Signed-off-by: Praateek <[email protected]> * bug fix; update tests Signed-off-by: Praateek <[email protected]> * update record limit to None Signed-off-by: Praateek <[email protected]> * bug fixes Signed-off-by: Praateek <[email protected]> * pr comments Signed-off-by: Praateek <[email protected]> * add back test html extractor implementations Signed-off-by: Praateek <[email protected]> * remove cc example Signed-off-by: Praateek <[email protected]> * add column utils Signed-off-by: Praateek <[email protected]> * add todos Signed-off-by: Praateek <[email protected]> * Add Wikipedia download and extract stage This commit introduces a comprehensive pipeline for downloading and processing Wikipedia dump files within the ray-curator framework. Key components include: - **WikipediaUrlGenerator**: Generates URLs for Wikipedia dump files. - **WikipediaDownloader**: Downloads .bz2 dump files using wget. - **WikipediaIterator**: Parses Wikipedia XML dumps and extracts article content. - **WikipediaExtractor**: Cleans Wikipedia markup and extracts meaningful text. Additionally, an example script demonstrating the usage of the new stage is included, along with tests for each component to ensure functionality and reliability. Documentation for the new stage is also provided to guide users in implementation and usage. Signed-off-by: Abhinav Garg <[email protected]> * merge from main Signed-off-by: Praateek <[email protected]> * move deps to text Signed-off-by: Praateek <[email protected]> * update dev Signed-off-by: Praateek <[email protected]> * update pyproject and test.yml Signed-off-by: Praateek <[email protected]> * remove cugraph extra pyproject Signed-off-by: Praateek <[email protected]> * move text to optional deps Signed-off-by: Praateek <[email protected]> * Refactor pyproject.toml: Remove unused dependencies and clean up dev section Signed-off-by: Abhinav Garg <[email protected]> * Remove unused Wikipedia example and related README documentation from the download text stages. Signed-off-by: Abhinav Garg <[email protected]> * Add method to fetch JSON dump data for Wikipedia and refactor dump date retrieval logic - Introduced `_get_data_for_dump` method to handle fetching and parsing JSON dump data. - Refactored logic in `_get_wikipedia_urls` to iterate through available dumps and check their status. - Improved error handling for cases where dump data cannot be loaded or is not finished. Signed-off-by: Abhinav Garg <[email protected]> * Add README for custom download pipelines and remove Wikipedia stage documentation - Introduced a new README.md file detailing the structure and implementation of custom download pipelines. - Removed the outdated README.md for the Wikipedia download and extract stage to streamline documentation. Signed-off-by: Abhinav Garg <[email protected]> * Add num_workers_per_node method to DocumentDownloader and WikipediaDownloader - Implemented num_workers_per_node method in DocumentDownloader to define the number of workers per node for downloading tasks. - Overridden num_workers_per_node in WikipediaDownloader to return a fixed value of 1. - Updated xenna_stage_spec method in DocumentDownloadStage to include the number of workers per node. Signed-off-by: Abhinav Garg <[email protected]> * Update WikipediaDownloader to use 2 workers and change logging level in WikipediaIterator - Modified num_workers_per_node in WikipediaDownloader to return 2, allowing for increased parallelism during downloads. - Changed logging from info to debug level in WikipediaIterator for extracted articles to reduce log verbosity. Signed-off-by: Abhinav Garg <[email protected]> --------- Signed-off-by: Praateek <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> Co-authored-by: Praateek <[email protected]> * Fixing tests (#827) * Refactor Wikipedia extraction and URL generation logic - Removed redundant return statement in `WikipediaExtractor` class. - Simplified status check for dump data in `WikipediaUrlGenerator` by directly accessing the dictionary keys. - Updated logging level in tests to ensure accurate assertions on log calls. - Enhanced test cases for URL generation to cover various dump statuses. These changes improve code clarity and maintainability while ensuring robust error handling in the Wikipedia download and extraction process. Signed-off-by: Abhinav Garg <[email protected]> * Add mwparserfromhell dependency to pyproject.toml - Included `mwparserfromhell==0.6.5` in the text dependencies section of `pyproject.toml` to support parsing Wikipedia markup. This addition enhances the functionality of the project by ensuring the necessary tools for processing Wikipedia data are available. Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> --------- Signed-off-by: Abhinav Garg <[email protected]> Signed-off-by: [Your Name] <[email protected]> * Update ray version to 2.48 #839 * Re-enable CI/CD for Ray API branch (#840) * CI/CD for Ray API branch Signed-off-by: Sarah Yurick <[email protected]> * add text dependencies Signed-off-by: Sarah Yurick <[email protected]> * only run cpu tests Signed-off-by: Sarah Yurick <[email protected]> * comment instead of delete Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> * Ray Video Pipeline : Video Reader (#775) * Add video io reader * Add test * Add VideoReaderStage to video reading pipeline and update VideoDownloadStage to accept VideoTask. Enhance video reading capabilities with new tests for VideoReaderStage. Signed-off-by: Ao Tang <[email protected]> * Update VideoDownloadStage to support verbose logging and modify video_read_example to include verbose argument. Signed-off-by: Ao Tang <[email protected]> * Update outputs for VideoDownloadStage and VideoReaderStage to include additional metadata fields. Signed-off-by: Ao Tang <[email protected]> * Update CI workflow to include video dependencies for testing Signed-off-by: Ao Tang <[email protected]> * Add tests for video tasks module - Introduced a new test package for tasks with an initial test suite for the video tasks module, including tests for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes. - Implemented various test cases to validate initialization, property calculations, metadata extraction, and size calculations. This enhances the testing coverage for video-related functionalities in the ray-curator project. Signed-off-by: Ao Tang <[email protected]> * Enhance video tasks module with additional test cases - Expanded the test suite for the video tasks module by adding new test cases for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes. - Improved coverage for various functionalities including initialization, property calculations, and metadata extraction. This update strengthens the reliability of video-related features in the ray-curator project. Signed-off-by: Ao Tang <[email protected]> * Update pyproject.toml to include a trailing comma for pynvml dependency Signed-off-by: Ao Tang <[email protected]> * Refactor video processing stages to introduce a composite VideoReaderDownloadStage - Replaced separate VideoReaderStage and VideoDownloadStage with a new VideoReaderDownloadStage that combines both functionalities. - Updated the video_read_example to utilize the new composite stage. - Adjusted inputs and outputs for VideoDownloadStage to reflect changes in the pipeline. - Added tests for the new VideoReaderDownloadStage to ensure proper functionality and integration. This refactor simplifies the video reading and downloading process within the ray-curator framework. Signed-off-by: Ao Tang <[email protected]> --------- Signed-off-by: Ao Tang <[email protected]> * chore: Add new trustees and vetters to the copy-pr-bot configuration (#841) (#842) * chore: Add new trustees and vetters to the copy-pr-bot configuration * chore: Remove empty line in copy-pr-bot configuration * chore: Remove ryantwolf from additional trustees and vetters in copy-pr-bot configuration --------- Signed-off-by: Ao Tang <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: Ao Tang <[email protected]> * ci: Add community-bot (#846) (#849) Signed-off-by: oliver könig <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: oliver könig <[email protected]> * Ray Video Reader Enhancement (#848) * Refactor video reading stages: Rename VideoReaderStage to VideoListStage and update VideoReaderDownloadStage to use the new class. Adjust tests accordingly to reflect the changes in stage names and functionality. Signed-off-by: Ao Tang <[email protected]> * Rename test_video_reader to test_video_list Signed-off-by: Ao Tang <[email protected]> * Update VideoListStage name and corresponding tests to reflect new naming convention - Changed the internal name of VideoListStage from "video_reader" to "video_list". - Updated assertions in the test for VideoListStage to match the new name. - Adjusted configuration in the VideoReaderDownloadStage to use "video_list" instead of "video_reader". This ensures consistency across the codebase following the recent refactor. Signed-off-by: Ao Tang <[email protected]> * Update test assertions in VideoReaderDownloadStage to use "video_list" instead of "video_reader" Signed-off-by: Ao Tang <[email protected]> * Refactor video processing stages: Replace VideoDownloadStage with VideoReaderStage in VideoReaderDownloadStage. Update related tests to reflect the new structure and ensure consistency across the codebase. Signed-off-by: Ao Tang <[email protected]> * Enhance VideoListStage and VideoReaderStage documentation Signed-off-by: Ao Tang <[email protected]> * Refactor video reading pipeline: Introduce VideoLoadingStage as a composite stage that combines VideoListStage and VideoReaderStage. Signed-off-by: Ao Tang <[email protected]> * Remove SplitPipeTask from video module and update imports accordingly. Signed-off-by: Ao Tang <[email protected]> * Refactor video task imports: Update import statements in video_list, video_loading, video_reader, and related test files to use the new video module structure. Signed-off-by: Ao Tang <[email protected]> * ruff fix Signed-off-by: Ao Tang <[email protected]> * Implement FilePartitioningStage: Introduce a new stage for partitioning files into groups based on specified criteria, including a limit on the number of groups. Update VideoLoadingStage to utilize FilePartitioningStage instead of the deprecated VideoListStage. Refactor VideoReaderStage to accept FileGroupTask as input and adjust related tests to ensure functionality and correctness. Signed-off-by: Ao Tang <[email protected]> * Refactor video reading stages: Replace VideoLoadingStage with VideoReader as a composite stage that combines FilePartitioningStage and VideoReaderStage. Update related tests to ensure functionality and correctness. Remove deprecated VideoLoadingStage and its associated tests. Signed-off-by: Ao Tang <[email protected]> * Update video_limit type in VideoReader to support None: Changed the type of video_limit from int to int | None to allow for more flexible configuration. This enhances the usability of the VideoReader class. Signed-off-by: Ao Tang <[email protected]> * Refactor file partitioning limit check Signed-off-by: Ao Tang <[email protected]> * Remove redundant tests from TestVideoReader: Deleted tests for video limit values, verbose flag, file extensions, and files per partition configuration to streamline the test suite and focus on essential functionality. Signed-off-by: Ao Tang <[email protected]> --------- Signed-off-by: Ao Tang <[email protected]> * Enhance FilePartitioningStage to enforce task limit check earlier in the process. (#867) Signed-off-by: Ao Tang <[email protected]> * Initialize and shutdown ray session in each executor (#844) * Remove pynvml dependency from pyproject.toml (#872) * docs: refactor all the things (#826) (#859) * docs: refactor all the things * remove auto api docs * api docs to gitignore * updated readme * python linting fixes batch 1 * batch 2 * batch 3 * update --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: L.B. <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> * ci(fix): Use GITHUB_TOKEN for community bot (#853) (#854) * ci(fix): Use GITHUB_TOKEN for community bot * f --------- Signed-off-by: oliver könig <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> * update LLM PII redaction file - fix issue 828 (#868) (#871) * update LLM PII redaction file - fix 828 * Fix ruff check LLM PII redaction file - fix 828 * update LLM PII redaction Enron-file - fix 828 * update LLM-PII redaction README - fix 828 * updated LLM PII redaction Enron-file - fix 828 * updated LLM PII redaction file - fix 828 * Update tutorials/curator-llm-pii/README.md * removed typo from README file - fix 828 * updated LLM redaction tutorial - fix 828 * updated LLM redaction-Enron file - fix 828 * updated LLM redaction-Enron file - fix 828 * Update tutorials/curator-llm-pii/PII-LLM-modification-Enron.ipynb * Update tutorials/curator-llm-pii/PII-LLM-modification-Enron.ipynb --------- Signed-off-by: Adeola Adesoba <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: aadesoba-nv <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> * [Tutorials] Lazy import GPU modules in the Llama Nemotron tutorial (#831) (#875) Signed-off-by: Mehran Maghoumi <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: Mehran Maghoumi <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> * docs: changelog update (#860) (#887) * docs: changelog update * formatting * remove item --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: L.B. <[email protected]> * linkfixes (#865) (#882) Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: L.B. <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> * docs: Fixing version switcher issues (#885) (#886) Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: Andrew Schilling <[email protected]> * [Ray] Download and extract ArXiv (#805) * remove dask arxiv Signed-off-by: Sarah Yurick <[email protected]> * first pass for entire arxiv implementation Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> * fix circular import Signed-off-by: Sarah Yurick <[email protected]> * working module Signed-off-by: Sarah Yurick <[email protected]> * add downloader tests Signed-off-by: Sarah Yurick <[email protected]> * remove unused noqa Signed-off-by: Sarah Yurick <[email protected]> * add test_iterator Signed-off-by: Sarah Yurick <[email protected]> * add extractor tests Signed-off-by: Sarah Yurick <[email protected]> * fix failing download tests Signed-off-by: Sarah Yurick <[email protected]> * add test_stage Signed-off-by: Sarah Yurick <[email protected]> * sort Signed-off-by: Sarah Yurick <[email protected]> * add url generator tests Signed-off-by: Sarah Yurick <[email protected]> * remove noqa Signed-off-by: Sarah Yurick <[email protected]> * remove nemo_curator/download, outdated scripts, outdated examples Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> * [Ray] Classifiers (#753) * [Ray] Classifiers Signed-off-by: Sarah Yurick <[email protected]> * fix ruff Signed-off-by: Sarah Yurick <[email protected]> * add utils file Signed-off-by: Sarah Yurick <[email protected]> * commit quality classifier benchmark helpers Signed-off-by: Sarah Yurick <[email protected]> * use basictokenizer as cpu tokenizer, add crossfit config Signed-off-by: Sarah Yurick <[email protected]> * some ruff Signed-off-by: Sarah Yurick <[email protected]> * merge upstream Signed-off-by: Praateek <[email protected]> * use _name, remove gpu resources from labeler Signed-off-by: Sarah Yurick <[email protected]> * consolidate praateek's work with distributeddataclassifier for quality classifier Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> * add content type, domain, multilingual domain, and filter_by support Signed-off-by: Sarah Yurick <[email protected]> * support for fineweb, fineweb mixtral, and fineweb nemotron classifiers Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> * add prompt task complexity support Signed-off-by: Sarah Yurick <[email protected]> * remove noqa Signed-off-by: Sarah Yurick <[email protected]> * padding_size does not need to be exposed to user Signed-off-by: Sarah Yurick <[email protected]> * max_seq_length does not need to be exposed to the user, set default micro_batch_sizes Signed-off-by: Sarah Yurick <[email protected]> * add max_chars, edit docstring Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> * aegis functionality, start working on instruction data guard Signed-off-by: Sarah Yurick <[email protected]> * nit fixes Signed-off-by: Sarah Yurick <[email protected]> * add working pytests for all classifiers Signed-off-by: Sarah Yurick <[email protected]> * remove existing pytest file Signed-off-by: Sarah Yurick <[email protected]> * add more comments to tests Signed-off-by: Sarah Yurick <[email protected]> * address review, add mem conversation, add README Signed-off-by: Sarah Yurick <[email protected]> * move redundant test code Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> * model_inference_batch_size and format_name_with_suffix Signed-off-by: Sarah Yurick <[email protected]> * add missing hf_token usage, remove test file, restructure dirs and files Signed-off-by: Sarah Yurick <[email protected]> * delete old examples and scripts Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek <[email protected]> Co-authored-by: Praateek <[email protected]> * [RAY] Add ID Module (#876) * Add id inital working IMP Signed-off-by: Vibhu Jawa <[email protected]> * working add_id Signed-off-by: Vibhu Jawa <[email protected]> * Add ID Signed-off-by: Vibhu Jawa <[email protected]> * Update ray-curator/ray_curator/tasks/tasks.py Co-authored-by: Copilot <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * Add prefix feature, overwrite, warnings Signed-off-by: Vibhu Jawa <[email protected]> * rename id_prefix to user_prefix Signed-off-by: Vibhu Jawa <[email protected]> * Add in test for tasks and fix task id Signed-off-by: VibhuJawa <[email protected]> --------- Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: VibhuJawa <[email protected]> Co-authored-by: Copilot <[email protected]> * Add video splitting pipeline with fixed stride extraction and transcoding Stage (#783) * Add video splitting pipeline with fixed stride extraction and transcoding stages - Introduced `video_split_clip_example.py` to demonstrate video splitting functionality. - Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips. - Implemented command-line arguments for configuring video processing parameters. - Created utility functions for grouping iterables in `grouping.py`. - Added unit tests for the new stages in `test_clip_transcoding_stage.py` and `test_fixed_stride_extractor_stage.py`. Signed-off-by: Ao Tang <[email protected]> * Refactor video splitting pipeline to remove debug mode and enhance stage integration Signed-off-by: Ao Tang <[email protected]> * Add video limit argument to video split clip example Signed-off-by: Ao Tang <[email protected]> * Refactor video processing stages to enhance resource management and integrate new functionalities - Replaced separate VideoReaderStage and VideoDownloadStage with a composite VideoReaderDownloadStage, streamlining the video reading and downloading process. - Updated ClipTranscodingStage to improve GPU resource allocation and added detailed arguments for better configurability. - Adjusted tests to reflect changes in resource management, ensuring accurate assertions on GPU usage. These changes improve the clarity and efficiency of video processing within the ray-curator framework. Signed-off-by: Ao Tang <[email protected]> * Add mock GPU classes and enhance ClipTranscodingStage tests - Introduced MockGpuInfo and MockGpuResources classes to simulate GPU information and resources for testing. - Updated test_resources_gpu_encoder and test_resources_hwaccel_enabled methods to utilize mocks, ensuring accurate resource assertions without dependency on actual GPU hardware. - Enhanced test_different_encoder_configurations to validate resource requirements for various encoder configurations, including GPU settings. These changes improve the robustness of the ClipTranscodingStage tests by isolating them from hardware dependencies, facilitating easier testing and validation. Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Ao Tang <[email protected]> * Remove deprecated GPU resource tests from ClipTranscodingStage Signed-off-by: Ao Tang <[email protected]> * Remove unused test for processing in debug mode from ClipTranscodingStage tests Signed-off-by: Ao Tang <[email protected]> * Add unit tests for grouping utilities in the ray_curator.utils module Signed-off-by: Ao Tang <[email protected]> * Enhance video processing stages with ray stage specifications - Added `ray_stage_spec` method to `ClipTranscodingStage`, `VideoDownloadStage`, and `VideoReaderStage` to define stage characteristics for Ray integration. - Updated input and output methods in `ClipTranscodingStage` to include additional input parameters. - Modified `SplitPipeTask` to return properties from `data` instead of `video`, ensuring consistency in task data handling. - Added unit tests to verify the correctness of the new `ray_stage_spec` implementations. These changes improve the integration of video processing stages with Ray's architecture and enhance test coverage for the new functionalities. Signed-off-by: Ao Tang <[email protected]> * Refactor video processing imports and update pipeline stages Signed-off-by: Ao Tang <[email protected]> * Remove unused `IS_ACTOR_STAGE` key from `ray_stage_spec` in `ClipTranscodingStage` and clean up commented-out code. This simplifies the stage specification and prepares for future enhancements. Signed-off-by: Ao Tang <[email protected]> * Remove redundant check for video source bytes in ClipTranscodingStage. This simplifies the process method by eliminating unnecessary error handling when source bytes are not available. Signed-off-by: Ao Tang <[email protected]> * Refactor ClipTranscodingStage to use a class variable for the stage name and implement post-initialization resource setup. Added error handling for None source bytes in the process method. Updated tests to remove redundant checks and ensure proper functionality. Signed-off-by: Ao Tang <[email protected]> * Remove unnecessary error handling for None source bytes in ClipTranscodingStage's process method, Signed-off-by: Ao Tang <[email protected]> * remove redudant test Signed-off-by: Ao Tang <[email protected]> * precommit fix Signed-off-by: Ao Tang <[email protected]> --------- Signed-off-by: Ao Tang <[email protected]> Signed-off-by: [Your Name] <[email protected]> * docs: ray curator api autodoc updates (#896) Signed-off-by: Lawrence Lane <[email protected]> * Move all text stages to `stages/text/` (#891) * first pass Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> * fix tests Signed-off-by: Sarah Yurick <[email protected]> * fix after merge Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> * Add Ray Actor Pool Exceuctor (#893) * Initial Minhash implementation on Ray (#837) * Initial minhash logic without Stage API Signed-off-by: Ayush Dattagupta <[email protected]> * update args and support passing in pre-batched files Signed-off-by: Ayush Dattagupta <[email protected]> * Remove old minhash impl Signed-off-by: Ayush Dattagupta <[email protected]> * Add Class to do GPU IO for dedup Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> * Add ID Generator class Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> * Move MinHashActor to a GPUMinHash class and create a GPUMinHash Processing stage Signed-off-by: Ayush Dattagupta <[email protected]> * Remove minhash method in favor of minhashProcessingStage Signed-off-by: Ayush Dattagupta <[email protected]> * Add mkdir logic to the writer Signed-off-by: Ayush Dattagupta <[email protected]> * Add file partitioning stage to __init__.py Signed-off-by: Ayush Dattagupta <[email protected]> * Update cuda12x extra to deduplication. Bump pynvml to avoid conflicts Signed-off-by: Ayush Dattagupta <[email protected]> * Update stage name Signed-off-by: Ayush Dattagupta <[email protected]> * Add initial minhash tests Signed-off-by: Ayush Dattagupta <[email protected]> * Add rmm pool arg to MinhashStage, default to false in the parent actor Signed-off-by: Ayush Dattagupta <[email protected]> * Move IO and ID generator logic to the Stage rather than the parent GPUMinHash class Signed-off-by: Ayush Dattagupta <[email protected]> * Update GPUMinHash Tests Signed-off-by: Ayush Dattagupta <[email protected]> * Standardize Id generator actor name Signed-off-by: Ayush Dattagupta <[email protected]> * Add GPUMinHashStage tests Signed-off-by: Ayush Dattagupta <[email protected]> * Rename GPUMinHashStage to MinHashStage Signed-off-by: Ayush Dattagupta <[email protected]> * Add marker for GPU tests Signed-off-by: Ayush Dattagupta <[email protected]> * update cpu ci workflow to skip GPU tests Signed-off-by: Ayush Dattagupta <[email protected]> * Skip tests if imports fail Signed-off-by: Ayush Dattagupta <[email protected]> * move cudf import checks before stage imports Signed-off-by: Ayush Dattagupta <[email protected]> * Use storage options from read_kwargs directly Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Praateek Mahajan <[email protected]> * docs: curate text load data content updates for ray (#895) * docs: load text data article updates Signed-off-by: Lawrence Lane <[email protected]> * remove "ray-curator" for curator Signed-off-by: Lawrence Lane <[email protected]> * simplify naming Signed-off-by: Lawrence Lane <[email protected]> * imports Signed-off-by: Lawrence Lane <[email protected]> * imports Signed-off-by: Lawrence Lane <[email protected]> * imports Signed-off-by: Lawrence Lane <[email protected]> * linkfix Signed-off-by: Lawrence Lane <[email protected]> * read through Signed-off-by: Lawrence Lane <[email protected]> * simplification Signed-off-by: Lawrence Lane <[email protected]> * remove placeholder concept details Signed-off-by: Lawrence Lane <[email protected]> * pipeline verbiage Signed-off-by: Lawrence Lane <[email protected]> * initial feedback round Signed-off-by: Lawrence Lane <[email protected]> * reduce admonition noise Signed-off-by: Lawrence Lane <[email protected]> * minor updates Signed-off-by: Lawrence Lane <[email protected]> * minor updates Signed-off-by: Lawrence Lane <[email protected]> * feedback Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> * Adding function decorator for very simple functions to be converted into stages (#835) * Revert 'Add utility decorators for ProcessingStage creation' (empty cherry-pick) Signed-off-by: Abhinav Garg <[email protected]> * Add utility decorators for ProcessingStage creation This commit introduces a new module containing the `processing_stage` decorator, which allows users to easily convert plain Python functions into `ProcessingStage` instances. The decorator supports configuration options such as stage name, resource allocation, and batch size. Additionally, unit tests have been added to validate the functionality of the decorator and ensure proper handling of task processing. Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> * test commit Signed-off-by: Sarah Yurick <[email protected]> * add test_stage_registry, other nits Signed-off-by: Sarah Yurick <[email protected]> * overwrite stage registry Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> * propagate _metadata and _stage_perf Signed-off-by: Sarah Yurick <[email protected]> * accept resources dict Signed-off-by: Sarah Yurick <[email protected]> * reformat Signed-off-by: Sarah Yurick <[email protected]> * add process_batch tests Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> * remove todo Signed-off-by: Sarah Yurick <[email protected]> * add pipeline example Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Abhinav Garg <[email protected]> Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> * Add Text Embedding Model (#899) * Add Ray curator dockerfile and enable testing (#879) * Add Ray curator dockerfile and enable testing Signed-off-by: Dong Hyuk Chang <[email protected]> * Fix indentation issues Signed-off-by: Dong Hyuk Chang <[email protected]> * Update dockerfile and add cuda12x Signed-off-by: Dong Hyuk Chang <[email protected]> * Update coverage pathes Signed-off-by: Dong Hyuk Chang <[email protected]> * Update gpu tests runner Signed-off-by: Dong Hyuk Chang <[email protected]> * Add gpu testing scripts and update Signed-off-by: Dong Hyuk Chang <[email protected]> * Cd into ray-curator for coverage Signed-off-by: Dong Hyuk Chang <[email protected]> * Create dev layer and install dev packages Signed-off-by: Dong Hyuk Chang <[email protected]> * Update coverage paths Signed-off-by: Dong Hyuk Chang <[email protected]> * Install opencv Signed-off-by: Dong Hyuk Chang <[email protected]> * Address syntax error Signed-off-by: Dong Hyuk Chang <[email protected]> * Update cv2 ubuntu dependencies Signed-off-by: Dong Hyuk Chang <[email protected]> * Fix typo Signed-off-by: Dong Hyuk Chang <[email protected]> * Add cudf placeholder test Signed-off-by: Dong Hyuk Chang <[email protected]> * Space after import Signed-off-by: Dong Hyuk Chang <[email protected]> * Add gpu_only_import Signed-off-by: Dong Hyuk Chang <[email protected]> * Remove import utils for now Signed-off-by: Dong Hyuk Chang <[email protected]> * Fix spacing Signed-off-by: Dong Hyuk Chang <[email protected]> * Skip gpu tests for cpu Signed-off-by: Dong Hyuk Chang <[email protected]> * Update unit test coverage path Signed-off-by: Dong Hyuk Chang <[email protected]> * Skip gpu coverage report for now Signed-off-by: Dong Hyuk Chang <[email protected]> * Use pixi Signed-off-by: Dong Hyuk Chang <[email protected]> * Fix dockerfile syntax Signed-off-by: Dong Hyuk Chang <[email protected]> * Try ffmpeg only Signed-off-by: Dong Hyuk Chang <[email protected]> * Add extra index url for pixi Signed-off-by: Dong Hyuk Chang <[email protected]> * Address typos Signed-off-by: Dong Hyuk Chang <[email protected]> * Install git Signed-off-by: Dong Hyuk Chang <[email protected]> * Update entrypoint Signed-off-by: Dong Hyuk Chang <[email protected]> * Fix typo Signed-off-by: Dong Hyuk Chang <[email protected]> * Use env var for dev install Signed-off-by: Dong Hyuk Chang <[email protected]> * Resolve syntax error Signed-off-by: Dong Hyuk Chang <[email protected]> * Fix env var and verbose install Signed-off-by: Dong Hyuk Chang <[email protected]> * Update pixi entrypoint and pyproject install Signed-off-by: Dong Hyuk Chang <[email protected]> * Trigger entrypoint before tests Signed-off-by: Dong Hyuk Chang <[email protected]> * Update test entrypoint Signed-off-by: Dong Hyuk Chang <[email protected]> * Source entrypoint Signed-off-by: Dong Hyuk Chang <[email protected]> * Update list of dev install pixi Signed-off-by: Dong Hyuk Chang <[email protected]> * Add back cuda12x and index-strategy Signed-off-by: Dong Hyuk Chang <[email protected]> * Turn off verbose install Signed-off-by: Dong Hyuk Chang <[email protected]> * Skip gpu coverage for now Signed-off-by: Dong Hyuk Chang <[email protected]> * Support arm Signed-off-by: Dong Hyuk Chang <[email protected]> * Set timeout for dockerbuild and update pyproject Signed-off-by: Dong Hyuk Chang <[email protected]> * Remove retry github config Signed-off-by: Dong Hyuk Chang <[email protected]> --------- Signed-off-by: Dong Hyuk Chang <[email protected]> * ci: Install ray-curator module (#905) * Add ray curator as pypi dependency Signed-off-by: Dong Hyuk Chang <[email protected]> * Add package info and test import Signed-off-by: Dong Hyuk Chang <[email protected]> * Update pyproject.toml Signed-off-by: Dong Hyuk Chang <[email protected]> * Copy src for pixi install Signed-off-by: Dong Hyuk Chang <[email protected]> * Update test import Signed-off-by: Dong Hyuk Chang <[email protected]> * Revert temp test Signed-off-by: Dong Hyuk Chang <[email protected]> --------- Signed-off-by: Dong Hyuk Chang <[email protected]> * [REVIEW] Add modifers to ray curator (#898) * Inital WIP modifier workflows Signed-off-by: VibhuJawa <[email protected]> * Moved tests and also moved modifiers to text sub module Signed-off-by: VibhuJawa <[email protected]> * Add tests for the meta class and modifier and improve docstring Signed-off-by: VibhuJawa <[email protected]> * Update ray-curator/ray_curator/stages/text/modifiers/slicer.py Co-authored-by: Copilot <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * Update ray-curator/ray_curator/stages/text/modifiers/line_remover.py Co-authored-by: Copilot <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * Delete files from dask dir and remove optional download fields Signed-off-by: Vibhu Jawa <[email protected]> * Add pytest as requested Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: VibhuJawa <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> Co-authored-by: Copilot <[email protected]> * Allow users to fuse multiple `DocumentFilter` objects into a single `ScoreFilter` stage (#850) * Allow users to fuse multiple `DocumentFilter` objects into a single `ScoreFilter` stage Signed-off-by: Sarah Yurick <[email protected]> * remove old example and scripts file Signed-off-by: Sarah Yurick <[email protected]> * add suggestions Signed-off-by: Sarah Yurick <[email protected]> * add init Signed-off-by: Sarah Yurick <[email protected]> * fix csv path Signed-off-by: Sarah Yurick <[email protected]> * clearer error messages Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> * Fix exception when blocksize is set (#892) (#904) If blocksize is set instead of files_per_partition, this line raised an exception. Signed-off-by: Yurii Paniv <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: Yurii Paniv <[email protected]> * docs: curate text - process data - language dir (#900) * docs: curate text - process data - language dir Signed-off-by: Lawrence Lane <[email protected]> * remove extra content Signed-off-by: Lawrence Lane <[email protected]> * another pass Signed-off-by: Lawrence Lane <[email protected]> * remove pool Signed-off-by: Lawrence Lane <[email protected]> * formatting Signed-off-by: Lawrence Lane <[email protected]> * feedback Signed-off-by: Lawrence Lane <[email protected]> * clarificaiton and alternative as pipeline stage. removed extra section Signed-off-by: Lawrence Lane <[email protected]> * Update docs/curate-text/process-data/language-management/language.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> * docs: add README for experimental scripts directory (#910) Signed-off-by: Abhinav Garg <[email protected]> * Add IdGenerator to JsonlReader + IdGenerator tests / write_to_disk / from_disk (#907) * Initial buckets to edges stage (#909) * Initial buckets to edges stage Signed-off-by: Ayush Dattagupta <[email protected]> * re-add file utils from lsh pr Signed-off-by: Ayush Dattagupta <[email protected]> * Handle directory cleanup/creation logic in the stage Signed-off-by: Ayush Dattagupta <[email protected]> * Add tests for buckets to edglist Signed-off-by: Ayush Dattagupta <[email protected]> * Rename doc_id_column to doc_id_field, update storage_options to read/write_kwargs instead Signed-off-by: Ayush Dattagupta <[email protected]> * Fix indentation Signed-off-by: Ayush Dattagupta <[email protected]> * Fix kwargs args Signed-off-by: Ayush Dattagupta <[email protected]> * Add copyright headers Signed-off-by: Ayush Dattagupta <[email protected]> * remove previous curator impl Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> * [SemDedup] Add KMeans (#912) * S3 Client (#903) * WIP Signed-off-by: Ao Tang <[email protected]> * WIP Signed-off-by: Ao Tang <[email protected]> * Refactor S3 client configuration and enhance video reading logging - Updated S3_PROFILE_PATH to use an environment variable for better flexibility in specifying the S3 credentials file location. - Improved logging in VideoReaderStage to provide more informative messages about video byte downloads, including the size of the downloaded video. Signed-off-by: Ao Tang <[email protected]> * Enhance VideoReader functionality with S3 support and improve validation checks - Updated VideoReader to conditionally use ClientPartitioningStage for S3 paths and FilePartitioningStage for local paths, improving flexibility in handling video sources. - Enhanced validation in VideoTask to check for the existence of input videos when provided as pathlib.Path, ensuring better error handling. - Removed unused methods from S3Client to streamline the codebase. Signed-off-by: Ao Tang <[email protected]> * Remove redundant exception raising in VideoReaderStage to improve error handling during video reading. This change prevents unnecessary propagation of exceptions while still logging errors effectively. Signed-off-by: Ao Tang <[email protected]> * Refactor ClientPartitioningStage and enhance S3 client configuration - Rearranged import statements for better organization and readability in `client_partitioning.py` and `video_reader.py`. - Updated `S3ClientConfig` and `BaseClientConfig` to use `@dataclass` for improved data handling. - Added comprehensive unit tests for `ClientPartitioningStage`, covering initialization, setup, and processing methods with various scenarios. - Improved error handling and validation in the `_read_list_json` function. This refactor enhances the maintainability and test coverage of the codebase, ensuring better functionality and reliability in handling client partitioning tasks. Signed-off-by: Ao Tang <[email protected]> * Remove SPDX license comments from S3 client, storage client, and storage utilities files to streamline code readability. This change simplifies the file headers while retaining essential module documentation. Signed-off-by: Ao Tang <[email protected]> * Use Fsspec instead of boto3 Signed-off-by: Ao Tang <[email protected]> * Refactor file handling and enhance video reading capabilities - Introduced a new `FSPath` class in `client_utils.py` for improved file operations with fsspec. - Updated `ClientPartitioningStage` and `VideoReaderStage` to utilize the new `FSPath` class for better handling of file paths. - Removed unused imports and streamlined code in `client_partitioning.py` and `video_reader.py`. - Enhanced error handling in `VideoReaderStage` to support various input types for video sources. This refactor improves the maintainability and flexibility of file handling in the video processing pipeline. Signed-off-by: Ao Tang <[email protected]> * move client_partitioning.py Signed-off-by: Ao Tang <[email protected]> * ruff check Signed-off-by: Ao Tang <[email protected]> * Fix broken tests Signed-off-by: Ao Tang <[email protected]> * Add `as_posix` method to `FSPath` class and implement comprehensive test suite - Introduced `as_posix` method in the `FSPath` class to convert filesystem paths to POSIX format, accommodating various protocols. - Created a new test suite for `FSPath` in `test_client_utils.py`, covering initialization, string representation, file operations, and edge cases. - Enhanced tests for `get_bytes_cat_ranges` to handle different file sizes and error scenarios. This update improves the functionality and test coverage of the `FSPath` class, ensuring robust file handling across different filesystems. Signed-off-by: Ao Tang <[email protected]> * Remove logging of downloaded video size in VideoReaderStage to streamline error handling and reduce unnecessary output. Signed-off-by: Ao Tang <[email protected]> * Refactor video reading and splitting pipeline examples for improved readability - Reformatted the `create_video_reading_pipeline` and `create_video_splitting_pipeline` functions to enhance code clarity by aligning parameters and removing unnecessary line breaks. - Updated the `VideoReader` and `ClipTranscodingStage` instantiation to follow a consistent style. - Made minor adjustments in the `ClientPartitioningStage` to ensure consistent formatting and improved readability. These changes contribute to a cleaner and more maintainable codebase for video processing pipelines. Signed-off-by: Ao Tang <[email protected]> --------- Signed-off-by: Ao Tang <[email protected]> * Add ClipWriterStage to video splitting pipeline Clean (#897) * WIP Signed-off-by: Ao Tang <[email protected]> * WIP Signed-off-by: Ao Tang <[email protected]> * Update ClipWriterStage to clarify local storage usage Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Ao Tang <[email protected]> * Enhance video clip processing with new GenericClipWriterStage and required output path argument - Introduced a new GenericClipWriterStage for writing video clips and their metadata, consolidating the writing process and improving resource management. - Updated the video_split_clip_example to require an output clip path, ensuring that users specify where to save the generated clips. - The new stage supports parallel writing of clips and metadata, enhancing performance and flexibility in video processing workflows. Signed-off-by: Ao Tang <[email protected]> * Enhance ClipWriterStage with additional metadata handling - Improved `ClipWriterStage` to support writing additional metadata during video processing. - Updated related utility functions to accommodate new metadata fields. - Refined unit tests to cover the new functionality and ensure reliability. Signed-off-by: Ao Tang <[email protected]> * Add ClipWriterStage to video splitting pipeline - Introduced `ClipWriterStage` for writing clips and metadata during video processing. - Updated `video_split_clip_example.py` to include the new stage, allowing for clip writing functionality. - Enhanced command-line argument parsing for output clip path. - Added utility functions for managing storage paths and writing data in various formats. - Implemented unit tests for `ClipWriterStage` to ensure functionality and reliability. Signed-off-by: Ao Tang <[email protected]> * ruff fix Signed-off-by: Ao Tang <[email protected]> * ruff format Signed-off-by: Ao Tang <[email protected]> * Refactor S3 client configuration and enhance video reading logging - Updated S3_PROFILE_PATH to use an environment variable for better flexibility in specifying the S3 credentials file location. - Improved logging in VideoReaderStage to provide more informative messages about video byte downloads, including the size of the downloaded video. Signed-off-by: Ao Tang <[email protected]> * Enhance VideoReader functionality with S3 support and improve validation checks - Updated VideoReader to conditionally use ClientPartitioningStage for S3 paths and FilePartitioningStage for local paths, improving flexibility in handling video sources. - Enhanced validation in VideoTask to check for the existence of input videos when provided as pathlib.Path, ensuring better error handling. - Removed unused methods from S3Client to streamline the codebase. Signed-off-by: Ao Tang <[email protected]> * Remove redundant exception raising in VideoReaderStage to improve error handling during video reading. This change prevents unnecessary propagation of exceptions while still logging errors effectively. Signed-off-by: Ao Tang <[email protected]> * Refactor ClientPartitioningStage and enhance S3 client configuration - Rearranged import statements for better organization and readability in `client_partitioning.py` and `video_reader.py`. - Updated `S3ClientConfig` and `BaseClientConfig` to use `@dataclass` for improved data handling. - Added comprehensive unit tests for `ClientPartitioningStage`, covering initialization, setup, and processing methods with various scenarios. - Improved error handling and validation in the `_read_list_json` function. This refactor enhances the maintainability and test coverage of the codebase, ensuring better functionality and reliability in handling client partitioning tasks. Signed-off-by: Ao Tang <[email protected]> * Remove SPDX license comments from S3 client, storage client, and storage utilities files to streamline code readability. This change simplifies the file headers while retaining essential module documentation. Signed-off-by: Ao Tang <[email protected]> * Use Fsspec instead of boto3 Signed-off-by: Ao Tang <[email protected]> * Refactor file handling and enhance video reading capabilities - Introduced a new `FSPath` class in `client_utils.py` for improved file operations with fsspec. - Updated `ClientPartitioningStage` and `VideoReaderStage` to utilize the new `FSPath` class for better handling of file paths. - Removed unused imports and streamlined code in `client_partitioning.py` and `video_reader.py`. - Enhanced error handling in `VideoReaderStage` to support various input types for video sources. This refactor improves the maintainability and flexibility of file handling in the video processing pipeline. Signed-off-by: Ao Tang <[email protected]> * move client_partitioning.py Signed-off-by: Ao Tang <[email protected]> * ruff check Signed-off-by: Ao Tang <[email protected]> * Fix broken tests Signed-off-by: Ao Tang <[email protected]> * Remove unused `generic_clip_writer.py`, `storage_client.py`, and related utility files; refactor `writer_utils.py` to eliminate storage client dependencies and streamline file writing functions. Update tests to reflect these changes and ensure compatibility with the new structure. Signed-off-by: Ao Tang <[email protected]> * Remove test file `test_client_utils.py` for the `FSPath` class, cleaning up unused test cases and ensuring the test suite reflects the current codebase structure. Signed-off-by: Ao Tang <[email protected]> * Refactor ClipWriterStage to remove storage client dependencies and streamline file writing methods. Updated method signatures to eliminate storage client parameters, enhancing code clarity and maintainability. Signed-off-by: Ao Tang <[email protected]> * Remove unused import of ClipWriterStage in video_split_clip_example.py to streamline the code and improve clarity. Signed-off-by: Abhinav Garg <[email protected]> * Remove unused `input_s3_profile_name` attribute from `VideoReader` class to streamline the code and improve clarity. Signed-off-by: Ao Tang <[email protected]> * Remove unused `input_s3_profile_name` attribute from `VideoReader` class to streamline the code and improve clarity. Signed-off-by: Abhinav Garg <[email protected]> * Refactor video metadata writing in ClipWriterStage by removing an unnecessary blank line for improved code clarity. Update get_full_path function signature for consistency in type hinting. Enhance test case formatting in TestVideoReaderStage to improve readability and maintainability. Signed-off-by: Ao Tang <[email protected]> --------- Signed-off-by: Ao Tang <[email protected]> Signed-off-by: [Your Name] <[email protected]> Signed-off-by: Abhinav Garg <[email protected]> Co-authored-by: Abhinav Garg <[email protected]> * Add motion filtering stages to video splitting pipeline (#797) * Add video io reader * Add test * Add video splitting pipeline with fixed stride extraction and transcoding stages - Introduced `video_split_clip_example.py` to demonstrate video splitting functionality. - Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips. - Implemented command-line arguments for configuring…

* delete old config, examples, nemo_curator, and tests directories Signed-off-by: Sarah Yurick <[email protected]> * move ray-curator contents to top level directory Signed-off-by: Sarah Yurick <[email protected]> * edit all files in backends/ Signed-off-by: Sarah Yurick <[email protected]> * all core, examples, and metrics files Signed-off-by: Sarah Yurick <[email protected]> * all files under models, pipeline, tasks, utils directories Signed-off-by: Sarah Yurick <[email protected]> * all stages except text/ and video/ Signed-off-by: Sarah Yurick <[email protected]> * all text stages Signed-off-by: Sarah Yurick <[email protected]> * all video stages Signed-off-by: Sarah Yurick <[email protected]> * move examples dir to top level, rename ray_curator to nemo_curator Signed-off-by: Sarah Yurick <[email protected]> * add tests for backends, core, models, and pipelines Signed-off-by: Sarah Yurick <[email protected]> * remaining non stage tests Signed-off-by: Sarah Yurick <[email protected]> * some stages tests Signed-off-by: Sarah Yurick <[email protected]> * remaining tests Signed-off-by: Sarah Yurick <[email protected]> * delete outdated tutorials Signed-off-by: Sarah Yurick <[email protected]> * edit github and docker files Signed-off-by: Sarah Yurick <[email protected]> * delete dask conftest Signed-off-by: Sarah Yurick <[email protected]> * edit pyproject and fix some ruff Signed-off-by: Sarah Yurick <[email protected]> * more ruff Signed-off-by: Sarah Yurick <[email protected]> * fix more ruff Signed-off-by: Sarah Yurick <[email protected]> * playing catch up Signed-off-by: Sarah Yurick <[email protected]> * edit docs Signed-off-by: Sarah Yurick <[email protected]> * comment out cicd-gpu-tests Signed-off-by: Sarah Yurick <[email protected]> * remove resources as property usages Signed-off-by: Sarah Yurick <[email protected]> * edit test Signed-off-by: Sarah Yurick <[email protected]> * update fuzzy dedup from main Signed-off-by: Sarah Yurick <[email protected]> * edit from image pr Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>

Signed-off-by: Praateek Mahajan <[email protected]>

Signed-off-by: Praateek <[email protected]>

…rking-scripts Signed-off-by: Praateek <[email protected]>

Signed-off-by: Praateek <[email protected]>

…rking-scripts Signed-off-by: Praateek <[email protected]>

Signed-off-by: Praateek <[email protected]>

…kmahajan/NeMo-Curator into praateek/nc-benchmarking-scripts Signed-off-by: Praateek <[email protected]>

…rking-scripts Signed-off-by: Praateek <[email protected]>

praateekmahajan and others added 24 commits August 28, 2025 12:50

fc

981efcd

Signed-off-by: Praateek <[email protected]>

some more changes

fb1a628

Signed-off-by: Praateek <[email protected]>

pr review

39c4fe9

Signed-off-by: Praateek <[email protected]>

Merge branch 'main' into praateek/add-removal-workflow

663d74c

fc

13addb6

Signed-off-by: Praateek <[email protected]>

Merge branch 'main' into praateek/add-removal-workflow

006a023

typo missed

c899f2e

Signed-off-by: Praateek <[email protected]>

Merge branch 'main' into praateek/add-removal-workflow

a8cd862

Signed-off-by: Praateek Mahajan <[email protected]>

delete base

9482b21

Signed-off-by: Praateek <[email protected]>

..

1d16cd4

Signed-off-by: Praateek <[email protected]>

fc

0e7db8a

Signed-off-by: Praateek <[email protected]>

use nc

3976ddb

Signed-off-by: Praateek <[email protected]>

...

0e1b44c

Signed-off-by: Praateek <[email protected]>

no need for path in kill id gen

797dc41

Signed-off-by: Praateek <[email protected]>

move to finally

d4d9dc7

Signed-off-by: Praateek <[email protected]>

Merge branch 'praateek/add-removal-workflow' into praateek/nc-benchma…

18fd779

…rking-scripts Signed-off-by: Praateek <[email protected]>

..

7ac071f

Signed-off-by: Praateek <[email protected]>

Merge remote-tracking branch 'upstream/main' into praateek/nc-benchma…

14e39a6

…rking-scripts Signed-off-by: Praateek <[email protected]>

fc

1bfeff8

Signed-off-by: Praateek <[email protected]>

Merge branch 'main' into praateek/nc-benchmarking-scripts

1102553

common crawl

195b5f0

Signed-off-by: Praateek <[email protected]>

Merge branch 'praateek/nc-benchmarking-scripts' of github.com:praatee…

176119d

…kmahajan/NeMo-Curator into praateek/nc-benchmarking-scripts Signed-off-by: Praateek <[email protected]>

copy-pr-bot bot temporarily deployed to test September 4, 2025 21:02 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci September 4, 2025 21:02 Error

Merge remote-tracking branch 'upstream/main' into praateek/nc-benchma…

511245e

…rking-scripts Signed-off-by: Praateek <[email protected]>

copy-pr-bot bot temporarily deployed to test September 4, 2025 21:07 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci September 4, 2025 21:07 Failure

rlratzel mentioned this pull request Oct 23, 2025

[WIP] Adds initial benchmarking framework #1197

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add benchmarking abilitiy #1011

Add benchmarking abilitiy #1011

Uh oh!

praateekmahajan commented Sep 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add benchmarking abilitiy #1011

Are you sure you want to change the base?

Add benchmarking abilitiy #1011

Uh oh!

Conversation

praateekmahajan commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Future TODOs (not ranked by priority)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

praateekmahajan commented Sep 2, 2025 •

edited

Loading