Skip to content

Conversation

@praateekmahajan
Copy link
Contributor

@praateekmahajan praateekmahajan commented Sep 2, 2025

Description

Nightly benchmarking runs a configurable matrix of benchmark scripts with per-entry Ray cluster isolation, standardized result outputs, environment capture, and optional logging to MLflow/W&B. It’s designed to provide reproducible, comparable performance measurements across datasets and executors, with clean per-run artifacts and robust failure isolation.

Usage

python -m nightly_benchmarking.run \
  --matrix nightly_benchmarking/matrix.yaml \
  --datasets nightly_benchmarking/dataset_paths.json

Future TODOs (not ranked by priority)

  1. Upload results / artifacts to Google Drive
  2. Post to Slack ability
  3. Be able to create charts from logged results over time
  4. See if we can run each of the entry in the matrix on a separate cluster (ie. increase parallelism)
  5. Figure out how to run in CI? If running in CI, then how to get environment variables

praateekmahajan and others added 24 commits August 28, 2025 12:50
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
* New API Spec with Ray Backend (#726)

* Create package + reorganize  (#2)

* fc

Signed-off-by: Praateek <[email protected]>

* remove per file ignore

Signed-off-by: Praateek <[email protected]>

* sc

Signed-off-by: Praateek <[email protected]>

* ruff

Signed-off-by: Praateek <[email protected]>

* use curator_id_str

Signed-off-by: Praateek <[email protected]>

---------

Signed-off-by: Praateek <[email protected]>

* fc

Signed-off-by: Praateek <[email protected]>

* kmeans works

Signed-off-by: Praateek <[email protected]>

* Fuzzy dedup fixes (#11)

* high level method for each step

Signed-off-by: Ayush Dattagupta <[email protected]>

* Fixes/changes after testing

Signed-off-by: Ayush Dattagupta <[email protected]>

* Updates to existing fuzzy_dedup modules

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add high level fuzzy dedup api and e2e example

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add e2e example

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add config

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>

* fc

Signed-off-by: Praateek <[email protected]>

* fc

Signed-off-by: Praateek <[email protected]>

* removal works

Signed-off-by: Praateek <[email protected]>

* bug fix

Signed-off-by: Praateek <[email protected]>

* working streaming embedding with id generator

Signed-off-by: Praateek <[email protected]>

* Dump high level skeleton

Signed-off-by: Ayush Dattagupta <[email protected]>

* update xenna executor

Signed-off-by: Ayush Dattagupta <[email protected]>

* More changes

Signed-off-by: Ayush Dattagupta <[email protected]>

* working example

Signed-off-by: Praateek <[email protected]>

* Revert "working example"

This reverts commit 7b3e65173dd1df92b0de9431fcfebdbc0b93d6c9.

* [WIP] Add reader + utf modifier (#31)

* Dump high level skeleton

Signed-off-by: Ayush Dattagupta <[email protected]>

* update xenna executor

Signed-off-by: Ayush Dattagupta <[email protected]>

* More changes

Signed-off-by: Ayush Dattagupta <[email protected]>

* Updates for utfModifier+ high level updates

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove old examples and add new modifier and stages

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add modify stage

Signed-off-by: Ayush Dattagupta <[email protected]>

* More updates

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>

* Revert "[WIP] Add reader + utf modifier (#31)" (#32)

This reverts commit ef25e3eff6502cb9bfc4a57ba48f0939284fd49b.

* rebase

Signed-off-by: Praateek <[email protected]>

* rebase continue

Signed-off-by: Praateek <[email protected]>

* Remove older file versions

Signed-off-by: Sarah Yurick <[email protected]>

* Final changes as per the meeting

* refactor

Signed-off-by: Praateek <[email protected]>

* example works

Signed-off-by: Praateek <[email protected]>

* add base classes

Signed-off-by: Praateek <[email protected]>

* example works

Signed-off-by: Praateek <[email protected]>

* ..

Signed-off-by: Praateek <[email protected]>

* more google style

Signed-off-by: Praateek <[email protected]>

* add init for backends

Signed-off-by: Praateek <[email protected]>

* Update example script

* add impl

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

* add suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* add another check

Signed-off-by: Sarah Yurick <[email protected]>

* Move changes one level deeper in ray-curator, add pyproject toml

Signed-off-by: Ayush Dattagupta <[email protected]>

* Update dependencies to include cosmos-xenna and pyarrow explicitly

Signed-off-by: Ayush Dattagupta <[email protected]>

* Update python upper bound

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add a simple contributing file with instructions

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove pyarrow check since it's an explicit dependency

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove unusued file utils

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Praateek <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: Praateek Mahajan <[email protected]>
Co-authored-by: Praateek <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Co-authored-by: Abhinav Garg <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>

* [Ray] Allow loguru to be serialized #729

* [Ray] Add Jsonl / Parquet Writer Stage (#730)

* Update CI testing workflow for ray branch (#739)

* Update ci workflow to build ray-curator package instead

Signed-off-by: Ayush Dattagupta <[email protected]>

* Split out CPU and GPU modules

Signed-off-by: Ayush Dattagupta <[email protected]>

* Update pytest command

Signed-off-by: Ayush Dattagupta <[email protected]>

* update crossfit dep to use pinned version (avoiding absl dep issues)

Signed-off-by: Ayush Dattagupta <[email protected]>

* Explicitly add absl-py dependency to avoid python 3.10 errors

Signed-off-by: Ayush Dattagupta <[email protected]>

* Update paths for codecov

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>

* Initial API desing doc (#737)

* Intial APi desing doc

Signed-off-by: Abhinav Garg <[email protected]>

* Update ray-curator/api-design.md

Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

* Update ray-curator/api-design.md

Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

* Update ray-curator/api-design.md

Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

* Update ray-curator/api-design.md

Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

* Update ray-curator/api-design.md

Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

* Update ray-curator/api-design.md

Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

* Update ray-curator/api-design.md

Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

* Update ray-curator/api-design.md

Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

* Update ray-curator/api-design.md

Co-authored-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

* Refine map-style execution description in API design document to clarify task transformation and mapping flexibility.

* Remove redundant sections on Tasks, Stages, and Pipelines from the API design document to streamline content and improve clarity.

* Add quickstart example and update API design documentation

- Introduced a new quickstart example in `ray_curator/examples/quickstart.py` demonstrating a sentiment analysis pipeline with three stages: TaskCreationStage, WordCountStage, and SentimentStage.
- Updated `api-design.md` to include a new section for examples, linking to the quickstart for user reference.
- Clarified resource requirements in `resources.py` documentation for GPU usage and constraints.

* Ruff related changes

Signed-off-by: Abhinav Garg <[email protected]>

* PR changes

Signed-off-by: Abhinav Garg <[email protected]>

* Update DocumentTask to DocumentBatch in API design for improved type flexibility

Signed-off-by: Abhinav Garg <[email protected]>

* Add fault tolerance requirements to API design documentation

- Introduced a new section outlining the necessity for fault tolerance and retry safety in all stages.
- Highlighted critical aspects such as task preemption and handling of partial operations to ensure robustness during execution.

Signed-off-by: Abhinav Garg <[email protected]>

---------

Signed-off-by: Abhinav Garg <[email protected]>
Co-authored-by: Praateek Mahajan <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>

* Refactor XennaExecutor by removing the cluster initialization function and deleting the associated ray_cluster_init.py file. This streamlines the execution process by eliminating unnecessary setup code. (#768)

Signed-off-by: Abhinav Garg <[email protected]>

* [Ray] Add Ray Data as an experimental backend (#740)

* [Ray] Add integration test to test backends for a specified pipeline (#770)

* Adding with_ for options in ProcessingStage and CompositeStage (#764)

* [Ray] `DocumentFilter` and `Filter`/`Score`/`ScoreFilter` (#746)

* add documentfilter implementation

Signed-off-by: Sarah Yurick <[email protected]>

* fix nits and ruff

Signed-off-by: Sarah Yurick <[email protected]>

* add additional logic for setup, setup_on_node, and process_batch

Signed-off-by: Sarah Yurick <[email protected]>

* add pytests

Signed-off-by: Sarah Yurick <[email protected]>

* add dep

Signed-off-by: Sarah Yurick <[email protected]>

* more dep edits

Signed-off-by: Sarah Yurick <[email protected]>

* another dep

Signed-off-by: Sarah Yurick <[email protected]>

* add fasttext dep

Signed-off-by: Sarah Yurick <[email protected]>

* add jieba and mecab

Signed-off-by: Sarah Yurick <[email protected]>

* add default None params for setup_on_node and setup functions

Signed-off-by: Sarah Yurick <[email protected]>

* add praateek's suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* organize imports

Signed-off-by: Sarah Yurick <[email protected]>

* remove process_batch

Signed-off-by: Sarah Yurick <[email protected]>

* add _metadata to result

Signed-off-by: Sarah Yurick <[email protected]>

* add praateek's suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* ruff and post init for _name

Signed-off-by: Sarah Yurick <[email protected]>

* modify test

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>

* [Ray] Add Download Extract Base Class +  Common Crawl Stage (#738)

* [Ray] Use Ray Actors where viable (#792)

* Extract And download for WIkipedia (#795)

* copy over

Signed-off-by: Praateek <[email protected]>

* copy over

Signed-off-by: Praateek <[email protected]>

* add init to download

Signed-off-by: Praateek <[email protected]>

* move justext

Signed-off-by: Praateek <[email protected]>

* move resiliparse

Signed-off-by: Praateek <[email protected]>

* move trafilatura

Signed-off-by: Praateek <[email protected]>

* move get_stop_list_dict

Signed-off-by: Praateek <[email protected]>

* move download_utils.py to utils/download_utils.py

Signed-off-by: Praateek <[email protected]>

* move out to download.py

Signed-off-by: Praateek <[email protected]>

* move WarcIterator towarc_reader.py

Signed-off-by: Praateek <[email protected]>

* move CommonCrawlWARCExtractor to html_extractor

Signed-off-by: Praateek <[email protected]>

* remove commoncrawl.py

Signed-off-by: Praateek <[email protected]>

* create url_generation.py from download_utils

Signed-off-by: Praateek <[email protected]>

* tests dir

Signed-off-by: Praateek <[email protected]>

* copy over test_download.py as test_common_crawl.py

Signed-off-by: Praateek <[email protected]>

* add html_extractors/__init__

Signed-off-by: Praateek <[email protected]>

* move html_extractor to ProcessingStage

Signed-off-by: Praateek <[email protected]>

* update WarcReader to use ProecssingStage

Signed-off-by: Praateek <[email protected]>

* move to classes for url generation

Signed-off-by: Praateek <[email protected]>

* typo in name

Signed-off-by: Praateek <[email protected]>

* bug fixes in justext; rename resiliparse func; utils modular

Signed-off-by: Praateek <[email protected]>

* init file in for download/text

Signed-off-by: Praateek <[email protected]>

* justtext minor change

Signed-off-by: Praateek <[email protected]>

* support str in htmlextractor

Signed-off-by: Praateek <[email protected]>

* add a working example

Signed-off-by: Praateek <[email protected]>

* set source_files so that write can be hashed

Signed-off-by: Praateek <[email protected]>

* use pprint in example

Signed-off-by: Praateek <[email protected]>

* update comment

Signed-off-by: Praateek <[email protected]>

* all tests migrated + work

Signed-off-by: Praateek <[email protected]>

* update defaults in example; comments in stage

Signed-off-by: Praateek <[email protected]>

* add tests for url generation + PR review

Signed-off-by: Praateek <[email protected]>

* update download for aws

Signed-off-by: Praateek <[email protected]>

* rename aws to use_aws_to_donwload

Signed-off-by: Praateek <[email protected]>

* update resources

Signed-off-by: Praateek <[email protected]>

* change url generation to have ray-stage-spec

Signed-off-by: Praateek <[email protected]>

* make download fault tolerant

Signed-off-by: Praateek <[email protected]>

* refactor as per pr reviews; with tests

Signed-off-by: Praateek <[email protected]>

* add readme

Signed-off-by: Praateek <[email protected]>

* bug fix; update tests

Signed-off-by: Praateek <[email protected]>

* update record limit to None

Signed-off-by: Praateek <[email protected]>

* bug fixes

Signed-off-by: Praateek <[email protected]>

* pr comments

Signed-off-by: Praateek <[email protected]>

* add back test html extractor implementations

Signed-off-by: Praateek <[email protected]>

* remove cc example

Signed-off-by: Praateek <[email protected]>

* add column utils

Signed-off-by: Praateek <[email protected]>

* add todos

Signed-off-by: Praateek <[email protected]>

* Add Wikipedia download and extract stage

This commit introduces a comprehensive pipeline for downloading and processing Wikipedia dump files within the ray-curator framework. Key components include:

- **WikipediaUrlGenerator**: Generates URLs for Wikipedia dump files.
- **WikipediaDownloader**: Downloads .bz2 dump files using wget.
- **WikipediaIterator**: Parses Wikipedia XML dumps and extracts article content.
- **WikipediaExtractor**: Cleans Wikipedia markup and extracts meaningful text.

Additionally, an example script demonstrating the usage of the new stage is included, along with tests for each component to ensure functionality and reliability.

Documentation for the new stage is also provided to guide users in implementation and usage.

Signed-off-by: Abhinav Garg <[email protected]>

* merge from main

Signed-off-by: Praateek <[email protected]>

* move deps to text

Signed-off-by: Praateek <[email protected]>

* update dev

Signed-off-by: Praateek <[email protected]>

* update pyproject and test.yml

Signed-off-by: Praateek <[email protected]>

* remove cugraph extra pyproject

Signed-off-by: Praateek <[email protected]>

* move text to optional deps

Signed-off-by: Praateek <[email protected]>

* Refactor pyproject.toml: Remove unused dependencies and clean up dev section

Signed-off-by: Abhinav Garg <[email protected]>

* Remove unused Wikipedia example and related README documentation from the download text stages.

Signed-off-by: Abhinav Garg <[email protected]>

* Add method to fetch JSON dump data for Wikipedia and refactor dump date retrieval logic

- Introduced `_get_data_for_dump` method to handle fetching and parsing JSON dump data.
- Refactored logic in `_get_wikipedia_urls` to iterate through available dumps and check their status.
- Improved error handling for cases where dump data cannot be loaded or is not finished.

Signed-off-by: Abhinav Garg <[email protected]>

* Add README for custom download pipelines and remove Wikipedia stage documentation

- Introduced a new README.md file detailing the structure and implementation of custom download pipelines.
- Removed the outdated README.md for the Wikipedia download and extract stage to streamline documentation.

Signed-off-by: Abhinav Garg <[email protected]>

* Add num_workers_per_node method to DocumentDownloader and WikipediaDownloader

- Implemented num_workers_per_node method in DocumentDownloader to define the number of workers per node for downloading tasks.
- Overridden num_workers_per_node in WikipediaDownloader to return a fixed value of 1.
- Updated xenna_stage_spec method in DocumentDownloadStage to include the number of workers per node.

Signed-off-by: Abhinav Garg <[email protected]>

* Update WikipediaDownloader to use 2 workers and change logging level in WikipediaIterator

- Modified num_workers_per_node in WikipediaDownloader to return 2, allowing for increased parallelism during downloads.
- Changed logging from info to debug level in WikipediaIterator for extracted articles to reduce log verbosity.

Signed-off-by: Abhinav Garg <[email protected]>

---------

Signed-off-by: Praateek <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
Co-authored-by: Praateek <[email protected]>

* Fixing tests (#827)

* Refactor Wikipedia extraction and URL generation logic

- Removed redundant return statement in `WikipediaExtractor` class.
- Simplified status check for dump data in `WikipediaUrlGenerator` by directly accessing the dictionary keys.
- Updated logging level in tests to ensure accurate assertions on log calls.
- Enhanced test cases for URL generation to cover various dump statuses.

These changes improve code clarity and maintainability while ensuring robust error handling in the Wikipedia download and extraction process.

Signed-off-by: Abhinav Garg <[email protected]>

* Add mwparserfromhell dependency to pyproject.toml

- Included `mwparserfromhell==0.6.5` in the text dependencies section of `pyproject.toml` to support parsing Wikipedia markup.

This addition enhances the functionality of the project by ensuring the necessary tools for processing Wikipedia data are available.

Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

---------

Signed-off-by: Abhinav Garg <[email protected]>
Signed-off-by: [Your Name] <[email protected]>

* Update ray version to 2.48 #839

* Re-enable CI/CD for Ray API branch (#840)

* CI/CD for Ray API branch

Signed-off-by: Sarah Yurick <[email protected]>

* add text dependencies

Signed-off-by: Sarah Yurick <[email protected]>

* only run cpu tests

Signed-off-by: Sarah Yurick <[email protected]>

* comment instead of delete

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>

* Ray Video Pipeline : Video Reader (#775)

* Add video io reader

* Add test

* Add VideoReaderStage to video reading pipeline and update VideoDownloadStage to accept VideoTask. Enhance video reading capabilities with new tests for VideoReaderStage.

Signed-off-by: Ao Tang <[email protected]>

* Update VideoDownloadStage to support verbose logging and modify video_read_example to include verbose argument.

Signed-off-by: Ao Tang <[email protected]>

* Update outputs for VideoDownloadStage and VideoReaderStage to include additional metadata fields.

Signed-off-by: Ao Tang <[email protected]>

* Update CI workflow to include video dependencies for testing

Signed-off-by: Ao Tang <[email protected]>

* Add tests for video tasks module

- Introduced a new test package for tasks with an initial test suite for the video tasks module, including tests for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes.
- Implemented various test cases to validate initialization, property calculations, metadata extraction, and size calculations.

This enhances the testing coverage for video-related functionalities in the ray-curator project.

Signed-off-by: Ao Tang <[email protected]>

* Enhance video tasks module with additional test cases

- Expanded the test suite for the video tasks module by adding new test cases for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes.
- Improved coverage for various functionalities including initialization, property calculations, and metadata extraction.

This update strengthens the reliability of video-related features in the ray-curator project.

Signed-off-by: Ao Tang <[email protected]>

* Update pyproject.toml to include a trailing comma for pynvml dependency

Signed-off-by: Ao Tang <[email protected]>

* Refactor video processing stages to introduce a composite VideoReaderDownloadStage

- Replaced separate VideoReaderStage and VideoDownloadStage with a new VideoReaderDownloadStage that combines both functionalities.
- Updated the video_read_example to utilize the new composite stage.
- Adjusted inputs and outputs for VideoDownloadStage to reflect changes in the pipeline.
- Added tests for the new VideoReaderDownloadStage to ensure proper functionality and integration.

This refactor simplifies the video reading and downloading process within the ray-curator framework.

Signed-off-by: Ao Tang <[email protected]>

---------

Signed-off-by: Ao Tang <[email protected]>

* chore: Add new trustees and vetters to the copy-pr-bot configuration (#841) (#842)

* chore: Add new trustees and vetters to the copy-pr-bot configuration



* chore: Remove empty line in copy-pr-bot configuration



* chore: Remove ryantwolf from additional trustees and vetters in copy-pr-bot configuration



---------

Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: Ao Tang <[email protected]>

* ci: Add community-bot (#846) (#849)

Signed-off-by: oliver könig <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Ray Video Reader Enhancement (#848)

* Refactor video reading stages: Rename VideoReaderStage to VideoListStage and update VideoReaderDownloadStage to use the new class. Adjust tests accordingly to reflect the changes in stage names and functionality.

Signed-off-by: Ao Tang <[email protected]>

* Rename test_video_reader to test_video_list

Signed-off-by: Ao Tang <[email protected]>

* Update VideoListStage name and corresponding tests to reflect new naming convention

- Changed the internal name of VideoListStage from "video_reader" to "video_list".
- Updated assertions in the test for VideoListStage to match the new name.
- Adjusted configuration in the VideoReaderDownloadStage to use "video_list" instead of "video_reader".

This ensures consistency across the codebase following the recent refactor.

Signed-off-by: Ao Tang <[email protected]>

* Update test assertions in VideoReaderDownloadStage to use "video_list" instead of "video_reader"

Signed-off-by: Ao Tang <[email protected]>

* Refactor video processing stages: Replace VideoDownloadStage with VideoReaderStage in VideoReaderDownloadStage. Update related tests to reflect the new structure and ensure consistency across the codebase.

Signed-off-by: Ao Tang <[email protected]>

* Enhance VideoListStage and VideoReaderStage documentation

Signed-off-by: Ao Tang <[email protected]>

* Refactor video reading pipeline: Introduce VideoLoadingStage as a composite stage that combines VideoListStage and VideoReaderStage.

Signed-off-by: Ao Tang <[email protected]>

* Remove SplitPipeTask from video module and update imports accordingly.

Signed-off-by: Ao Tang <[email protected]>

* Refactor video task imports: Update import statements in video_list, video_loading, video_reader, and related test files to use the new video module structure.

Signed-off-by: Ao Tang <[email protected]>

* ruff fix

Signed-off-by: Ao Tang <[email protected]>

* Implement FilePartitioningStage: Introduce a new stage for partitioning files into groups based on specified criteria, including a limit on the number of groups. Update VideoLoadingStage to utilize FilePartitioningStage instead of the deprecated VideoListStage. Refactor VideoReaderStage to accept FileGroupTask as input and adjust related tests to ensure functionality and correctness.

Signed-off-by: Ao Tang <[email protected]>

* Refactor video reading stages: Replace VideoLoadingStage with VideoReader as a composite stage that combines FilePartitioningStage and VideoReaderStage. Update related tests to ensure functionality and correctness. Remove deprecated VideoLoadingStage and its associated tests.

Signed-off-by: Ao Tang <[email protected]>

* Update video_limit type in VideoReader to support None: Changed the type of video_limit from int to int | None to allow for more flexible configuration. This enhances the usability of the VideoReader class.

Signed-off-by: Ao Tang <[email protected]>

* Refactor file partitioning limit check

Signed-off-by: Ao Tang <[email protected]>

* Remove redundant tests from TestVideoReader: Deleted tests for video limit values, verbose flag, file extensions, and files per partition configuration to streamline the test suite and focus on essential functionality.

Signed-off-by: Ao Tang <[email protected]>

---------

Signed-off-by: Ao Tang <[email protected]>

* Enhance FilePartitioningStage to enforce task limit check earlier in the process. (#867)

Signed-off-by: Ao Tang <[email protected]>

* Initialize and shutdown ray session in each executor (#844)

* Remove pynvml dependency from pyproject.toml (#872)

* docs: refactor all the things (#826) (#859)

* docs: refactor all the things



* remove auto api docs



* api docs to gitignore



* updated readme



* python linting fixes batch 1



* batch 2



* batch 3



* update



---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: L.B. <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>

* ci(fix): Use GITHUB_TOKEN for community bot (#853) (#854)

* ci(fix): Use GITHUB_TOKEN for community bot



* f



---------

Signed-off-by: oliver könig <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>

* update LLM PII redaction file - fix issue 828 (#868) (#871)

* update LLM PII redaction file - fix 828



* Fix ruff check LLM PII redaction file - fix 828



* update LLM PII redaction Enron-file - fix 828



* update LLM-PII redaction README - fix 828



* updated LLM PII redaction Enron-file - fix 828



* updated LLM PII redaction file - fix 828



* Update tutorials/curator-llm-pii/README.md




* removed typo from README file - fix 828



* updated LLM redaction tutorial - fix 828



* updated LLM redaction-Enron file - fix 828



* updated LLM redaction-Enron file - fix 828



* Update tutorials/curator-llm-pii/PII-LLM-modification-Enron.ipynb



* Update tutorials/curator-llm-pii/PII-LLM-modification-Enron.ipynb



---------

Signed-off-by: Adeola Adesoba <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: aadesoba-nv <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>

* [Tutorials] Lazy import GPU modules in the Llama Nemotron tutorial (#831) (#875)

Signed-off-by: Mehran Maghoumi <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: Mehran Maghoumi <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>

* docs: changelog update (#860) (#887)

* docs: changelog update



* formatting



* remove item



---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: L.B. <[email protected]>

* linkfixes (#865) (#882)

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: L.B. <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>

* docs: Fixing version switcher issues (#885) (#886)

Signed-off-by: Andrew Schilling <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: Andrew Schilling <[email protected]>

* [Ray] Download and extract ArXiv (#805)

* remove dask arxiv

Signed-off-by: Sarah Yurick <[email protected]>

* first pass for entire arxiv implementation

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

* fix circular import

Signed-off-by: Sarah Yurick <[email protected]>

* working module

Signed-off-by: Sarah Yurick <[email protected]>

* add downloader tests

Signed-off-by: Sarah Yurick <[email protected]>

* remove unused noqa

Signed-off-by: Sarah Yurick <[email protected]>

* add test_iterator

Signed-off-by: Sarah Yurick <[email protected]>

* add extractor tests

Signed-off-by: Sarah Yurick <[email protected]>

* fix failing download tests

Signed-off-by: Sarah Yurick <[email protected]>

* add test_stage

Signed-off-by: Sarah Yurick <[email protected]>

* sort

Signed-off-by: Sarah Yurick <[email protected]>

* add url generator tests

Signed-off-by: Sarah Yurick <[email protected]>

* remove noqa

Signed-off-by: Sarah Yurick <[email protected]>

* remove nemo_curator/download, outdated scripts, outdated examples

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>

* [Ray] Classifiers (#753)

* [Ray] Classifiers

Signed-off-by: Sarah Yurick <[email protected]>

* fix ruff

Signed-off-by: Sarah Yurick <[email protected]>

* add utils file

Signed-off-by: Sarah Yurick <[email protected]>

* commit quality classifier benchmark helpers

Signed-off-by: Sarah Yurick <[email protected]>

* use basictokenizer as cpu tokenizer, add crossfit config

Signed-off-by: Sarah Yurick <[email protected]>

* some ruff

Signed-off-by: Sarah Yurick <[email protected]>

* merge upstream

Signed-off-by: Praateek <[email protected]>

* use _name, remove gpu resources from labeler

Signed-off-by: Sarah Yurick <[email protected]>

* consolidate praateek's work with distributeddataclassifier for quality classifier

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

* add content type, domain, multilingual domain, and filter_by support

Signed-off-by: Sarah Yurick <[email protected]>

* support for fineweb, fineweb mixtral, and fineweb nemotron classifiers

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

* add prompt task complexity support

Signed-off-by: Sarah Yurick <[email protected]>

* remove noqa

Signed-off-by: Sarah Yurick <[email protected]>

* padding_size does not need to be exposed to user

Signed-off-by: Sarah Yurick <[email protected]>

* max_seq_length does not need to be exposed to the user, set default micro_batch_sizes

Signed-off-by: Sarah Yurick <[email protected]>

* add max_chars, edit docstring

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

* aegis functionality, start working on instruction data guard

Signed-off-by: Sarah Yurick <[email protected]>

* nit fixes

Signed-off-by: Sarah Yurick <[email protected]>

* add working pytests for all classifiers

Signed-off-by: Sarah Yurick <[email protected]>

* remove existing pytest file

Signed-off-by: Sarah Yurick <[email protected]>

* add more comments to tests

Signed-off-by: Sarah Yurick <[email protected]>

* address review, add mem conversation, add README

Signed-off-by: Sarah Yurick <[email protected]>

* move redundant test code

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

* model_inference_batch_size and format_name_with_suffix

Signed-off-by: Sarah Yurick <[email protected]>

* add missing hf_token usage, remove test file, restructure dirs and files

Signed-off-by: Sarah Yurick <[email protected]>

* delete old examples and scripts

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Praateek <[email protected]>
Co-authored-by: Praateek <[email protected]>

* [RAY] Add ID Module (#876)

* Add id inital working IMP

Signed-off-by: Vibhu Jawa <[email protected]>

* working add_id

Signed-off-by: Vibhu Jawa <[email protected]>

* Add ID

Signed-off-by: Vibhu Jawa <[email protected]>

* Update ray-curator/ray_curator/tasks/tasks.py

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>

* Add prefix feature, overwrite, warnings

Signed-off-by: Vibhu Jawa <[email protected]>

* rename id_prefix to user_prefix

Signed-off-by: Vibhu Jawa <[email protected]>

* Add in test for tasks and fix task id

Signed-off-by: VibhuJawa <[email protected]>

---------

Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: VibhuJawa <[email protected]>
Co-authored-by: Copilot <[email protected]>

* Add video splitting pipeline with fixed stride extraction and transcoding Stage (#783)

* Add video splitting pipeline with fixed stride extraction and transcoding stages

- Introduced `video_split_clip_example.py` to demonstrate video splitting functionality.
- Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips.
- Implemented command-line arguments for configuring video processing parameters.
- Created utility functions for grouping iterables in `grouping.py`.
- Added unit tests for the new stages in `test_clip_transcoding_stage.py` and `test_fixed_stride_extractor_stage.py`.

Signed-off-by: Ao Tang <[email protected]>

* Refactor video splitting pipeline to remove debug mode and enhance stage integration

Signed-off-by: Ao Tang <[email protected]>

* Add video limit argument to video split clip example

Signed-off-by: Ao Tang <[email protected]>

* Refactor video processing stages to enhance resource management and integrate new functionalities

- Replaced separate VideoReaderStage and VideoDownloadStage with a composite VideoReaderDownloadStage, streamlining the video reading and downloading process.
- Updated ClipTranscodingStage to improve GPU resource allocation and added detailed arguments for better configurability.
- Adjusted tests to reflect changes in resource management, ensuring accurate assertions on GPU usage.

These changes improve the clarity and efficiency of video processing within the ray-curator framework.

Signed-off-by: Ao Tang <[email protected]>

* Add mock GPU classes and enhance ClipTranscodingStage tests

- Introduced MockGpuInfo and MockGpuResources classes to simulate GPU information and resources for testing.
- Updated test_resources_gpu_encoder and test_resources_hwaccel_enabled methods to utilize mocks, ensuring accurate resource assertions without dependency on actual GPU hardware.
- Enhanced test_different_encoder_configurations to validate resource requirements for various encoder configurations, including GPU settings.

These changes improve the robustness of the ClipTranscodingStage tests by isolating them from hardware dependencies, facilitating easier testing and validation.

Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Ao Tang <[email protected]>

* Remove deprecated GPU resource tests from ClipTranscodingStage

Signed-off-by: Ao Tang <[email protected]>

* Remove unused test for processing in debug mode from ClipTranscodingStage tests

Signed-off-by: Ao Tang <[email protected]>

* Add unit tests for grouping utilities in the ray_curator.utils module

Signed-off-by: Ao Tang <[email protected]>

* Enhance video processing stages with ray stage specifications

- Added `ray_stage_spec` method to `ClipTranscodingStage`, `VideoDownloadStage`, and `VideoReaderStage` to define stage characteristics for Ray integration.
- Updated input and output methods in `ClipTranscodingStage` to include additional input parameters.
- Modified `SplitPipeTask` to return properties from `data` instead of `video`, ensuring consistency in task data handling.
- Added unit tests to verify the correctness of the new `ray_stage_spec` implementations.

These changes improve the integration of video processing stages with Ray's architecture and enhance test coverage for the new functionalities.

Signed-off-by: Ao Tang <[email protected]>

* Refactor video processing imports and update pipeline stages

Signed-off-by: Ao Tang <[email protected]>

* Remove unused `IS_ACTOR_STAGE` key from `ray_stage_spec` in `ClipTranscodingStage` and clean up commented-out code. This simplifies the stage specification and prepares for future enhancements.

Signed-off-by: Ao Tang <[email protected]>

* Remove redundant check for video source bytes in ClipTranscodingStage. This simplifies the process method by eliminating unnecessary error handling when source bytes are not available.

Signed-off-by: Ao Tang <[email protected]>

* Refactor ClipTranscodingStage to use a class variable for the stage name and implement post-initialization resource setup. Added error handling for None source bytes in the process method. Updated tests to remove redundant checks and ensure proper functionality.

Signed-off-by: Ao Tang <[email protected]>

* Remove unnecessary error handling for None source bytes in ClipTranscodingStage's process method,

Signed-off-by: Ao Tang <[email protected]>

* remove redudant test

Signed-off-by: Ao Tang <[email protected]>

* precommit fix

Signed-off-by: Ao Tang <[email protected]>

---------

Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: [Your Name] <[email protected]>

* docs: ray curator api autodoc updates (#896)

Signed-off-by: Lawrence Lane <[email protected]>

* Move all text stages to `stages/text/` (#891)

* first pass

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

* fix tests

Signed-off-by: Sarah Yurick <[email protected]>

* fix after merge

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>

* Add Ray Actor Pool Exceuctor  (#893)

* Initial Minhash implementation on Ray (#837)

* Initial minhash logic without Stage API

Signed-off-by: Ayush Dattagupta <[email protected]>

* update args and support passing in pre-batched files

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove old minhash impl

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add Class to do GPU IO for dedup

Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>

* Add ID Generator class

Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>

* Move MinHashActor to a GPUMinHash class and create a GPUMinHash Processing stage

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove minhash method in favor of minhashProcessingStage

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add mkdir logic to the writer

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add file partitioning stage to __init__.py

Signed-off-by: Ayush Dattagupta <[email protected]>

* Update cuda12x extra to deduplication. Bump pynvml to avoid conflicts

Signed-off-by: Ayush Dattagupta <[email protected]>

* Update stage name

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add initial minhash tests

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add rmm pool arg to MinhashStage, default to false in the parent actor

Signed-off-by: Ayush Dattagupta <[email protected]>

* Move IO and ID generator logic to the Stage rather than the parent GPUMinHash class

Signed-off-by: Ayush Dattagupta <[email protected]>

* Update GPUMinHash Tests

Signed-off-by: Ayush Dattagupta <[email protected]>

* Standardize Id generator actor name

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add GPUMinHashStage tests

Signed-off-by: Ayush Dattagupta <[email protected]>

* Rename GPUMinHashStage to MinHashStage

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add marker for GPU tests

Signed-off-by: Ayush Dattagupta <[email protected]>

* update cpu ci workflow to skip GPU tests

Signed-off-by: Ayush Dattagupta <[email protected]>

* Skip tests if imports fail

Signed-off-by: Ayush Dattagupta <[email protected]>

* move cudf import checks before stage imports

Signed-off-by: Ayush Dattagupta <[email protected]>

* Use storage options from read_kwargs directly

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Co-authored-by: Praateek Mahajan <[email protected]>

* docs: curate text load data content updates for ray (#895)

* docs: load text data article updates

Signed-off-by: Lawrence Lane <[email protected]>

* remove "ray-curator" for curator

Signed-off-by: Lawrence Lane <[email protected]>

* simplify naming

Signed-off-by: Lawrence Lane <[email protected]>

* imports

Signed-off-by: Lawrence Lane <[email protected]>

* imports

Signed-off-by: Lawrence Lane <[email protected]>

* imports

Signed-off-by: Lawrence Lane <[email protected]>

* linkfix

Signed-off-by: Lawrence Lane <[email protected]>

* read through

Signed-off-by: Lawrence Lane <[email protected]>

* simplification

Signed-off-by: Lawrence Lane <[email protected]>

* remove placeholder concept details

Signed-off-by: Lawrence Lane <[email protected]>

* pipeline verbiage

Signed-off-by: Lawrence Lane <[email protected]>

* initial feedback round

Signed-off-by: Lawrence Lane <[email protected]>

* reduce admonition noise

Signed-off-by: Lawrence Lane <[email protected]>

* minor updates

Signed-off-by: Lawrence Lane <[email protected]>

* minor updates

Signed-off-by: Lawrence Lane <[email protected]>

* feedback

Signed-off-by: Lawrence Lane <[email protected]>

---------

Signed-off-by: Lawrence Lane <[email protected]>

* Adding function decorator for very simple functions to be converted into stages (#835)

* Revert 'Add utility decorators for ProcessingStage creation' (empty cherry-pick)

Signed-off-by: Abhinav Garg <[email protected]>

* Add utility decorators for ProcessingStage creation

This commit introduces a new module containing the `processing_stage` decorator, which allows users to easily convert plain Python functions into `ProcessingStage` instances. The decorator supports configuration options such as stage name, resource allocation, and batch size. Additionally, unit tests have been added to validate the functionality of the decorator and ensure proper handling of task processing.

Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>

* test commit

Signed-off-by: Sarah Yurick <[email protected]>

* add test_stage_registry, other nits

Signed-off-by: Sarah Yurick <[email protected]>

* overwrite stage registry

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

* propagate _metadata and _stage_perf

Signed-off-by: Sarah Yurick <[email protected]>

* accept resources dict

Signed-off-by: Sarah Yurick <[email protected]>

* reformat

Signed-off-by: Sarah Yurick <[email protected]>

* add process_batch tests

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

* remove todo

Signed-off-by: Sarah Yurick <[email protected]>

* add pipeline example

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Abhinav Garg <[email protected]>
Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>

* Add Text Embedding Model (#899)

* Add Ray curator dockerfile and enable testing (#879)

* Add Ray curator dockerfile and enable testing

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Fix indentation issues

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update dockerfile and add cuda12x

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update coverage pathes

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update gpu tests runner

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Add gpu testing scripts and update

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Cd into ray-curator for coverage

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Create dev layer and install dev packages

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update coverage paths

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Install opencv

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Address syntax error

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update cv2 ubuntu dependencies

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Fix typo

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Add cudf placeholder test

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Space after import

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Add gpu_only_import

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Remove import utils for now

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Fix spacing

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Skip gpu tests for cpu

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update unit test coverage path

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Skip gpu coverage report for now

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Use pixi

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Fix dockerfile syntax

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Try ffmpeg only

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Add extra index url for pixi

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Address typos

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Install git

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update entrypoint

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Fix typo

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Use env var for dev install

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Resolve syntax error

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Fix env var and verbose install

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update pixi entrypoint and pyproject install

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Trigger entrypoint before tests

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update test entrypoint

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Source entrypoint

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update list of dev install pixi

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Add back cuda12x and index-strategy

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Turn off verbose install

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Skip gpu coverage for now

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Support arm

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Set timeout for dockerbuild and update pyproject

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Remove retry github config

Signed-off-by: Dong Hyuk Chang <[email protected]>

---------

Signed-off-by: Dong Hyuk Chang <[email protected]>

* ci: Install ray-curator module (#905)

* Add ray curator as pypi dependency

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Add package info and test import

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update pyproject.toml

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Copy src for pixi install

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Update test import

Signed-off-by: Dong Hyuk Chang <[email protected]>

* Revert temp test

Signed-off-by: Dong Hyuk Chang <[email protected]>

---------

Signed-off-by: Dong Hyuk Chang <[email protected]>

* [REVIEW] Add modifers to ray curator (#898)

* Inital WIP modifier workflows

Signed-off-by: VibhuJawa <[email protected]>

* Moved tests and also moved modifiers to text sub module

Signed-off-by: VibhuJawa <[email protected]>

* Add tests for the meta class and modifier and improve docstring

Signed-off-by: VibhuJawa <[email protected]>

* Update ray-curator/ray_curator/stages/text/modifiers/slicer.py

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>

* Update ray-curator/ray_curator/stages/text/modifiers/line_remover.py

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>

* Delete files from dask dir and remove optional download fields

Signed-off-by: Vibhu Jawa <[email protected]>

* Add pytest as requested

Signed-off-by: Vibhu Jawa <[email protected]>

---------

Signed-off-by: VibhuJawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Co-authored-by: Copilot <[email protected]>

* Allow users to fuse multiple `DocumentFilter` objects into a single `ScoreFilter` stage (#850)

* Allow users to fuse multiple `DocumentFilter` objects into a single `ScoreFilter` stage

Signed-off-by: Sarah Yurick <[email protected]>

* remove old example and scripts file

Signed-off-by: Sarah Yurick <[email protected]>

* add suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* add init

Signed-off-by: Sarah Yurick <[email protected]>

* fix csv path

Signed-off-by: Sarah Yurick <[email protected]>

* clearer error messages

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>

* Fix exception when blocksize is set (#892) (#904)

If blocksize is set instead of files_per_partition, this line raised an exception.

Signed-off-by: Yurii Paniv <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: Yurii Paniv <[email protected]>

* docs: curate text - process data - language dir (#900)

* docs: curate text - process data - language dir

Signed-off-by: Lawrence Lane <[email protected]>

* remove extra content

Signed-off-by: Lawrence Lane <[email protected]>

* another pass

Signed-off-by: Lawrence Lane <[email protected]>

* remove pool

Signed-off-by: Lawrence Lane <[email protected]>

* formatting

Signed-off-by: Lawrence Lane <[email protected]>

* feedback

Signed-off-by: Lawrence Lane <[email protected]>

* clarificaiton and alternative as pipeline stage. removed extra section

Signed-off-by: Lawrence Lane <[email protected]>

* Update docs/curate-text/process-data/language-management/language.md

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: L.B. <[email protected]>

---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: L.B. <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>

* docs: add README for experimental scripts directory (#910)

Signed-off-by: Abhinav Garg <[email protected]>

* Add IdGenerator to JsonlReader + IdGenerator tests / write_to_disk / from_disk (#907)

* Initial buckets to edges stage (#909)

* Initial buckets to edges stage

Signed-off-by: Ayush Dattagupta <[email protected]>

* re-add file utils from lsh pr

Signed-off-by: Ayush Dattagupta <[email protected]>

* Handle directory cleanup/creation logic in the stage

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add tests for buckets to edglist

Signed-off-by: Ayush Dattagupta <[email protected]>

* Rename doc_id_column to doc_id_field, update storage_options to read/write_kwargs instead

Signed-off-by: Ayush Dattagupta <[email protected]>

* Fix indentation

Signed-off-by: Ayush Dattagupta <[email protected]>

* Fix kwargs args

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add copyright headers

Signed-off-by: Ayush Dattagupta <[email protected]>

* remove previous curator impl

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>

* [SemDedup] Add KMeans (#912)

* S3 Client (#903)

* WIP

Signed-off-by: Ao Tang <[email protected]>

* WIP

Signed-off-by: Ao Tang <[email protected]>

* Refactor S3 client configuration and enhance video reading logging

- Updated S3_PROFILE_PATH to use an environment variable for better flexibility in specifying the S3 credentials file location.
- Improved logging in VideoReaderStage to provide more informative messages about video byte downloads, including the size of the downloaded video.

Signed-off-by: Ao Tang <[email protected]>

* Enhance VideoReader functionality with S3 support and improve validation checks

- Updated VideoReader to conditionally use ClientPartitioningStage for S3 paths and FilePartitioningStage for local paths, improving flexibility in handling video sources.
- Enhanced validation in VideoTask to check for the existence of input videos when provided as pathlib.Path, ensuring better error handling.
- Removed unused methods from S3Client to streamline the codebase.

Signed-off-by: Ao Tang <[email protected]>

* Remove redundant exception raising in VideoReaderStage to improve error handling during video reading. This change prevents unnecessary propagation of exceptions while still logging errors effectively.

Signed-off-by: Ao Tang <[email protected]>

* Refactor ClientPartitioningStage and enhance S3 client configuration

- Rearranged import statements for better organization and readability in `client_partitioning.py` and `video_reader.py`.
- Updated `S3ClientConfig` and `BaseClientConfig` to use `@dataclass` for improved data handling.
- Added comprehensive unit tests for `ClientPartitioningStage`, covering initialization, setup, and processing methods with various scenarios.
- Improved error handling and validation in the `_read_list_json` function.

This refactor enhances the maintainability and test coverage of the codebase, ensuring better functionality and reliability in handling client partitioning tasks.

Signed-off-by: Ao Tang <[email protected]>

* Remove SPDX license comments from S3 client, storage client, and storage utilities files to streamline code readability. This change simplifies the file headers while retaining essential module documentation.

Signed-off-by: Ao Tang <[email protected]>

* Use Fsspec instead of boto3

Signed-off-by: Ao Tang <[email protected]>

* Refactor file handling and enhance video reading capabilities

- Introduced a new `FSPath` class in `client_utils.py` for improved file operations with fsspec.
- Updated `ClientPartitioningStage` and `VideoReaderStage` to utilize the new `FSPath` class for better handling of file paths.
- Removed unused imports and streamlined code in `client_partitioning.py` and `video_reader.py`.
- Enhanced error handling in `VideoReaderStage` to support various input types for video sources.

This refactor improves the maintainability and flexibility of file handling in the video processing pipeline.

Signed-off-by: Ao Tang <[email protected]>

* move client_partitioning.py

Signed-off-by: Ao Tang <[email protected]>

* ruff check

Signed-off-by: Ao Tang <[email protected]>

* Fix broken tests

Signed-off-by: Ao Tang <[email protected]>

* Add `as_posix` method to `FSPath` class and implement comprehensive test suite

- Introduced `as_posix` method in the `FSPath` class to convert filesystem paths to POSIX format, accommodating various protocols.
- Created a new test suite for `FSPath` in `test_client_utils.py`, covering initialization, string representation, file operations, and edge cases.
- Enhanced tests for `get_bytes_cat_ranges` to handle different file sizes and error scenarios.

This update improves the functionality and test coverage of the `FSPath` class, ensuring robust file handling across different filesystems.

Signed-off-by: Ao Tang <[email protected]>

* Remove logging of downloaded video size in VideoReaderStage to streamline error handling and reduce unnecessary output.

Signed-off-by: Ao Tang <[email protected]>

* Refactor video reading and splitting pipeline examples for improved readability

- Reformatted the `create_video_reading_pipeline` and `create_video_splitting_pipeline` functions to enhance code clarity by aligning parameters and removing unnecessary line breaks.
- Updated the `VideoReader` and `ClipTranscodingStage` instantiation to follow a consistent style.
- Made minor adjustments in the `ClientPartitioningStage` to ensure consistent formatting and improved readability.

These changes contribute to a cleaner and more maintainable codebase for video processing pipelines.

Signed-off-by: Ao Tang <[email protected]>

---------

Signed-off-by: Ao Tang <[email protected]>

* Add ClipWriterStage to video splitting pipeline Clean (#897)

* WIP

Signed-off-by: Ao Tang <[email protected]>

* WIP

Signed-off-by: Ao Tang <[email protected]>

* Update ClipWriterStage to clarify local storage usage

Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Ao Tang <[email protected]>

* Enhance video clip processing with new GenericClipWriterStage and required output path argument

- Introduced a new GenericClipWriterStage for writing video clips and their metadata, consolidating the writing process and improving resource management.
- Updated the video_split_clip_example to require an output clip path, ensuring that users specify where to save the generated clips.
- The new stage supports parallel writing of clips and metadata, enhancing performance and flexibility in video processing workflows.

Signed-off-by: Ao Tang <[email protected]>

* Enhance ClipWriterStage with additional metadata handling

- Improved `ClipWriterStage` to support writing additional metadata during video processing.
- Updated related utility functions to accommodate new metadata fields.
- Refined unit tests to cover the new functionality and ensure reliability.

Signed-off-by: Ao Tang <[email protected]>

* Add ClipWriterStage to video splitting pipeline

- Introduced `ClipWriterStage` for writing clips and metadata during video processing.
- Updated `video_split_clip_example.py` to include the new stage, allowing for clip writing functionality.
- Enhanced command-line argument parsing for output clip path.
- Added utility functions for managing storage paths and writing data in various formats.
- Implemented unit tests for `ClipWriterStage` to ensure functionality and reliability.

Signed-off-by: Ao Tang <[email protected]>

* ruff fix

Signed-off-by: Ao Tang <[email protected]>

* ruff format

Signed-off-by: Ao Tang <[email protected]>

* Refactor S3 client configuration and enhance video reading logging

- Updated S3_PROFILE_PATH to use an environment variable for better flexibility in specifying the S3 credentials file location.
- Improved logging in VideoReaderStage to provide more informative messages about video byte downloads, including the size of the downloaded video.

Signed-off-by: Ao Tang <[email protected]>

* Enhance VideoReader functionality with S3 support and improve validation checks

- Updated VideoReader to conditionally use ClientPartitioningStage for S3 paths and FilePartitioningStage for local paths, improving flexibility in handling video sources.
- Enhanced validation in VideoTask to check for the existence of input videos when provided as pathlib.Path, ensuring better error handling.
- Removed unused methods from S3Client to streamline the codebase.

Signed-off-by: Ao Tang <[email protected]>

* Remove redundant exception raising in VideoReaderStage to improve error handling during video reading. This change prevents unnecessary propagation of exceptions while still logging errors effectively.

Signed-off-by: Ao Tang <[email protected]>

* Refactor ClientPartitioningStage and enhance S3 client configuration

- Rearranged import statements for better organization and readability in `client_partitioning.py` and `video_reader.py`.
- Updated `S3ClientConfig` and `BaseClientConfig` to use `@dataclass` for improved data handling.
- Added comprehensive unit tests for `ClientPartitioningStage`, covering initialization, setup, and processing methods with various scenarios.
- Improved error handling and validation in the `_read_list_json` function.

This refactor enhances the maintainability and test coverage of the codebase, ensuring better functionality and reliability in handling client partitioning tasks.

Signed-off-by: Ao Tang <[email protected]>

* Remove SPDX license comments from S3 client, storage client, and storage utilities files to streamline code readability. This change simplifies the file headers while retaining essential module documentation.

Signed-off-by: Ao Tang <[email protected]>

* Use Fsspec instead of boto3

Signed-off-by: Ao Tang <[email protected]>

* Refactor file handling and enhance video reading capabilities

- Introduced a new `FSPath` class in `client_utils.py` for improved file operations with fsspec.
- Updated `ClientPartitioningStage` and `VideoReaderStage` to utilize the new `FSPath` class for better handling of file paths.
- Removed unused imports and streamlined code in `client_partitioning.py` and `video_reader.py`.
- Enhanced error handling in `VideoReaderStage` to support various input types for video sources.

This refactor improves the maintainability and flexibility of file handling in the video processing pipeline.

Signed-off-by: Ao Tang <[email protected]>

* move client_partitioning.py

Signed-off-by: Ao Tang <[email protected]>

* ruff check

Signed-off-by: Ao Tang <[email protected]>

* Fix broken tests

Signed-off-by: Ao Tang <[email protected]>

* Remove unused `generic_clip_writer.py`, `storage_client.py`, and related utility files; refactor `writer_utils.py` to eliminate storage client dependencies and streamline file writing functions. Update tests to reflect these changes and ensure compatibility with the new structure.

Signed-off-by: Ao Tang <[email protected]>

* Remove test file `test_client_utils.py` for the `FSPath` class, cleaning up unused test cases and ensuring the test suite reflects the current codebase structure.

Signed-off-by: Ao Tang <[email protected]>

* Refactor ClipWriterStage to remove storage client dependencies and streamline file writing methods. Updated method signatures to eliminate storage client parameters, enhancing code clarity and maintainability.

Signed-off-by: Ao Tang <[email protected]>

* Remove unused import of ClipWriterStage in video_split_clip_example.py to streamline the code and improve clarity.

Signed-off-by: Abhinav Garg <[email protected]>

* Remove unused `input_s3_profile_name` attribute from `VideoReader` class to streamline the code and improve clarity.

Signed-off-by: Ao Tang <[email protected]>

* Remove unused `input_s3_profile_name` attribute from `VideoReader` class to streamline the code and improve clarity.

Signed-off-by: Abhinav Garg <[email protected]>

* Refactor video metadata writing in ClipWriterStage by removing an unnecessary blank line for improved code clarity. Update get_full_path function signature for consistency in type hinting. Enhance test case formatting in TestVideoReaderStage to improve readability and maintainability.

Signed-off-by: Ao Tang <[email protected]>

---------

Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
Co-authored-by: Abhinav Garg <[email protected]>

* Add motion filtering stages to video splitting pipeline (#797)

* Add video io reader

* Add test

* Add video splitting pipeline with fixed stride extraction and transcoding stages

- Introduced `video_split_clip_example.py` to demonstrate video splitting functionality.
- Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips.
- Implemented command-line arguments for configuring…
* delete old config, examples, nemo_curator, and tests directories

Signed-off-by: Sarah Yurick <[email protected]>

* move ray-curator contents to top level directory

Signed-off-by: Sarah Yurick <[email protected]>

* edit all files in backends/

Signed-off-by: Sarah Yurick <[email protected]>

* all core, examples, and metrics files

Signed-off-by: Sarah Yurick <[email protected]>

* all files under models, pipeline, tasks, utils directories

Signed-off-by: Sarah Yurick <[email protected]>

* all stages except text/ and video/

Signed-off-by: Sarah Yurick <[email protected]>

* all text stages

Signed-off-by: Sarah Yurick <[email protected]>

* all video stages

Signed-off-by: Sarah Yurick <[email protected]>

* move examples dir to top level, rename ray_curator to nemo_curator

Signed-off-by: Sarah Yurick <[email protected]>

* add tests for backends, core, models, and pipelines

Signed-off-by: Sarah Yurick <[email protected]>

* remaining non stage tests

Signed-off-by: Sarah Yurick <[email protected]>

* some stages tests

Signed-off-by: Sarah Yurick <[email protected]>

* remaining tests

Signed-off-by: Sarah Yurick <[email protected]>

* delete outdated tutorials

Signed-off-by: Sarah Yurick <[email protected]>

* edit github and docker files

Signed-off-by: Sarah Yurick <[email protected]>

* delete dask conftest

Signed-off-by: Sarah Yurick <[email protected]>

* edit pyproject and fix some ruff

Signed-off-by: Sarah Yurick <[email protected]>

* more ruff

Signed-off-by: Sarah Yurick <[email protected]>

* fix more ruff

Signed-off-by: Sarah Yurick <[email protected]>

* playing catch up

Signed-off-by: Sarah Yurick <[email protected]>

* edit docs

Signed-off-by: Sarah Yurick <[email protected]>

* comment out cicd-gpu-tests

Signed-off-by: Sarah Yurick <[email protected]>

* remove resources as property usages

Signed-off-by: Sarah Yurick <[email protected]>

* edit test

Signed-off-by: Sarah Yurick <[email protected]>

* update fuzzy dedup from main

Signed-off-by: Sarah Yurick <[email protected]>

* edit from image pr

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
…kmahajan/NeMo-Curator into praateek/nc-benchmarking-scripts

Signed-off-by: Praateek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants