Skip to content

Conversation

@ronjer30
Copy link
Contributor

Description

Adds a pipeline recipe ray-curator/ray_curator/examples/image/README.md to generate captions for images using VLM NIMs. It also includes 2 stages added to ray_curator/stages to support reading images from files and URLs and a VLM captioning stage.

Usage

See ray-curator/ray_curator/examples/image/README.md for examples of usage

## Checklist
- [X] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md).
- [X] New or Existing tests cover these changes.
- [ ] The documentation is up to date with these changes.

aadesoba-nv and others added 30 commits June 13, 2025 12:13
…s/curator-llm-pii` folder (NVIDIA-NeMo#688)

* your message

* adding example notebook

Signed-off-by: Adeola <[email protected]>

* adding notebook with open source data

Signed-off-by: Adeola <[email protected]>

* Update tutorials/curator-llm-pii/README.md

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* updating example notebook

Signed-off-by: Adeola <[email protected]>

* updating readme

Signed-off-by: Adeola <[email protected]>

* updating readme

Signed-off-by: Adeola <[email protected]>

* Apply suggestions from code review

Signed-off-by: Sarah Yurick <[email protected]>

* uploading Enron notebook

Signed-off-by: Adeola Adesoba <[email protected]>

* updated enron notebook

Signed-off-by: Adeola Adesoba <[email protected]>

* updated PII-LLM notebook

Signed-off-by: Adeola Adesoba <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>

* updated PII-LLM-Enron notebook

Signed-off-by: Adeola Adesoba <[email protected]>

* updated PII-LLM-tutorial notebook

Signed-off-by: Adeola Adesoba <[email protected]>

* removed outdated PII-LLM-tutorial notebook

Signed-off-by: Adeola Adesoba <[email protected]>

* removed outdated PII-LLM-Enron notebook

Signed-off-by: Adeola Adesoba <[email protected]>

* Update README.md

Signed-off-by: Sarah Yurick <[email protected]>

* recommitting overwritten PII-LLM-Enron notebook

Signed-off-by: Adeola Adesoba <[email protected]>

* recommitting overwritten PII-LLM-tutorial notebook

Signed-off-by: Adeola Adesoba <[email protected]>

* Updated README.md

Signed-off-by: Adeola Adesoba <[email protected]>

* Apply suggestions from code review

Signed-off-by: Sarah Yurick <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Adeola <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Adeola Adesoba <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Minor change to test CI

Signed-off-by: Charlie Truong <[email protected]>

* Remove example test

Signed-off-by: Charlie Truong <[email protected]>

* Minor change

Signed-off-by: Charlie Truong <[email protected]>

* Run on NeMo GPU runners

Signed-off-by: Charlie Truong <[email protected]>

---------

Signed-off-by: Charlie Truong <[email protected]>
* [🤖]: Howdy folks, let's bump NeMo Curator to `0.10.0rc0.dev0` !

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Fix lint error in package_info

Signed-off-by: Charlie Truong <[email protected]>

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Charlie Truong <[email protected]>
Co-authored-by: chtruong814 <[email protected]>
* ci: Run gputests on main

Signed-off-by: oliver könig <[email protected]>

* f

Signed-off-by: oliver könig <[email protected]>

---------

Signed-off-by: oliver könig <[email protected]>
NVIDIA-NeMo#752)

* Fix failing Trafilatura tests for extracting Chinese and Japanese text

Signed-off-by: Sarah Yurick <[email protected]>

* force cpu tests to run

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
…eMo#749)

* Add placeholder codecov check for when tests do not run

Signed-off-by: Charlie Truong <[email protected]>

* Force tests not to run

Signed-off-by: Charlie Truong <[email protected]>

* Revert "Force tests not to run"

This reverts commit a3073ba.

Signed-off-by: Charlie Truong <[email protected]>

* Add gpu-ci-result step

Signed-off-by: Charlie Truong <[email protected]>

* Skip currently failing unit tests

Signed-off-by: Charlie Truong <[email protected]>

* Run cpu tests on all PRs

Signed-off-by: Charlie Truong <[email protected]>

* Revert "Skip currently failing unit tests"

This reverts commit ccc88ef.

Signed-off-by: Charlie Truong <[email protected]>

* Fix gpu-ci-results

Signed-off-by: Charlie Truong <[email protected]>

---------

Signed-off-by: Charlie Truong <[email protected]>
* update llama nemotron docs

Signed-off-by: Sarah Yurick <[email protected]>

* add cd cmd

Signed-off-by: Sarah Yurick <[email protected]>

* update blocksize and n workers

Signed-off-by: Sarah Yurick <[email protected]>

* revert n workers

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
…rsal (NVIDIA-NeMo#769)

* Implement safe extraction methods for tar files to prevent path traversal attacks in arxiv.py

Signed-off-by: Abhinav Garg <[email protected]>

* Adding tests

Signed-off-by: Abhinav Garg <[email protected]>

---------

Signed-off-by: Abhinav Garg <[email protected]>
…VIDIA-NeMo#774)

* Remove skip markers from classifier tests to enable execution on GPU

Signed-off-by: Abhinav Garg <[email protected]>

* Remove skip markers from additional classifier tests to enable GPU execution

Signed-off-by: Abhinav Garg <[email protected]>

* Skip classifier tests for GPU execution in test_classifiers.py

Signed-off-by: Abhinav Garg <[email protected]>

---------

Signed-off-by: Abhinav Garg <[email protected]>
* Update README.md

Fix: Update installation command in README.md

Signed-off-by: Arham Mehta <[email protected]>

* Update README.md

Signed-off-by: Arham Mehta <[email protected]>

* Update README.md

Signed-off-by: Arham Mehta <[email protected]>

---------

Signed-off-by: Arham Mehta <[email protected]>
…ation

- Introduced `grafana.ini` for Grafana security and paths configuration.
- Added `xenna_graphana_dashboard.json` for a comprehensive dashboard setup.
- Created provisioning files for dashboards and data sources to streamline Grafana setup.
- Implemented a script to launch Prometheus and Grafana with necessary configurations.

This setup enhances the monitoring capabilities for Ray metrics, providing a structured approach to visualize and manage performance data.

Signed-off-by: Abhinav Garg <[email protected]>
…tions

- Added support for starting Ray through the script with the `--start_ray` option.
- Implemented a help function to guide users on script usage and available options.
- Improved error handling for unknown command line arguments.

This update streamlines the process of launching monitoring tools for Ray metrics, allowing for a more flexible and user-friendly experience.

Signed-off-by: Abhinav Garg <[email protected]>
* Changes in semdedup scripts

Signed-off-by: abdr17 <[email protected]>

* Made required changes in semdedup tests

Signed-off-by: abdr17 <[email protected]>

---------

Signed-off-by: abdr17 <[email protected]>
* incorportaing nvingest and curator upgrades

Signed-off-by: Rucha Apte <[email protected]>

* Modified signature

Signed-off-by: Rucha Apte <[email protected]>

* Resolve comments on PR

Signed-off-by: Rucha Apte <[email protected]>

* Update tutorials/multimodal_dapt_curation/curator/main.py

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>

* Update tutorials/multimodal_dapt_curation/curator/main.py

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>

* Resolving PR comments

Signed-off-by: Rucha Apte <[email protected]>

* resolving PR comment by bot

Signed-off-by: Rucha Apte <[email protected]>

* Ruff fixes

Signed-off-by: Rucha Apte <[email protected]>

---------

Signed-off-by: Rucha Apte <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Vibhu Jawa <[email protected]>
…o#794)

- Implemented tests to ensure _safe_extract prevents path traversal, absolute path, device file, unsafe symlink, and absolute symlink attacks.
- Verified that normal files are extracted correctly without security risks.

These tests enhance the security of the extraction process for tar files in the arxiv module.

Signed-off-by: Abhinav Garg <[email protected]>
sarahyurick and others added 28 commits July 22, 2025 12:49
* CI/CD for Ray API branch

Signed-off-by: Sarah Yurick <[email protected]>

* add text dependencies

Signed-off-by: Sarah Yurick <[email protected]>

* only run cpu tests

Signed-off-by: Sarah Yurick <[email protected]>

* comment instead of delete

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
* Add video io reader

* Add test

* Add VideoReaderStage to video reading pipeline and update VideoDownloadStage to accept VideoTask. Enhance video reading capabilities with new tests for VideoReaderStage.

Signed-off-by: Ao Tang <[email protected]>

* Update VideoDownloadStage to support verbose logging and modify video_read_example to include verbose argument.

Signed-off-by: Ao Tang <[email protected]>

* Update outputs for VideoDownloadStage and VideoReaderStage to include additional metadata fields.

Signed-off-by: Ao Tang <[email protected]>

* Update CI workflow to include video dependencies for testing

Signed-off-by: Ao Tang <[email protected]>

* Add tests for video tasks module

- Introduced a new test package for tasks with an initial test suite for the video tasks module, including tests for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes.
- Implemented various test cases to validate initialization, property calculations, metadata extraction, and size calculations.

This enhances the testing coverage for video-related functionalities in the ray-curator project.

Signed-off-by: Ao Tang <[email protected]>

* Enhance video tasks module with additional test cases

- Expanded the test suite for the video tasks module by adding new test cases for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes.
- Improved coverage for various functionalities including initialization, property calculations, and metadata extraction.

This update strengthens the reliability of video-related features in the ray-curator project.

Signed-off-by: Ao Tang <[email protected]>

* Update pyproject.toml to include a trailing comma for pynvml dependency

Signed-off-by: Ao Tang <[email protected]>

* Refactor video processing stages to introduce a composite VideoReaderDownloadStage

- Replaced separate VideoReaderStage and VideoDownloadStage with a new VideoReaderDownloadStage that combines both functionalities.
- Updated the video_read_example to utilize the new composite stage.
- Adjusted inputs and outputs for VideoDownloadStage to reflect changes in the pipeline.
- Added tests for the new VideoReaderDownloadStage to ensure proper functionality and integration.

This refactor simplifies the video reading and downloading process within the ray-curator framework.

Signed-off-by: Ao Tang <[email protected]>

---------

Signed-off-by: Ao Tang <[email protected]>
…VIDIA-NeMo#841) (NVIDIA-NeMo#842)

* chore: Add new trustees and vetters to the copy-pr-bot configuration



* chore: Remove empty line in copy-pr-bot configuration



* chore: Remove ryantwolf from additional trustees and vetters in copy-pr-bot configuration



---------

Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: Ao Tang <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>
- Resolved conflicts in ray-curator/pyproject.toml by merging ray dependency versions and removing duplicate entries
- Resolved conflicts in ray-curator/ray_curator/tasks/video.py by properly importing DecodedData and using correct type annotation
- Kept modified .github/workflows/test.yml from image_curation_2 branch
… and refactor process_sync method

Signed-off-by: RanjitR <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.