-
Notifications
You must be signed in to change notification settings - Fork 184
Add VlM captioning pipeline for images #914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ronjer30
wants to merge
80
commits into
NVIDIA-NeMo:huvu/image_curation_2
Choose a base branch
from
ronjer30:vlm-captioning
base: huvu/image_curation_2
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add VlM captioning pipeline for images #914
ronjer30
wants to merge
80
commits into
NVIDIA-NeMo:huvu/image_curation_2
from
ronjer30:vlm-captioning
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…s/curator-llm-pii` folder (NVIDIA-NeMo#688) * your message * adding example notebook Signed-off-by: Adeola <[email protected]> * adding notebook with open source data Signed-off-by: Adeola <[email protected]> * Update tutorials/curator-llm-pii/README.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * updating example notebook Signed-off-by: Adeola <[email protected]> * updating readme Signed-off-by: Adeola <[email protected]> * updating readme Signed-off-by: Adeola <[email protected]> * Apply suggestions from code review Signed-off-by: Sarah Yurick <[email protected]> * uploading Enron notebook Signed-off-by: Adeola Adesoba <[email protected]> * updated enron notebook Signed-off-by: Adeola Adesoba <[email protected]> * updated PII-LLM notebook Signed-off-by: Adeola Adesoba <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-LLM-tutorial.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * Update tutorials/curator-llm-pii/PII-modification-Enron.ipynb Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> * updated PII-LLM-Enron notebook Signed-off-by: Adeola Adesoba <[email protected]> * updated PII-LLM-tutorial notebook Signed-off-by: Adeola Adesoba <[email protected]> * removed outdated PII-LLM-tutorial notebook Signed-off-by: Adeola Adesoba <[email protected]> * removed outdated PII-LLM-Enron notebook Signed-off-by: Adeola Adesoba <[email protected]> * Update README.md Signed-off-by: Sarah Yurick <[email protected]> * recommitting overwritten PII-LLM-Enron notebook Signed-off-by: Adeola Adesoba <[email protected]> * recommitting overwritten PII-LLM-tutorial notebook Signed-off-by: Adeola Adesoba <[email protected]> * Updated README.md Signed-off-by: Adeola Adesoba <[email protected]> * Apply suggestions from code review Signed-off-by: Sarah Yurick <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Adeola <[email protected]> Signed-off-by: aadesoba-nv <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Adeola Adesoba <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Minor change to test CI Signed-off-by: Charlie Truong <[email protected]> * Remove example test Signed-off-by: Charlie Truong <[email protected]> * Minor change Signed-off-by: Charlie Truong <[email protected]> * Run on NeMo GPU runners Signed-off-by: Charlie Truong <[email protected]> --------- Signed-off-by: Charlie Truong <[email protected]>
Signed-off-by: Charlie Truong <[email protected]>
* [🤖]: Howdy folks, let's bump NeMo Curator to `0.10.0rc0.dev0` ! Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Fix lint error in package_info Signed-off-by: Charlie Truong <[email protected]> --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Charlie Truong <[email protected]> Co-authored-by: chtruong814 <[email protected]>
* ci: Run gputests on main Signed-off-by: oliver könig <[email protected]> * f Signed-off-by: oliver könig <[email protected]> --------- Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
…ication notebook (NVIDIA-NeMo#748) Signed-off-by: Abhinav Garg <[email protected]>
NVIDIA-NeMo#752) * Fix failing Trafilatura tests for extracting Chinese and Japanese text Signed-off-by: Sarah Yurick <[email protected]> * force cpu tests to run Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Charlie Truong <[email protected]>
…eMo#749) * Add placeholder codecov check for when tests do not run Signed-off-by: Charlie Truong <[email protected]> * Force tests not to run Signed-off-by: Charlie Truong <[email protected]> * Revert "Force tests not to run" This reverts commit a3073ba. Signed-off-by: Charlie Truong <[email protected]> * Add gpu-ci-result step Signed-off-by: Charlie Truong <[email protected]> * Skip currently failing unit tests Signed-off-by: Charlie Truong <[email protected]> * Run cpu tests on all PRs Signed-off-by: Charlie Truong <[email protected]> * Revert "Skip currently failing unit tests" This reverts commit ccc88ef. Signed-off-by: Charlie Truong <[email protected]> * Fix gpu-ci-results Signed-off-by: Charlie Truong <[email protected]> --------- Signed-off-by: Charlie Truong <[email protected]>
Signed-off-by: Charlie Truong <[email protected]>
* update llama nemotron docs Signed-off-by: Sarah Yurick <[email protected]> * add cd cmd Signed-off-by: Sarah Yurick <[email protected]> * update blocksize and n workers Signed-off-by: Sarah Yurick <[email protected]> * revert n workers Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>
…rsal (NVIDIA-NeMo#769) * Implement safe extraction methods for tar files to prevent path traversal attacks in arxiv.py Signed-off-by: Abhinav Garg <[email protected]> * Adding tests Signed-off-by: Abhinav Garg <[email protected]> --------- Signed-off-by: Abhinav Garg <[email protected]>
…VIDIA-NeMo#774) * Remove skip markers from classifier tests to enable execution on GPU Signed-off-by: Abhinav Garg <[email protected]> * Remove skip markers from additional classifier tests to enable GPU execution Signed-off-by: Abhinav Garg <[email protected]> * Skip classifier tests for GPU execution in test_classifiers.py Signed-off-by: Abhinav Garg <[email protected]> --------- Signed-off-by: Abhinav Garg <[email protected]>
Signed-off-by: Arham Mehta <[email protected]>
* Update README.md Fix: Update installation command in README.md Signed-off-by: Arham Mehta <[email protected]> * Update README.md Signed-off-by: Arham Mehta <[email protected]> * Update README.md Signed-off-by: Arham Mehta <[email protected]> --------- Signed-off-by: Arham Mehta <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
…ersion control. Signed-off-by: Abhinav Garg <[email protected]>
…ation - Introduced `grafana.ini` for Grafana security and paths configuration. - Added `xenna_graphana_dashboard.json` for a comprehensive dashboard setup. - Created provisioning files for dashboards and data sources to streamline Grafana setup. - Implemented a script to launch Prometheus and Grafana with necessary configurations. This setup enhances the monitoring capabilities for Ray metrics, providing a structured approach to visualize and manage performance data. Signed-off-by: Abhinav Garg <[email protected]>
…tions - Added support for starting Ray through the script with the `--start_ray` option. - Implemented a help function to guide users on script usage and available options. - Improved error handling for unknown command line arguments. This update streamlines the process of launching monitoring tools for Ray metrics, allowing for a more flexible and user-friendly experience. Signed-off-by: Abhinav Garg <[email protected]>
* Changes in semdedup scripts Signed-off-by: abdr17 <[email protected]> * Made required changes in semdedup tests Signed-off-by: abdr17 <[email protected]> --------- Signed-off-by: abdr17 <[email protected]>
* incorportaing nvingest and curator upgrades Signed-off-by: Rucha Apte <[email protected]> * Modified signature Signed-off-by: Rucha Apte <[email protected]> * Resolve comments on PR Signed-off-by: Rucha Apte <[email protected]> * Update tutorials/multimodal_dapt_curation/curator/main.py Co-authored-by: Copilot <[email protected]> Signed-off-by: Rucha Apte <[email protected]> * Update tutorials/multimodal_dapt_curation/curator/main.py Co-authored-by: Copilot <[email protected]> Signed-off-by: Rucha Apte <[email protected]> * Resolving PR comments Signed-off-by: Rucha Apte <[email protected]> * resolving PR comment by bot Signed-off-by: Rucha Apte <[email protected]> * Ruff fixes Signed-off-by: Rucha Apte <[email protected]> --------- Signed-off-by: Rucha Apte <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Vibhu Jawa <[email protected]>
…o#794) - Implemented tests to ensure _safe_extract prevents path traversal, absolute path, device file, unsafe symlink, and absolute symlink attacks. - Verified that normal files are extracted correctly without security risks. These tests enhance the security of the extraction process for tar files in the arxiv module. Signed-off-by: Abhinav Garg <[email protected]>
* CI/CD for Ray API branch Signed-off-by: Sarah Yurick <[email protected]> * add text dependencies Signed-off-by: Sarah Yurick <[email protected]> * only run cpu tests Signed-off-by: Sarah Yurick <[email protected]> * comment instead of delete Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>
* Add video io reader * Add test * Add VideoReaderStage to video reading pipeline and update VideoDownloadStage to accept VideoTask. Enhance video reading capabilities with new tests for VideoReaderStage. Signed-off-by: Ao Tang <[email protected]> * Update VideoDownloadStage to support verbose logging and modify video_read_example to include verbose argument. Signed-off-by: Ao Tang <[email protected]> * Update outputs for VideoDownloadStage and VideoReaderStage to include additional metadata fields. Signed-off-by: Ao Tang <[email protected]> * Update CI workflow to include video dependencies for testing Signed-off-by: Ao Tang <[email protected]> * Add tests for video tasks module - Introduced a new test package for tasks with an initial test suite for the video tasks module, including tests for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes. - Implemented various test cases to validate initialization, property calculations, metadata extraction, and size calculations. This enhances the testing coverage for video-related functionalities in the ray-curator project. Signed-off-by: Ao Tang <[email protected]> * Enhance video tasks module with additional test cases - Expanded the test suite for the video tasks module by adding new test cases for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes. - Improved coverage for various functionalities including initialization, property calculations, and metadata extraction. This update strengthens the reliability of video-related features in the ray-curator project. Signed-off-by: Ao Tang <[email protected]> * Update pyproject.toml to include a trailing comma for pynvml dependency Signed-off-by: Ao Tang <[email protected]> * Refactor video processing stages to introduce a composite VideoReaderDownloadStage - Replaced separate VideoReaderStage and VideoDownloadStage with a new VideoReaderDownloadStage that combines both functionalities. - Updated the video_read_example to utilize the new composite stage. - Adjusted inputs and outputs for VideoDownloadStage to reflect changes in the pipeline. - Added tests for the new VideoReaderDownloadStage to ensure proper functionality and integration. This refactor simplifies the video reading and downloading process within the ray-curator framework. Signed-off-by: Ao Tang <[email protected]> --------- Signed-off-by: Ao Tang <[email protected]>
…VIDIA-NeMo#841) (NVIDIA-NeMo#842) * chore: Add new trustees and vetters to the copy-pr-bot configuration * chore: Remove empty line in copy-pr-bot configuration * chore: Remove ryantwolf from additional trustees and vetters in copy-pr-bot configuration --------- Signed-off-by: Ao Tang <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: Ao Tang <[email protected]>
Signed-off-by: oliver könig <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: oliver könig <[email protected]>
- Resolved conflicts in ray-curator/pyproject.toml by merging ray dependency versions and removing duplicate entries - Resolved conflicts in ray-curator/ray_curator/tasks/video.py by properly importing DecodedData and using correct type annotation - Kept modified .github/workflows/test.yml from image_curation_2 branch
Signed-off-by: RanjitR <[email protected]>
… and refactor process_sync method Signed-off-by: RanjitR <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Adds a pipeline recipe
ray-curator/ray_curator/examples/image/README.mdto generate captions for images using VLM NIMs. It also includes 2 stages added toray_curator/stagesto support reading images from files and URLs and a VLM captioning stage.Usage
See
ray-curator/ray_curator/examples/image/README.mdfor examples of usage