Skip to content

Commit ca102c3

Browse files
authored
Dcvlr (#1750)
1 parent ab88e73 commit ca102c3

File tree

7 files changed

+337
-49
lines changed

7 files changed

+337
-49
lines changed

configs/projects/dcvlr/README.md

Lines changed: 94 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# DCVLR: Data Curation for Vision Language Reasoning
1+
# DCVLR - Getting Under the Hood
22

33
[![NeurIPS 2025](https://img.shields.io/badge/NeurIPS-2025-blue.svg)](https://neurips.cc/Conferences/2025)
44
[![Competition](https://img.shields.io/badge/Competition-Open-green.svg)](https://dcvlr.org)
@@ -16,58 +16,117 @@
1616

1717
---
1818

19+
## What is this directory?
1920

20-
DCVLR is the first open-data, open-models, open-source competition for data curation in vision-language reasoning, hosted at NeurIPS 2025.
21+
This directory is intended to accompany the [2025 DCVLR (Data Curation for Vision-Language Reasoning) NeurIPS competition](https://dcvlr-neurips.github.io/). If you don't know what that is, you should go read the competition website and then come back here!
2122

23+
## DCVLR: Digging Deeper
2224

23-
## 🎯 Challenge
25+
The DCVLR competition was explicitly designed to have a *low barrier to entry*, allowing a diverse collection of teams to compete. However, we know that many teams may be interested in digging deeper into the data and the tasks in order to optimize the performance of their allowed submissions. If that's you, you've come to the right place. This directory will give you all the building blocks necessary to reproduce the train and eval pipeline used in the DCVLR competition on your own cluster.
2426

25-
Participants can leverage any source datasets to curate high-quality instruction-tuning datasets (1K or 10K examples). Participants are encouraged to explore diverse curation strategies, from synthetic data generation to subset selection. Submissions will be evaluated by fine-tuning an undisclosed, open-source vision-language model on the curated data and measuring performance across a wide variety of benchmarks.
27+
## What You Will Need
2628

27-
## 🚀 Quick Start
29+
In order to reproduce our experimental pipeline with the model architectures we consider for this competition (which range from 7B to 10B parameters), you will need access to a cluster with at least 8 A100 GPUs, and 1TB of disk space. If you don't have access, you can rent a cluster, e.g. on [Lambda](https://lambdalabs.com/service/gpu-cloud). All DCVLR participants are eligible for a credit on Lambda which they can use to run experiments for the competition.
2830

29-
Get started with training in minutes:
31+
We plan to provide add examples of how to experiment on smaller architectures (e.g. 1B parameters) to this directory at a later date, so stay tuned. You can also refer to the [Oumi documentation](https://oumi.ai/docs/en/latest/index.html) for more information on how to run experiments on smaller clusters.
32+
33+
### Data Sourcing
34+
35+
Where can you source data that might be suitable for training for this competition? If you want to draw on existing datasets, here are a few we recommend looking into --
36+
37+
[Llava-O1](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k)
38+
39+
[Math-Llava](https://huggingface.co/datasets/Zhiqiang007/MathV360K)
40+
41+
[Geo-170K](https://huggingface.co/datasets/Luckyjhg/Geo170K)
42+
43+
[Open-R1](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified)
44+
45+
[AIDC Ovis](https://huggingface.co/datasets/AIDC-AI/Ovis-dataset)
46+
47+
[Llava 1V](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data)
48+
49+
### Data Curation
50+
51+
We will add documentation on how to use Oumi for synthetic data curation and data transformation here soon. Stay tuned!
52+
53+
For now, you will have to BYOD (bring your own dataset) in an Oumi-supported dataset format. For this competition, we highly recommend the flexible "hf_vision" format, which allows you to load a wide range of VL datasets from the Hugging Face Hub. Here's an example we used for training on a filtered version of the Multimodal Open-R1 dataset:
3054

3155
```bash
32-
# Install oumi
33-
uv pip install "oumi[gpu]"
56+
datasets:
57+
- dataset_name: "hf_vision"
58+
split: "train"
59+
shuffle: True
60+
seed: 42
61+
trust_remote_code: True
62+
transform_num_workers: "auto"
63+
dataset_kwargs:
64+
hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
65+
image_column: "image"
66+
question_column: "problem"
67+
answer_column: "solution"
68+
return_tensors: True
69+
```
70+
71+
### Model Training
3472

35-
# Train with Molmo-7B-O
36-
oumi train -c molmo-o --dataset dataset.jsonl
73+
#### Setup and Environment
3774

38-
# Train with Qwen2.5-VL-7B-Instruct
39-
oumi train -c qwen2.5-vl-7b-instruct --dataset dataset.jsonl
75+
DCVLR experiments can be run using the main branch of the Oumi repository. We provide a [DOCKERFILE](https://github.com/oumi-ai/oumi/blob/main/Dockerfile) for building Oumi, or you can follow the instructions in the [Quickstart](https://oumi.ai/docs/en/latest/get_started/quickstart.html).
76+
77+
#### Commands
78+
79+
Model training is extremely straightforward, requiring only a single command:
80+
81+
```bash
82+
export MY_CONFIG=<PATH/TO/qwenvl-openr1.yaml>
83+
torchrun --nproc-per-node 8 --standalone -m oumi train -c $MY_CONFIG
4084
```
4185

42-
## 📅 Key Dates
86+
We provide configurations for three models; Molmo-D, Molmo-O, and QwenVL-2.5. Other models such as InternVL3 may also be used in the competition.
87+
88+
Depending on how `training: output_dir` is set in the config file, the model checkpoints will be saved in the base of the specified directory.
89+
90+
We then recommend syncing the trained model to HuggingFace Hub using the `huggingface-cli` tool to enable version control and ease of future access. The repository need not exist in advance, it will be automatically created when you use this command.
4391

44-
| Date | Milestone |
45-
|------|-----------|
46-
| **June 11, 2025** | Release of Competition Materials |
47-
| **July 1, 2025** | Submission Portal Opens |
48-
| **October 1, 2025** | Final Submission Deadline |
49-
| **November 1, 2025** | Results Announced |
50-
| **December 2025** | NeurIPS 2025 Presentation |
92+
```bash
93+
huggingface-cli upload-large-folder <YOUR_HF_REPO> <YOUR_OUTPUT_DIRECTORY> --repo-type=model
94+
```
95+
96+
### Model Evaluation
5197

98+
#### Setup and Environment
5299

53-
## 📚 Competition Resources
100+
We use a modified version of [VLMEvalKit](https://github.com/oumi-ai/VLMEvalKit) for our evaluation harness. You can clone and install it following the directions in the repo, or use the provided [DOCKERFILE](https://github.com/oumi-ai/VLMEvalKit/blob/main/docker/Dockerfile.cuda12.9-oumi-molmo-qwen).
54101

55-
| Resource | Description | Link |
56-
|----------|-------------|------|
57-
| 📊 **Starter Kit** | Comprehensive starter kit with example datasets, training scripts, and best practices | [Access Starter Kit](https://huggingface.co/datasets/oumi-ai/dcvlr-starter-kit) |
58-
| 💻 **Training Scripts** | Starting scripts for fine-tuning multiple vision-language models | [View Scripts](https://github.com/oumi-ai/oumi/tree/main/configs/projects/dcvlr) |
59-
| 🧪 **Evaluation Code** | Scripts to evaluate model outputs on diverse benchmark development sets | [Get Code](https://github.com/oumi-ai/oumi/tree/main/configs/projects/dcvlr) |
60-
| ☁️ **Compute Resources** | GPU credits from Lambda Labs for participants | [Apply for Credits](https://oumi-ai.typeform.com/to/OGPuRt6U") |
61-
| 📚 **Documentation** | Complete guides and tutorials | [View Documentation](https://oumi.ai/docs) |
102+
#### Commands
62103

63-
## 🤝 Sponsors
104+
Model evaluation can also be conducted using a simple one-line command. We give an example with four datasets; these datasets are not guaranteed to be the ones we use in the competition, however, they are a good starting point for the types of tasks we are considering.
64105

65-
- **Lambda Labs** - Compute Resources
66-
- **Oumi.ai** - Competition Support
106+
```bash
107+
export MODEL_NAME=<YOUR/HF/MODEL/PATH>
108+
export WORK_DIR=<YOUR/OUTPUT/DIRECTORY>
109+
mkdir -p "$WORK_DIR"
110+
export DATASETS="VMCBench_DEV WeMath MathVista_MINI LiveXivVQA"
111+
python scripts/wandb_logger.py --run-and-log \
112+
--data $DATASETS \
113+
--work-dir $WORK_DIR \
114+
--use-vllm \
115+
--save-detailed-eval \
116+
--save-judge-responses \
117+
--max-output-tokens 4096 \
118+
--pass-custom-model $MODEL_NAME
119+
```
67120

68-
## 📞 Contact
121+
## How to Cite DCVLR
69122

70-
Have questions? Get in touch with the DCVLR team:
123+
If you wish to refer to DCVLR in your work, please cite the following:
71124

72-
- **Website**: [dcvlr.org](https://dcvlr.org)
73-
- **Email**: [Contact Form](https://dcvlr.org/contact)
125+
```bib
126+
@misc{DCVLR: Data Curation for Vision-Language Reasoning,
127+
author = {Feuer, Benjamin and Tripathi, Rohun and Elachqar, Oussama and Zhang, Yuhui and Hulkund, Neha and Nguyen, Thao and Shabtay, Nimrod and Udandarao, Vishaal and Wang, Xiaohan and Webb, Stefan and Koukoumidis, Emmanouil and Schmidt, Ludwig and Xie, Saining and Yeung-Levy, Serena and Liang, Paul and Beery, Sara and Gkioxari, Georgia}
128+
month = June,
129+
title = {{DCVLR}},
130+
year = {2025}
131+
}
132+
```

configs/projects/dcvlr/starter_kit/README.md

Lines changed: 0 additions & 1 deletion
This file was deleted.

configs/projects/dcvlr/starter_kit/evaluate.sh

Whitespace-only changes.
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Full fine-tune config for Molmo-7B-D.
2+
#
3+
# Note: the original model is not compatible with the latest version of transformers and oumi
4+
# We use the oumi-ai version of the model instead until the original model is updated.
5+
#
6+
# Requirements:
7+
# - uv pip install einops tf-keras
8+
#
9+
# Usage:
10+
# oumi train -c configs/recipes/vision/molmo/sft/molmo_d_full/train.yaml
11+
#
12+
# See Also:
13+
# - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
14+
# - Config class: oumi.core.configs.TrainingConfig
15+
# - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/training_config.py
16+
# - Other training configs: configs/**/pretraining/, configs/**/sft/, configs/**/dpo/
17+
18+
model:
19+
# model_name: "allenai/Molmo-7B-O-0924"
20+
model_name: "oumi-ai/Molmo-7B-D-0924"
21+
torch_dtype_str: "float32"
22+
model_max_length: 8192
23+
trust_remote_code: True
24+
model_kwargs:
25+
max_position_embeddings: 8192
26+
27+
data:
28+
train:
29+
collator_name: "vision_language_sft"
30+
collator_kwargs:
31+
process_individually: True
32+
use_torchdata: True
33+
datasets:
34+
- dataset_name: "hf_vision"
35+
split: "train"
36+
shuffle: True
37+
seed: 42
38+
trust_remote_code: True
39+
transform_num_workers: "auto"
40+
dataset_kwargs:
41+
hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
42+
image_column: "image"
43+
question_column: "problem"
44+
answer_column: "solution"
45+
return_tensors: True
46+
47+
training:
48+
output_dir: "output/molmo_d_openr1"
49+
trainer_type: "TRL_SFT"
50+
enable_gradient_checkpointing: False # Note: Molmo does not support gradient checkpointing
51+
per_device_train_batch_size: 1
52+
optimizer: "adamw_torch_fused"
53+
logging_steps: 100
54+
save_steps: 0
55+
include_performance_metrics: True
56+
log_model_summary: True
57+
dataloader_main_process_only: False
58+
59+
fsdp:
60+
enable_fsdp: True
61+
sharding_strategy: "HYBRID_SHARD"
62+
mixed_precision: "bf16"
63+
forward_prefetch: True
64+
auto_wrap_policy: "SIZE_BASED_WRAP" # TODO: use transformer wrapper instead
65+
min_num_params: 100000

configs/projects/dcvlr/starter_kit/train.yaml renamed to configs/projects/dcvlr/starter_kit/molmo-o-train-openr1.yaml

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Full fine-tune config for Molmo-7B-D.
1+
# Full fine-tune config for Molmo-7B-O.
22
#
33
# Note: the original model is not compatible with the latest version of transformers and oumi
44
# We use the oumi-ai version of the model instead until the original model is updated.
@@ -8,7 +8,6 @@
88
#
99
# Usage:
1010
# oumi train -c configs/recipes/vision/molmo/sft/molmo_o_full/train.yaml
11-
# torchrun --nproc-per-node 4 --standalone -m oumi train -c configs/recipes/vision/molmo/sft/molmo_o_full/train.yaml
1211
#
1312
# See Also:
1413
# - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
@@ -17,11 +16,12 @@
1716
# - Other training configs: configs/**/pretraining/, configs/**/sft/, configs/**/dpo/
1817

1918
model:
20-
# model_name: "allenai/Molmo-7B-O-0924"
2119
model_name: "oumi-ai/Molmo-7B-O-0924"
2220
torch_dtype_str: "float32"
23-
model_max_length: 2048
21+
model_max_length: 8192
2422
trust_remote_code: True
23+
model_kwargs:
24+
max_position_embeddings: 8192
2525

2626
data:
2727
train:
@@ -30,24 +30,26 @@ data:
3030
process_individually: True
3131
use_torchdata: True
3232
datasets:
33-
- dataset_name: "merve/vqav2-small"
34-
split: "validation"
33+
- dataset_name: "hf_vision"
34+
split: "train"
3535
shuffle: True
3636
seed: 42
37+
trust_remote_code: True
3738
transform_num_workers: "auto"
3839
dataset_kwargs:
39-
# processor_name: "allenai/Molmo-7B-O-0924"
40-
processor_name: "oumi-ai/Molmo-7B-O-0924"
41-
return_conversation: True
40+
hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
41+
image_column: "image"
42+
question_column: "problem"
43+
answer_column: "solution"
44+
return_tensors: True
4245

4346
training:
44-
output_dir: "output/molmo_sft"
47+
output_dir: "output/molmo_o_openr1"
4548
trainer_type: "TRL_SFT"
4649
enable_gradient_checkpointing: False # Note: Molmo does not support gradient checkpointing
47-
per_device_train_batch_size: 2
48-
max_steps: 20
50+
per_device_train_batch_size: 1
4951
optimizer: "adamw_torch_fused"
50-
logging_steps: 5
52+
logging_steps: 100
5153
save_steps: 0
5254
include_performance_metrics: True
5355
log_model_summary: True
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Qwen 2.5 VL 7B full fine-tune training config.
2+
#
3+
# Requirements:
4+
# - Log into WandB (`wandb login`) or disable `enable_wandb`
5+
# - (optional) If you want to use flash attention, run `pip install -U flash-attn --no-build-isolation`
6+
#
7+
#
8+
# See Also:
9+
# - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
10+
# - Config class: oumi.core.configs.TrainingConfig
11+
# - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/training_config.py
12+
# - Other training configs: configs/**/pretraining/, configs/**/sft/, configs/**/dpo/
13+
14+
model:
15+
model_name: "Qwen/Qwen2.5-VL-7B-Instruct"
16+
torch_dtype_str: "bfloat16"
17+
model_max_length: 10000
18+
trust_remote_code: True
19+
attn_implementation: "sdpa" # You can also use `flash_attention_2` if you install it
20+
chat_template: "qwen2-vl-instruct" # 2.5 uses the same template as 2.0
21+
22+
data:
23+
train:
24+
collator_name: "vision_language_sft"
25+
collator_kwargs:
26+
process_individually: True
27+
use_torchdata: True
28+
datasets:
29+
- dataset_name: "hf_vision"
30+
split: "train"
31+
shuffle: True
32+
seed: 42
33+
trust_remote_code: True
34+
transform_num_workers: "auto"
35+
dataset_kwargs:
36+
hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
37+
image_column: "image"
38+
question_column: "problem"
39+
answer_column: "solution"
40+
return_tensors: True
41+
processor_name: "Qwen/Qwen2.5-VL-7B-Instruct"
42+
43+
training:
44+
output_dir: "output/qwen2_5_vl_7b_openr1"
45+
trainer_type: "TRL_SFT"
46+
enable_gradient_checkpointing: True
47+
per_device_train_batch_size: 1 # Must be 1: the model generates variable-sized image features
48+
gradient_accumulation_steps: 1
49+
# max_steps: 20 # Uncomment if you want to limit the number of training steps.
50+
num_train_epochs: 1
51+
# If this is not passed, checkpoints may be saved which are suitable for resuming training but not for loading from HF
52+
save_final_model: True
53+
54+
gradient_checkpointing_kwargs:
55+
# Reentrant docs: https://pytorch.org/docs/stable/checkpoint.html#torch.utils.checkpoint.checkpoint
56+
use_reentrant: False
57+
ddp_find_unused_parameters: False
58+
empty_device_cache_steps: 1
59+
compile: False
60+
61+
optimizer: "adamw_torch_fused"
62+
learning_rate: 2e-5
63+
warmup_ratio: 0.03
64+
weight_decay: 0.01
65+
lr_scheduler_type: "cosine"
66+
67+
logging_steps: 5
68+
save_steps: 0
69+
dataloader_main_process_only: False
70+
dataloader_num_workers: 2
71+
dataloader_prefetch_factor: 8
72+
include_performance_metrics: True
73+
log_model_summary: False
74+
enable_wandb: True
75+
76+
fsdp:
77+
enable_fsdp: True
78+
sharding_strategy: "HYBRID_SHARD"
79+
mixed_precision: "bf16"
80+
forward_prefetch: True
81+
auto_wrap_policy: "SIZE_BASED_WRAP" # TODO: use transformer wrapper instead
82+
min_num_params: 100000

0 commit comments

Comments
 (0)