Skip to content

[Feature] Dataset Version Control #1519

@whittenator

Description

@whittenator

📄 Description

Currently, Intel Geti does not provide a native way to version control datasets. When users add new imagery, modify labels, or restructure classes, these changes are applied directly to the dataset without a mechanism to track revisions, create restore points, or compare different versions. This creates significant risk when iterating on datasets:

  • Performance regression isolation: When new data is added and model performance unexpectedly drops, it is difficult to determine whether the issue stems from data quality, label accuracy, or distribution shift. Users must manually export and manage dataset snapshots outside the platform, which is error-prone and lacks traceability.

  • Experimentation friction: Data scientists cannot safely experiment with different data compositions (e.g., adding hard negative samples, removing ambiguous classes, or balancing categories) without permanently altering the source dataset. This discourages iterative improvement and A/B testing of dataset strategies.

  • Collaboration & reproducibility: Teams lack a clear audit trail of what data changed, when, and why. Reproducing a model trained on a specific dataset state becomes nearly impossible if the dataset has been modified in the interim.

The proposed solution is to implement a dataset version control system within Geti, similar to how Git manages code versions. Users should be able to commit dataset states with descriptive messages, tag versions for releases, create branches for experiments, and seamlessly revert to previous versions when needed.

Key capabilities would include:

  • Version snapshots: Create named versions of a dataset at any point (e.g., "v1.0-baseline", "v1.1-added-night-vision-samples")

  • Diff & compare: Visualize differences between versions (added/removed images, class distribution changes, label modifications)

  • Revert/rollback: Restore a project to use any previous dataset version with one click

  • Branching: Maintain parallel dataset variants for different experiments without conflicts

  • Tagging & annotations: Add metadata to versions for easy identification and team communication

🎯 Objective

Enable users to track, manage, and restore multiple versions of their dataset within Intel Geti, providing full auditability and safe experimentation workflows that directly tie dataset changes to model performance outcomes.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions