Skip to content

Conversation

lingzhq
Copy link
Collaborator

@lingzhq lingzhq commented Jun 12, 2025

Introduces a novel data selection op based on semantic diversity across domains, designed to automatically select the most diverse subset of data samples, which is inspired by the DaaR paper.

  • Converts input samples into embeddings
  • Use embeddings to cluster pseudo-domains
  • Selects samples based on various distances to maximize diversity

[WIP] Ongoing development of additional operators derived from the DaaR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant