-
Notifications
You must be signed in to change notification settings - Fork 276
[NewOp] Add group_diversity_filter op #745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @lingzhq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant new text processing capability by adding a group_diversity_filter
operator. This filter enables the measurement of semantic diversity within a batch of text samples, which is a crucial feature for advanced data curation and reinforcement learning applications. The implementation supports flexible integration with various embedding models and provides clear metrics for diversity assessment.
Highlights
- New
group_diversity_filter
operator: A new filter,group_diversity_filter
, has been introduced to calculate the in-group semantic diversity of text samples. This operator is designed to support diversity reward shaping in systems like Trinity-RFT. - Embedding-based diversity scoring: The filter works by converting input text samples into embedding vectors using either Hugging Face models or external API services. It then computes the cosine similarity of each sample's embedding against the average embedding of the entire group. These similarities are normalized to produce a
text_ebd_diversity_score
for each sample. - Configuration and Integration: The new
group_diversity_filter
is now configurable viaconfigs/config_all.yaml
, allowing users to specify the embedding model (API or Hugging Face), embedding dimensions, and score normalization parameters. New constants (text_ebd_diversity
andtext_ebd_diversity_score
) have been added todata_juicer/utils/constant.py
to store the calculated metrics. - Comprehensive Testing: Dedicated unit tests have been added to
tests/ops/filter/test_group_diversity_filter.py
to validate the diversity calculation logic. These tests cover both API-based and Hugging Face-based embedding models, ensuring that outlier samples are correctly identified with higher diversity scores.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The code changes introduce a new group diversity filter. The review focuses on improving performance by batching the text embedding process and vectorizing the score normalization logic. It also suggests using loguru for consistent logging.
embeddings = [] | ||
model = get_model(self.model_key, rank, self.use_cuda()) | ||
|
||
for text in tqdm(texts, desc="Embedding texts", leave=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop processes texts one by one, which is inefficient for a batched operator. Most embedding models, including Hugging Face sentence-transformers, are optimized for batch processing. Given that this operator processes the entire dataset in a single batch (num_proc=1
), this loop can become a significant performance bottleneck.
Consider refactoring this to process texts in batches. For Hugging Face models, you can pass the entire list of texts to model.encode()
outside the loop. For API models, check if batching is supported by the underlying API wrapper.
if self.is_hf_model:
try:
# Use batch encoding for efficiency with Hugging Face models
embeddings = model.encode(texts, show_progress_bar=False)
return np.array(embeddings, dtype=np.float32)
except Exception as e:
logger.error(f"Failed to embed texts in batch. Error: {e}. Using zero vectors for all.")
dim = model.get_sentence_embedding_dimension()
return np.zeros((len(texts), dim), dtype=np.float32)
except Exception as e: | ||
dim = model.get_sentence_embedding_dimension() if self.is_hf_model else self.ebd_dim | ||
embeddings.append(np.zeros(dim, dtype=np.float32)) | ||
print(f"Failed to embed text: '{text}'. Error: {e}. Using zero vector.", file=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalized_scores = [] | ||
if range_sim < 1e-8: | ||
normalized_scores = [0.0] * len(cos_sims) | ||
else: | ||
for sim in cos_sims: | ||
normalized_sim = self.norm_ratio * (max_sim - sim) / range_sim | ||
normalized_scores.append(normalized_sim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop for calculating normalized_scores
can be vectorized using numpy
for better performance and readability. This avoids iterating through the similarities one by one in Python.
if range_sim < 1e-8:
normalized_scores = [0.0] * len(cos_sims)
else:
cos_sims_np = np.array(cos_sims)
normalized_scores_np = self.norm_ratio * (max_sim - cos_sims_np) / range_sim
normalized_scores = normalized_scores_np.tolist()
As the title says, this op calculates the in-group diversity for a batch of samples.
Here's the breakdown:
stat "text_ebd_diversity_score"
for each sample.This op can support the diversity reward shaping in Trinity-RFT.
[Note] Since this op needs to see all samples to calculate a single group average, the
num_proc (np) must be set to 1
.[TODO] This op may need to handle the input data more dynamically, especially when dealing with
batches of prompt-rollouts
from a Trinity Buffer.