Skip to content

Conversation

lingzhq
Copy link
Collaborator

@lingzhq lingzhq commented Jul 22, 2025

As the title says, this op calculates the in-group diversity for a batch of samples.

Here's the breakdown:

  • It first converts all input samples into embedding vectors.
  • Then, it calculates the cosine similarity of each sample against the average embedding of the whole group.
  • Finally, it normalizes these similarities to produce stat "text_ebd_diversity_score" for each sample.

This op can support the diversity reward shaping in Trinity-RFT.

[Note] Since this op needs to see all samples to calculate a single group average, the num_proc (np) must be set to 1.

[TODO] This op may need to handle the input data more dynamically, especially when dealing with batches of prompt-rollouts from a Trinity Buffer.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @lingzhq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant new text processing capability by adding a group_diversity_filter operator. This filter enables the measurement of semantic diversity within a batch of text samples, which is a crucial feature for advanced data curation and reinforcement learning applications. The implementation supports flexible integration with various embedding models and provides clear metrics for diversity assessment.

Highlights

  • New group_diversity_filter operator: A new filter, group_diversity_filter, has been introduced to calculate the in-group semantic diversity of text samples. This operator is designed to support diversity reward shaping in systems like Trinity-RFT.
  • Embedding-based diversity scoring: The filter works by converting input text samples into embedding vectors using either Hugging Face models or external API services. It then computes the cosine similarity of each sample's embedding against the average embedding of the entire group. These similarities are normalized to produce a text_ebd_diversity_score for each sample.
  • Configuration and Integration: The new group_diversity_filter is now configurable via configs/config_all.yaml, allowing users to specify the embedding model (API or Hugging Face), embedding dimensions, and score normalization parameters. New constants (text_ebd_diversity and text_ebd_diversity_score) have been added to data_juicer/utils/constant.py to store the calculated metrics.
  • Comprehensive Testing: Dedicated unit tests have been added to tests/ops/filter/test_group_diversity_filter.py to validate the diversity calculation logic. These tests cover both API-based and Hugging Face-based embedding models, ensuring that outlier samples are correctly identified with higher diversity scores.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes introduce a new group diversity filter. The review focuses on improving performance by batching the text embedding process and vectorizing the score normalization logic. It also suggests using loguru for consistent logging.

embeddings = []
model = get_model(self.model_key, rank, self.use_cuda())

for text in tqdm(texts, desc="Embedding texts", leave=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This loop processes texts one by one, which is inefficient for a batched operator. Most embedding models, including Hugging Face sentence-transformers, are optimized for batch processing. Given that this operator processes the entire dataset in a single batch (num_proc=1), this loop can become a significant performance bottleneck.

Consider refactoring this to process texts in batches. For Hugging Face models, you can pass the entire list of texts to model.encode() outside the loop. For API models, check if batching is supported by the underlying API wrapper.

if self.is_hf_model:
    try:
        # Use batch encoding for efficiency with Hugging Face models
        embeddings = model.encode(texts, show_progress_bar=False)
        return np.array(embeddings, dtype=np.float32)
    except Exception as e:
        logger.error(f"Failed to embed texts in batch. Error: {e}. Using zero vectors for all.")
        dim = model.get_sentence_embedding_dimension()
        return np.zeros((len(texts), dim), dtype=np.float32)

except Exception as e:
dim = model.get_sentence_embedding_dimension() if self.is_hf_model else self.ebd_dim
embeddings.append(np.zeros(dim, dtype=np.float32))
print(f"Failed to embed text: '{text}'. Error: {e}. Using zero vector.", file=sys.stderr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code uses print(..., file=sys.stderr) for logging errors. The tests for this OP use loguru.logger. For consistency with the rest of the project, it's better to use loguru.logger.error() here.

logger.error(f"Failed to embed text: '{text}'. Error: {e}. Using zero vector.")

Comment on lines +124 to +130
normalized_scores = []
if range_sim < 1e-8:
normalized_scores = [0.0] * len(cos_sims)
else:
for sim in cos_sims:
normalized_sim = self.norm_ratio * (max_sim - sim) / range_sim
normalized_scores.append(normalized_sim)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop for calculating normalized_scores can be vectorized using numpy for better performance and readability. This avoids iterating through the similarities one by one in Python.

if range_sim < 1e-8:
    normalized_scores = [0.0] * len(cos_sims)
else:
    cos_sims_np = np.array(cos_sims)
    normalized_scores_np = self.norm_ratio * (max_sim - cos_sims_np) / range_sim
    normalized_scores = normalized_scores_np.tolist()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant