[NewOp] Add group_diversity_filter op #745

lingzhq · 2025-07-22T07:59:49Z

As the title says, this op calculates the in-group diversity for a batch of samples.

Here's the breakdown:

It first converts all input samples into embedding vectors.
Then, it calculates the cosine similarity of each sample against the average embedding of the whole group.
Finally, it normalizes these similarities to produce stat "text_ebd_diversity_score" for each sample.

This op can support the diversity reward shaping in Trinity-RFT.

[Note] Since this op needs to see all samples to calculate a single group average, the num_proc (np) must be set to 1.

[TODO] This op may need to handle the input data more dynamically, especially when dealing with batches of prompt-rollouts from a Trinity Buffer.

gemini-code-assist

Summary of Changes

Hello @lingzhq, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant new text processing capability by adding a group_diversity_filter operator. This filter enables the measurement of semantic diversity within a batch of text samples, which is a crucial feature for advanced data curation and reinforcement learning applications. The implementation supports flexible integration with various embedding models and provides clear metrics for diversity assessment.

Highlights

New group_diversity_filter operator: A new filter, group_diversity_filter, has been introduced to calculate the in-group semantic diversity of text samples. This operator is designed to support diversity reward shaping in systems like Trinity-RFT.
Embedding-based diversity scoring: The filter works by converting input text samples into embedding vectors using either Hugging Face models or external API services. It then computes the cosine similarity of each sample's embedding against the average embedding of the entire group. These similarities are normalized to produce a text_ebd_diversity_score for each sample.
Configuration and Integration: The new group_diversity_filter is now configurable via configs/config_all.yaml, allowing users to specify the embedding model (API or Hugging Face), embedding dimensions, and score normalization parameters. New constants (text_ebd_diversity and text_ebd_diversity_score) have been added to data_juicer/utils/constant.py to store the calculated metrics.
Comprehensive Testing: Dedicated unit tests have been added to tests/ops/filter/test_group_diversity_filter.py to validate the diversity calculation logic. These tests cover both API-based and Hugging Face-based embedding models, ensuring that outlier samples are correctly identified with higher diversity scores.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The code changes introduce a new group diversity filter. The review focuses on improving performance by batching the text embedding process and vectorizing the score normalization logic. It also suggests using loguru for consistent logging.

gemini-code-assist · 2025-07-22T08:02:02Z

data_juicer/ops/filter/group_diversity_filter.py

+        embeddings = []
+        model = get_model(self.model_key, rank, self.use_cuda())
+
+        for text in tqdm(texts, desc="Embedding texts", leave=False):


This loop processes texts one by one, which is inefficient for a batched operator. Most embedding models, including Hugging Face sentence-transformers, are optimized for batch processing. Given that this operator processes the entire dataset in a single batch (num_proc=1), this loop can become a significant performance bottleneck.

Consider refactoring this to process texts in batches. For Hugging Face models, you can pass the entire list of texts to model.encode() outside the loop. For API models, check if batching is supported by the underlying API wrapper.

if self.is_hf_model: try: # Use batch encoding for efficiency with Hugging Face models embeddings = model.encode(texts, show_progress_bar=False) return np.array(embeddings, dtype=np.float32) except Exception as e: logger.error(f"Failed to embed texts in batch. Error: {e}. Using zero vectors for all.") dim = model.get_sentence_embedding_dimension() return np.zeros((len(texts), dim), dtype=np.float32)

gemini-code-assist · 2025-07-22T08:02:03Z

data_juicer/ops/filter/group_diversity_filter.py

+            except Exception as e:
+                dim = model.get_sentence_embedding_dimension() if self.is_hf_model else self.ebd_dim
+                embeddings.append(np.zeros(dim, dtype=np.float32))
+                print(f"Failed to embed text: '{text}'. Error: {e}. Using zero vector.", file=sys.stderr)


The code uses print(..., file=sys.stderr) for logging errors. The tests for this OP use loguru.logger. For consistency with the rest of the project, it's better to use loguru.logger.error() here.

logger.error(f"Failed to embed text: '{text}'. Error: {e}. Using zero vector.")

gemini-code-assist · 2025-07-22T08:02:03Z

data_juicer/ops/filter/group_diversity_filter.py

+        normalized_scores = []
+        if range_sim < 1e-8:
+            normalized_scores = [0.0] * len(cos_sims)
+        else:
+            for sim in cos_sims:
+                normalized_sim = self.norm_ratio * (max_sim - sim) / range_sim
+                normalized_scores.append(normalized_sim)


This loop for calculating normalized_scores can be vectorized using numpy for better performance and readability. This avoids iterating through the similarities one by one in Python.

if range_sim < 1e-8: normalized_scores = [0.0] * len(cos_sims) else: cos_sims_np = np.array(cos_sims) normalized_scores_np = self.norm_ratio * (max_sim - cos_sims_np) / range_sim normalized_scores = normalized_scores_np.tolist()

Add an op GroupDiversityFilter

216bf50

lingzhq temporarily deployed to Testing July 22, 2025 07:59 — with GitHub Actions Inactive

lingzhq requested review from HYLcool and yxdyc July 22, 2025 08:00

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

Merge branch 'main' into op/group_diversity

f242ee8

lingzhq temporarily deployed to Testing July 24, 2025 07:36 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NewOp] Add group_diversity_filter op #745

[NewOp] Add group_diversity_filter op #745

Uh oh!

lingzhq commented Jul 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 22, 2025

Uh oh!

gemini-code-assist bot Jul 22, 2025

Uh oh!

gemini-code-assist bot Jul 22, 2025

Uh oh!

Uh oh!

[NewOp] Add group_diversity_filter op #745

Are you sure you want to change the base?

[NewOp] Add group_diversity_filter op #745

Uh oh!

Conversation

lingzhq commented Jul 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!