Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 17 additions & 17 deletions docs/en/notes/guide/domain_specific_operators/rare_operators.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: RARE Operators
createTime: 2025/06/24 11:43:42
createTime: 2025/09/26 11:47:42
permalink: /en/guide/RARE_operators/
---

Expand All @@ -20,24 +20,24 @@ The RARE operator workflow systematically generates synthetic data for reasoning

| Name | Application Type | Description | Official Repository or Paper |
| :--- | :--- | :--- | :--- |
| Doc2Query✨ | Question Generation | Generates complex reasoning questions and corresponding scenarios based on original documents. | ReasonIR: Training Retrievers for Reasoning Tasks |
| BM25HardNeg✨ | Hard Negative Mining | Mines hard negative samples that are textually similar but semantically irrelevant to the generated questions to construct challenging retrieval contexts. | ReasonIR: Training Retrievers for Reasoning Tasks |
| ReasonDistill🚀 | Reasoning Process Generation | Combines the question, positive, and negative documents to prompt a large language model to generate a detailed reasoning process, "distilling" its domain thinking patterns. | RARE: Retrieval-Augmented Reasoning Modeling |
| RAREDoc2QueryGenerator✨ | Question Generation | Generates complex reasoning questions and corresponding scenarios based on original documents. | ReasonIR: Training Retrievers for Reasoning Tasks |
| RAREBM25HardNegGenerator✨ | Hard Negative Mining | Mines hard negative samples that are textually similar but semantically irrelevant to the generated questions to construct challenging retrieval contexts. | ReasonIR: Training Retrievers for Reasoning Tasks |
| RAREReasonDistillGenerator🚀 | Reasoning Process Generation | Combines the question, positive, and negative documents to prompt a large language model to generate a detailed reasoning process, "distilling" its domain thinking patterns. | RARE: Retrieval-Augmented Reasoning Modeling |

## Operator Interface Usage Instructions

For operators that require specifying storage paths or calling models, we provide encapsulated **model interfaces** and **storage object interfaces**. You can predefine the model API parameters for an operator as follows:

```python
from dataflow.llmserving import APILLMServing_request
from dataflow.serving.api_llm_serving_request import APILLMServing_request

api_llm_serving = APILLMServing_request(
api_url="your_api_url",
key_name_of_api_key="YOUR_API_KEY",
model_name="model_name",
max_workers=5
)
```

You can predefine the storage parameters for an operator as follows:

```python
Expand All @@ -57,7 +57,7 @@ Regarding parameter passing, the constructor of an operator object primarily rec

## Detailed Operator Descriptions

### 1\. Doc2Query
### 1\. RAREDoc2QueryGenerator

**Functional Description**

Expand All @@ -77,9 +77,9 @@ This operator is the first step in the RARE data generation workflow. It utilize
**Usage Example**

```python
from dataflow.operators.generate.RARE import Doc2Query
from dataflow.operators.rare import RAREDoc2QueryGenerator

doc2query_step = Doc2Query(llm_serving=api_llm_serving)
doc2query_step = RAREDoc2QueryGenerator(llm_serving=api_llm_serving)
doc2query_step.run(
storage=self.storage.step(),
input_key="text",
Expand All @@ -88,15 +88,15 @@ doc2query_step.run(
)
```

### 2\. BM25HardNeg
### 2\. RAREBM25HardNegGenerator

**Functional Description**

This operator uses the classic BM25 algorithm to retrieve and select the most relevant hard negative samples from the entire document corpus for each "question-positive document" pair. These negative samples are lexically similar to the query but are semantically incorrect or irrelevant answers. The goal is to create a challenging retrieval environment that forces the model to perform finer-grained reasoning and discrimination in subsequent steps.

**Dependency Installation**

The BM25HardNeg operator depends on pyserini, gensim, and JDK. The configuration method for Linux is as follows:
The RAREBM25HardNegGenerator operator depends on pyserini, gensim, and JDK. The configuration method for Linux is as follows:
```Bash
sudo apt install openjdk-21-jdk
pip install pyserini gensim
Expand All @@ -116,9 +116,9 @@ pip install pyserini gensim
**Usage Example**

```python
from dataflow.operators.generate.RARE import BM25HardNeg
from dataflow.operators.rare import RAREBM25HardNegGenerator

bm25hardneg_step = BM25HardNeg()
bm25hardneg_step = RAREBM25HardNegGenerator()
bm25hardneg_step.run(
storage=self.storage.step(),
input_question_key="question",
Expand All @@ -128,11 +128,11 @@ bm25hardneg_step.run(
)
```

### 3\. ReasonDistill
### 3\. RAREReasonDistillGenerator

**Functional Description**

This operator is the core implementation of the RARE paradigm. It integrates the question and scenario generated by `Doc2Query`, the original positive document, and the hard negatives mined by `BM25HardNeg` to construct a complex context. It then prompts a large language model (the teacher model) to generate a detailed, step-by-step reasoning process based on this context. This process aims to "distill" the teacher model's domain thinking patterns and generate data for training a student model, teaching it how to perform contextualized reasoning rather than relying on parameterized knowledge.
This operator is the core implementation of the RARE paradigm. It integrates the question and scenario generated by `RAREDoc2QueryGenerator`, the original positive document, and the hard negatives mined by `RAREBM25HardNegGenerator` to construct a complex context. It then prompts a large language model (the teacher model) to generate a detailed, step-by-step reasoning process based on this context. This process aims to "distill" the teacher model's domain thinking patterns and generate data for training a student model, teaching it how to perform contextualized reasoning rather than relying on parameterized knowledge.

**Input Parameters**

Expand All @@ -149,9 +149,9 @@ This operator is the core implementation of the RARE paradigm. It integrates the
**Usage Example**

```python
from dataflow.operators.generate.RARE import ReasonDistill
from dataflow.operators.rare import RAREReasonDistillGenerator

reasondistill_step = ReasonDistill(llm_serving=api_llm_serving)
reasondistill_step = RAREReasonDistillGenerator(llm_serving=api_llm_serving)
reasondistill_step.run(
storage=self.storage.step(),
input_text_key="text",
Expand Down
68 changes: 36 additions & 32 deletions docs/en/notes/guide/pipelines/RAREPipeline.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: RARE Data Synthesis Pipeline
icon: game-icons:great-pyramid
createTime: 2025/07/04 15:40:18
createTime: 2025/09/26 11:54:18
permalink: /en/guide/rare_pipeline/
---

Expand All @@ -17,7 +17,7 @@ The **RARE (Retrieval-Augmented Reasoning Modeling) Data Synthesis Pipeline** is
This pipeline can generate high-quality, knowledge- and reasoning-intensive training data from a given set of documents, enabling even lightweight models to achieve top-tier performance, potentially surpassing large models like GPT-4 and DeepSeek-R1.

### Dependency Installation
The `BM25HardNeg` operator in `RAREPipeline` depends on `pyserini`, `gensim`, and `JDK`. The configuration method for Linux is as follows:
The `RAREBM25HardNegGenerator` operator in `RAREPipeline` depends on `pyserini`, `gensim`, and `JDK`. The configuration method for Linux is as follows:
```bash
sudo apt install openjdk-21-jdk
pip install pyserini gensim
Expand All @@ -44,9 +44,9 @@ self.storage = FileStorage(
)
```

### 2\. Generate Knowledge and Reasoning-Intensive Questions (Doc2Query)
### 2\. Generate Knowledge and Reasoning-Intensive Questions (RAREDoc2QueryGenerator)

The first step in the pipeline is the **`Doc2Query`** operator. It uses an LLM to generate questions and scenarios based on the input documents that require complex reasoning to answer. These questions are designed to be independent of the original document, but the reasoning process required to answer them relies on the knowledge contained within the document.
The first step in the pipeline is the **`RAREDoc2QueryGenerator`** operator. It uses an LLM to generate questions and scenarios based on the input documents that require complex reasoning to answer. These questions are designed to be independent of the original document, but the reasoning process required to answer them relies on the knowledge contained within the document.

**Functionality:**

Expand All @@ -64,9 +64,9 @@ self.doc2query_step1.run(
)
```

### 3\. Mine Hard Negative Samples (BM25HardNeg)
### 3\. Mine Hard Negative Samples (RAREBM25HardNegGenerator)

The second step uses the **`BM25HardNeg`** operator. After generating the questions, this step utilizes the BM25 algorithm to retrieve and filter "hard negative samples" for each question from the entire dataset. These negative samples are textually similar to the "correct" document (the positive sample) but cannot be logically used to answer the question, thus increasing the challenge for the model in the subsequent reasoning step.
The second step uses the **`RAREBM25HardNegGenerator`** operator. After generating the questions, this step utilizes the BM25 algorithm to retrieve and filter "hard negative samples" for each question from the entire dataset. These negative samples are textually similar to the "correct" document (the positive sample) but cannot be logically used to answer the question, thus increasing the challenge for the model in the subsequent reasoning step.

**Functionality:**

Expand All @@ -85,9 +85,9 @@ self.bm25hardneg_step2.run(
)
```

### 4\. Distill the Reasoning Process (ReasonDistill)
### 4\. Distill the Reasoning Process (RAREReasonDistillGenerator)

The final step is the **`ReasonDistill`** operator. It combines the question, scenario, one positive sample, and multiple hard negative samples to construct a complex prompt. It then leverages a powerful "teacher" LLM (like GPT-4o) to generate a detailed, step-by-step reasoning process (Chain-of-Thought) that demonstrates how to use the provided (mixed true and false) information to arrive at the final answer.
The final step is the **`RAREReasonDistillGenerator`** operator. It combines the question, scenario, one positive sample, and multiple hard negative samples to construct a complex prompt. It then leverages a powerful "teacher" LLM (like GPT-4o) to generate a detailed, step-by-step reasoning process (Chain-of-Thought) that demonstrates how to use the provided (mixed true and false) information to arrive at the final answer.

**Functionality:**

Expand Down Expand Up @@ -115,56 +115,60 @@ self.reasondistill_step3.run(
Below is the sample code for running the complete `RAREPipeline`. It executes the three steps described above in sequence, progressively transforming the original documents into high-quality training data that includes a question, a scenario, hard negative samples, and a detailed reasoning process.

```python
from dataflow.operators.generate.RARE import (
Doc2Query,
BM25HardNeg,
ReasonDistill,
from dataflow.operators.rare import (
RAREDoc2QueryGenerator,
RAREBM25HardNegGenerator,
RAREReasonDistillGenerator,
)
from dataflow.utils.storage import FileStorage
from dataflow.llmserving import APILLMServing_request, LocalModelLLMServing
from dataflow.serving.api_llm_serving_request import APILLMServing_request
from dataflow.serving.local_model_llm_serving import LocalModelLLMServing_vllm

class RAREPipeline():
def __init__(self):

self.storage = FileStorage(
first_entry_file_name="../example_data/AgenticRAGPipeline/pipeline_small_chunk.json",
first_entry_file_name="./dataflow/example/RAREPipeline/pipeline_small_chunk.json",
cache_path="./cache_local",
file_name_prefix="dataflow_cache_step",
cache_type="json",
)

# Use an API server as the LLM service
# Using an API server as the LLM service, you can switch to `LocalModelLLMServing_vllm` to use a local model.
llm_serving = APILLMServing_request(
api_url="https://api.openai.com/v1/chat/completions",
key_name_of_api_key="OPENAI_API_KEY",
model_name="gpt-4o",
max_workers=1
)

self.doc2query_step1 = Doc2Query(llm_serving)
self.bm25hardneg_step2 = BM25HardNeg()
self.reasondistill_step3 = ReasonDistill(llm_serving)
self.doc2query_step1 = RAREDoc2QueryGenerator(llm_serving)
self.bm25hardneg_step2 = RAREBM25HardNegGenerator()
self.reasondistill_step3 = RAREReasonDistillGenerator(llm_serving)

def forward(self):

self.doc2query_step1.run(
storage=self.storage.step(),
input_key="text",
storage = self.storage.step(),
input_key = "text",
)

self.bm25hardneg_step2.run(
storage=self.storage.step(),
input_question_key="question",
input_text_key="text",
output_negatives_key="hard_negatives",
storage = self.storage.step(),
input_question_key = "question",
input_text_key = "text",
output_negatives_key = "hard_negatives",
)

self.reasondistill_step3.run(
storage=self.storage.step(),
input_text_key="text",
input_question_key="question",
input_scenario_key="scenario",
input_hardneg_key="hard_negatives",
output_key="reasoning",
storage= self.storage.step(),
input_text_key = "text",
input_question_key = "question",
input_scenario_key = "scenario",
input_hardneg_key = "hard_negatives",
output_key= "reasoning",
)

if __name__ == "__main__":
model = RAREPipeline()
model.forward()
Expand Down
Loading