diff --git a/docs/en/notes/guide/domain_specific_operators/rare_operators.md b/docs/en/notes/guide/domain_specific_operators/rare_operators.md index 1f47740fc..0ac082a70 100644 --- a/docs/en/notes/guide/domain_specific_operators/rare_operators.md +++ b/docs/en/notes/guide/domain_specific_operators/rare_operators.md @@ -1,6 +1,6 @@ --- title: RARE Operators -createTime: 2025/06/24 11:43:42 +createTime: 2025/09/26 11:47:42 permalink: /en/guide/RARE_operators/ --- @@ -20,24 +20,24 @@ The RARE operator workflow systematically generates synthetic data for reasoning | Name | Application Type | Description | Official Repository or Paper | | :--- | :--- | :--- | :--- | -| Doc2Query✨ | Question Generation | Generates complex reasoning questions and corresponding scenarios based on original documents. | ReasonIR: Training Retrievers for Reasoning Tasks | -| BM25HardNeg✨ | Hard Negative Mining | Mines hard negative samples that are textually similar but semantically irrelevant to the generated questions to construct challenging retrieval contexts. | ReasonIR: Training Retrievers for Reasoning Tasks | -| ReasonDistill🚀 | Reasoning Process Generation | Combines the question, positive, and negative documents to prompt a large language model to generate a detailed reasoning process, "distilling" its domain thinking patterns. | RARE: Retrieval-Augmented Reasoning Modeling | +| RAREDoc2QueryGenerator✨ | Question Generation | Generates complex reasoning questions and corresponding scenarios based on original documents. | ReasonIR: Training Retrievers for Reasoning Tasks | +| RAREBM25HardNegGenerator✨ | Hard Negative Mining | Mines hard negative samples that are textually similar but semantically irrelevant to the generated questions to construct challenging retrieval contexts. | ReasonIR: Training Retrievers for Reasoning Tasks | +| RAREReasonDistillGenerator🚀 | Reasoning Process Generation | Combines the question, positive, and negative documents to prompt a large language model to generate a detailed reasoning process, "distilling" its domain thinking patterns. | RARE: Retrieval-Augmented Reasoning Modeling | ## Operator Interface Usage Instructions For operators that require specifying storage paths or calling models, we provide encapsulated **model interfaces** and **storage object interfaces**. You can predefine the model API parameters for an operator as follows: ```python -from dataflow.llmserving import APILLMServing_request +from dataflow.serving.api_llm_serving_request import APILLMServing_request api_llm_serving = APILLMServing_request( api_url="your_api_url", + key_name_of_api_key="YOUR_API_KEY", model_name="model_name", max_workers=5 ) ``` - You can predefine the storage parameters for an operator as follows: ```python @@ -57,7 +57,7 @@ Regarding parameter passing, the constructor of an operator object primarily rec ## Detailed Operator Descriptions -### 1\. Doc2Query +### 1\. RAREDoc2QueryGenerator **Functional Description** @@ -77,9 +77,9 @@ This operator is the first step in the RARE data generation workflow. It utilize **Usage Example** ```python -from dataflow.operators.generate.RARE import Doc2Query +from dataflow.operators.rare import RAREDoc2QueryGenerator -doc2query_step = Doc2Query(llm_serving=api_llm_serving) +doc2query_step = RAREDoc2QueryGenerator(llm_serving=api_llm_serving) doc2query_step.run( storage=self.storage.step(), input_key="text", @@ -88,7 +88,7 @@ doc2query_step.run( ) ``` -### 2\. BM25HardNeg +### 2\. RAREBM25HardNegGenerator **Functional Description** @@ -96,7 +96,7 @@ This operator uses the classic BM25 algorithm to retrieve and select the most re **Dependency Installation** -The BM25HardNeg operator depends on pyserini, gensim, and JDK. The configuration method for Linux is as follows: +The RAREBM25HardNegGenerator operator depends on pyserini, gensim, and JDK. The configuration method for Linux is as follows: ```Bash sudo apt install openjdk-21-jdk pip install pyserini gensim @@ -116,9 +116,9 @@ pip install pyserini gensim **Usage Example** ```python -from dataflow.operators.generate.RARE import BM25HardNeg +from dataflow.operators.rare import RAREBM25HardNegGenerator -bm25hardneg_step = BM25HardNeg() +bm25hardneg_step = RAREBM25HardNegGenerator() bm25hardneg_step.run( storage=self.storage.step(), input_question_key="question", @@ -128,11 +128,11 @@ bm25hardneg_step.run( ) ``` -### 3\. ReasonDistill +### 3\. RAREReasonDistillGenerator **Functional Description** -This operator is the core implementation of the RARE paradigm. It integrates the question and scenario generated by `Doc2Query`, the original positive document, and the hard negatives mined by `BM25HardNeg` to construct a complex context. It then prompts a large language model (the teacher model) to generate a detailed, step-by-step reasoning process based on this context. This process aims to "distill" the teacher model's domain thinking patterns and generate data for training a student model, teaching it how to perform contextualized reasoning rather than relying on parameterized knowledge. +This operator is the core implementation of the RARE paradigm. It integrates the question and scenario generated by `RAREDoc2QueryGenerator`, the original positive document, and the hard negatives mined by `RAREBM25HardNegGenerator` to construct a complex context. It then prompts a large language model (the teacher model) to generate a detailed, step-by-step reasoning process based on this context. This process aims to "distill" the teacher model's domain thinking patterns and generate data for training a student model, teaching it how to perform contextualized reasoning rather than relying on parameterized knowledge. **Input Parameters** @@ -149,9 +149,9 @@ This operator is the core implementation of the RARE paradigm. It integrates the **Usage Example** ```python -from dataflow.operators.generate.RARE import ReasonDistill +from dataflow.operators.rare import RAREReasonDistillGenerator -reasondistill_step = ReasonDistill(llm_serving=api_llm_serving) +reasondistill_step = RAREReasonDistillGenerator(llm_serving=api_llm_serving) reasondistill_step.run( storage=self.storage.step(), input_text_key="text", diff --git a/docs/en/notes/guide/pipelines/RAREPipeline.md b/docs/en/notes/guide/pipelines/RAREPipeline.md index 05553cd50..81dba7371 100644 --- a/docs/en/notes/guide/pipelines/RAREPipeline.md +++ b/docs/en/notes/guide/pipelines/RAREPipeline.md @@ -1,7 +1,7 @@ --- title: RARE Data Synthesis Pipeline icon: game-icons:great-pyramid -createTime: 2025/07/04 15:40:18 +createTime: 2025/09/26 11:54:18 permalink: /en/guide/rare_pipeline/ --- @@ -17,7 +17,7 @@ The **RARE (Retrieval-Augmented Reasoning Modeling) Data Synthesis Pipeline** is This pipeline can generate high-quality, knowledge- and reasoning-intensive training data from a given set of documents, enabling even lightweight models to achieve top-tier performance, potentially surpassing large models like GPT-4 and DeepSeek-R1. ### Dependency Installation -The `BM25HardNeg` operator in `RAREPipeline` depends on `pyserini`, `gensim`, and `JDK`. The configuration method for Linux is as follows: +The `RAREBM25HardNegGenerator` operator in `RAREPipeline` depends on `pyserini`, `gensim`, and `JDK`. The configuration method for Linux is as follows: ```bash sudo apt install openjdk-21-jdk pip install pyserini gensim @@ -44,9 +44,9 @@ self.storage = FileStorage( ) ``` -### 2\. Generate Knowledge and Reasoning-Intensive Questions (Doc2Query) +### 2\. Generate Knowledge and Reasoning-Intensive Questions (RAREDoc2QueryGenerator) -The first step in the pipeline is the **`Doc2Query`** operator. It uses an LLM to generate questions and scenarios based on the input documents that require complex reasoning to answer. These questions are designed to be independent of the original document, but the reasoning process required to answer them relies on the knowledge contained within the document. +The first step in the pipeline is the **`RAREDoc2QueryGenerator`** operator. It uses an LLM to generate questions and scenarios based on the input documents that require complex reasoning to answer. These questions are designed to be independent of the original document, but the reasoning process required to answer them relies on the knowledge contained within the document. **Functionality:** @@ -64,9 +64,9 @@ self.doc2query_step1.run( ) ``` -### 3\. Mine Hard Negative Samples (BM25HardNeg) +### 3\. Mine Hard Negative Samples (RAREBM25HardNegGenerator) -The second step uses the **`BM25HardNeg`** operator. After generating the questions, this step utilizes the BM25 algorithm to retrieve and filter "hard negative samples" for each question from the entire dataset. These negative samples are textually similar to the "correct" document (the positive sample) but cannot be logically used to answer the question, thus increasing the challenge for the model in the subsequent reasoning step. +The second step uses the **`RAREBM25HardNegGenerator`** operator. After generating the questions, this step utilizes the BM25 algorithm to retrieve and filter "hard negative samples" for each question from the entire dataset. These negative samples are textually similar to the "correct" document (the positive sample) but cannot be logically used to answer the question, thus increasing the challenge for the model in the subsequent reasoning step. **Functionality:** @@ -85,9 +85,9 @@ self.bm25hardneg_step2.run( ) ``` -### 4\. Distill the Reasoning Process (ReasonDistill) +### 4\. Distill the Reasoning Process (RAREReasonDistillGenerator) -The final step is the **`ReasonDistill`** operator. It combines the question, scenario, one positive sample, and multiple hard negative samples to construct a complex prompt. It then leverages a powerful "teacher" LLM (like GPT-4o) to generate a detailed, step-by-step reasoning process (Chain-of-Thought) that demonstrates how to use the provided (mixed true and false) information to arrive at the final answer. +The final step is the **`RAREReasonDistillGenerator`** operator. It combines the question, scenario, one positive sample, and multiple hard negative samples to construct a complex prompt. It then leverages a powerful "teacher" LLM (like GPT-4o) to generate a detailed, step-by-step reasoning process (Chain-of-Thought) that demonstrates how to use the provided (mixed true and false) information to arrive at the final answer. **Functionality:** @@ -115,56 +115,60 @@ self.reasondistill_step3.run( Below is the sample code for running the complete `RAREPipeline`. It executes the three steps described above in sequence, progressively transforming the original documents into high-quality training data that includes a question, a scenario, hard negative samples, and a detailed reasoning process. ```python -from dataflow.operators.generate.RARE import ( - Doc2Query, - BM25HardNeg, - ReasonDistill, +from dataflow.operators.rare import ( + RAREDoc2QueryGenerator, + RAREBM25HardNegGenerator, + RAREReasonDistillGenerator, ) from dataflow.utils.storage import FileStorage -from dataflow.llmserving import APILLMServing_request, LocalModelLLMServing +from dataflow.serving.api_llm_serving_request import APILLMServing_request +from dataflow.serving.local_model_llm_serving import LocalModelLLMServing_vllm class RAREPipeline(): def __init__(self): + self.storage = FileStorage( - first_entry_file_name="../example_data/AgenticRAGPipeline/pipeline_small_chunk.json", + first_entry_file_name="./dataflow/example/RAREPipeline/pipeline_small_chunk.json", cache_path="./cache_local", file_name_prefix="dataflow_cache_step", cache_type="json", ) - # Use an API server as the LLM service + # Using an API server as the LLM service, you can switch to `LocalModelLLMServing_vllm` to use a local model. llm_serving = APILLMServing_request( api_url="https://api.openai.com/v1/chat/completions", + key_name_of_api_key="OPENAI_API_KEY", model_name="gpt-4o", max_workers=1 ) - self.doc2query_step1 = Doc2Query(llm_serving) - self.bm25hardneg_step2 = BM25HardNeg() - self.reasondistill_step3 = ReasonDistill(llm_serving) - + self.doc2query_step1 = RAREDoc2QueryGenerator(llm_serving) + self.bm25hardneg_step2 = RAREBM25HardNegGenerator() + self.reasondistill_step3 = RAREReasonDistillGenerator(llm_serving) + def forward(self): + self.doc2query_step1.run( - storage=self.storage.step(), - input_key="text", + storage = self.storage.step(), + input_key = "text", ) self.bm25hardneg_step2.run( - storage=self.storage.step(), - input_question_key="question", - input_text_key="text", - output_negatives_key="hard_negatives", + storage = self.storage.step(), + input_question_key = "question", + input_text_key = "text", + output_negatives_key = "hard_negatives", ) self.reasondistill_step3.run( - storage=self.storage.step(), - input_text_key="text", - input_question_key="question", - input_scenario_key="scenario", - input_hardneg_key="hard_negatives", - output_key="reasoning", + storage= self.storage.step(), + input_text_key = "text", + input_question_key = "question", + input_scenario_key = "scenario", + input_hardneg_key = "hard_negatives", + output_key= "reasoning", ) - + if __name__ == "__main__": model = RAREPipeline() model.forward() diff --git a/docs/zh/notes/guide/domain_specific_operators/rare_operators.md b/docs/zh/notes/guide/domain_specific_operators/rare_operators.md index e7eca0aab..994464292 100644 --- a/docs/zh/notes/guide/domain_specific_operators/rare_operators.md +++ b/docs/zh/notes/guide/domain_specific_operators/rare_operators.md @@ -1,6 +1,6 @@ --- title: RARE算子 -createTime: 2025/06/24 11:43:42 +createTime: 2025/09/26 11:44:42 permalink: /zh/guide/RARE_operators/ --- @@ -28,19 +28,19 @@ RARE 算子流程通过三个核心步骤,系统性地生成用于推理能力