Releases: OpenDCAI/DataFlow
Dataflow v1.0.6 Release Note
🚀 DataFlow v1.0.6 更新日志
🔑 主要功能更新
-
Prompt 注册系统:
引入统一的 Prompt Registry,使每个算子(Operator)可以绑定多个 Prompt 模板,实现“一对多”结构化注册机制,方便不同任务场景的复用与扩展。感谢 @SunnyHaze。 -
新增 Code 处理 Pipeline:
新增完整的代码处理 Pipeline 及相关算子,支持代码数据集的分析、过滤与质量清洗,助力代码智能与数据清理任务。感谢 @beccabai。 -
Reasoning Pipeline 获奖验证:
DataFlow 的 Reasoning Pipeline 在 BAAI LIC Reasoning Competition 中荣获 第一名,充分验证了系统在逻辑推理与数据流调度上的鲁棒性与创新性。感谢 @miaode74 及数学推理团队 @scuuy @wongzhenhao @HeRunming @haolpku。 -
自动化 PDF2Model 功能:
新增 PDF-to-Model 自动生成模块,可将输入 PDF 或数据集自动转换为结构化 QA 数据,用LlamaFactory训练下游模型。此功能实现从文档到模型数据的端到端自动构建。感谢 @YalinFeng01 与 @ZhaoyangHan04。 -
自动基准评测模块:
新增 DataFlow Eval 模块,支持在 Pipeline 内对文本类基准(如字符串匹配、语义匹配)进行自动评测。感谢 @YalinFeng01。 -
统一数据库管理的 Text2SQL Pipeline:
全新改造 Text2SQL Pipeline,加入 DB Manager,统一支持 MySQL、SQLite 等多种数据库类型,并增强 Prompt 模板管理与算子复用性。感谢 @TechNomad-ds。 -
JSON Schema 结构化输出:
LLMServing与LiteLLMServing现已支持 JSON Schema 输出,可直接生成结构化响应结果,提升多模态任务兼容性。感谢 @wongzhenhao。 -
书籍结构化 QA 抽取 Pipeline:
新增 BookQA 抽取 Pipeline 及相关算子,可从书籍、长文本中自动提取结构化问答数据。感谢 @HeRunming。 -
Science 算子扩展:
新增科学类(Science)算子,支持科研类与多模态数据集的处理。感谢 @haolpku。 -
彩色 Logger 美化:
升级日志系统为彩色输出,提升调试与监控体验。感谢 @MOLYHECI。 -
官方教学视频上线:
发布全新 Bilibili 教程系列,系统讲解 DataFlow 的核心概念、工作流与实操案例。
🔗 观看教程 >>
感谢 @Qmeiyi。
🧩 重要改进
- 增加 Prompt 注册与自动校验机制(@SunnyHaze)
- 支持 VLLM Serving 的结构化输出(@wongzhenhao)
- 增强 Pipeline 编译时检查机制(@SunnyHaze)
- 优化 PDF2Model 与 Benchmark 自动评测功能(@YalinFeng01)
- 发布官方教程系列(@Qmeiyi)
- Agent 重构计划预告:
DataFlow Agent 模块正在全面重构中,将迁移至 LangGraph 架构,实现更高效的多 Agent 管理与任务编排,敬请期待。
🚀 DataFlow v1.0.6 Key Feature Updates
-
Prompt Registration System
Introduced a unified Prompt Registry that supports one-to-many prompt bindings per operator, allowing flexible task adaptation and consistent structure. Thanks to @SunnyHaze. -
New Code Processing Pipeline
Added a comprehensive code pipeline and related operators for analyzing, filtering, and processing code datasets. Thanks to @beccabai. -
Reasoning Pipeline Achievements
The Reasoning pipeline achieved 1st place in the BAAI LIC Reasoning Competition, validating DataFlow’s reasoning robustness and system scalability. Thanks to @miaode74, @scuuy, @wongzhenhao, @HeRunming, and @haolpku. -
Automatic PDF2Model Functionality
Added an automated PDF-to-Model module that converts PDF documents or datasets into structured QA pairs, enabling downstream model training with LlamaFactory. Thanks to @YalinFeng01 and @ZhaoyangHan04. -
Automatic Benchmark Evaluation
Introduced the DataFlow Eval module for automatic text benchmark evaluation (e.g., string match and semantic match). Thanks to @YalinFeng01. -
Text2SQL Pipeline with Unified DB Manager
Refactored the Text2SQL pipeline with a new DB Manager supporting MySQL, SQLite, and more. Enhanced prompt modularity and operator reuse. Thanks to @TechNomad-ds. -
JSON Schema Structural Output
LLMServingandLiteLLMServingnow support JSON Schema structured outputs, allowing models to produce well-formed structured results. Thanks to @wongzhenhao. -
Structured QA Extraction from Books
Added a BookQA Extraction Pipeline to automatically extract structured QA pairs from book-style documents. Thanks to @HeRunming. -
Science Operators Added
Introduced Science operators for scientific and multimodal data processing. Thanks to @haolpku. -
Colorful and Informative Logger
Enhanced logging with a colorful output format for better readability and debugging. Thanks to @MOLYHECI. -
New Tutorial Series
Released a Bilibili tutorial series introducing key DataFlow concepts and practical demos.
🎥 Watch here — Thanks to @Qmeiyi.
🧩 Notable Improvements
- Added prompt registration and validation – @SunnyHaze
- Added structured output support for VLLM Serving – @wongzhenhao
- Enhanced pipeline compilation checks – @SunnyHaze
- Improved PDF2Model and benchmark evaluation – @YalinFeng01
- Added official tutorial series – @Qmeiyi
- Agent Refactor Announcement
The DataFlow Agent is undergoing a major refactor and will soon migrate to a LangGraph-based architecture, supporting advanced multi-agent orchestration.
What's Changed
- [webui] debug for WebUI, revise func name 'type' to 'serving_type' by @SunnyHaze in #186
- [Debug] Fix API bug with adding a button to write env value DF_API_KEY by @HeRunming in #187
- [WebUI] Add pdf knowledge base clean WebUI by @HeRunming in #189
- unify _api_chat usage by @MOLYHECI in #190
- 为llm_serving添加请求失败重试机制 && 修复当 llm_serving 出现调用失败 cleaned 结果中会有 None 出现 TypeError: argument of t… by @xyxhchb in #188
- [debug] fix dir not exist by @HeRunming in #192
- [webui] debug import error by @SunnyHaze in #193
- add adp and update gradio by @Qmeiyi in #194
- Fix the execution classifier operator in the text2sql pipeline by @TechNomad-ds in #197
- Dataflow agent Console bug fix by @DeepThinkingZhouLiu in #199
- move non-key params to init function by @ZhaoyangHan04 in #200
- [Update] unified i/o keys with input/output_* format by @wongzhenhao in #201
- migrate from DataFlow421 by @yuwenkai2003 in #202
- [Pipeline] add Automatic Speech Recognition module and corresponding pipeline. by @gty1829 in #207
- [issue temp] update issue template; and add
sglang&minerutodataflow envby @SunnyHaze in #208 - Bug Fix by @DeepThinkingZhouLiu in #210
- [Compiled Pipeline] Add naive logic of Compiled pipeline for pre-check of key logic & Serving management. by @MOLYHECI in #191
- [compile] Added gradient color transition from step=0 to step=n in th… by @SunnyHaze in #213
- [Agent] significant reduce debug time when writing pipeline with
dataflow agentby calling pipeline.complie() by @DeepThinkingZhouLiu in #214 - [compile] Report all KeyErrors in a single, consolidated compilation … by @SunnyHaze in #215
- [Agent] fix prompt_template issue for Dataflow agent when autorun by @DeepThinkingZhouLiu in #216
- add encoding check during write storage by @ZhaoyangHan04 in #217
- rewrite LALMServing by @gty1829 in #219
- add get_desc functions for new ops by @scuuy in #224
- fix bug in diy prompt by @scuuy in #225
- Add core_text and chemistry smiles extraction pipeline by @haolpku in #226
- [refactor] Rename operators and revise op structure at 2025-08-21 by @SunnyHaze in #227
- update text2sql pipeline, reconstruct prompt template by @TechNomad-ds in #230
- [debug] fix #231, Prompted Generator issue after #227 by @haolpku in #232
- Add material pipeline and pairwise prompted generator by @haolpku in #234
- fix serving name and fix import LocalModelLLMServing bug by @haolpku in #235
- add atomic operation by @Fengzhongzhihan in #236
- fix chunk logic when length of tokens greater than model max token size by @CheinTian in #239
- fix chemical pipeline output schema bug by @ZhaoyangHan04 in #240
- update response format for chemistry pipelines by @haolpku in #243
- [PDF2model/text2model] dataflow PDF2model/text2model function added to dataflow cli by @dataflow-fyl in #242
- Dataflow agent SH by @DeepThinkingZhouLiu in #241
- fix api serving by @haolpku in #245
- Update the Text2SQL pipeline, refactored the database manager to support better database extensibility; manage prompts through prompt template classes to improve operator reusability. by @TechNomad-ds in #244
- [README] Add a documentation link for the pipeline in the README file. fix #221 by @miaode74 in #250
- add bench eval pipeline (string match and semantic match) by @scuuy in #238
- rename kbc ops and prompt class by @ZhaoyangHan04 in #249
- [Refactor] AgenticRAG pipeline & Doc2QA pipeline & KCenterGreedy by @wongzhenhao in #246
- [refactor] moving example file to right path by @wongzhenhao in #253
- divide general_text operators into core_text, general_text, text_pt, text_sft by @moly...
Dataflow v1.0.5 Release Note
DataFlow v1.0.5 Key Feature Updates
- Add General Reasoning Pipeline : add new pipeline to support general reasoning data and diy prompt, and fix some bugs, reform some reasoning ops by @scuuy in #137
- Add Batch Wrapper : Upload batch_wrapper for batching a operator in a pipeline. by @SunnyHaze in #157
- Pandas Operator Release : Release GeneralFilter for pandas by @wongzhenhao in #170
- Add Multiturn Function Call Operators add example data for FuncCallPipeline & rename MultiTurnDialogueGenerator by @MOLYHECI in #136
- Add Math Problem Extractor : Add VQAServing, Add mathbook_promblem_extractor to KBC Pipeline by @HeRunming in #152
- Refine General Text Operators : Customizable prompt for sft generators by @zzy1127 in #139
- Fix Local Serving Bug : Fix Local Model Serving, apply chat_template to sys & user prompt by @haolpku in #158
- Speed Up Text2SQL Pipeline Recontruct the database manager to improve the efficiency for text2sql pipeline by @TechNomad-ds in #174
Notable Changes
- Add Dataflow WebUI : Add Gradio WebUI for all operators by @HeRunming in #169
- Add Dataflow-Agent WebUI :Add agent gradio UI by @DeepMindLiuZhou in #175
- Add MinerU for KBCPipeline : @Niujunbo2002 add MinerU2.0 in #132 and support for fetching arxiv pdf links by @ZhaoyangHan04 in #171
- Add Sglang Support : Add
tensor_parallelanddata_paralleltoLocalLLMServing_sglangby @SunnyHaze in #147
What's Changed
- add get_desc for all general text operators by @zzy1127 in #133
- [Feature] GeneralFilter for GeneralText release! by @wongzhenhao in #135
- fix problem by @YqjMartin in #138
- add example data for FuncCallPipeline & rename MultiTurnDialogueGenerator by @MOLYHECI in #136
- add examples to get_desc by @ZhaoyangHan04 in #134
- 可定制prompt的sft生成器 by @zzy1127 in #139
- (new) add new pipeline to support general reasoning data and diy prompt, and fix some bugs, reform some reasoning ops by @scuuy in #137
- Support MinerU2 for KnowledgeCleaning by @Niujunbo2002 in #132
- [serving] set default
vllm_seedparam forLocalModelLLMServing_vllmtoNoneto avoid warning by @SunnyHaze in #143 - 修复gpu reasoning pipeline bug by @scuuy in #145
- refine the get_desc func for each operator for text2sql pipeline by @TechNomad-ds in #142
- [Serving] Add
tensor_parallelanddata_paralleltoLocalLLMServing_sglangby @SunnyHaze in #147 - text的所有算子加get_desc函数 by @scuuy in #146
- 修复storage列解析错误展开data字段到dataframe,调整版本,修复AnswerNgramFilter算子的bug by @leaderwolfpipi in #115
- add medical pipeline, generated by agent by @DeepMindLiuZhou in #148
- [serving] add sglang for all scripts for option by @SunnyHaze in #150
- implement kbc batch process operators and pipeline by @ZhaoyangHan04 in #151
- [Serving, KBC]Add VQAServing, Add mathbook_promblem_extractor to KBC Pipeline. by @HeRunming in #152
- fix bug for RemoveEmojiRefiner by @zzy1127 in #153
- fix bugs in batch_kbc by @ZhaoyangHan04 in #156
- [batch_wrapper] upload batch_wrapper for batching a operator in a pipeline. by @SunnyHaze in #157
- [Serving] Fix Local Model Serving, apply chat_template to sys & user prompt by @haolpku in #158
- add API-based languagefilter & customized MetaScorer by @MOLYHECI in #161
- fix quickstart bug by @haolpku in #162
- 统一embedding的属性名,调整SQLVariationGenerator算子填充逻辑补充进原始数据 by @leaderwolfpipi in #160
- add publications by @Qmeiyi in #163
- 修复reasoning流水线上其他算子向前兼容问题 by @leaderwolfpipi in #165
- add desc for func call & add statics for meta score by @MOLYHECI in #168
- [webui] Add Gradio WebUI for experience all operators. by @HeRunming in #169
- [Feature] PandasOperator release! [Update] GeneralFilter updated by @wongzhenhao in #170
- Support for fetching arxiv pdf links by @ZhaoyangHan04 in #171
- add new reasoning operator “answer_model_judge” , to check reference answer via llm by @scuuy in #172
- [WebUI] Add API Pipeline UI by @HeRunming in #173
- Add agent gradio UI by @DeepMindLiuZhou in #175
- recontruct the database manager to improve the efficiency for text2sql pipeline by @TechNomad-ds in #174
- fix bug by @TechNomad-ds in #176
- Update Gardio and Bug Fix by @DeepMindLiuZhou in #177
- add operator in readme by @Qmeiyi in #178
- change kbc script in playground & manage kbc pipelines by @ZhaoyangHan04 in #179
- Unify backend and fronted by @DeepMindLiuZhou in #180
- add mathbook extract to playground by @HeRunming in #181
- add gradio in readme by @Qmeiyi in #182
- add safety checks in fetching pdf by @ZhaoyangHan04 in #184
- 增加了多轮对话中,对部分user生成缺少assistant的情况修复 by @Arunshmily in #185
New Contributors
- @Niujunbo2002 made their first contribution in #132
- @Arunshmily made their first contribution in #185
Full Changelog: v1.0.4...v1.0.5
Dataflow v1.0.4 Release Notes
DataFlow v1.0.4 Key Feature Updates
- Automatic Operator Code Generation: Introduced new features for automatic operator code generation by @DeepMindLiuZhou (PR #61).
- Myscale Storage Support: Added support for myscale storage by @leaderwolfpipi (PR #60).
- Dialogue Function Generation: Implemented a function to generate from conversations by @MOLYHECI (PR #59).
- QA Generator and Translator: Added a QA generator and translation feature by @haolpku (PR #65).
- Text2SQL Pipeline Update: Refactored the text2sql pipeline by @TechNomad-ds (PR #113).
- AgenticRAG Pipeline Enhancements: Enhanced the AgenticRAG pipeline to fully support embedding models by @wongzhenhao (PR #86).
- Lazy Load Framework Support: @MOLYHECI The entire framework now supports lazy loading, significantly improving loading speeds. #87
- GeneralText Optimization: @zzy1127 optimized information related to GeneralText. #102 #112 #125
- Removal of Legacy Code: @HeRunming removed outdated code logic from the repository. #118
Notable Changes
- Operator Naming Rules: Renamed all operators naming rules by @SunnyHaze (PR #81).
- FuncCall Pipeline: Introduced a new FuncCall Pipeline by @MOLYHECI (PR #88).
- Batch PDF Extractor: Added functionality for batch PDF extraction by @haolpku (PR #111).
- Bug Fixes and Improvements: Various contributors, including @YqjMartin and @ZhaoyangHan04, worked on code refactoring, dependency fixes, and bug resolutions.
What's Changed
- Dataflow agent new features for automatic operator code generation by @DeepMindLiuZhou in #61
- 支持myscale storage by @leaderwolfpipi in #60
- Add function generate from conversations (dialogue) by @MOLYHECI in #59
- add QA generator and translator by @haolpku in #65
- change face and add acknowledgements by @Qmeiyi in #68
- change face by @Qmeiyi in #69
- delete the api aisuite (fix #32) by @scuuy in #70
- Rename all operators naming rules. by @SunnyHaze in #81
- adding missing numpy import by @JimmyAwoe in #76
- [Rename] unused file deleted by @wongzhenhao in #82
- rename RARE operators by @mi-iro in #83
- [Update] APILLMServing_request now support embedding model & AgenticRAG pipeline fully support API request by @wongzhenhao in #86
- Support litellm by @Sucran in #84
- Add Lazyloader feature for GeneralText by @MOLYHECI in #87
- Dataflow agent by @DeepMindLiuZhou in #91
- fix dependency conficts in kbc pipeline by @ZhaoyangHan04 in #89
- solve issue #92 and #85 by @zzy1127 in #94
- add TYPE_CHECKING if-else for VSCode static check by @MOLYHECI in #93
- [oper] rename
promptgeneratortopromptedgeneratorby @SunnyHaze in #95 - [Update] AgenticRAG pipeline now support APILLMServing for embedding by @wongzhenhao in #96
- [Update] AgenticRAG pipeline now support APILLMServing for embedding models by @wongzhenhao in #97
- reduce logger content by @ZhaoyangHan04 in #98
- Add auto generate _import_structure function & fix import issues for dataflow/statics/ by @MOLYHECI in #99
- Add FuncCall Pipeline by @MOLYHECI in #88
- add prompts for consistentchat and fix some bugs by @zzy1127 in #102
- Add local QA generation and translation by @haolpku in #104
- Dataflow agent update, with demo for writing some operators by @SunnyHaze in #105
- fix translation bug and add data by @haolpku in #107
- fix agentic RAG problem and add eval operators by @YqjMartin in #106
- add abbreviation module by @haolpku in #108
- [storage] add error logging when don't call step before first run. by @SunnyHaze in #110
- add batch pdf extractor by @haolpku in #111
- modify code position by @YqjMartin in #109
- [register] update register which could return type of operators by
get_type_of_operatorby @SunnyHaze in #112 - update readme by @Qmeiyi in #114
- update readme about agent by @Qmeiyi in #117
- fix import bugs for sub-folder used operators by @MOLYHECI in #116
- remove out-of-time fuction in dataflow/utils/utils.py by @HeRunming in #118
- modift file path and redundant file by @YqjMartin in #121
- Delete Operator.json by @DeepMindLiuZhou in #120
- add sft syn pipeline by @zzy1127 in #122
- new rename generators by @zzy1127 in #125
- 把sft合成放到playground里面了 by @zzy1127 in #126
- [Update] Improve AgenticRAG code readability by @wongzhenhao in #129
- update text2sql pipeline by @TechNomad-ds in #113
- fix the db not exist bug by @TechNomad-ds in #131
New Contributors
- @JimmyAwoe made their first contribution in #76
- @Sucran made their first contribution in #84
Full Changelog: v1.0.3...v1.0.4
Dataflow v1.0.3 Release Notes
What's changed
- Update more scorers (operators) to
GeneralText pipeline. (#38 and #48 ). Thanks @zzy1127 @MOLYHECI - Update more operators to
AgenticRAG pipeline. (#50 , #41). Thanks @wongzhenhao @YqjMartin - Revise API_KEY env variable passing logic in the
APIServingclass. The default variable isDF_API_KEYto avoid conflicts (#57 ). Thanks @SunnyHaze - Rename
llmservingtoservingfor future extension of other kinds of web services. #44 . Thanks @SunnyHaze - Update the Readme. (#40 , #52 , #53 ) Thanks @Qmeiyi
- Revise some bugs and parameter issues in
AgenticRAGpipeline. #49 . Thanks @TheRoadQaQ - Revise some bugs and parameter issues in
Knowledge base cleaning pipeline. #47 . Thanks @ZhaoyangHan04
Detailed list for all changed PRs
- update readme by @Qmeiyi in #40
- [New Operators] A lite implementation of OPPO TaskCraft by @wongzhenhao in #41
- add scorers by @zzy1127 in #38
- [update] rename
llmservingtoservingto fit future extension by @SunnyHaze in #44 - agentic rag para revise by @TheRoadQaQ in #49
- add remaining operators by @zzy1127 in #48
- normalize file path and params by @ZhaoyangHan04 in #47
- update readme by @Qmeiyi in #52
- update readme by @Qmeiyi in #53
- 增加了一些完善agenticRAG生成的方法 by @YqjMartin in #50
- [serving] set default API serving key to
DF_API_KEYand this key ca… by @SunnyHaze in #57
Full Changelog: v1.0.2...v1.0.3
Dataflow v1.0.2 Release Notes
New features
- Add implementation of Dataflow Agents #34 . Thanks @DeepMindLiuZhou
debug
- Fix get-desc issue #35 , Thanks @leaderwolfpipi
- Fix including bug for
/example/KBC/test.docand/example/KBC/test.pdfin manifest.ini. Thanks @SunnyHaze
Dataflow v1.0.1 Release Notes
New features
- add RARE pipeline (#33) @mi-iro
- add API calling to
text pipeline, i.e.test_sft_filter.py(#29) @zzy1127
Thanks for your contribution.
Debug
Fix the PyPI issue that makes pip install open-dataflow fail. @SunnyHaze . Thanks @leaderwolfpipi reported this bug.
Dataflow v1.0.0 Release Notes
🎉🎉🎉We are thrilled to release our Data-centric AI system, DataFLow! 🎉🎉🎉
Version: v1.0.0
Modular and AI-assisted data preparation system for high-efficiency pipelines.
🚀 Introduction
DataFlow is a high-efficiency data preparation system composed of advanced operators and multi-stage data processing pipelines. It integrates rule-based methods, deep learning models, and large language models (LLMs) to provide a modular, scalable, and reconfigurable design.
It aims to improve the quality and efficiency of data cleaning, augmentation, and construction — supporting the development of next-generation large-scale models.
Designed for researchers and engineers working on data-centric AI, LLM training, and scalable data workflows.
🧠 Core Features
- 🔁 Modular Operator Design: Inspired by PyTorch, each operator is configurable and reusable.
- 🧩 Multi-stage Pipelines: Flexibly chain operators for end-to-end data processing.
- 🤖 Agent for DataFlow: LLM-powered automation for pipeline orchestration and operator generation.
- ⚙️ Hybrid Techniques: Seamlessly combines rule-based, neural, and LLM-based methods.
- 💾 Built-in Storage Layer: Manage intermediate data and caching.
- 🔌 LLM Backend Support: Easily plug into GPT-style backends with
LLMServing.
🧱 Framework Overview
DataFlow consists of the following core modules:
| Module | Description |
|---|---|
operator |
Basic data processing units, reusable across pipelines. |
pipeline |
Manages multi-step workflows by chaining multiple operators. |
storage |
Manages data cache, storage, and I/O between steps. |
LLMServing |
Integrates large models for reasoning, filtering, and generation. |
Agent |
Automatically generates, orchestrates, and manages data workflows. |
🛠️ Example Usage and Operators
To get started quickly with real examples, please refer to our documentation:
-
📘 Example Pipelines:
Text Pipeline Tutorial -
🧩 Available Operators:
Operator Reference for Text Evaluation
These guides provide hands-on usage of core modules including Pipeline, Operator, and Agent, and demonstrate how to configure, extend, and run a complete data processing workflow using DataFlow.
🔍 Why DataFlow?
| Feature | Benefit |
|---|---|
| PyTorch-style API | Easy to learn and integrate |
| LLM + Rules + NN | Flexible and powerful hybrid workflows |
| Auto Agent Support | Reduces manual data prep burden |
| Storage Layer | Efficient checkpointing and result reuse |
| Fully Modular | Easy to extend, test, and compose |
📫 Contact
For issues, contributions, or questions, feel free to reach out:
GitHub: https://github.com/OpenDCAI/DataFlow
Email: [email protected]
Dataflow v0.0.3 Release Notes
First Release for Dataflow system
- Now the Dataflow codespace has been fully implemented with all features.
- You can easily experience our powerful data-centric system with
pip install open-dataflowanddataflow initcommand.