Skip to content

Releases: OpenDCAI/DataFlow

Dataflow v1.0.6 Release Note

15 Oct 15:26

Choose a tag to compare

🚀 DataFlow v1.0.6 更新日志

🔑 主要功能更新

  • Prompt 注册系统
    引入统一的 Prompt Registry,使每个算子(Operator)可以绑定多个 Prompt 模板,实现“一对多”结构化注册机制,方便不同任务场景的复用与扩展。感谢 @SunnyHaze

  • 新增 Code 处理 Pipeline
    新增完整的代码处理 Pipeline 及相关算子,支持代码数据集的分析、过滤与质量清洗,助力代码智能与数据清理任务。感谢 @beccabai

  • Reasoning Pipeline 获奖验证
    DataFlow 的 Reasoning Pipeline 在 BAAI LIC Reasoning Competition 中荣获 第一名,充分验证了系统在逻辑推理与数据流调度上的鲁棒性与创新性。感谢 @miaode74 及数学推理团队 @scuuy @wongzhenhao @HeRunming @haolpku

  • 自动化 PDF2Model 功能
    新增 PDF-to-Model 自动生成模块,可将输入 PDF 或数据集自动转换为结构化 QA 数据,用LlamaFactory训练下游模型。此功能实现从文档到模型数据的端到端自动构建。感谢 @YalinFeng01@ZhaoyangHan04

  • 自动基准评测模块
    新增 DataFlow Eval 模块,支持在 Pipeline 内对文本类基准(如字符串匹配、语义匹配)进行自动评测。感谢 @YalinFeng01

  • 统一数据库管理的 Text2SQL Pipeline
    全新改造 Text2SQL Pipeline,加入 DB Manager,统一支持 MySQL、SQLite 等多种数据库类型,并增强 Prompt 模板管理与算子复用性。感谢 @TechNomad-ds

  • JSON Schema 结构化输出
    LLMServingLiteLLMServing 现已支持 JSON Schema 输出,可直接生成结构化响应结果,提升多模态任务兼容性。感谢 @wongzhenhao

  • 书籍结构化 QA 抽取 Pipeline
    新增 BookQA 抽取 Pipeline 及相关算子,可从书籍、长文本中自动提取结构化问答数据。感谢 @HeRunming

  • Science 算子扩展
    新增科学类(Science)算子,支持科研类与多模态数据集的处理。感谢 @haolpku

  • 彩色 Logger 美化
    升级日志系统为彩色输出,提升调试与监控体验。感谢 @MOLYHECI

  • 官方教学视频上线
    发布全新 Bilibili 教程系列,系统讲解 DataFlow 的核心概念、工作流与实操案例。
    🔗 观看教程 >>
    感谢 @Qmeiyi


🧩 重要改进

  • 增加 Prompt 注册与自动校验机制(@SunnyHaze
  • 支持 VLLM Serving 的结构化输出(@wongzhenhao
  • 增强 Pipeline 编译时检查机制(@SunnyHaze
  • 优化 PDF2Model 与 Benchmark 自动评测功能(@YalinFeng01
  • 发布官方教程系列(@Qmeiyi
  • Agent 重构计划预告
    DataFlow Agent 模块正在全面重构中,将迁移至 LangGraph 架构,实现更高效的多 Agent 管理与任务编排,敬请期待。

🚀 DataFlow v1.0.6 Key Feature Updates

  • Prompt Registration System
    Introduced a unified Prompt Registry that supports one-to-many prompt bindings per operator, allowing flexible task adaptation and consistent structure. Thanks to @SunnyHaze.

  • New Code Processing Pipeline
    Added a comprehensive code pipeline and related operators for analyzing, filtering, and processing code datasets. Thanks to @beccabai.

  • Reasoning Pipeline Achievements
    The Reasoning pipeline achieved 1st place in the BAAI LIC Reasoning Competition, validating DataFlow’s reasoning robustness and system scalability. Thanks to @miaode74, @scuuy, @wongzhenhao, @HeRunming, and @haolpku.

  • Automatic PDF2Model Functionality
    Added an automated PDF-to-Model module that converts PDF documents or datasets into structured QA pairs, enabling downstream model training with LlamaFactory. Thanks to @YalinFeng01 and @ZhaoyangHan04.

  • Automatic Benchmark Evaluation
    Introduced the DataFlow Eval module for automatic text benchmark evaluation (e.g., string match and semantic match). Thanks to @YalinFeng01.

  • Text2SQL Pipeline with Unified DB Manager
    Refactored the Text2SQL pipeline with a new DB Manager supporting MySQL, SQLite, and more. Enhanced prompt modularity and operator reuse. Thanks to @TechNomad-ds.

  • JSON Schema Structural Output
    LLMServing and LiteLLMServing now support JSON Schema structured outputs, allowing models to produce well-formed structured results. Thanks to @wongzhenhao.

  • Structured QA Extraction from Books
    Added a BookQA Extraction Pipeline to automatically extract structured QA pairs from book-style documents. Thanks to @HeRunming.

  • Science Operators Added
    Introduced Science operators for scientific and multimodal data processing. Thanks to @haolpku.

  • Colorful and Informative Logger
    Enhanced logging with a colorful output format for better readability and debugging. Thanks to @MOLYHECI.

  • New Tutorial Series
    Released a Bilibili tutorial series introducing key DataFlow concepts and practical demos.
    🎥 Watch here — Thanks to @Qmeiyi.


🧩 Notable Improvements

  • Added prompt registration and validation – @SunnyHaze
  • Added structured output support for VLLM Serving – @wongzhenhao
  • Enhanced pipeline compilation checks – @SunnyHaze
  • Improved PDF2Model and benchmark evaluation – @YalinFeng01
  • Added official tutorial series – @Qmeiyi
  • Agent Refactor Announcement
    The DataFlow Agent is undergoing a major refactor and will soon migrate to a LangGraph-based architecture, supporting advanced multi-agent orchestration.

What's Changed

  • [webui] debug for WebUI, revise func name 'type' to 'serving_type' by @SunnyHaze in #186
  • [Debug] Fix API bug with adding a button to write env value DF_API_KEY by @HeRunming in #187
  • [WebUI] Add pdf knowledge base clean WebUI by @HeRunming in #189
  • unify _api_chat usage by @MOLYHECI in #190
  • 为llm_serving添加请求失败重试机制 && 修复当 llm_serving 出现调用失败 cleaned 结果中会有 None 出现 TypeError: argument of t… by @xyxhchb in #188
  • [debug] fix dir not exist by @HeRunming in #192
  • [webui] debug import error by @SunnyHaze in #193
  • add adp and update gradio by @Qmeiyi in #194
  • Fix the execution classifier operator in the text2sql pipeline by @TechNomad-ds in #197
  • Dataflow agent Console bug fix by @DeepThinkingZhouLiu in #199
  • move non-key params to init function by @ZhaoyangHan04 in #200
  • [Update] unified i/o keys with input/output_* format by @wongzhenhao in #201
  • migrate from DataFlow421 by @yuwenkai2003 in #202
  • [Pipeline] add Automatic Speech Recognition module and corresponding pipeline. by @gty1829 in #207
  • [issue temp] update issue template; and add sglang & mineru to dataflow env by @SunnyHaze in #208
  • Bug Fix by @DeepThinkingZhouLiu in #210
  • [Compiled Pipeline] Add naive logic of Compiled pipeline for pre-check of key logic & Serving management. by @MOLYHECI in #191
  • [compile] Added gradient color transition from step=0 to step=n in th… by @SunnyHaze in #213
  • [Agent] significant reduce debug time when writing pipeline with dataflow agent by calling pipeline.complie() by @DeepThinkingZhouLiu in #214
  • [compile] Report all KeyErrors in a single, consolidated compilation … by @SunnyHaze in #215
  • [Agent] fix prompt_template issue for Dataflow agent when autorun by @DeepThinkingZhouLiu in #216
  • add encoding check during write storage by @ZhaoyangHan04 in #217
  • rewrite LALMServing by @gty1829 in #219
  • add get_desc functions for new ops by @scuuy in #224
  • fix bug in diy prompt by @scuuy in #225
  • Add core_text and chemistry smiles extraction pipeline by @haolpku in #226
  • [refactor] Rename operators and revise op structure at 2025-08-21 by @SunnyHaze in #227
  • update text2sql pipeline, reconstruct prompt template by @TechNomad-ds in #230
  • [debug] fix #231, Prompted Generator issue after #227 by @haolpku in #232
  • Add material pipeline and pairwise prompted generator by @haolpku in #234
  • fix serving name and fix import LocalModelLLMServing bug by @haolpku in #235
  • add atomic operation by @Fengzhongzhihan in #236
  • fix chunk logic when length of tokens greater than model max token size by @CheinTian in #239
  • fix chemical pipeline output schema bug by @ZhaoyangHan04 in #240
  • update response format for chemistry pipelines by @haolpku in #243
  • [PDF2model/text2model] dataflow PDF2model/text2model function added to dataflow cli by @dataflow-fyl in #242
  • Dataflow agent SH by @DeepThinkingZhouLiu in #241
  • fix api serving by @haolpku in #245
  • Update the Text2SQL pipeline, refactored the database manager to support better database extensibility; manage prompts through prompt template classes to improve operator reusability. by @TechNomad-ds in #244
  • [README] Add a documentation link for the pipeline in the README file. fix #221 by @miaode74 in #250
  • add bench eval pipeline (string match and semantic match) by @scuuy in #238
  • rename kbc ops and prompt class by @ZhaoyangHan04 in #249
  • [Refactor] AgenticRAG pipeline & Doc2QA pipeline & KCenterGreedy by @wongzhenhao in #246
  • [refactor] moving example file to right path by @wongzhenhao in #253
  • divide general_text operators into core_text, general_text, text_pt, text_sft by @moly...
Read more

Dataflow v1.0.5 Release Note

23 Jul 12:02

Choose a tag to compare

DataFlow v1.0.5 Key Feature Updates

  • Add General Reasoning Pipeline : add new pipeline to support general reasoning data and diy prompt, and fix some bugs, reform some reasoning ops by @scuuy in #137
  • Add Batch Wrapper : Upload batch_wrapper for batching a operator in a pipeline. by @SunnyHaze in #157
  • Pandas Operator Release : Release GeneralFilter for pandas by @wongzhenhao in #170
  • Add Multiturn Function Call Operators add example data for FuncCallPipeline & rename MultiTurnDialogueGenerator by @MOLYHECI in #136
  • Add Math Problem Extractor : Add VQAServing, Add mathbook_promblem_extractor to KBC Pipeline by @HeRunming in #152
  • Refine General Text Operators : Customizable prompt for sft generators by @zzy1127 in #139
  • Fix Local Serving Bug : Fix Local Model Serving, apply chat_template to sys & user prompt by @haolpku in #158
  • Speed Up Text2SQL Pipeline Recontruct the database manager to improve the efficiency for text2sql pipeline by @TechNomad-ds in #174

Notable Changes

  • Add Dataflow WebUI : Add Gradio WebUI for all operators by @HeRunming in #169
  • Add Dataflow-Agent WebUI :Add agent gradio UI by @DeepMindLiuZhou in #175
  • Add MinerU for KBCPipeline : @Niujunbo2002 add MinerU2.0 in #132 and support for fetching arxiv pdf links by @ZhaoyangHan04 in #171
  • Add Sglang Support : Add tensor_parallel and data_parallel to LocalLLMServing_sglang by @SunnyHaze in #147

What's Changed

  • add get_desc for all general text operators by @zzy1127 in #133
  • [Feature] GeneralFilter for GeneralText release! by @wongzhenhao in #135
  • fix problem by @YqjMartin in #138
  • add example data for FuncCallPipeline & rename MultiTurnDialogueGenerator by @MOLYHECI in #136
  • add examples to get_desc by @ZhaoyangHan04 in #134
  • 可定制prompt的sft生成器 by @zzy1127 in #139
  • (new) add new pipeline to support general reasoning data and diy prompt, and fix some bugs, reform some reasoning ops by @scuuy in #137
  • Support MinerU2 for KnowledgeCleaning by @Niujunbo2002 in #132
  • [serving] set default vllm_seed param for LocalModelLLMServing_vllm to None to avoid warning by @SunnyHaze in #143
  • 修复gpu reasoning pipeline bug by @scuuy in #145
  • refine the get_desc func for each operator for text2sql pipeline by @TechNomad-ds in #142
  • [Serving] Add tensor_parallel and data_parallel to LocalLLMServing_sglang by @SunnyHaze in #147
  • text的所有算子加get_desc函数 by @scuuy in #146
  • 修复storage列解析错误展开data字段到dataframe,调整版本,修复AnswerNgramFilter算子的bug by @leaderwolfpipi in #115
  • add medical pipeline, generated by agent by @DeepMindLiuZhou in #148
  • [serving] add sglang for all scripts for option by @SunnyHaze in #150
  • implement kbc batch process operators and pipeline by @ZhaoyangHan04 in #151
  • [Serving, KBC]Add VQAServing, Add mathbook_promblem_extractor to KBC Pipeline. by @HeRunming in #152
  • fix bug for RemoveEmojiRefiner by @zzy1127 in #153
  • fix bugs in batch_kbc by @ZhaoyangHan04 in #156
  • [batch_wrapper] upload batch_wrapper for batching a operator in a pipeline. by @SunnyHaze in #157
  • [Serving] Fix Local Model Serving, apply chat_template to sys & user prompt by @haolpku in #158
  • add API-based languagefilter & customized MetaScorer by @MOLYHECI in #161
  • fix quickstart bug by @haolpku in #162
  • 统一embedding的属性名,调整SQLVariationGenerator算子填充逻辑补充进原始数据 by @leaderwolfpipi in #160
  • add publications by @Qmeiyi in #163
  • 修复reasoning流水线上其他算子向前兼容问题 by @leaderwolfpipi in #165
  • add desc for func call & add statics for meta score by @MOLYHECI in #168
  • [webui] Add Gradio WebUI for experience all operators. by @HeRunming in #169
  • [Feature] PandasOperator release! [Update] GeneralFilter updated by @wongzhenhao in #170
  • Support for fetching arxiv pdf links by @ZhaoyangHan04 in #171
  • add new reasoning operator “answer_model_judge” , to check reference answer via llm by @scuuy in #172
  • [WebUI] Add API Pipeline UI by @HeRunming in #173
  • Add agent gradio UI by @DeepMindLiuZhou in #175
  • recontruct the database manager to improve the efficiency for text2sql pipeline by @TechNomad-ds in #174
  • fix bug by @TechNomad-ds in #176
  • Update Gardio and Bug Fix by @DeepMindLiuZhou in #177
  • add operator in readme by @Qmeiyi in #178
  • change kbc script in playground & manage kbc pipelines by @ZhaoyangHan04 in #179
  • Unify backend and fronted by @DeepMindLiuZhou in #180
  • add mathbook extract to playground by @HeRunming in #181
  • add gradio in readme by @Qmeiyi in #182
  • add safety checks in fetching pdf by @ZhaoyangHan04 in #184
  • 增加了多轮对话中,对部分user生成缺少assistant的情况修复 by @Arunshmily in #185

New Contributors

Full Changelog: v1.0.4...v1.0.5

Dataflow v1.0.4 Release Notes

15 Jul 14:49

Choose a tag to compare

DataFlow v1.0.4 Key Feature Updates

  • Automatic Operator Code Generation: Introduced new features for automatic operator code generation by @DeepMindLiuZhou (PR #61).
  • Myscale Storage Support: Added support for myscale storage by @leaderwolfpipi (PR #60).
  • Dialogue Function Generation: Implemented a function to generate from conversations by @MOLYHECI (PR #59).
  • QA Generator and Translator: Added a QA generator and translation feature by @haolpku (PR #65).
  • Text2SQL Pipeline Update: Refactored the text2sql pipeline by @TechNomad-ds (PR #113).
  • AgenticRAG Pipeline Enhancements: Enhanced the AgenticRAG pipeline to fully support embedding models by @wongzhenhao (PR #86).
  • Lazy Load Framework Support: @MOLYHECI The entire framework now supports lazy loading, significantly improving loading speeds. #87
  • GeneralText Optimization: @zzy1127 optimized information related to GeneralText. #102 #112 #125
  • Removal of Legacy Code: @HeRunming removed outdated code logic from the repository. #118

Notable Changes

  • Operator Naming Rules: Renamed all operators naming rules by @SunnyHaze (PR #81).
  • FuncCall Pipeline: Introduced a new FuncCall Pipeline by @MOLYHECI (PR #88).
  • Batch PDF Extractor: Added functionality for batch PDF extraction by @haolpku (PR #111).
  • Bug Fixes and Improvements: Various contributors, including @YqjMartin and @ZhaoyangHan04, worked on code refactoring, dependency fixes, and bug resolutions.

What's Changed

New Contributors

Full Changelog: v1.0.3...v1.0.4

Dataflow v1.0.3 Release Notes

10 Jul 05:42

Choose a tag to compare

What's changed

  • Update more scorers (operators) to GeneralText pipeline. (#38 and #48 ). Thanks @zzy1127 @MOLYHECI
  • Update more operators to AgenticRAG pipeline. (#50 , #41). Thanks @wongzhenhao @YqjMartin
  • Revise API_KEY env variable passing logic in the APIServing class. The default variable is DF_API_KEY to avoid conflicts (#57 ). Thanks @SunnyHaze
  • Rename llmserving to serving for future extension of other kinds of web services. #44 . Thanks @SunnyHaze
  • Update the Readme. (#40 , #52 , #53 ) Thanks @Qmeiyi
  • Revise some bugs and parameter issues in AgenticRAG pipeline. #49 . Thanks @TheRoadQaQ
  • Revise some bugs and parameter issues in Knowledge base cleaning pipeline. #47 . Thanks @ZhaoyangHan04

Detailed list for all changed PRs

Full Changelog: v1.0.2...v1.0.3

Dataflow v1.0.2 Release Notes

03 Jul 09:39

Choose a tag to compare

New features

  • Add implementation of Dataflow Agents #34 . Thanks @DeepMindLiuZhou

debug

  • Fix get-desc issue #35 , Thanks @leaderwolfpipi
  • Fix including bug for /example/KBC/test.doc and /example/KBC/test.pdf in manifest.ini. Thanks @SunnyHaze

Dataflow v1.0.1 Release Notes

03 Jul 07:38

Choose a tag to compare

New features

  • add RARE pipeline (#33) @mi-iro
  • add API calling to text pipeline, i.e. test_sft_filter.py (#29) @zzy1127

Thanks for your contribution.

Debug

Fix the PyPI issue that makes pip install open-dataflow fail. @SunnyHaze . Thanks @leaderwolfpipi reported this bug.

Dataflow v1.0.0 Release Notes

30 Jun 14:13

Choose a tag to compare

🎉🎉🎉We are thrilled to release our Data-centric AI system, DataFLow! 🎉🎉🎉

Version: v1.0.0
Modular and AI-assisted data preparation system for high-efficiency pipelines.


🚀 Introduction

DataFlow is a high-efficiency data preparation system composed of advanced operators and multi-stage data processing pipelines. It integrates rule-based methods, deep learning models, and large language models (LLMs) to provide a modular, scalable, and reconfigurable design.

It aims to improve the quality and efficiency of data cleaning, augmentation, and construction — supporting the development of next-generation large-scale models.

Designed for researchers and engineers working on data-centric AI, LLM training, and scalable data workflows.


🧠 Core Features

  • 🔁 Modular Operator Design: Inspired by PyTorch, each operator is configurable and reusable.
  • 🧩 Multi-stage Pipelines: Flexibly chain operators for end-to-end data processing.
  • 🤖 Agent for DataFlow: LLM-powered automation for pipeline orchestration and operator generation.
  • ⚙️ Hybrid Techniques: Seamlessly combines rule-based, neural, and LLM-based methods.
  • 💾 Built-in Storage Layer: Manage intermediate data and caching.
  • 🔌 LLM Backend Support: Easily plug into GPT-style backends with LLMServing.

🧱 Framework Overview

DataFlow consists of the following core modules:

Module Description
operator Basic data processing units, reusable across pipelines.
pipeline Manages multi-step workflows by chaining multiple operators.
storage Manages data cache, storage, and I/O between steps.
LLMServing Integrates large models for reasoning, filtering, and generation.
Agent Automatically generates, orchestrates, and manages data workflows.

🛠️ Example Usage and Operators

To get started quickly with real examples, please refer to our documentation:

These guides provide hands-on usage of core modules including Pipeline, Operator, and Agent, and demonstrate how to configure, extend, and run a complete data processing workflow using DataFlow.

🔍 Why DataFlow?

Feature Benefit
PyTorch-style API Easy to learn and integrate
LLM + Rules + NN Flexible and powerful hybrid workflows
Auto Agent Support Reduces manual data prep burden
Storage Layer Efficient checkpointing and result reuse
Fully Modular Easy to extend, test, and compose

📫 Contact

For issues, contributions, or questions, feel free to reach out:

GitHub: https://github.com/OpenDCAI/DataFlow
Email: [email protected]

Dataflow v0.0.3 Release Notes

29 Jun 18:16

Choose a tag to compare

First Release for Dataflow system

  • Now the Dataflow codespace has been fully implemented with all features.
  • You can easily experience our powerful data-centric system with pip install open-dataflow and dataflow init command.