Skip to content

Commit cc6c6a5

Browse files
authored
add text generator docs (#19)
* refine text docs * add generators
1 parent afa440a commit cc6c6a5

File tree

2 files changed

+66
-2
lines changed

2 files changed

+66
-2
lines changed

docs/en/notes/guide/operators/text_process.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ permalink: /en/guide/mq07gwz4/
99

1010
## Overview
1111

12-
DataFlow currently supports text data processing at the data point level, categorized into three types: refiners, deduplicators, and filters.
12+
DataFlow currently supports text data processing at the data point level, categorized into four types: refiners, deduplicators, generators and filters.
1313

1414
<table class="tg">
1515
<thead>
@@ -30,6 +30,11 @@ DataFlow currently supports text data processing at the data point level, catego
3030
<td class="tg-0pky">6</td>
3131
<td class="tg-0pky">Removes duplicate data points using methods such as hashing.</td>
3232
</tr>
33+
<tr>
34+
<td class="tg-0pky">Generators</td>
35+
<td class="tg-0pky">2</td>
36+
<td class="tg-0pky">Generate specific format data based on seed documents</td>
37+
</tr>
3338
<tr>
3439
<td class="tg-0pky">Filters</td>
3540
<td class="tg-0pky">42</td>
@@ -194,6 +199,33 @@ DataFlow currently supports text data processing at the data point level, catego
194199
</tbody>
195200
</table>
196201

202+
## Generators
203+
204+
<table class="tg">
205+
<thead>
206+
<tr>
207+
<th class="tg-0pky">Name</th>
208+
<th class="tg-0pky">Applicable Type</th>
209+
<th class="tg-0pky">Description</th>
210+
<th class="tg-0pky">Repository or Paper</th>
211+
</tr>
212+
</thead>
213+
<tbody>
214+
<tr>
215+
<td class="tg-0pky">PretrainGenerator</td>
216+
<td class="tg-0pky">Pretrain</td>
217+
<td class="tg-0pky">Synthesize phi-4 question and answer data pairs using pre trained document data, and retell the document in QA format</td>
218+
<td class="tg-0pky"><a href="https://arxiv.org/pdf/2401.16380">Paper</a></td>
219+
</tr>
220+
<tr>
221+
<td class="tg-0pky">SupervisedFinetuneGenerator</td>
222+
<td class="tg-0pky">SFT</td>
223+
<td class="tg-0pky">Synthesize SFT format QA data pairs based on seed documents and return original information</td>
224+
<td class="tg-0pky">-</td>
225+
</tr>
226+
</tbody>
227+
</table>
228+
197229
## Filters
198230

199231
<table class="tg">

docs/zh/notes/guide/operators/text_process.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ permalink: /zh/guide/q07ou7d9/
77

88
# 文本数据处理
99
## 概览
10-
DataFlow目前支持的文本数据处理主要针对于数据点层面,可以分为以下三种类型,分别是数据改写器、数据去重器和数据过滤器
10+
DataFlow目前支持的文本数据处理主要针对于数据点层面,可以分为以下四种类型,分别是数据改写器、数据去重器、数据过滤器和数据生成器
1111
<table class="tg">
1212
<thead>
1313
<tr>
@@ -27,6 +27,11 @@ DataFlow目前支持的文本数据处理主要针对于数据点层面,可以
2727
<td class="tg-0pky">6</td>
2828
<td class="tg-0pky">通过哈希等方法进行数据点去重</td>
2929
</tr>
30+
<tr>
31+
<td class="tg-0pky">数据生成器</td>
32+
<td class="tg-0pky">2</td>
33+
<td class="tg-0pky">根据种子文档生成特定格式数据</td>
34+
</tr>
3035
<tr>
3136
<td class="tg-0pky">数据过滤器</td>
3237
<td class="tg-0pky">42</td>
@@ -191,6 +196,33 @@ DataFlow目前支持的文本数据处理主要针对于数据点层面,可以
191196
</tbody>
192197
</table>
193198

199+
## 数据生成器
200+
201+
<table class="tg">
202+
<thead>
203+
<tr>
204+
<th class="tg-0pky">名称</th>
205+
<th class="tg-0pky">适用类型</th>
206+
<th class="tg-0pky">简介</th>
207+
<th class="tg-0pky">官方仓库或论文</th>
208+
</tr>
209+
</thead>
210+
<tbody>
211+
<tr>
212+
<td class="tg-0pky">PretrainGenerator</td>
213+
<td class="tg-0pky">预训练</td>
214+
<td class="tg-0pky">使用预训练文档数据合成类phi-4问答数据对,使用QA格式复述文档</td>
215+
<td class="tg-0pky"><a href="https://arxiv.org/pdf/2401.16380">Paper</a></td>
216+
</tr>
217+
<tr>
218+
<td class="tg-0pky">SupervisedFinetuneGenerator</td>
219+
<td class="tg-0pky">SFT</td>
220+
<td class="tg-0pky">根据种子文档合成SFT格式QA数据对,并返回原文信息</td>
221+
<td class="tg-0pky">-</td>
222+
</tr>
223+
</tbody>
224+
</table>
225+
194226
## 数据过滤器
195227

196228
<table class="tg">

0 commit comments

Comments
 (0)