Skip to content

Commit afa440a

Browse files
authored
refine text docs (#18)
1 parent d722f57 commit afa440a

File tree

2 files changed

+70
-87
lines changed

2 files changed

+70
-87
lines changed

docs/en/notes/guide/quickstart/TextPipeline.md

Lines changed: 35 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ bash text_pipeline/run_sft_synthetic_new.sh
9696
--step_type process
9797
```
9898
4. **HtmlUrlRemoverRefiner**
99-
- Function: Remove HTML tags
99+
- Function: Remove HTML tags, such as \<tag\>
100100
- Command:
101101
```bash
102102
python pipeline_step.py \
@@ -114,7 +114,7 @@ bash text_pipeline/run_sft_synthetic_new.sh
114114
--step_type process
115115
```
116116
6. **BlocklistFilter**
117-
- Function: Filter text containing too many blocked words
117+
- Function: Filter text containing too many blocked words, blocklist refers to [List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words)
118118
- Command:
119119
```bash
120120
python pipeline_step.py \
@@ -123,7 +123,7 @@ bash text_pipeline/run_sft_synthetic_new.sh
123123
--step_type process
124124
```
125125
7. **WordNumberFilter**
126-
- Function: Filter by word count
126+
- Function: Filter by word count in [20, 100000] (adjustable)
127127
- Command:
128128
```bash
129129
python pipeline_step.py \
@@ -141,7 +141,7 @@ bash text_pipeline/run_sft_synthetic_new.sh
141141
--step_type process
142142
```
143143
9. **SentenceNumberFilter**
144-
- Function: Filter by abnormal sentence count
144+
- Function: Filter by abnormal sentence count, keep documents which sentence count in [3, 7500] (adjustable)
145145
- Command:
146146
```bash
147147
python pipeline_step.py \
@@ -150,7 +150,7 @@ bash text_pipeline/run_sft_synthetic_new.sh
150150
--step_type process
151151
```
152152
10. **LineEndWithEllipsisFilter**
153-
- Function: Proportionally filter text ending with ellipsis
153+
- Function: Filter text with ellipsis ending sentence ratio greater than 0.3 (adjustable)
154154
- Command:
155155
```bash
156156
python pipeline_step.py \
@@ -168,7 +168,7 @@ bash text_pipeline/run_sft_synthetic_new.sh
168168
--step_type process
169169
```
170170
12. **MeanWordLengthFilter**
171-
- Function: Filter by average word length
171+
- Function: Filter by average word length in [3, 10] (adjustable)
172172
- Command:
173173
```bash
174174
python pipeline_step.py \
@@ -177,7 +177,7 @@ bash text_pipeline/run_sft_synthetic_new.sh
177177
--step_type process
178178
```
179179
13. **SymbolWordRatioFilter**
180-
- Function: Filter text with high symbol-to-word ratio
180+
- Function: Filter text with symbol(such as #)-to-word ratio > 0.4
181181
- Command:
182182
```bash
183183
python pipeline_step.py \
@@ -186,7 +186,7 @@ bash text_pipeline/run_sft_synthetic_new.sh
186186
--step_type process
187187
```
188188
14. **HtmlEntityFilter**
189-
- Function: Filter text with excessive HTML entities
189+
- Function: Filter text with excessive HTML entities, such as nbsp, lt, gt...
190190
- Command:
191191
```bash
192192
python pipeline_step.py \
@@ -195,7 +195,7 @@ bash text_pipeline/run_sft_synthetic_new.sh
195195
--step_type process
196196
```
197197
15. **IDCardFilter**
198-
- Function: Filter text containing ID card information
198+
- Function: Privacy protection. Filter text containing ID card information, such as ”身份证“,”ID NO.“.
199199
- Command:
200200
```bash
201201
python pipeline_step.py \
@@ -213,7 +213,7 @@ bash text_pipeline/run_sft_synthetic_new.sh
213213
--step_type process
214214
```
215215
17. **SpecialCharacterFilter**
216-
- Function: Filter text with many special characters
216+
- Function: Filter text with any special characters (such as r"u200e")
217217
- Command:
218218
```bash
219219
python pipeline_step.py \
@@ -222,88 +222,81 @@ bash text_pipeline/run_sft_synthetic_new.sh
222222
--step_type process
223223
```
224224
18. **WatermarkFilter**
225-
- Function: Filter text containing watermarks
225+
- Function: Filter text containing watermarks, such as“Watermark”, "Copyright"...
226226
- Command:
227227
```bash
228228
python pipeline_step.py \
229229
--yaml_path text_pipeline/yaml/WatermarkFilter.yaml \
230230
--step_name WatermarkFilter \
231231
--step_type process
232232
```
233-
19. **StopWordFilter**
234-
- Function: Filter text with low stopword ratio
235-
- Command:
236-
```bash
237-
python pipeline_step.py \
238-
--yaml_path text_pipeline/yaml/StopWordFilter.yaml \
239-
--step_name StopWordFilter \
240-
--step_type process
241-
```
242-
20. **CurlyBracketFilter**
243-
- Function: Filter text with high curly bracket ratio
233+
234+
19. **CurlyBracketFilter**
235+
- Function: Filter text with curly bracket ratio greater than 0.025. (adjustable)
244236
- Command:
245237
```bash
246238
python pipeline_step.py \
247239
--yaml_path text_pipeline/yaml/CurlyBracketFilter.yaml \
248240
--step_name CurlyBracketFilter \
249241
--step_type process
250242
```
251-
21. **CapitalWordsFilter**
252-
- Function: Filter text with high uppercase letter ratio
243+
20. **CapitalWordsFilter**
244+
- Function: Filter text with uppercase letter ratio greater than 0.2. (adjustable)
253245
- Command:
254246
```bash
255247
python pipeline_step.py \
256248
--yaml_path text_pipeline/yaml/CapitalWordsFilter.yaml \
257249
--step_name CapitalWordsFilter \
258250
--step_type process
259251
```
260-
22. **LoremIpsumFilter**
261-
- Function: Filter text containing "lorem ipsum"
252+
21. **LoremIpsumFilter**
253+
- Function: Filter text containing "lorem ipsum". The text. Lorem Ipsum is a random pseudotext commonly used in typesetting design.
254+
262255
- Command:
263256
```bash
264257
python pipeline_step.py \
265258
--yaml_path text_pipeline/yaml/LoremIpsumFilter.yaml \
266259
--step_name LoremIpsumFilter \
267260
--step_type process
268261
```
269-
23. **UniqueWordsFilter**
270-
- Function: Filter text with few unique words
262+
22. **UniqueWordsFilter**
263+
- Function: Filter text with unique words ratio < 0.1 (adjustable)
271264
- Command:
272265
```bash
273266
python pipeline_step.py \
274267
--yaml_path text_pipeline/yaml/UniqueWordsFilter.yaml \
275268
--step_name UniqueWordsFilter \
276269
--step_type process
277270
```
278-
24. **CharNumberFilter**
279-
- Function: Filter text with few characters
271+
23. **CharNumberFilter**
272+
- Function: Filter text with characters less than 100 (adjustable)
280273
- Command:
281274
```bash
282275
python pipeline_step.py \
283276
--yaml_path text_pipeline/yaml/CharNumberFilter.yaml \
284277
--step_name CharNumberFilter \
285278
--step_type process
286279
```
287-
25. **LineStartWithBulletpointFilter**
288-
- Function: Filter text starting with bullet points
280+
24. **LineStartWithBulletpointFilter**
281+
- Function: Filter text starting with bullet points ratio greater than 0.9 (adjustable)
289282
- Command:
290283
```bash
291284
python pipeline_step.py \
292285
--yaml_path text_pipeline/yaml/LineStartWithBulletpointFilter.yaml \
293286
--step_name LineStartWithBulletpointFilter \
294287
--step_type process
295288
```
296-
26. **LineWithJavascriptFilter**
297-
- Function: Filter text containing JavaScript
289+
25. **LineWithJavascriptFilter**
290+
- Function: Filter text containing JavaScript numbers > 3 (adjustable)
298291
- Command:
299292
```bash
300293
python pipeline_step.py \
301294
--yaml_path text_pipeline/yaml/LineWithJavascriptFilter.yaml \
302295
--step_name LineWithJavascriptFilter \
303296
--step_type process
304297
```
305-
27. **PairQualFilter**
306-
- Function: Score text quality with a quality scorer
298+
26. **PairQualFilter**
299+
- Function: Score text quality with a quality scorer, which is based on the bge model and supports both Chinese and English. It is trained using GPT to compare and score texts in pairs.
307300
- Command:
308301
```bash
309302
python pipeline_step.py \
@@ -317,7 +310,7 @@ bash text_pipeline/run_sft_synthetic_new.sh
317310
Based on **Pipeline 1**, add the following operators:
318311

319312
1. **PretrainGenerator**
320-
- Function: Use Qwen2.5-7b to synthesize phi-4-style QA pair data from seed documents
313+
- Function: Use llm to synthesize phi-4-style QA pair data from seed documents
321314
- Command:
322315
```bash
323316
python pipeline_step.py \
@@ -326,7 +319,7 @@ Based on **Pipeline 1**, add the following operators:
326319
--step_type generator
327320
```
328321
2. **QuratingFilter**
329-
- Function: Score and filter synthesized text across writing_style, required_expertise, facts_and_trivia, educational_value dimensions
322+
- Function: Score and filter synthesized text across writing_style, required_expertise, facts_and_trivia, educational_value dimensions. [Model](https://github.com/princeton-nlp/QuRating)
330323
- Command:
331324
```bash
332325
python pipeline_step.py \
@@ -338,7 +331,7 @@ Based on **Pipeline 1**, add the following operators:
338331
### 4.3 SFT Data Filtering Pipeline
339332

340333
1. **WordNumberFilter**
341-
- Function: Filter by output length, keep between 20–1000 words
334+
- Function: Filter by output length, keep between 20–1000 words (adjustable)
342335
- Command:
343336
```bash
344337
python pipeline_step.py \
@@ -347,7 +340,7 @@ Based on **Pipeline 1**, add the following operators:
347340
--step_type process
348341
```
349342
2. **SuperfilteringFilter**
350-
- Function: Filter by instruction IFD score
343+
- Function: Filter by instruction IFD score. [Model](https://github.com/tianyi-lab/Superfiltering)
351344
- Command:
352345
```bash
353346
python pipeline_step.py \
@@ -356,7 +349,7 @@ Based on **Pipeline 1**, add the following operators:
356349
--step_type process
357350
```
358351
3. **DeitaQualityFilter**
359-
- Function: Filter by instruction quality score
352+
- Function: Filter by instruction quality score. [Model](https://huggingface.co/hkust-nlp/deita-quality-scorer)
360353
- Command:
361354
```bash
362355
python pipeline_step.py \
@@ -365,7 +358,7 @@ Based on **Pipeline 1**, add the following operators:
365358
--step_type process
366359
```
367360
4. **InstagFilter**
368-
- Function: Filter by number of instruction tags
361+
- Function: Filter by number of instruction tags [Model](https://github.com/OFA-Sys/InsTag)
369362
- Command:
370363
```bash
371364
python pipeline_step.py \

0 commit comments

Comments
 (0)