You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Function: Filter text containing too many blocked words
117
+
- Function: Filter text containing too many blocked words, blocklist refers to [List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words)
- Function: Score text quality with a quality scorer
298
+
26. **PairQualFilter**
299
+
- Function: Score text quality with a quality scorer, which is based on the bge model and supports both Chinese and English. It is trained using GPT to compare and score texts in pairs.
Based on **Pipeline 1**, add the following operators:
318
311
319
312
1. **PretrainGenerator**
320
-
- Function: Use Qwen2.5-7b to synthesize phi-4-style QA pair data from seed documents
313
+
- Function: Use llm to synthesize phi-4-style QA pair data from seed documents
321
314
- Command:
322
315
```bash
323
316
python pipeline_step.py \
@@ -326,7 +319,7 @@ Based on **Pipeline 1**, add the following operators:
326
319
--step_type generator
327
320
```
328
321
2. **QuratingFilter**
329
-
- Function: Score and filter synthesized text across writing_style, required_expertise, facts_and_trivia, educational_value dimensions
322
+
- Function: Score and filter synthesized text across writing_style, required_expertise, facts_and_trivia, educational_value dimensions. [Model](https://github.com/princeton-nlp/QuRating)
330
323
- Command:
331
324
```bash
332
325
python pipeline_step.py \
@@ -338,7 +331,7 @@ Based on **Pipeline 1**, add the following operators:
338
331
### 4.3 SFT Data Filtering Pipeline
339
332
340
333
1. **WordNumberFilter**
341
-
- Function: Filter by output length, keep between 20–1000 words
334
+
- Function: Filter by output length, keep between 20–1000 words (adjustable)
342
335
- Command:
343
336
```bash
344
337
python pipeline_step.py \
@@ -347,7 +340,7 @@ Based on **Pipeline 1**, add the following operators:
347
340
--step_type process
348
341
```
349
342
2. **SuperfilteringFilter**
350
-
- Function: Filter by instruction IFD score
343
+
- Function: Filter by instruction IFD score. [Model](https://github.com/tianyi-lab/Superfiltering)
351
344
- Command:
352
345
```bash
353
346
python pipeline_step.py \
@@ -356,7 +349,7 @@ Based on **Pipeline 1**, add the following operators:
356
349
--step_type process
357
350
```
358
351
3. **DeitaQualityFilter**
359
-
- Function: Filter by instruction quality score
352
+
- Function: Filter by instruction quality score. [Model](https://huggingface.co/hkust-nlp/deita-quality-scorer)
360
353
- Command:
361
354
```bash
362
355
python pipeline_step.py \
@@ -365,7 +358,7 @@ Based on **Pipeline 1**, add the following operators:
365
358
--step_type process
366
359
```
367
360
4. **InstagFilter**
368
-
- Function: Filter by number of instruction tags
361
+
- Function: Filter by number of instruction tags [Model](https://github.com/OFA-Sys/InsTag)
0 commit comments