Skip to content

Conversation

@lshpku
Copy link

@lshpku lshpku commented Oct 23, 2025

迁移PaddleNLP的预训练数据转换工具

原脚本:PaddleNLP/llm/tools/preprocess/create_pretraining_data.py

原使用文档:https://paddlenlp.readthedocs.io/en/latest/llm/dataset.html

其实只改了from ernie.tokenizer import Ernie4_5_Tokenizer这一行,如果不是因为PaddleNLP不再支持EB模型,直接用PaddleNLP里面的脚本也可以

@paddle-bot
Copy link

paddle-bot bot commented Oct 23, 2025

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant