Skip to content

Conversation

Gaiejj
Copy link
Member

@Gaiejj Gaiejj commented Mar 26, 2025

Description

🎉! We supported the SFT training of Qwen2.5-Omni within 1 hour! Here are the specific training screenshots👇

image

Test

Please test your changes by running the following command:

cd scripts
bash test/test_text_to_text.sh ./opt PATH_TO_OUTPUT_ROOT_DIR

Here, ./opt is the directory containing the test scripts for the opt model, and PATH_TO_OUTPUT_ROOT_DIR is the path to the output root directory. The test scripts will save ~1GB data to the output root directory and delete it after the test. Please ensure you have enough space on your disk.

Lint

Please run the following command in the root directory to check your code style:

pip install pre-commit
pre-commit run --all-files

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

@deyituo
Copy link

deyituo commented Mar 27, 2025

可以支持t2s吗

@Gaiejj
Copy link
Member Author

Gaiejj commented Mar 27, 2025

Talker模块我们正在加紧研究,今明应该可以弄好text-audio输入的微调~

@DQYZHWK
Copy link

DQYZHWK commented Mar 27, 2025

有计划支持三模态的全量微调吗(文本system prompt,图片,语音指令)

@Gaiejj
Copy link
Member Author

Gaiejj commented Mar 27, 2025

@DQYZHWK 这个事情我们非常感兴趣做,但是苦于没有相应的数据,您有参考不

@Alex-Songs
Copy link

@Gaiejj 请问支持音频和图片一起训练吗,就是一个batch里既有语音又有图片这种

@DQYZHWK
Copy link

DQYZHWK commented Mar 27, 2025

@DQYZHWK 这个事情我们非常感兴趣做,但是苦于没有相应的数据,您有参考不
很抱歉,我没有相关的数据集。
https://mp.weixin.qq.com/s/hJ5x8xUstBjwNZc1mmqE-g
但是可以参考这篇文章,您可以使用VQA数据集通过tts (chattts,fishspeech)转化成SQA数据集。期待未来能集成此demo。

@Gaiejj
Copy link
Member Author

Gaiejj commented Mar 27, 2025

@DQYZHWK @Alex-Songs 感谢推荐,我们近期会尝试尝试这种三模态微调!

@zuitbjc1096
Copy link

请问, 微调qwen2.5-omni的脚本被移除了吗?

@Gaiejj
Copy link
Member Author

Gaiejj commented Mar 28, 2025

@zuitbjc1096 您好,代码在这里:https://github.com/Gaiejj/align-anything/tree/dev-omni

@Alex-Songs
Copy link

@Gaiejj 好像qwen2.5-omni用的transformers库加了个tp_plan,需要torch>=2.5,目前微调代码也需要torch>=2.5吗?

@Gaiejj Gaiejj closed this Mar 31, 2025
@Gaiejj Gaiejj deleted the dev-omni branch March 31, 2025 08:48
@Gaiejj Gaiejj restored the dev-omni branch March 31, 2025 08:49
@Gaiejj
Copy link
Member Author

Gaiejj commented Mar 31, 2025

@Alex-Songs 是的,需要遵从Qwen-2.5-Omni的官方依赖~

@Gaiejj Gaiejj reopened this Mar 31, 2025
@Alex-Songs
Copy link

@Gaiejj 大佬,再问下qwen2.5-omni-7b的权重是thinker.visual.blocks.11.attn.proj.weight,直接加载thinker的话需要改成visual.blocks.11.attn.proj.weight吗?

@shanhaidexiamo
Copy link

大佬,想咨询下,TMRoPE 没有实现的基础上可以直接微调这个模型吗?谢谢

@liu6381810
Copy link

hello大佬,我看了下qwen2.5-omni的code,如果需要训练talker,构造训练数据时需要语音tokenizer先对语音数据做tokenize转成语音的codec id,但是它似乎没开源语音tokenizer,想问下你们这里是怎么处理的

@Gaiejj
Copy link
Member Author

Gaiejj commented Apr 2, 2025

其实不是大佬orz,最近实现的时候也遇到了这些问题,感觉 @Alex-Songs 的说法是对的,我们之后有进展了会第一时间在这里更新~

@sky1170447398
Copy link

@Alex-Songs 是的,需要遵从Qwen-2.5-Omni的官方依赖~

tp_plan这个参数导致pretrain model的时候会报错“raise NotImplementedError("This model does not have a tensor parallel plan.")”请问有遇到过吗

@Gaiejj
Copy link
Member Author

Gaiejj commented Apr 7, 2025

请问有复现指南吗,我们可以帮忙解决一下

@jiahui-w
Copy link

jiahui-w commented Apr 8, 2025

请问现在是否支持视频+音频(视频里的音频)+prompt的微调呢,我看官方代码里面给的是图片+prompt嘞,非常感谢

@Kingdroper
Copy link

Talker模块我们正在加紧研究,今明应该可以弄好text-audio输入的微调~

talker模块现在可以支持更换音色吗 比如用一些其他的音色微调

@zzchust
Copy link

zzchust commented Apr 10, 2025

mark

@SeungyounShin
Copy link

Like @liu6381810 mentioned, I faced the same issue and posted about it [here](https://huggingface.co/Qwen/Qwen2.5-Omni-7B/discussions/40) for the author's attention. However, there might be a reason why they haven't disclosed their speech tokenizer. Given that, I'm not currently expecting them to release it. It seems we'll likely need to train the talker component from scratch using our own voice data.

@pjgao
Copy link

pjgao commented Apr 28, 2025

请问talker部分的微调有进展吗

@dongkeun-livetoon
Copy link

CosyVoice also uses a speech tokenizer architecture. Maybe we can refer to it.

@wwfcnu
Copy link

wwfcnu commented May 20, 2025

这个现在是支持(system prompt +文本指令+语音 -->text)的微调吗

@wwfcnu
Copy link

wwfcnu commented May 20, 2025

Talker模块我们正在加紧研究,今明应该可以弄好text-audio输入的微调~

我看代码里只有text-image输入的微调

@candle1220
Copy link

Talker模块我们正在加紧研究,今明应该可以弄好text-audio输入的微调~

代码里面仍然没有关于 audio 的微调

@Gaiejj
Copy link
Member Author

Gaiejj commented Jul 4, 2025

Hey all! We sincerely apologize for our initial misestimation of the progress timeline and the delayed response! During this period, we attempted to fine-tune the text-to-audio-to-text and text-to-audio functionalities. However, due to the exceptionally advanced architecture of qwen2.5-omni, our academic team lacked the necessary engineering expertise, which resulted in the abnormally poor performance of the trained models. This is the primary reason for our prolonged silence.

We are continuing our efforts and will promptly report any breakthroughs. We also welcome community contributions through implementation references, which we will integrate into align-anything.

Once again, our deepest apologies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.