-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Add Wan2.2-S2V #12258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Wan2.2-S2V #12258
Conversation
…date example imports Add unit tests for WanSpeechToVideoPipeline and WanS2VTransformer3DModel and gguf
The previous audio encoding logic was a placeholder. It is now replaced with a `Wav2Vec2ForCTC` model and processor, including the full implementation for processing audio inputs. This involves resampling and aligning audio features with video frames to ensure proper synchronization. Additionally, utility functions for loading audio from files or URLs are added, and the `audio_processor` module is refactored to correctly handle audio data types instead of image types.
Introduces support for audio and pose conditioning, replacing the previous image conditioning mechanism. The model now accepts audio embeddings and pose latents as input. This change also adds two new, mutually exclusive motion processing modules: - `MotionerTransformers`: A transformer-based module for encoding motion. - `FramePackMotioner`: A module that packs frames from different temporal buckets for motion representation. Additionally, an `AudioInjector` module is implemented to fuse audio features into specific transformer blocks using cross-attention.
The `MotionerTransformers` module is removed and its functionality is replaced by a `FramePackMotioner` module and a simplified standard motion processing pipeline. The codebase is refactored to remove the `einops` dependency, replacing `rearrange` operations with standard PyTorch tensor manipulations for better code consistency. Additionally, `AdaLayerNorm` is introduced for improved conditioning, and helper functions for Rotary Positional Embeddings (RoPE) are added (probably temporarily) and refactored for clarity and flexibility. The audio injection mechanism is also updated to align with the new model structure.
Removes the calculation of several unused variables and an unnecessary `deepcopy` operation on the latents tensor. This change also removes the now-unused `deepcopy` import, simplifying the overall logic.
Refactors the `WanS2VTransformer3DModel` for clarity and better handling of various conditioning inputs like audio, pose, and motion. Key changes: - Simplifies the `WanS2VTransformerBlock` by removing projection layers and streamlining the forward pass. - Introduces `after_transformer_block` to cleanly inject audio information after each transformer block, improving code organization. - Enhances the main `forward` method to better process and combine multiple conditioning signals (image, audio, motion) before the transformer blocks. - Adds support for a zero-value timestep to differentiate between image and video latents. - Generalizes temporal embedding logic to support multiple model variations.
Introduces the necessary configurations and state dictionary key mappings to enable the conversion of S2V model checkpoints to the Diffusers format. This includes: - A new transformer configuration for the S2V model architecture, including parameters for audio and pose conditioning. - A comprehensive rename dictionary to map the original S2V layer names to their Diffusers equivalents.
…heads in transformer configuration
Co-authored-by: YiYi Xu <[email protected]>
Co-authored-by: YiYi Xu <[email protected]>
Co-authored-by: YiYi Xu <[email protected]>
Adds a utility function to merge video and audio files using ffmpeg. This simplifies the process of combining audio and video outputs, especially useful in pipelines like WanSpeechToVideoPipeline. The function handles temporary file creation, command execution, and error handling for a more robust merging process.
Consolidates audio injection functionality by moving the `after_transformer_block` method into the `AudioInjector` class. This change improves code organization and encapsulation, making the injection process more modular and maintainable.
Co-authored-by: YiYi Xu <[email protected]>
Co-authored-by: YiYi Xu <[email protected]>
Simplifies the audio injection process by directly passing injection layer indices to the `AudioInjector`. This removes the need for a depth-first search and dictionary creation within the injector, making the code more efficient and readable.
hi @tolgacangoz |
motion_latents = videos_last_latents.to(dtype=motion_latents.dtype, device=motion_latents.device) | ||
|
||
# Accumulate latents so as to decode them all at once at the end | ||
all_latents.append(segment_latents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, when testing on my side I saw duplicated frames between the end of a chunk and the start of the next chunk. It seems that one motion frame is still at the start of segment_latents
in this line for all chunks except the first.
I think it's coming from the fact that num_latent_frames
(21 with default inputs) is not the same as in self.prepare_latents
(20), the extra 1 keeps the last motion frame. Replacing all_latents.append(segment_latents)
with all_latents.append(latents)
fixed it on my side, changing the num_latent_frame formula should do the same.
Super cool to do the chunking in latent space instead of video frames, I hope this helps a little :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks a bunch for testing @gsprochette! I decided to follow the original repo, since this PR is supposed to be an integration PR rather than an extra-optimizations PR.
…ze and crop strategies
…near resampling and adjust frame chunk settings Updates the speech-to-video pipeline to perform a decode-encode cycle within the generation loop for each video chunk. This change improves temporal consistency between chunks by using the pixels of the previously generated frames, rather than their latents, to condition the next chunk. Key changes include: - Modifying the generation loop to decode latents into video frames, update the conditioning pixels, and then re-encode them for the next iteration's motion latents. - Setting the default `num_frames_per_chunk` to 80 and adjusting the corresponding frame logic. - Enabling `bilinear` resampling in the `VideoProcessor`.
This PR is fixing #12257
This PR is ready for review except for these current TODOs:
diffusers
assumes thatnum_frames
,height
, andwidth
are the same in a batch, etc., as opposed to the original repo. There are many for loops in the original repo. This is my current priority now.When I equalize several parameters to be able to produce the same/similar videos:
wan.mp4
diffusers.mp4
Try
WanSpeechToVideoPipeline
!@yiyixuxu @sayakpaul @a-r-r-o-w @asomoza @DN6 @stevhliu
@WanX-Video-1 @Steven-SWZhang @kelseyee
@SHYuanBest @J4BEZ @okaris @xziayro-ai @teith @luke14free