Skip to content
Open
Show file tree
Hide file tree
Changes from 115 commits
Commits
Show all changes
129 commits
Select commit Hold shift + click to select a range
a40d776
temp
tolgacangoz Aug 29, 2025
4be705f
template2
tolgacangoz Aug 29, 2025
cd18245
up
tolgacangoz Aug 29, 2025
bbe282f
fix-copies
tolgacangoz Aug 29, 2025
41fba83
upp
tolgacangoz Aug 29, 2025
1a0059f
Refactor WanSpeechToVideoPipeline: remove unused image encoder and up…
tolgacangoz Aug 29, 2025
44f4866
encoding image to audio
tolgacangoz Aug 29, 2025
933b618
Refactor Wan Speech-to-Video audio encoding
tolgacangoz Aug 30, 2025
6d55c93
up
tolgacangoz Aug 30, 2025
e6f6a22
up
tolgacangoz Aug 30, 2025
313fea5
up
tolgacangoz Aug 30, 2025
4ac9339
Improve Wan S2V pipeline
tolgacangoz Sep 1, 2025
66ec4ff
up
tolgacangoz Sep 1, 2025
d6ec465
Refactor latent preparation for S2V
tolgacangoz Sep 1, 2025
a463c09
up
tolgacangoz Sep 1, 2025
65191a9
feat: Add audio, pose, and advanced motion conditioning
tolgacangoz Sep 1, 2025
7925229
Refactor `WanS2V` transformer and introduce FramePack motioner
tolgacangoz Sep 1, 2025
323049d
Removes unused code from the speech-to-video pipeline
tolgacangoz Sep 2, 2025
6515b23
Refactor WanS2VTransformer and improve conditioning
tolgacangoz Sep 2, 2025
fe5a626
Add `AttentionMixin` to `WanS2VTransformer3DModel`
tolgacangoz Sep 2, 2025
bb5f10a
fix: Update parameter name for audio encoder to `num_attention_heads`
tolgacangoz Sep 2, 2025
bb5f4c9
feat: Improve support for S2V model conversion
tolgacangoz Sep 2, 2025
f6fb523
simplify
tolgacangoz Sep 3, 2025
dfec152
up
tolgacangoz Sep 3, 2025
21cd65f
refactor: Simplify AdaLayerNorm initialization and forward method
tolgacangoz Sep 3, 2025
89b9bcb
fix: Correct parameter value for pose_dim and name for num_attention_…
tolgacangoz Sep 3, 2025
167bd23
fix: Update audio injector to use WanTransformerBlock instead of WanA…
tolgacangoz Sep 3, 2025
9b6bf4b
upp
tolgacangoz Sep 3, 2025
4bed628
feat: Add audio injector attention mappings to transformer key renaming
tolgacangoz Sep 3, 2025
a112328
up docs
tolgacangoz Sep 3, 2025
0685646
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz Sep 3, 2025
c798d93
Adapt the `WanS2VTransformerBlock` to handle the new `temb` format, w…
tolgacangoz Sep 3, 2025
d612c41
style
tolgacangoz Sep 3, 2025
f1ef8fa
Simplify
tolgacangoz Sep 3, 2025
30be7e8
Remove unused audio encoder import
tolgacangoz Sep 3, 2025
7ee98eb
Fix typo
tolgacangoz Sep 3, 2025
4674ead
up
tolgacangoz Sep 3, 2025
6ee3b85
simplify
tolgacangoz Sep 4, 2025
508cf8d
Adding rope for hidden states and image
tolgacangoz Sep 4, 2025
1fcfeba
style
tolgacangoz Sep 4, 2025
74d6381
Refactor ropes
tolgacangoz Sep 4, 2025
9cd08bc
refactor
tolgacangoz Sep 4, 2025
fd3af1d
up
tolgacangoz Sep 4, 2025
97991aa
Preserve the lost dimension explicitly
tolgacangoz Sep 5, 2025
2048861
Use complex rope temporarily
tolgacangoz Sep 5, 2025
17166e2
upp
tolgacangoz Sep 5, 2025
b9a7149
style
tolgacangoz Sep 5, 2025
244005a
fix
tolgacangoz Sep 5, 2025
fde574d
fix: correct key names in S2V transformer mapping for audio components
tolgacangoz Sep 5, 2025
83567bf
fixes
tolgacangoz Sep 5, 2025
551c74e
Fix errors encountering during inference
tolgacangoz Sep 5, 2025
8341218
up
tolgacangoz Sep 5, 2025
6663e58
tolgacangoz Sep 5, 2025
a9b08de
Fix bugs and improve stability in WanSpeechToVideo model
tolgacangoz Sep 6, 2025
86123d9
style
tolgacangoz Sep 6, 2025
8064c42
Enhance load_audio function to support audio loading from URLs using …
tolgacangoz Sep 6, 2025
ac16d5d
upp
tolgacangoz Sep 7, 2025
dbc0764
add _repeated_blocks
tolgacangoz Sep 7, 2025
80a2fbe
up
tolgacangoz Sep 7, 2025
acc8ecb
fix
tolgacangoz Sep 7, 2025
a729033
up
tolgacangoz Sep 7, 2025
a0d5217
fix previous latensts
tolgacangoz Sep 7, 2025
4fd1014
set deterministic for fa2
tolgacangoz Sep 7, 2025
3773de3
up
tolgacangoz Sep 7, 2025
d0e3e26
update example docstring
tolgacangoz Sep 7, 2025
fe41edd
upp
tolgacangoz Sep 8, 2025
f5439e1
up
tolgacangoz Sep 8, 2025
eef629d
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz Sep 8, 2025
d72f549
style
tolgacangoz Sep 8, 2025
5eea0c7
Enhance load_video function with frame sampling options and reverse p…
tolgacangoz Sep 8, 2025
33e5b67
Refactor load_pose_condition method to simplify pose video handling a…
tolgacangoz Sep 8, 2025
9562c26
style
tolgacangoz Sep 8, 2025
a2c1952
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz Sep 8, 2025
8c4c018
Update parameter descriptions and simplify tensor operations in WanSp…
tolgacangoz Sep 8, 2025
12facf8
Propose to vectorize to assume each element in a batch standard, same
tolgacangoz Sep 8, 2025
4ab5547
style
tolgacangoz Sep 8, 2025
4e5f357
Fix pose_video tensor initialization to use correct dtype and device
tolgacangoz Sep 8, 2025
b9224b9
up
tolgacangoz Sep 8, 2025
bcf71db
ıp
tolgacangoz Sep 8, 2025
c248b6d
fix
tolgacangoz Sep 8, 2025
206bbaa
up
tolgacangoz Sep 8, 2025
a126570
Fix mask_input tensor shape and dimension in WanS2VTransformer3DModel
tolgacangoz Sep 9, 2025
bd0b72e
Fix mask_input tensor indexing in WanS2VTransformer3DModel
tolgacangoz Sep 9, 2025
29bddb5
up docs
tolgacangoz Sep 10, 2025
37a44c2
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz Sep 10, 2025
746514f
Adds video and audio merging functionality in docs
tolgacangoz Sep 10, 2025
b2e57b8
fix: initialize pose_video variable in WanSpeechToVideoPipeline
tolgacangoz Sep 10, 2025
1321330
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz Sep 11, 2025
f7fbf36
Enables passing attention kwargs
tolgacangoz Sep 12, 2025
111085f
Propose flash attention with precomputed max_seqlen_k-only
tolgacangoz Sep 12, 2025
11d98e1
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz Sep 15, 2025
8b58f63
style
tolgacangoz Sep 15, 2025
66e58b8
Propose to add `FP32RMSNorm`
tolgacangoz Sep 15, 2025
f503a26
Fix argument unpacking in audio injector call in WanS2VTransformer3DM…
tolgacangoz Sep 16, 2025
9fe3596
Remove `FP32RMSNorm`
tolgacangoz Sep 17, 2025
3542a46
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz Sep 17, 2025
6761385
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz Sep 17, 2025
e0b8ce9
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz Sep 17, 2025
b8b6709
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz Sep 17, 2025
d2840fc
Update module names
tolgacangoz Sep 17, 2025
a6c1b27
Adds `export_to_merged_video_audio` utility
tolgacangoz Sep 17, 2025
2f09d10
style
tolgacangoz Sep 17, 2025
454b442
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz Sep 17, 2025
d837dfc
Refactor audio injection logic
tolgacangoz Sep 17, 2025
d9fd755
style
tolgacangoz Sep 17, 2025
e5ab1dd
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz Sep 17, 2025
c6e8fa4
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz Sep 17, 2025
62dc61e
Refactors audio injection logic
tolgacangoz Sep 17, 2025
6b98ebd
Refactor adain mode handling
tolgacangoz Sep 17, 2025
8665fd5
style
tolgacangoz Sep 17, 2025
dd15817
revert
tolgacangoz Sep 18, 2025
9f4edb4
Take `AdaLayerNorm` from `normalization`
tolgacangoz Sep 18, 2025
52ffc49
style
tolgacangoz Sep 18, 2025
6196332
Refactor audio encoder with weighted average layer
tolgacangoz Sep 18, 2025
5c50519
style
tolgacangoz Sep 18, 2025
d57448a
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz Sep 24, 2025
ee1f6ff
Enhance image resizing functionality with additional options for resi…
tolgacangoz Sep 24, 2025
e15d3f6
Add resize_mode parameter to preprocess_video for flexible video resi…
tolgacangoz Sep 24, 2025
bc2165a
Refactor video processing in WanSpeechToVideoPipeline to support bili…
tolgacangoz Sep 24, 2025
0bf98b6
style
tolgacangoz Sep 24, 2025
226a451
Add `Motioner` class for _simple_ motion processing in `WanS2VTransfo…
tolgacangoz Sep 24, 2025
aef52d3
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz Sep 24, 2025
7122b61
Add `WanS2VCausalConvLayer` for modularism
tolgacangoz Sep 24, 2025
9dab88f
style
tolgacangoz Sep 24, 2025
70ef9c3
Add CP configs
tolgacangoz Sep 24, 2025
2d6176b
Update attention dispather usage
tolgacangoz Sep 24, 2025
77da3e3
Refactor example docstring for aspect ratio resizing and update num_f…
tolgacangoz Sep 24, 2025
9f61f5c
up docs
tolgacangoz Sep 24, 2025
079dd7d
up docs
tolgacangoz Sep 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
219 changes: 203 additions & 16 deletions docs/source/en/api/pipelines/wan.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ The following Wan models are supported in Diffusers:
- [Wan 2.2 T2V 14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers)
- [Wan 2.2 I2V 14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers)
- [Wan 2.2 TI2V 5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B-Diffusers)
- [Wan 2.2 S2V 14B](https://huggingface.co/Wan-AI/Wan2.2-S2V-14B-Diffusers)

> [!TIP]
> Click on the Wan models in the right sidebar for more examples of video generation.
Expand Down Expand Up @@ -95,15 +96,15 @@ pipeline = WanPipeline.from_pretrained(
pipeline.to("cuda")

prompt = """
The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""
negative_prompt = """
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
"""

Expand Down Expand Up @@ -150,15 +151,15 @@ pipeline.transformer = torch.compile(
)

prompt = """
The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""
negative_prompt = """
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
"""

Expand Down Expand Up @@ -236,6 +237,186 @@ export_to_video(output, "output.mp4", fps=16)
</hfoption>
</hfoptions>


### Wan-S2V: Audio-Driven Cinematic Video Generation

[Wan-S2V](https://huggingface.co/papers/2508.18621) by the Wan Team.

*Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.*

The example below demonstrates how to use the speech-to-video pipeline to generate a video using a text description, a starting frame, an audio, and a pose video.

<hfoptions id="S2V usage">
<hfoption id="usage">

```python
import numpy as np, math
import torch
from diffusers import AutoencoderKLWan, WanSpeechToVideoPipeline
from diffusers.utils import export_to_video, load_image, load_audio, load_video
from transformers import Wav2Vec2ForCTC
import requests
from PIL import Image
from io import BytesIO


model_id = "Wan-AI/Wan2.2-S2V-14B-Diffusers"
audio_encoder = Wav2Vec2ForCTC.from_pretrained(model_id, subfolder="audio_encoder", dtype=torch.float32)
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanSpeechToVideoPipeline.from_pretrained(
model_id, vae=vae, audio_encoder=audio_encoder, torch_dtype=torch.bfloat16
)
pipe.to("cuda")

headers = {"User-Agent": "Mozilla/5.0"}
url = "https://upload.wikimedia.org/wikipedia/commons/4/46/Albert_Einstein_sticks_his_tongue.jpg"
resp = requests.get(url, headers=headers, timeout=30)
image = Image.open(BytesIO(resp.content))

audio, sampling_rate = load_audio("https://github.com/Wan-Video/Wan2.2/raw/refs/heads/main/examples/Five%20Hundred%20Miles.MP3")
#pose_video_path_or_url = "https://github.com/Wan-Video/Wan2.2/raw/refs/heads/main/examples/pose.mp4"

def get_size_less_than_area(height,
width,
target_area=1024 * 704,
divisor=64):
if height * width <= target_area:
# If the original image area is already less than or equal to the target,
# no resizing is needed—just padding. Still need to ensure that the padded area doesn't exceed the target.
max_upper_area = target_area
min_scale = 0.1
max_scale = 1.0
else:
# Resize to fit within the target area and then pad to multiples of `divisor`
max_upper_area = target_area # Maximum allowed total pixel count after padding
d = divisor - 1
b = d * (height + width)
a = height * width
c = d**2 - max_upper_area

# Calculate scale boundaries using quadratic equation
min_scale = (-b + math.sqrt(b**2 - 2 * a * c)) / (2 * a) # Scale when maximum padding is applied
max_scale = math.sqrt(max_upper_area / (height * width)) # Scale without any padding

# We want to choose the largest possible scale such that the final padded area does not exceed max_upper_area
# Use binary search-like iteration to find this scale
find_it = False
for i in range(100):
scale = max_scale - (max_scale - min_scale) * i / 100
new_height, new_width = int(height * scale), int(width * scale)

# Pad to make dimensions divisible by 64
pad_height = (64 - new_height % 64) % 64
pad_width = (64 - new_width % 64) % 64
pad_top = pad_height // 2
pad_bottom = pad_height - pad_top
pad_left = pad_width // 2
pad_right = pad_width - pad_left

padded_height, padded_width = new_height + pad_height, new_width + pad_width

if padded_height * padded_width <= max_upper_area:
find_it = True
break

if find_it:
return padded_height, padded_width
else:
# Fallback: calculate target dimensions based on aspect ratio and divisor alignment
aspect_ratio = width / height
target_width = int(
(target_area * aspect_ratio)**0.5 // divisor * divisor)
target_height = int(
(target_area / aspect_ratio)**0.5 // divisor * divisor)

# Ensure the result is not larger than the original resolution
if target_width >= width or target_height >= height:
target_width = int(width // divisor * divisor)
target_height = int(height // divisor * divisor)

return target_height, target_width

def aspect_ratio_resize(image, pipe, max_area):
height, width = get_size_less_than_area(image.size[1], image.size[0], target_area=max_area)
image = image.resize((width, height))
return image, height, width

image, height, width = aspect_ratio_resize(first_frame, pipe, 480*832)

prompt = "Einstein singing a song."

output = pipe(
prompt=prompt, image=image, audio=audio, sampling_rate=sampling_rate,
height=height, width=width, num_frames_per_chunk=81,
#pose_video_path_or_url=pose_video_path_or_url,
).frames[0]
export_to_video(output, "output.mp4", fps=16)

# Lastly, we need to merge the video and audio into a new video, with the duration set to
# the shorter of the two and overwrite the original video file.

import os, logging, subprocess, shutil

def merge_video_audio(video_path: str, audio_path: str):
logging.basicConfig(level=logging.INFO)

if not os.path.exists(video_path):
raise FileNotFoundError(f"video file {video_path} does not exist")
if not os.path.exists(audio_path):
raise FileNotFoundError(f"audio file {audio_path} does not exist")

base, ext = os.path.splitext(video_path)
temp_output = f"{base}_temp{ext}"

try:
# Create ffmpeg command
command = [
'ffmpeg',
'-y', # overwrite
'-i',
video_path,
'-i',
audio_path,
'-c:v',
'copy', # copy video stream
'-c:a',
'aac', # use AAC audio encoder
'-b:a',
'192k', # set audio bitrate (optional)
'-map',
'0:v:0', # select the first video stream
'-map',
'1:a:0', # select the first audio stream
'-shortest', # choose the shortest duration
temp_output
]

# Execute the command
logging.info("Start merging video and audio...")
result = subprocess.run(
command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

# Check result
if result.returncode != 0:
error_msg = f"FFmpeg execute failed: {result.stderr}"
logging.error(error_msg)
raise RuntimeError(error_msg)

shutil.move(temp_output, video_path)
logging.info(f"Merge completed, saved to {video_path}")

except Exception as e:
if os.path.exists(temp_output):
os.remove(temp_output)
logging.error(f"merge_video_audio failed with error: {e}")

merge_video_audio("output.mp4", "audio.mp3")
```

</hfoption>
</hfoptions>


### Any-to-Video Controllable Generation

Wan VACE supports various generation techniques which achieve controllable video generation. Some of the capabilities include:
Expand Down Expand Up @@ -281,10 +462,10 @@ The general rule of thumb to keep in mind when preparing inputs for the VACE pip

# use "steamboat willie style" to trigger the LoRA
prompt = """
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""

Expand Down Expand Up @@ -353,6 +534,12 @@ The general rule of thumb to keep in mind when preparing inputs for the VACE pip
- all
- __call__

## WanSpeechToVideoPipeline

[[autodoc]] WanSpeechToVideoPipeline
- all
- __call__

## WanVideoToVideoPipeline

[[autodoc]] WanVideoToVideoPipeline
Expand Down
Loading