Different generation with `Diffusers` in I2V tasks for LTX-video

### Describe the bug

Hello, I encountered an issue with the generation when attempting the I2V task using `Diffusers`. Is there any difference between the `diffusers` implementation and the `LTX-video-inference scripts` in the I2V task? 

- The above is the result from the `inference.py`, and the following is the result generated with `diffuser`.
- Prompts: `a person`

https://github.com/user-attachments/assets/6e2aeeaf-c52b-402c-ae92-aff2d325464b


https://github.com/user-attachments/assets/59f815ad-1746-4ec5-ae1c-a47dcfa0fd02


https://github.com/user-attachments/assets/8ca3c79b-8003-4fa2-82b1-8ae17beccb9c



- test img
![ref](https://github.com/user-attachments/assets/e3638227-68cf-4510-b380-24071a9409fc)


Besides, it seems that the text prompt has a significant impact on the I2V generation with 'diffusers'. Could I be missing any important arguments?
https://huggingface.co/docs/diffusers/api/pipelines/ltx_video
- results


https://github.com/user-attachments/assets/c062c21f-5611-4860-ba17-441dd26a8913


https://github.com/user-attachments/assets/991ec853-ee26-43a7-914b-622d115a9b7f


https://github.com/user-attachments/assets/ff3e7f04-c17d-4f0a-9aba-2db68aae792d


https://github.com/user-attachments/assets/f2699759-c36e-4839-bddd-37b84a85e2c7

### Reproduction

- for LTX-video generation
https://github.com/Lightricks/LTX-Video/blob/main/inference.py
```
python inference.py \
    --ckpt_path ./pretrained_models/LTX-Video \
    --output_path './samples' \
    --prompt "A person." \
    --input_image_path ./samples/test_cases.png \
    --height 512 \
    --width 512 \
    --num_frames 49 \
    --seed 42 
```

- for diffuser generation: it seems that the negative prompts are causing the issues. However, even when I remove them, the results are still not satisfactory.
```
import argparse
import torch
from diffusers import LTXVideoTransformer3DModel
from diffusers import LTXImageToVideoPipeline
from diffusers import FlowMatchEulerDiscreteScheduler, AutoencoderKLLTXVideo
from diffusers.utils import export_to_video, load_image, load_video


from moviepy import VideoFileClip, AudioFileClip
import numpy as np
from pathlib import Path
import os
import imageio
from einops import rearrange
from PIL import Image
import random

def seed_everething(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)

def generate_video(args):

    pipe = LTXImageToVideoPipeline.from_pretrained(args.ltx_model_path, torch_dtype=torch.bfloat16)
    pipe.to("cuda")

    negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

    image = load_image(args.validation_image)
    prompt = "A person."
    negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
    generator = torch.Generator(
        device="cuda" if torch.cuda.is_available() else "cpu"
    ).manual_seed(42)

    video = pipe(
        image=image,
        prompt=prompt,
        guidance_scale=3,
        # stg_scale=1,
        generator=generator,
        callback_on_step_end=None,
        negative_prompt=negative_prompt,
        width=512,
        height=512,
        num_frames=49,
        num_inference_steps=50,
        decode_timestep=0.05,
        decode_noise_scale=0.025,

    ).frames[0]
    export_to_video(video, args.output_file, fps=24)
```

- for demo images with difference text prompts
 https://huggingface.co/docs/diffusers/api/pipelines/ltx_video

```
import torch
from diffusers import LTXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

pipe = LTXImageToVideoPipeline.from_pretrained("./pretrained_models/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")

image = load_image("samples/image.png")
prompt = "A young girl stands."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=704,
    height=480,
    num_frames=161,
    num_inference_steps=50,
).frames[0]
modified_prompt = "-".join(prompt.split()[:14])
export_to_video(video, f"samples/test_out/demo-{modified_prompt}.mp4", fps=24)
```

### Logs

```shell

```

### System Info

torch                    2.4.1
torchao                  0.7.0
torchvision              0.19.1
diffusers                0.32.1
python 3.10

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different generation with `Diffusers` in I2V tasks for LTX-video #10565

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Different generation with Diffusers in I2V tasks for LTX-video #10565

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Different generation with `Diffusers` in I2V tasks for LTX-video #10565