Skip to content

Conversation

Net-Mist
Copy link
Contributor

@Net-Mist Net-Mist commented Jul 7, 2025

What does this PR do?

According to the Flux Kontext paper, the model should be able to take several images as input.

I've made a first attempt at implementing this. However, there are a few points I'm unsure about regarding how to best align with the codebase's design philosophy:

  • I added a new multiple_images parameter to the __call__ method of the FluxKontextPipeline. A check validates that image and multiple_images can't be set at the same time
  • The multiple_images parameter has type `PipelineSeveralImagesInput, defined as follows :
PipelineSeveralImagesInput = Union[
    Tuple[PIL.Image.Image, ...],
    Tuple[np.ndarray, ...],
    Tuple[torch.Tensor, ...],
    List[Tuple[PIL.Image.Image,  ...]],
    List[Tuple[np.ndarray,  ...]],
    List[Tuple[torch.Tensor,  ...]],
]

(To mirrors the PipelineImageInput type)

  • The image preprocessing logic was split into 2 methods (preprocess_image and preprocess_images) with a lot of duplicate code. There are a few ways to prevent the duplication, but I'm not sure which approach best fits the design principles of this repository. Possible options include:
    • Make the resize and preprocess methods of VaeImageProcessor more generic, to allow them to take PipelineSeveralImagesInput input. (In this case, should we use a TypeVar to condition the output of the function to the input ?)
    • Converting the image variable to a List[Tuple[{image_format}]] upfront to unify the processing logic.
    • Or duplicate the code, as done here.

What do you think ?

Fixes #11824

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@vuongminh1907
Copy link
Contributor

@Net-Mist, can you add some examples using 2-3 reference images? I tested but Kontext doesn't seem to work well with more than 2.

@Net-Mist
Copy link
Contributor Author

Net-Mist commented Jul 7, 2025

@vuongminh1907

The main difficulty I face when using Kontext with multiple image is that it seems the model has no notion of "first image" , "second image", ... (it wasn't trained for it), and struggle with "left", "right"

with this code:

import torch
from PIL import Image

from diffusers import FluxKontextPipeline
from diffusers.utils import load_image

pipe = FluxKontextPipeline.from_pretrained("black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16)
pipe.to("cuda")

image1 = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yarn-art-pikachu.png"
).convert("RGB").resize((1024, 1024), resample=Image.LANCZOS)
image2 = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
).convert("RGB").resize((1024, 1024), resample=Image.LANCZOS)
image3 = load_image(
    "https://www.pokemon.com/static-assets/content-assets/cms2/img/pokedex/full/151.png"
).convert("RGB").resize((1024, 1024), resample=Image.LANCZOS)


prompts = [
    "Put both animal in a cozy room",
]
images = pipe(
    multiple_images=[(image1, image2)],
    prompt=prompts,
    guidance_scale=2.5,
    generator=torch.Generator().manual_seed(42),
).images
images[0].save("output_0.png")


prompts = [
    "Put all animals in a cozy room",
]
images = pipe(
    multiple_images=[(image1, image2, image3)],
    prompt=prompts,
    guidance_scale=2.5,
    generator=torch.Generator().manual_seed(42),
).images
images[0].save("output_1.png")

I got these 2 images:

output_0

output_1

@asomoza
Copy link
Member

asomoza commented Jul 7, 2025

that's why I suggested in the issue that kontext works a lot better if you composite the image yourself, you can arrange the multiple subjects in a single image and use the space better to try to preserve all the details. This also work for when you do tryons, etc, each use case can be made better by just the composition of the image you use as a source.

Also your prompt is too simple, you need to be more specific on what do you want:

For example:

remove the background and put all three creatures, the cat, the pikachu and the mew, in a cozy room while preserving their style and characteristics.

source image resullt
three animals

remove the background and put all three creatures, the cat, the pikachu and the mew, in a cozy room, make them all realistic and holding hands

animals (1)

The automatic stitching or concatenating the images makes the model worse in my opinion.

You can also draw over the image to suggest what do you want like I did here

@Net-Mist
Copy link
Contributor Author

Net-Mist commented Jul 8, 2025

@asomoza
Yes, I see your point. I did a few more experiments with multiple inputs but it doesn't seems to work that well. At least, for my use-case, I will go back to train some LoRA ^^.
I'm not sure whether it's worth merging this PR. On one hand, it seems that others are interested in experimenting with this. On the other hand, it adds some complexity to the code.
I'll leave it up to the HF team to decide. I don't mind if you choose to close the PR, but I'm also happy to keep working on it once I get some feedback on the code.

@asomoza
Copy link
Member

asomoza commented Jul 8, 2025

thanks @Net-Mist, I'd rather not add more complexity to the pipeline if it is not something that adds some clear benefit to the users. I can already see multiple issues created with something similar to why kontext doesn't produce good results with multiple images.

I don't really know why people are so interested in this, everywhere I read, they always say that this kind of technique produces bad results, there's some basic examples that work but if you need something a little more complex that is not a subject and a background, it doesn't work as good, and even in the case it works, it always loses details because of the stitching or concatenation.

What we can do, since you didn't modify the VaeImageProcessor and chose the duplicate code route, you can add the PipelineSeveralImagesInput to the same file and we can make (move) it a community pipeline.

This way we can merge it a lot faster and gauge how much people like to use it or not.

@Net-Mist Net-Mist force-pushed the flux_multiple_input_image branch from 764fd69 to 061a7d8 Compare July 9, 2025 10:24
@Net-Mist Net-Mist force-pushed the flux_multiple_input_image branch from 061a7d8 to f079e21 Compare July 9, 2025 11:30
@Net-Mist
Copy link
Contributor Author

Net-Mist commented Jul 9, 2025

@asomoza
Yes, that seems like the best option.
I've updated the PR with your suggestion.
Thanks for your time :)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@asomoza asomoza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot! let's see how people use it.

@asomoza asomoza merged commit db715e2 into huggingface:main Jul 9, 2025
8 checks passed
@mamicro-li
Copy link

It won't be necessary to resize all the input images to the same size (in preprocess_images()). You can simply keep the input images' original aspect ratio.

@lavinal712
Copy link
Contributor

lavinal712 commented Jul 19, 2025

image_ids[..., 0] = 1

Why not image_ids[..., 0] = i + 1?

# set the image ids to the correct position in the latent grid
image_ids[..., 2] += i * (image_latent_height // 2)

What does this line of code do?

@lavinal712
Copy link
Contributor

In my experiments, I tested three different configurations:

# It best aligns with the paper’s description, which is to add an offset along the temporal dimension.
image_ids[..., 0] = i + 1

# Add offsets along both the temporal and vertical dimensions.
image_ids[..., 0] = i + 1
image_ids[..., 2] += i * (image_latent_height // 2)

# The approach taken in this PR sets the temporal offset to 1 and adds an offset along the vertical dimension.
image_ids[..., 0] = 1
image_ids[..., 2] += i * (image_latent_height // 2)

Input
image
image
Output

Prompt: The astronaut is holding the cat.

Output
image

I compared the three settings above and found no clear difference. Generating a satisfactory image still requires multiple attempts; sometimes the output doesn’t match the prompt.

It won't be necessary to resize all the input images to the same size (in preprocess_images()). You can simply keep the input images' original aspect ratio.

I agree with this approach. In my implementation, the two input images have different resolutions.

@Net-Mist
Copy link
Contributor Author

@lavinal712 I tested image_ids[..., 0] = i + 1, which is what the paper describes. However as the model hasn't been trained (yet ?) with several input images, I encountered more issues with this approach compared to using an offset on the other axis. For example, it often produced a pattern where patches from different images were mixed together.

image

I’m wondering if we could fine-tune or train a LoRA so the model learns to differentiate between images along the temporal axis, and grasps the concept of "first", "second", "third" images... (if someone here has a bit of free time, and the compute power required ^^)

@lavinal712
Copy link
Contributor

I'll have some free time starting in early August and would be happy to give it a try.

@ajinkyaT
Copy link

@lavinal712 @Net-Mist I can partner up with training a lora and trying out different configs.

@lavinal712
Copy link
Contributor

lavinal712 commented Aug 5, 2025

@ajinkyaT Glad to hear that, I happen to have some time recently.

@ajinkyaT @Net-Mist I found a fantastic repo to do multi-image inference: https://github.com/Saquib764/omini-kontext

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Run FluxKontextPipeline with multi-images input, But An error occurred

7 participants