feat: add multiple input image support in Flux Kontext #11880

Net-Mist · 2025-07-07T15:41:00Z

What does this PR do?

According to the Flux Kontext paper, the model should be able to take several images as input.

I've made a first attempt at implementing this. However, there are a few points I'm unsure about regarding how to best align with the codebase's design philosophy:

I added a new multiple_images parameter to the __call__ method of the FluxKontextPipeline. A check validates that image and multiple_images can't be set at the same time
The multiple_images parameter has type `PipelineSeveralImagesInput, defined as follows :

PipelineSeveralImagesInput = Union[
    Tuple[PIL.Image.Image, ...],
    Tuple[np.ndarray, ...],
    Tuple[torch.Tensor, ...],
    List[Tuple[PIL.Image.Image,  ...]],
    List[Tuple[np.ndarray,  ...]],
    List[Tuple[torch.Tensor,  ...]],
]

(To mirrors the PipelineImageInput type)

The image preprocessing logic was split into 2 methods (preprocess_image and preprocess_images) with a lot of duplicate code. There are a few ways to prevent the duplication, but I'm not sure which approach best fits the design principles of this repository. Possible options include:
- Make the resize and preprocess methods of VaeImageProcessor more generic, to allow them to take PipelineSeveralImagesInput input. (In this case, should we use a TypeVar to condition the output of the function to the input ?)
- Converting the image variable to a List[Tuple[{image_format}]] upfront to unify the processing logic.
- Or duplicate the code, as done here.

What do you think ?

Fixes #11824

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

vuongminh1907 · 2025-07-07T15:56:55Z

@Net-Mist, can you add some examples using 2-3 reference images? I tested but Kontext doesn't seem to work well with more than 2.

Net-Mist · 2025-07-07T17:46:47Z

@vuongminh1907

The main difficulty I face when using Kontext with multiple image is that it seems the model has no notion of "first image" , "second image", ... (it wasn't trained for it), and struggle with "left", "right"

with this code:

import torch
from PIL import Image

from diffusers import FluxKontextPipeline
from diffusers.utils import load_image

pipe = FluxKontextPipeline.from_pretrained("black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16)
pipe.to("cuda")

image1 = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yarn-art-pikachu.png"
).convert("RGB").resize((1024, 1024), resample=Image.LANCZOS)
image2 = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
).convert("RGB").resize((1024, 1024), resample=Image.LANCZOS)
image3 = load_image(
    "https://www.pokemon.com/static-assets/content-assets/cms2/img/pokedex/full/151.png"
).convert("RGB").resize((1024, 1024), resample=Image.LANCZOS)


prompts = [
    "Put both animal in a cozy room",
]
images = pipe(
    multiple_images=[(image1, image2)],
    prompt=prompts,
    guidance_scale=2.5,
    generator=torch.Generator().manual_seed(42),
).images
images[0].save("output_0.png")


prompts = [
    "Put all animals in a cozy room",
]
images = pipe(
    multiple_images=[(image1, image2, image3)],
    prompt=prompts,
    guidance_scale=2.5,
    generator=torch.Generator().manual_seed(42),
).images
images[0].save("output_1.png")

I got these 2 images:

asomoza · 2025-07-07T19:24:49Z

that's why I suggested in the issue that kontext works a lot better if you composite the image yourself, you can arrange the multiple subjects in a single image and use the space better to try to preserve all the details. This also work for when you do tryons, etc, each use case can be made better by just the composition of the image you use as a source.

Also your prompt is too simple, you need to be more specific on what do you want:

For example:

remove the background and put all three creatures, the cat, the pikachu and the mew, in a cozy room while preserving their style and characteristics.

source image	resullt

remove the background and put all three creatures, the cat, the pikachu and the mew, in a cozy room, make them all realistic and holding hands

The automatic stitching or concatenating the images makes the model worse in my opinion.

You can also draw over the image to suggest what do you want like I did here

Net-Mist · 2025-07-08T09:23:45Z

@asomoza
Yes, I see your point. I did a few more experiments with multiple inputs but it doesn't seems to work that well. At least, for my use-case, I will go back to train some LoRA ^^.
I'm not sure whether it's worth merging this PR. On one hand, it seems that others are interested in experimenting with this. On the other hand, it adds some complexity to the code.
I'll leave it up to the HF team to decide. I don't mind if you choose to close the PR, but I'm also happy to keep working on it once I get some feedback on the code.

asomoza · 2025-07-08T13:33:43Z

thanks @Net-Mist, I'd rather not add more complexity to the pipeline if it is not something that adds some clear benefit to the users. I can already see multiple issues created with something similar to why kontext doesn't produce good results with multiple images.

I don't really know why people are so interested in this, everywhere I read, they always say that this kind of technique produces bad results, there's some basic examples that work but if you need something a little more complex that is not a subject and a background, it doesn't work as good, and even in the case it works, it always loses details because of the stitching or concatenation.

What we can do, since you didn't modify the VaeImageProcessor and chose the duplicate code route, you can add the PipelineSeveralImagesInput to the same file and we can make (move) it a community pipeline.

This way we can merge it a lot faster and gauge how much people like to use it or not.

Net-Mist · 2025-07-09T11:39:13Z

@asomoza
Yes, that seems like the best option.
I've updated the PR with your suggestion.
Thanks for your time :)

HuggingFaceDocBuilderDev · 2025-07-09T14:58:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

asomoza

thanks a lot! let's see how people use it.

mamicro-li · 2025-07-16T10:00:09Z

It won't be necessary to resize all the input images to the same size (in preprocess_images()). You can simply keep the input images' original aspect ratio.

lavinal712 · 2025-07-19T16:12:10Z

image_ids[..., 0] = 1

Why not image_ids[..., 0] = i + 1?

# set the image ids to the correct position in the latent grid
image_ids[..., 2] += i * (image_latent_height // 2)

What does this line of code do?

lavinal712 · 2025-07-19T16:57:37Z

In my experiments, I tested three different configurations:

# It best aligns with the paper’s description, which is to add an offset along the temporal dimension.
image_ids[..., 0] = i + 1

# Add offsets along both the temporal and vertical dimensions.
image_ids[..., 0] = i + 1
image_ids[..., 2] += i * (image_latent_height // 2)

# The approach taken in this PR sets the temporal offset to 1 and adds an offset along the vertical dimension.
image_ids[..., 0] = 1
image_ids[..., 2] += i * (image_latent_height // 2)

Input

Output

Prompt: The astronaut is holding the cat.

Output

I compared the three settings above and found no clear difference. Generating a satisfactory image still requires multiple attempts; sometimes the output doesn’t match the prompt.

It won't be necessary to resize all the input images to the same size (in preprocess_images()). You can simply keep the input images' original aspect ratio.

I agree with this approach. In my implementation, the two input images have different resolutions.

Net-Mist · 2025-07-21T20:50:37Z

@lavinal712 I tested image_ids[..., 0] = i + 1, which is what the paper describes. However as the model hasn't been trained (yet ?) with several input images, I encountered more issues with this approach compared to using an offset on the other axis. For example, it often produced a pattern where patches from different images were mixed together.

I’m wondering if we could fine-tune or train a LoRA so the model learns to differentiate between images along the temporal axis, and grasps the concept of "first", "second", "third" images... (if someone here has a bit of free time, and the compute power required ^^)

lavinal712 · 2025-07-22T02:09:59Z

I'll have some free time starting in early August and would be happy to give it a try.

ajinkyaT · 2025-07-29T13:03:21Z

@lavinal712 @Net-Mist I can partner up with training a lora and trying out different configs.

lavinal712 · 2025-08-05T07:28:05Z

@ajinkyaT Glad to hear that, I happen to have some time recently.

@ajinkyaT @Net-Mist I found a fantastic repo to do multi-image inference: https://github.com/Saquib764/omini-kontext

feat: add multiple input image support in Flux Kontext

c517579

Net-Mist mentioned this pull request Jul 7, 2025

Run FluxKontextPipeline with multi-images input, But An error occurred #11824

Closed

Net-Mist force-pushed the flux_multiple_input_image branch from 764fd69 to 061a7d8 Compare July 9, 2025 10:24

move model to community

f079e21

Net-Mist force-pushed the flux_multiple_input_image branch from 061a7d8 to f079e21 Compare July 9, 2025 11:30

asomoza and others added 2 commits July 9, 2025 10:21

Merge branch 'main' into flux_multiple_input_image

a18ec9c

fix linter

8591db4

asomoza approved these changes Jul 9, 2025

View reviewed changes

asomoza merged commit db715e2 into huggingface:main Jul 9, 2025
8 checks passed

feat: add multiple input image support in Flux Kontext #11880

feat: add multiple input image support in Flux Kontext #11880

Uh oh!

Conversation

Net-Mist commented Jul 7, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

vuongminh1907 commented Jul 7, 2025

Uh oh!

Net-Mist commented Jul 7, 2025

Uh oh!

asomoza commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Net-Mist commented Jul 8, 2025

Uh oh!

asomoza commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Net-Mist commented Jul 9, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 9, 2025

Uh oh!

asomoza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mamicro-li commented Jul 16, 2025

Uh oh!

lavinal712 commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lavinal712 commented Jul 19, 2025

Uh oh!

Net-Mist commented Jul 21, 2025

Uh oh!

lavinal712 commented Jul 22, 2025

Uh oh!

ajinkyaT commented Jul 29, 2025

Uh oh!

lavinal712 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

asomoza commented Jul 7, 2025 •

edited

Loading

asomoza commented Jul 8, 2025 •

edited

Loading

lavinal712 commented Jul 19, 2025 •

edited

Loading

lavinal712 commented Aug 5, 2025 •

edited

Loading