-
Notifications
You must be signed in to change notification settings - Fork 6.4k
feat: add multiple input image support in Flux Kontext #11880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@Net-Mist, can you add some examples using 2-3 reference images? I tested but Kontext doesn't seem to work well with more than 2. |
The main difficulty I face when using Kontext with multiple image is that it seems the model has no notion of "first image" , "second image", ... (it wasn't trained for it), and struggle with "left", "right" with this code: import torch
from PIL import Image
from diffusers import FluxKontextPipeline
from diffusers.utils import load_image
pipe = FluxKontextPipeline.from_pretrained("black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16)
pipe.to("cuda")
image1 = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yarn-art-pikachu.png"
).convert("RGB").resize((1024, 1024), resample=Image.LANCZOS)
image2 = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
).convert("RGB").resize((1024, 1024), resample=Image.LANCZOS)
image3 = load_image(
"https://www.pokemon.com/static-assets/content-assets/cms2/img/pokedex/full/151.png"
).convert("RGB").resize((1024, 1024), resample=Image.LANCZOS)
prompts = [
"Put both animal in a cozy room",
]
images = pipe(
multiple_images=[(image1, image2)],
prompt=prompts,
guidance_scale=2.5,
generator=torch.Generator().manual_seed(42),
).images
images[0].save("output_0.png")
prompts = [
"Put all animals in a cozy room",
]
images = pipe(
multiple_images=[(image1, image2, image3)],
prompt=prompts,
guidance_scale=2.5,
generator=torch.Generator().manual_seed(42),
).images
images[0].save("output_1.png") I got these 2 images: |
that's why I suggested in the issue that kontext works a lot better if you composite the image yourself, you can arrange the multiple subjects in a single image and use the space better to try to preserve all the details. This also work for when you do tryons, etc, each use case can be made better by just the composition of the image you use as a source. Also your prompt is too simple, you need to be more specific on what do you want: For example:
The automatic stitching or concatenating the images makes the model worse in my opinion. You can also draw over the image to suggest what do you want like I did here |
@asomoza |
thanks @Net-Mist, I'd rather not add more complexity to the pipeline if it is not something that adds some clear benefit to the users. I can already see multiple issues created with something similar to I don't really know why people are so interested in this, everywhere I read, they always say that this kind of technique produces bad results, there's some basic examples that work but if you need something a little more complex that is not a subject and a background, it doesn't work as good, and even in the case it works, it always loses details because of the stitching or concatenation. What we can do, since you didn't modify the This way we can merge it a lot faster and gauge how much people like to use it or not. |
764fd69
to
061a7d8
Compare
061a7d8
to
f079e21
Compare
@asomoza |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot! let's see how people use it.
It won't be necessary to resize all the input images to the same size (in |
image_ids[..., 0] = 1 Why not # set the image ids to the correct position in the latent grid
image_ids[..., 2] += i * (image_latent_height // 2) What does this line of code do? |
In my experiments, I tested three different configurations: # It best aligns with the paper’s description, which is to add an offset along the temporal dimension.
image_ids[..., 0] = i + 1
# Add offsets along both the temporal and vertical dimensions.
image_ids[..., 0] = i + 1
image_ids[..., 2] += i * (image_latent_height // 2)
# The approach taken in this PR sets the temporal offset to 1 and adds an offset along the vertical dimension.
image_ids[..., 0] = 1
image_ids[..., 2] += i * (image_latent_height // 2) Prompt: The astronaut is holding the cat. I compared the three settings above and found no clear difference. Generating a satisfactory image still requires multiple attempts; sometimes the output doesn’t match the prompt.
I agree with this approach. In my implementation, the two input images have different resolutions. |
@lavinal712 I tested ![]() I’m wondering if we could fine-tune or train a LoRA so the model learns to differentiate between images along the temporal axis, and grasps the concept of "first", "second", "third" images... (if someone here has a bit of free time, and the compute power required ^^) |
I'll have some free time starting in early August and would be happy to give it a try. |
@lavinal712 @Net-Mist I can partner up with training a lora and trying out different configs. |
@ajinkyaT Glad to hear that, I happen to have some time recently. @ajinkyaT @Net-Mist I found a fantastic repo to do multi-image inference: https://github.com/Saquib764/omini-kontext |
What does this PR do?
According to the Flux Kontext paper, the model should be able to take several images as input.
I've made a first attempt at implementing this. However, there are a few points I'm unsure about regarding how to best align with the codebase's design philosophy:
multiple_images
parameter to the__call__
method of theFluxKontextPipeline
. A check validates thatimage
andmultiple_images
can't be set at the same timemultiple_images
parameter has type `PipelineSeveralImagesInput, defined as follows :(To mirrors the
PipelineImageInput
type)resize
andpreprocess
methods ofVaeImageProcessor
more generic, to allow them to takePipelineSeveralImagesInput
input. (In this case, should we use a TypeVar to condition the output of the function to the input ?)image
variable to aList[Tuple[{image_format}]]
upfront to unify the processing logic.What do you think ?
Fixes #11824
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.