-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Add basic implementation of AuraFlowImg2ImgPipeline #11340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for yet another contribution! Could you post a snippet and some results? |
Apologies, had to clean up things and make the tests actually work. The docstrings included in the CL should be a good snippet, i.e. import torch
from diffusers import AuraFlowImg2ImgPipeline
import requests
from PIL import Image
from io import BytesIO
# download an initial image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))
pipe = AuraFlowImg2ImgPipeline.from_pretrained("fal/AuraFlow-v0.3", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "A fantasy landscape, trending on artstation"
image = pipe(prompt=prompt, image=init_image, strength=0.75, num_inference_steps=50).images[0]
image.save("aura_flow_img2img.png") Unfortunately seems that my math may be wrong somehow? With strength 0.75 with strength 0.85 with strength 0.95 but with 0.9 @DN6, any ideas? |
# Compute latents | ||
latents = mean + std * sample | ||
|
||
# Scale latents | ||
latents = latents * self.vae.config.scaling_factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be:
init_latents = (init_latents - mean) * self.vae.config.scaling_factor / latents_std
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VAE config has ('latents_mean', None), ('latents_std', None)
so I believe the code would be a noop but implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay then init_latents *= self.vae.config.scaling_factor
should be just fine. We can safely remove the other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good 👍🏽 I think some parts of the implementation have to be adjusted to look more similar to the existing Flux pipelines.
|
||
return timesteps, num_inference_steps - t_start | ||
|
||
def prepare_img2img_latents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be placed under prepare_latents
like the other Img2Img pipelines
def prepare_latents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is what you intended, but done?
prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0) | ||
|
||
# 5. Prepare timesteps | ||
timesteps, num_inference_steps = self.get_timesteps( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think timesteps need to be adjusted for strength and shift no?
diffusers/src/diffusers/pipelines/flux/pipeline_flux_img2img.py
Lines 876 to 892 in bbd0c16
sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas | |
image_seq_len = (int(height) // self.vae_scale_factor // 2) * (int(width) // self.vae_scale_factor // 2) | |
mu = calculate_shift( | |
image_seq_len, | |
self.scheduler.config.get("base_image_seq_len", 256), | |
self.scheduler.config.get("max_image_seq_len", 4096), | |
self.scheduler.config.get("base_shift", 0.5), | |
self.scheduler.config.get("max_shift", 1.15), | |
) | |
timesteps, num_inference_steps = retrieve_timesteps( | |
self.scheduler, | |
num_inference_steps, | |
device, | |
sigmas=sigmas, | |
mu=mu, | |
) | |
timesteps, num_inference_steps = self.get_timesteps(num_inference_steps, strength, device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, probably :)
Unfortunately still seeing visual noise instead of the image at some values of strength so something in my math must be wrong. |
Do those values generally tend to be higher? |
A bit hard to see sorry |
Weird, I can click on the image to get the full sized one (with an extra click) |
I still have no idea what is going on but I think the code is correct yet AF may require some special work done. Here are my observations:
The core issue no longer seems to be only the deterministic |
Thanks for the detailed analysis. Do these vary from |
Yes, I am seeing same behavior in 0.2. I focused too much on VAE in the last comment - I don't think it's the root cause (after all SDXL works just fine) and perhaps the real issue is some kind numerical instability we are facing? I've included 3 videos generated taking a snapshot of the model state each 5 frames - before, at the moment of the issue and after. At least looking at them I can't notice anything weird that can explain the problem combined_steps_s0.85.mp4combined_steps_s0.90.mp4combined_steps_s0.95.mp4I've also attempted to affect the problematic strength range by changing number of steps or guidance scale but it had no effect. Interestingly enabling |
What does this PR do?
Adds a very mechanical conversion of other img2img pipelines (mostly SD3/Flux) to support AuraFlow. Seems to require a bit more strength (0.75+) compared to SDXL (my only point of reference that I've used a lot for I2I) but works fine and does not complain about GGUF (still need to check compilation).
Fixes # (issue)
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@cloneofsimo @sayakpaul
@yiyixuxu @asomoza