Add basic implementation of AuraFlowImg2ImgPipeline #11340

AstraliteHeart · 2025-04-16T11:13:52Z

What does this PR do?

Adds a very mechanical conversion of other img2img pipelines (mostly SD3/Flux) to support AuraFlow. Seems to require a bit more strength (0.75+) compared to SDXL (my only point of reference that I've used a lot for I2I) but works fine and does not complain about GGUF (still need to check compilation).

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@cloneofsimo @sayakpaul
@yiyixuxu @asomoza

sayakpaul · 2025-04-16T12:13:25Z

Thanks for yet another contribution! Could you post a snippet and some results?

AstraliteHeart · 2025-04-17T22:55:14Z

Apologies, had to clean up things and make the tests actually work.

The docstrings included in the CL should be a good snippet, i.e.

import torch
from diffusers import AuraFlowImg2ImgPipeline
import requests
from PIL import Image
from io import BytesIO

# download an initial image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))

pipe = AuraFlowImg2ImgPipeline.from_pretrained("fal/AuraFlow-v0.3", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "A fantasy landscape, trending on artstation"
image = pipe(prompt=prompt, image=init_image, strength=0.75, num_inference_steps=50).images[0]
image.save("aura_flow_img2img.png")

Unfortunately seems that my math may be wrong somehow?

With strength 0.75

with strength 0.85

with strength 0.95

but with 0.9

@DN6, any ideas?

sayakpaul · 2025-04-18T03:16:20Z

src/diffusers/pipelines/aura_flow/pipeline_aura_flow_img2img.py

+            # Compute latents
+            latents = mean + std * sample
+
+            # Scale latents
+            latents = latents * self.vae.config.scaling_factor


I think it should be:

init_latents = (init_latents - mean) * self.vae.config.scaling_factor / latents_std

Following
https://github.com/huggingface/diffusers/blob/eef3d6595456e69a48989c2cc44c739792341e07/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py#L751C17-L751C29

The VAE config has ('latents_mean', None), ('latents_std', None) so I believe the code would be a noop but implemented.

Oh okay then init_latents *= self.vae.config.scaling_factor should be just fine. We can safely remove the other things.

DN6

Looking good 👍🏽 I think some parts of the implementation have to be adjusted to look more similar to the existing Flux pipelines.

DN6 · 2025-04-18T07:46:37Z

src/diffusers/pipelines/aura_flow/pipeline_aura_flow_img2img.py

+
+        return timesteps, num_inference_steps - t_start
+
+    def prepare_img2img_latents(


Can this be placed under prepare_latents like the other Img2Img pipelines

diffusers/src/diffusers/pipelines/flux/pipeline_flux_img2img.py

Line 610 in bbd0c16

def prepare_latents(

Not sure if this is what you intended, but done?

DN6 · 2025-04-18T07:48:17Z

src/diffusers/pipelines/aura_flow/pipeline_aura_flow_img2img.py

+            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
+
+        # 5. Prepare timesteps
+        timesteps, num_inference_steps = self.get_timesteps(


Think timesteps need to be adjusted for strength and shift no?

diffusers/src/diffusers/pipelines/flux/pipeline_flux_img2img.py

Lines 876 to 892 in bbd0c16

sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas

image_seq_len = (int(height) // self.vae_scale_factor // 2) * (int(width) // self.vae_scale_factor // 2)

mu = calculate_shift(

image_seq_len,

self.scheduler.config.get("base_image_seq_len", 256),

self.scheduler.config.get("max_image_seq_len", 4096),

self.scheduler.config.get("base_shift", 0.5),

self.scheduler.config.get("max_shift", 1.15),

)

timesteps, num_inference_steps = retrieve_timesteps(

self.scheduler,

num_inference_steps,

device,

sigmas=sigmas,

mu=mu,

)

timesteps, num_inference_steps = self.get_timesteps(num_inference_steps, strength, device)

done, probably :)

AstraliteHeart · 2025-04-18T10:30:47Z

Unfortunately still seeing visual noise instead of the image at some values of strength so something in my math must be wrong.

sayakpaul · 2025-04-18T10:41:54Z

Do those values generally tend to be higher?

AstraliteHeart · 2025-04-18T12:03:20Z

sayakpaul · 2025-04-18T12:16:37Z

A bit hard to see sorry

AstraliteHeart · 2025-04-18T12:23:24Z

Weird, I can click on the image to get the full sized one (with an extra click)
It's the 0.88 - 0.94 range that is visual noise, and I think images above 0.94 not even using initial image.

AstraliteHeart · 2025-04-19T06:55:05Z

I still have no idea what is going on but I think the code is correct yet AF may require some special work done. Here are my observations:

When using certain input images, the VAE encoder, responsible for creating the initial latent representation (x0) of the input image, produces a latent distribution (latent_dist) where the standard deviation (std) component consistently collapses to effectively zero (e.g., std=0.0000, corresponding to a highly negative logvar).
This variance collapse is observed even when ensuring the VAE was loaded and operated entirely in float32 precision. This confirms the issue is not merely an fp16 underflow problem during VAE computation but rather suggests the SDXL VAE predicts near-zero variance for this type of input (I was aware of different issues with SDXL VAE but not this specific behavior).
Because the predicted std is zero, the subsequent step of sampling the initial latent variable x0 from this distribution (mean + std * noise) becomes deterministic, effectively yielding only the mean component (x0 = mean).
To address the lack of variance in x0, I attempted an experiment in which the logvar output by the VAE was manually "clamped" to a minimum value (tested min_logvar = -10.0 resulting in std ≈ 0.0067, and min_logvar = -4.0 resulting in std ≈ 0.135) before sampling x0. This successfully introduced non-zero variance into the initial latent state. At this point my assumption was just - the issue is in VAE.
Despite successfully injecting variance into x0 via clamping, the pipeline still produced noise/corrupted images when run at high strength values (e.g., strength=0.9).
Crucially, the pipeline works reasonably well at lower strength values (e.g., strength=0.7), producing recognizable image outputs that incorporate the initial image structure.

The core issue no longer seems to be only the deterministic x0 caused by the initial VAE variance collapse (as fixing that didn't solve the high-strength problem). Instead, the failure at high strength (0.9) may stem from an instability in the denoising process itself when initiated from the very high noise levels corresponding to these high strengths. The process is stable when starting from the lower noise levels associated with moderate strength (0.7).

sayakpaul · 2025-04-19T12:00:33Z

Thanks for the detailed analysis. Do these vary from AuraFlow and AuraFlow0.3?

AstraliteHeart · 2025-04-19T20:23:10Z

Yes, I am seeing same behavior in 0.2.

I focused too much on VAE in the last comment - I don't think it's the root cause (after all SDXL works just fine) and perhaps the real issue is some kind numerical instability we are facing?

I've included 3 videos generated taking a snapshot of the model state each 5 frames - before, at the moment of the issue and after. At least looking at them I can't notice anything weird that can explain the problem

combined_steps_s0.85.mp4

combined_steps_s0.90.mp4

combined_steps_s0.95.mp4

I've also attempted to affect the problematic strength range by changing number of steps or guidance scale but it had no effect. Interestingly enabling use_karras_sigmas=True on the scheduler seems to "fix" the issue as I no longer can hit the noise output but I still experience an very sharp change from "low strength i2i" to "just t2i" at around 0.98 strength. Feels like I am missing something super obvious here.

Add basic implementation for AuraFlowImg2ImgPipeline

6fb6f2b

AstraliteHeart added 2 commits April 17, 2025 19:30

Update i2i tests, fix style

6ff1af8

Use scale_noise directly and fix VAE decoding

6ac5cbb

sayakpaul reviewed Apr 18, 2025

View reviewed changes

DN6 reviewed Apr 18, 2025

View reviewed changes

Review updates

1b7fb36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic implementation of AuraFlowImg2ImgPipeline #11340

Add basic implementation of AuraFlowImg2ImgPipeline #11340

AstraliteHeart commented Apr 16, 2025 •

edited

Loading

sayakpaul commented Apr 16, 2025

AstraliteHeart commented Apr 17, 2025

sayakpaul Apr 18, 2025

AstraliteHeart Apr 18, 2025

sayakpaul Apr 18, 2025

DN6 left a comment •

edited

Loading

DN6 Apr 18, 2025

AstraliteHeart Apr 18, 2025

DN6 Apr 18, 2025

AstraliteHeart Apr 18, 2025

AstraliteHeart commented Apr 18, 2025 •

edited

Loading

sayakpaul commented Apr 18, 2025

AstraliteHeart commented Apr 18, 2025

sayakpaul commented Apr 18, 2025

AstraliteHeart commented Apr 18, 2025

AstraliteHeart commented Apr 19, 2025

sayakpaul commented Apr 19, 2025

AstraliteHeart commented Apr 19, 2025


		return timesteps, num_inference_steps - t_start

		def prepare_img2img_latents(

	sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas
	image_seq_len = (int(height) // self.vae_scale_factor // 2) * (int(width) // self.vae_scale_factor // 2)
	mu = calculate_shift(
	image_seq_len,
	self.scheduler.config.get("base_image_seq_len", 256),
	self.scheduler.config.get("max_image_seq_len", 4096),
	self.scheduler.config.get("base_shift", 0.5),
	self.scheduler.config.get("max_shift", 1.15),
	)
	timesteps, num_inference_steps = retrieve_timesteps(
	self.scheduler,
	num_inference_steps,
	device,
	sigmas=sigmas,
	mu=mu,
	)
	timesteps, num_inference_steps = self.get_timesteps(num_inference_steps, strength, device)

Add basic implementation of AuraFlowImg2ImgPipeline #11340

Are you sure you want to change the base?

Add basic implementation of AuraFlowImg2ImgPipeline #11340

Conversation

AstraliteHeart commented Apr 16, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

sayakpaul commented Apr 16, 2025

AstraliteHeart commented Apr 17, 2025

sayakpaul Apr 18, 2025

Choose a reason for hiding this comment

AstraliteHeart Apr 18, 2025

Choose a reason for hiding this comment

sayakpaul Apr 18, 2025

Choose a reason for hiding this comment

DN6 left a comment • edited Loading

Choose a reason for hiding this comment

DN6 Apr 18, 2025

Choose a reason for hiding this comment

AstraliteHeart Apr 18, 2025

Choose a reason for hiding this comment

DN6 Apr 18, 2025

Choose a reason for hiding this comment

AstraliteHeart Apr 18, 2025

Choose a reason for hiding this comment

AstraliteHeart commented Apr 18, 2025 • edited Loading

sayakpaul commented Apr 18, 2025

AstraliteHeart commented Apr 18, 2025

sayakpaul commented Apr 18, 2025

AstraliteHeart commented Apr 18, 2025

AstraliteHeart commented Apr 19, 2025

sayakpaul commented Apr 19, 2025

AstraliteHeart commented Apr 19, 2025

AstraliteHeart commented Apr 16, 2025 •

edited

Loading

DN6 left a comment •

edited

Loading

AstraliteHeart commented Apr 18, 2025 •

edited

Loading