FLUX.1-dev-edit-v0 / README.md

sayakpaul HF Staff

Update README.md

b22ea2f verified 10 months ago

preview code

raw

history blame

9.49 kB

metadata

base_model: black-forest-labs/FLUX.1-dev
library_name: diffusers
license: other
inference: true
tags:
  - flux
  - flux-diffusers
  - text-to-image
  - diffusers
  - control
  - diffusers-training

Flux Edit

These are the control weights trained on black-forest-labs/FLUX.1-dev and TIGER-Lab/OmniEdit-Filtered-1.2M for image editing. We use the Flux Control framework for fine-tuning.

License

Please adhere to the licensing terms as described here

Intended uses & limitations

Inference

from diffusers import FluxControlPipeline, FluxTransformer2DModel
from diffusers.utils import load_image
import torch 

path = "sayakpaul/FLUX.1-dev-edit-v0" # to change
edit_transformer = FluxTransformer2DModel.from_pretrained(path, torch_dtype=torch.bfloat16)
pipeline = FluxControlPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", transformer=edit_transformer, torch_dtype=torch.bfloat16
).to("cuda")

image = load_image("./assets/mushroom.jpg") # resize as needed.
print(image.size)

prompt = "turn the color of mushroom to gray"
image = pipeline(
    control_image=image,
    prompt=prompt,
    guidance_scale=30., # change this as needed.
    num_inference_steps=50, # change this as needed.
    max_sequence_length=512,
    height=image.height,
    width=image.width,
    generator=torch.manual_seed(0)
).images[0]
image.save("edited_image.png")

Speeding inference with a turbo LoRA

We can speed up the inference by reducing the num_inference_steps to produce a nice image by using turbo LoRA like ByteDance/Hyper-SD.

Make sure to install peft before running the code below: pip install -U peft.

Code

from diffusers import FluxControlPipeline, FluxTransformer2DModel
from diffusers.utils import load_image
from huggingface_hub import hf_hub_download
import torch

path = "sayakpaul/FLUX.1-dev-edit-v0" # to change
edit_transformer = FluxTransformer2DModel.from_pretrained(path, torch_dtype=torch.bfloat16)
pipeline = FluxControlPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", transformer=edit_transformer, torch_dtype=torch.bfloat16
).to("cuda")

# load the turbo LoRA
pipeline.load_lora_weights(
    hf_hub_download("ByteDance/Hyper-SD", "Hyper-FLUX.1-dev-8steps-lora.safetensors"), adapter_name="hyper-sd"
)
pipeline.set_adapters(["hyper-sd"], adapter_weights=[0.125])

image = load_image("./assets/mushroom.jpg") # resize as needed.
print(image.size)

prompt = "turn the color of mushroom to gray"
image = pipeline(
    control_image=image,
    prompt=prompt,
    guidance_scale=30., # change this as needed.
    num_inference_steps=8, # change this as needed.
    max_sequence_length=512,
    height=image.height,
    width=image.width,
    generator=torch.manual_seed(0)
).images[0]
image.save("edited_image.png")

Comparison

50 steps	8 steps

You can also choose to perform quantization if the memory requirements cannot be satisfied further w.r.t your hardware. Refer to the Diffusers documentation to learn more.

guidance_scale also impacts the results:

Source Image	Edited Image (gs: 10)	Edited Image (gs: 20)	Edited Image (gs: 30)	Edited Image (gs: 40)
Give this the look of a traditional Japanese woodblock print.
transform the setting to a winter scene
turn the color of mushroom to gray

Limitations and bias

Expect the model to perform underwhelmingly as we don't know the exact training details of Flux Control.

Training details

Fine-tuning codebase is here. Training hyperparameters:

Per GPU batch size: 4
Gradient accumulation steps: 4
Guidance scale: 30
BF16 mixed-precision
AdamW optimizer (8bit from bitsandbytes)
Constant learning rate of 5e-5
Weight decay of 1e-6
20000 training steps

Training was conducted using a node of 8xH100s.

We used a simplified flow mechanism to perform the linear interpolation. In pseudo-code, that looks like:

sigmas = torch.rand(batch_size)
timesteps = (sigmas * noise_scheduler.config.num_train_timesteps).long()
...

noisy_model_input = (1.0 - sigmas) * pixel_latents + sigmas * noise

where pixel_latents is computed from the source images and noise is drawn from a Gaussian distribution. For more details, check out the repository.