README.md · kohya-ss/misc-models at main

qwenimage-blob_emoji-4-s020-6.safetensors

Blob emoji LoRA.

The training captions are like Yellow blob emoji with smiling face with smiling eyes. The background is gray., so blob emoji or blob emoji with face ... etc. act as trigger words.

Blob emoji with face holds a sign says "Blob Emoji" in front of Japanese Shrine. --w 1024 --h 1024 --s 50 --d 1001
Blob emoji face drives a red sport car along a curved road on a cliff overlooking the sea. The sea is dotted with whitecaps. The sky is blue, and cumulonimbus clouds float on the horizon. --w 1664 --h 928 --s 50 --d 12345678

Dataset Creation Procedure

The dataset was created following these steps:

The SVG files from C1710/blobmoji (licensed under ASL 2.0) were used. Specifically, 118 different yellow blob emojis were selected from the SVG files.
cairosvg was used to convert these SVGs into 512x512 pixel transparent PNGs.
A script was then used to pad the images to 640x640 pixels and generate four versions of each image with different background colors: white, light gray, gray, and black. This resulted in a total of 472 images.
The captions were generated based on the official Unicode names of the emojis. The prefix Yellow blob emoji with and the suffix . The background is <color>. were added to each name.
- For example: Yellow blob emoji with smiling face with smiling eyes. The background is gray.
- Note: For some emojis (e.g., devil, zombie), the word Yellow was omitted from the prefix.

Dataset Definition

# general configurations
[general]
resolution = [640, 640]
batch_size = 16
enable_bucket = true
bucket_no_upscale = false
caption_extension = ".txt"

[[datasets]]
image_directory = "path/to/images_and_captions_dir"
cache_directory = "path/to/cache_dir"

Training Command

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 --rdzv_backend=c10d \
src/musubi_tuner/qwen_image_train_network.py  \
--dit path/to/dit.safetensors --vae path/to/vae.safetensors \
--text_encoder path/to/vlm.safetensors \
--dataset_config path/to/blob_emoji_v1_640_bs16.toml \
--output_dir path/to/output_dir \
--learning_rate 2e-4 \
--timestep_sampling shift --weighting_scheme none --discrete_flow_shift 2.0 \
--max_train_epochs 16 --mixed_precision bf16 --seed 42 --gradient_checkpointing \
--network_module=networks.lora_qwen_image \
--network_dim=4 --network_args loraplus_lr_ratio=4 \
--save_every_n_epochs=1  --max_data_loader_n_workers 2 \
--persistent_data_loader_workers \
--logging_dir ./logs --log_prefix qwenimage-blob4-2e4- \
--output_name qwenimage-blob4-2e4 \
--optimizer_type adamw8bit --flash_attn --split_attn \
--log_with tensorboard \
--sample_every_n_epochs 1 --sample_prompts path/to/prompts_qwen_blob_emoji.txt \
--fp8_base --fp8_scaled

Training Details

Training was conducted on a Windows machine with a multi-GPU setup (2x RTX A6000).
If you are not using a Windows environment or not performing multi-GPU training, please remove the --rdzv_backend=c10d argument.
Please note that due to the 2-GPU setup, the effective batch size is 32. To achieve the same results with limited VRAM, increase the gradient accumulation steps. However, you should be able to train successfully with a lower batch size by adjusting the learning rate.
The model was trained for 6 epochs (90 steps), which took approximately 1 hour with the Power Limit set to 60%.
Finally, the weights from all 6 epochs were merged using the LoRA Post-Hoc EMA script from Musubi Tuner with sigma_rel=0.2.

fp-1f-kisekae-1024-v4-2-PfPHEMA.safetensors

Post-Hoc EMA (with Power function sigma_rel=0.2) version of the following LoRA. The usage is the same.

fp-1f-kisekae-1024-v4-2.safetensors

Experimental LoRA for FramePack One Frame kisekaeichi. The target index is 5. The prompt is as follows:

The girl stays in the same pose, but her outfit changes into a <costume description>, then she changes into another girl wearing the same outfit.

costume description is something like school uniform etc. A detailed description may improve the results. For example: "T-shirt with writing on it" or "Girl with long hair"

This model is trained with 1024x1024 resolution. Please use at roughly the same resolution.

fp-1f-chibi-1024.safetensors

Experimental LoRA for FramePack One Frame Inference. The target index is 9. The prompt is as follows:

An anime character transforms: her head grows larger, her body becomes shorter and smaller, eyes become bigger and cuter. She turns into a chibi (super-deformed) version, with cartoonishly cute proportions. The transformation is quick and playful.

This model is trained with 1024x1024 resolution. Please use at roughly the same resolution. If the effect is too strong, lower the multiplier (strength) to 0.8 or less.

FramePack-dance-lora-d8.safetensors

Experimental LoRA for FramePack. This is for testing purposes and the effect is weak. Please set the prompt to something like A woman is spinning on her tiptoes . `.

flux-hasui-lora-d4-sigmoid-raw-gs1.0.safetensors

Experimental LoRA for FLUX.1 dev.

Trained with sd-scripts (Aug. 11) sd3 branch. NOTE: This settings requires > 26GB VRAM. Please add --fp8_base to enable fp8 training to reduce VRAM usage.

accelerate launch  --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train_network.py --pretrained_model_name_or_path flux1/flux1-dev.sft --clip_l sd3/clip_l.safetensors --t5xxl sd3/t5xxl_fp16.safetensors --ae flux1/ae_dev.sft --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 4 --optimizer_type adamw8bit --learning_rate 1e-3 --network_train_unet_only --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk  --highvram --max_train_epochs 4 --save_every_n_epochs 1 --dataset_config hasui_1024_bs1.toml --output_dir flux/lora --output_name lora-name --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0

.toml is below.

[general]
flip_aug = true
color_aug = false

[[datasets]]
enable_bucket = true
resolution = [1024,1024]
bucket_reso_steps = 64
max_bucket_reso = 2048
min_bucket_reso = 128
bucket_no_upscale = false
batch_size = 1
random_crop = false
shuffle_caption = false

  [[datasets.subsets]]
  image_dir = "path/to/train/images"
  num_repeats = 1
  caption_extension = ".txt"

sdxl-negprompt8-v1m.safetensors

Negative embeddings for sdxl. Num vectors per token = 8

stable-cascade-c-lora-hasui-v02.safetensors

Sample of LoRA for Stable Cascade Stage C.

Feb 22, 2024 Update: Fixed a bug that LoRA is not applied to some modules (to_q/k/v and to_out) in Attention.

This is an experimental model, so the format of the weights may change in the future.

a painting of an anthropomorphic penguin sitting in a cafe reading a book and having a coffee --w 1024 --h 1024 --d 1
a painting of japanese shrine in winter with snowfall --w 832 --h 1152 --d 1234

This model is trained with 169 images with captions. U-Net only, dim=4, conv_dim=4, alpha=1, lr=1e-3, 4 epochs, mixed precision bf16, 8bit AdamW, batch size 8, resolution 1024x1024 with aspect ratio bucketing. VRAM usage is approximately 22 GB.