WIP - STILL NEED TO TRAIN SHIFT 3, QINGLONG QWEN, AND MAYBE MORE SPARSE UNIFORM VARIATIONS
Qwen-Image Film
Comparing different timestep sampling methods; Shift 3, Qwen Shift, Qinglong Qwen, and two "new" methods I thought up; one based on KL-Optimal and another that is a "sparsified" uniform.
I trained using kohya-ss's musubi-trainer, with my additions added here. Training used 24GB VRAM, 32GB RAM, and ~7GB of swap.
A comparison between targeted timesteps can be found here.
KL-Optimal Details
KL-Optimal Multi targets 133 timesteps:
[
1, 16, 26, 32, 33, 42, 48, 51, 53, 64, 66, 76, 81, 83, 97, 99, 102, 105, 113, 125,
128, 129, 132, 146, 153, 159, 162, 165, 167, 178, 179, 195, 199, 206, 210, 212, 213, 228, 232, 233,
245, 253, 259, 262, 268, 280, 286, 297, 298, 304, 314, 315, 325, 332, 340, 342, 344, 350, 368, 371,
377, 384, 387, 390, 400, 405, 414, 424, 429, 439, 443, 445, 453, 460, 462, 482, 489, 491, 493, 502,
510, 522, 523, 535, 541, 542, 555, 563, 577, 585, 589, 596, 606, 622, 623, 628, 650, 651, 653, 659,
668, 674, 696, 698, 714, 717, 722, 727, 734, 746, 767, 772, 774, 778, 798, 810, 815, 821, 824, 847,
851, 859, 877, 879, 900, 903, 908, 921, 937, 938, 951, 968, 999
]
When using the lightx2v lightning LoRAs I get the best results with the KL-Optimal scheduler, no shift applied.
So from that I thought to myself to take the sigmas it chooses at 4 steps, 8 steps, 16 steps, 20 steps, 25 steps, 32 steps, and 50 steps, targeting only those timesteps during training. 4, 8, and 16 being good with the lightning LoRAs, 20, 25, 32, and 50 probably being better for inference without lightning.
Since the sigmas are all decimals that may not convert nicely to timesteps for the KL-Optimal Multi method I took all the timesteps, rounded them to the nearest timestep, and dropped any duplicate timesteps.
The hope is that this would result in better inference quality when using KL-Optimal, but at the cost of maybe becoming worse with all other dissimilar schedulers.
Uniform Sparse Details
Uniform Sparse 100 targets 100 timesteps:
[
1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200,
210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400,
410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600,
610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800,
810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, 999
]
After seeing the selection of timesteps in KL-Optimal Multi, I noticed it sorta looked like uniform, but with much fewer total timesteps. So from that I figured I should try a real "sparse uniform". It randomly chooses a timestep from 1 to 999, evenly spaced apart, with only a few amount of timesteps.
Using less timesteps seems to increase the strength of the LoRA, but past a point seems to cause artifacts (compare uniform-sparse-10 LoRA vs. uniform-sparse-40 LoRA).
Adjusting how sparse you go may depend on other settings such as learning rate or total steps. I did all the LoRAs with the same 1e-4 LR, 500 step, seed 42 settings, so I cannot really say.
Main Settings
- FP8 scaled
- BF16 mixed precision
- 8-bit CAME optimizer with Stochastic Rounding and Cautious Masking
- 0.01 weight decay
- 1 batch size
- 0.0001 learning rate
- Cosine learning rate scheduler
- 10 warmup steps
- 42 seed
LoRA Settings
- 16 rank
- 4 alpha (aiming for 25% scaling)
- 5% neuron dropout
- 5% rank dropout
Dataset Settings
- 2,684 images from r/35mm, r/analog, r/film, r/filmphotography, r/filmphotos, and r/mediumformat
- Captioned with Gemini
- 512x512, 768x768, and 1024x1024 non-upscale buckets (2,684 x 3)
- 500 max steps
Comparisons
- Seed: 42
- Steps: 16
- Sampler: Euler
- Scheduler: KL-Optimal
- CFG: 1.0
PJMixers-Images/lightx2v_Qwen-Image-Lightning-8steps-V1.0-V1.1
LoRA @ 1.0 strength- Film LoRA @ 1.25 strength
No LoRA vs. Qwen Shift vs. KL-Optimal Multi vs. Uniform Sparse 40 vs. Uniform Sparse 10
A close-up portrait photo of an older man looking directly at the viewer. He has shaved facial hair, and is wearing a navy blue sweatshirt. The background is a city park, and the man is sitting on a bench. Both the man and the background are in focus.
A young woman with light brown hair and brown eyes is looking directly at the camera. She has freckles on her face and is wearing a gray tank top with thin straps. She is also wearing a silver necklace with a round pendant. The background is completely black, making the woman the focal point of the image.
A large body of water with snowy mountains in the background. The foreground is covered in rolling fog, and there are clouds in the otherwise blue sky. The sky and landscape is dramatic and extraordinary. There is a dirt path leading to a forest.
Close-up food photo of a hybrid snail composed entirely of glossy sticky cinnamon buns. The shell is made from a puffy perfectly swirled cinnamon bun covered in a thick glossy white glaze. Baked edges with a jagged cinnamon bun texture slightly caramelized, dark cinnamon filling inside, rich golden brown color. The glaze drips down in thick sweet drops, the snail tendrils are made of twisted cinnamon dough, glistening with icing sugar, the glaze reflects warm, natural light. The scene is shot in a soft, fuzzy kitchen setting, with a hint of freshly baked pastries in the background.
A woman with dark brown hair and dark lipstick is sitting on a subway train seat, looking directly at the viewer. She is wearing a shiny black halter dress with a zipper down the center, black gloves on her hands, and black boots boots. Her eyes appear to be dark eyeshadow makeup. The subway train seats are teal in color and have horizontal lines running across them. Through the train windows on the sides, blurry figures of people can be seen through them.
Model tree for PJMixers-Images/Qwen-Image-Film
Base model
Qwen/Qwen-Image