Anima and different things not being the same
I found this pseudo-model repo really funny and I couldn't get it out of my head how weird and partly wrong it is.
The outdated training mechanism holds the model back.
No specification of what "outdated" means in this context, of course.
Additionally, Qwen introduced tokenizer issues that were not found in the T5 model.
SD* models used to generate background from all text tokens, not just those which are related to the background.
SD1.5 and SDXL were U-net based convolutional neural networks. They had no concept of attention. A U-net at its core is a segmentation model which can achieve only a very sparse field of concepts. There's a reason we switched to ViTs very quickly for this task and U-nets got forgotten about until Disco Diffusion.
But additionally, this probably wasn't even the bottleneck for those models. I mean switching to a DiT for Stable Diffusion 3 onwards did help its performance in this regard a lot? But it also kept using multimodal encoders from the stone age.
Which I can't blame them for, because there really was nothing better until multimodal langauge models got good recently. SigLiP might have helped, but it would have been stuck with the same "token per class" level of coarseness in its understanding of images during training.
Flux.1 did really well for what it had, but text encoders were the bottleneck for a long time. Cosmos used T5 because for what it was doing, T5 was perfectly adequate. In retraining Cosmos to fit the conditioning space created by Qwen3 0.6B, its ability to associate concepts got better, but I wouldn't be shocked if at the time this was posted, that conversion simply wasn't finished yet. As of this writing, Anima v1.0 is just over a day old.
The same is true for Anima, the attention modules are not weighted during training, which leads to concept bleeding and background issues.
So... again, there would never have been an opportunity to DO that with SDXL because U-nets don't have attention. I'm not aware of any projects on DiT models that are doing this either.
Never in my life have I ever heard of altering the weights of differing modules during training. In any modality but especially not when multiple modalities are being juggled around at once like this. I've heard of freezing modules for training, but simply changing the magnitude of the vectors shouldn't have any meaningful difference outside of making rectified flow harder to achieve and making the model... fit worse.
I tried to look it up and I got Scale Your Instructions and UltraGen, two papers, of which neither are relevant to this conversation. If this is a scene-specific technique, then I'm unsure why ready-for-battle sees fit to bring it up for the types of extremely deep retrain that Anima represents.
Am I crazy? Am I missing something? I haven't worked on image models in a while.
The same text flows through at each inference step. It's still the unfiltered garbage in, garbage out, but with higher-resolution garbage from the training data.
I have no idea what model of image generation is in this guy's brain, but I doubt it's in keeping with reality.
You get ONE conditioning tensor for the whole generation process.
You are ALWAYS processing the same "garbage in, garbage out", no matter what.
The only exception to this is HiDream-Image-O1 which came out recently, and it's got a bad case of Alpha-itis. The reason you don't get background problems with newer models is because their DiTs are big and strong enough to understand that because an image is matching a given subset of the conditioning tensor in one part of the latent image, it should develop that region more instead of spreading out.
The smallest DiT I've seen to date is Flux.2-Klein 4B's, which is, no bonus points for guessing, 4B. It has some issues with concept bleed but is okay. For broad-strokes editing it's quite capable for its size although I prefer the 9B for making fine edits.
The next smallest that I'm aware of is Z-Image, which has a 6B DiT. It's better, and in fact quite good in some respects. But here's the problem. And this is the reason that SDXL has been king in this very narrow slice of heavily stylized, subcultural art generation since 2023.
6B at 16 bits per weight means you need 12GB of VRAM for just the DiT. If you don't want to be switching back and forth, or OOMing on ancient generation UIs that can't offload like anything A1111-based to my knowledge (please correct me because I'm leaving the community tab on this OPEN), you also need to keep the 4B text encoder in hand, which is another 8GB you have to find space for. You can reasonably run the text encoder at 8 bits to bring that down, but you can't do that with diffusion models without losing a bunch of quality, especially at these sizes.
So why did Anima pick Cosmos, and switch the text encoder from T5 to Qwen?
Because Stable Diffusion XL's U-net was 2.6B parameters, and both of its text encoders together were around 800M.
With Anima's 2B DiT and 600M text encoder, it is LIGHTER in weight than Stable Diffusion XL. (T5 was a couple billion, I don't remember off the top of my head, Nvidia used a chonky boi for it though.)
Anima was made with a singular purpose in mind: get the anime dorks an offramp away from Precambrian era image generation tech.
And by doing all this janky nonsense, they accomplished this, without having to pretrain a whole model from scratch. Nothing proprietary has aesthetics this good. Nothing open weights can just know what you're talking about when you throw Gelbooru tags at it.
Anima is a finnicky son of a bitch sometimes but it takes SDXL levels of time to generate an image that is better than SDXL, including my darling wai_SHUFFLENOOB, and nothing else gets to make this claim.
So none of this bickering about weighting modules even matters. Could it have twiddled a couple hyperparameters better? Sure, maybe.
But it's a successful project of greater magnitude than you or I can say for ourselves, ready-for-battle.
We're all too broke to play the choosy beggar here.
Also, this monorepo could have been a post.
Model tree for inflatebot/anima-is-fine-actually
Base model
nvidia/Cosmos-Predict2-2B-Text2Image