Trained adapter for use of Gemma-3-1b as text encoder for Rouwei 0.8

image

white haired fox girl in intracate aqua dress riding giant black cat that flies through night sky, best quality, by kantoku


Update v0.1:

New version for gemma-3-1b adapter is available

You need to download and use both adapter model and llm for it to work properly. Llm can be downloaded as a single gguf file or as a dir via HF-Hub. To run it you need updated custom nodes. Example workflow. Other examples can be founded here.

The new version gives a better prompt adherence, allows to use structured prompts that describes individual features of each character, can create simple comixes and in general has more knowledge about anime-art-related things. It still not perfect but already significantly outperforms vanilla CLIP encoders allowing use of long and detailed prompts without typical tags bleeding.

Version for t5gemma-2b is available

As a parallel experiment with a slightly different approach adapter that utilises text encoder from t5gemma-2b-2b-ul2 and converts its outputs for SDXL unet was trained. Actually it shows pretty nice performance taking into account short training and untouched t5genna encoder. It already outperforms version with gemma-3-1b as text encoder in terms of knowledge about characters and artists styles, but is less accurate with complex prompts. To launch it you need a different workflow than for gemma-llm.

You need an updated set of custom nodes to make it work

Launch instructions, prompting tips and examples updated


What is it:

A drop-in replacement for SDXL text encoders that utilises the power of LLM for prompt understanding and conditions creation.

Kind of same as ELLA, SDXL-T5 and likely others, but this one is focused on anime models and advanced knowledge without censorship.

Why is it important:

SDXL proven to be a good and flexible model that can generate results with great aesthetic and variety at relatively low compute cost and high speed. But prompt adherence is significantly limited by the use of CLIPs. Also dealing with prompt longer than 75 tokens requires its splitting which may distort original meaning.

Replacement of CLIPs with something newer can potentially significantly improve SDXL capabilities in understanding of complex prompts and allow more control at the same time maintaining existing benefits. Also, extra things like images, coordinates, poses from openpose, individual prompts for each characters, etc. can be implemented and work in synergy with main prompt.

How does it work:

Text prompt is being processed by the LLM, then hidden states from last layer are processed by adapter to compensate casual attention and reshape them into conditions for SDXL unet.

Why Gemma-3:

Just because it is kind of decent and small model for experiments. Likely during further development it will be replaced with qwen-vl or some other model. And do not worry, there are no censoring or rejections that may appear in this llm inference. The scheme uses only hidden states that represent "model understanding".

What it can do (now):

First of all - at current state this is more proof of concept than actually working thing. Considering training budget - a miracle that it even works.

  • Processing of booru tags like you may used to prompt
  • Processing of natural language prompts including very short and very long ones up to 512 tokens (gemma tokeniser)
  • Structured prompts with markdown, xml, json and any other formatting to specify which comes where
  • Any combinations of these
  • No tags bleeding as long as it understand what you're giving

So it can work as standard text encoder but provides much deeper understanding for long expressions and can hold more conditions without dissolving into each other.

What it can't do (yet):

  1. May struggle at very complex prompts
  2. Knowledge is inconsistent, it can recognize some very rare character and mess with some more popular one
  3. Same for styles
  4. Use of some artists styles may negatively affect prompt understanding leading to ignoring of some parts
  5. Generate decent quality text
  6. Emphasis (tags weights:1.1) and typical spells

Actually all these will be solved with further training. First requires unet training, 2-4 requires training of LLM because it simply doesn't know and such words give too small reaction. 5 just requires more training (and corresponding dataset) and will be done soon. 6 requires improvement of custom nodes and will be added soon.

How to run:

Llm gemma-3-1b encoder

  1. Install custom nodes for Comfy
  2. Make sure you updated Transformers to version that supports gemma-3 and installed gguf python package in Comfy venv.
  3. Download adapter and put it into /models/llm_adapters
  4. Download trained llm GGUF or HF and put it into /models/LLM/ (whole dir in case of HF, you need all files from original model, not only .safetensors. Create the folder with model name if it doesn't exist).
  5. Download Rouwei-0.8 (vpred or epsilon or base-epsilon) checkpoint if you don't have one yet
  6. Use this workflow as a reference, feel free to experiment

Prevoius version 0.0alpha utilises original gemma-3-1b-it (non-gated mirror)

T5gemma-2b encoder

Same steps except:

You can use HF-Hub with command:

hf download Minthy/RouWei-Gemma --include "t5gemma-2b-2b-ul2_*" --local-dir "/path/to/comfy/models/LLM"

Unfortunatelly GGUF does not support t5gemma architecture now, can be updated as soon as added.

Prompting:

The new pipeline allows you to use almost any prompt format and is very flexible (even supports base64 or multilangual but with reduced performance). To get best results you can stick to following patterns:

  • Tags only: Supported, works fine, but there is no actual point to limit yourself with it only
  • Long natural text prompts: Works fine too, unless complexity for current state of developement is too high. Better to avoid excessive purple prose and meaningless fillers
  • Structured prompts: This is where things really start to get interesting. You can use json (like in ToriiGate examples), xml, or other ways, but the most convenient is Markdown. Mainly # headings to seperate prompt parts and point something specific. This works both with tags and NL prompts. For example:
2girls, wakamo (blue archive), izuna (blue archive), richly decorated room, from above, masterpiece.
## Character 1
Wakamo (blue archive), a fox girl with black hair, yellow eyes and fox mask standing on the left wearing maid outfit. She holds tray with a unworn panties. Her expression is smug and confident, she proudly presenting the tray.
## Character 2
Izuna (blue archive) fox girl with brown hair, yellow eyes, hair flower stands on the right. She also wears maid uniform, she is lifting the hem of her skirt showing that she wears no panties. blushing, ashamed
  • Any combinations of tags and natural text expressions

Possible issues:

  • Sometimes order of tags or words matters and there can be some biases towars specific characters, concepts and so on
  • Correct spelling is much more important than with clip encoders
  • In come cases use of artist and style tags gives strong bias that makes it harder to prompt, much better than in previous version
  • Still in early experimental state, despite showing outstanding results comparing with default SDXL encoders, weak in comparison with new large models like Flux

Current state of custom nodes does not support prompt weights and standard spells. Also (brackets) should be left as is, no need to add \.

Other settings and recommendations are same as for original RouWei

Quality tags:

Positive:

masterpiece or best quality you can use them both but it unlikely will give improvement, or just omit them. Keep it clean and avoid extra 'magic combinations' since they will likely give negative effect. Can be placed at the very end.

Negative:

worst quality or low quality. Same as for positive. Better to keep it clean only adding things you don't want to appear specificly in this image, not in general.

Knowledge:

It knows popular characters, can mimic artist styles, understand concepts and other things. But, they are limited by the llm that needs to be trained at later stages to get all. Also some more general things are limited by current dataset that consisct from anime pictures, and unet abilities.

Compatibility:

Designed to work with Rouwei, should also work with its merges and tunes. May have limited compatibility with Illustrious models, Noobai, other sdxl checkpoints.

Near future plans:

  • More studies and comparison to determine the most promising option to train with unet
  • Emphasis for custom nodes
  • Training code
  • ...

Training budget:

3 liters of beer, 0.5 liters of coffee, few days (now 2 weeks) on 3x5090 rig.

I'm willing to help/cooperate:

Join Discord server where you can share you thouhts, give proposal, request, etc. Write me directly here, on civitai or dm in discord.

Donations:

BTC bc1qwv83ggq8rvv07uk6dv4njs0j3yygj3aax4wg6c

ETH/USDT(e) 0x04C8a749F49aE8a56CB84cF0C99CD9E92eDB17db

XMR 47F7JAyKP8tMBtzwxpoZsUVB8wzg2VrbtDKBice9FAS1FikbHEXXPof4PAb42CQ5ch8p8Hs4RvJuzPHDtaVSdQzD6ZbA5TZ

Thanks:

NeuroSenko (code), Rimuru (idea, discussions), Lord (testing), DraconicDragon (fixes, testing)

Also many thanks to those who supported me before:

A number of anonymous persons, Bakariso, dga, Fi., ello, K., LOL2024, NeuroSenko, OpenRoot-Compute, rred, Soviet Cat, Sv1., T., TekeshiX

License:

This repo contains original or finetuned models google/t5gemma-2b-2b-ul2 and google/gemma-3-1b-it. Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms.

MIT license for adapter models.

Downloads last month
1,128
GGUF
Model size
1,000M params
Architecture
gemma3
Hardware compatibility
Log In to view the estimation

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Minthy/RouWei-Gemma