Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
multimodalart 
posted an update Mar 5, 2024
Post
The Stable Diffusion 3 research paper broken down, including some overlooked details! 📝

Model
📏 2 base model variants mentioned: 2B and 8B sizes

📐 New architecture in all abstraction levels:
- 🔽 UNet; ⬆️ Multimodal Diffusion Transformer, bye cross attention 👋
- 🆕 Rectified flows for the diffusion process
- 🧩 Still a Latent Diffusion Model

📄 3 text-encoders: 2 CLIPs, one T5-XXL; plug-and-play: removing the larger one maintains competitiveness

🗃️ Dataset was deduplicated with SSCD which helped with memorization (no more details about the dataset tho)

Variants
🔁 A DPO fine-tuned model showed great improvement in prompt understanding and aesthetics
✏️ An Instruct Edit 2B model was trained, and learned how to do text-replacement

Results
✅ State of the art in automated evals for composition and prompt understanding
✅ Best win rate in human preference evaluation for prompt understanding, aesthetics and typography (missing some details on how many participants and the design of the experiment)

Paper: https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf

Thanks for the breakdown.

...Agreed, thank you!!!

Interesting, it seems like the novel things they added to SD 3 are really just:

  1. Changing the scheduling (~ linear flow)
  2. Sample more in the middle of the time range
  3. (novel) MM-DiT, which just splits the post-attention activation into 2 MLPs (one for text, one for image, though they both self-attend jointly as they're concatenated together for attention)
  4. Combines 3 text embeddings together depending on the complexity of the prompt (could a decision-router model be used to determine how many embeddings to use?)

I honestly thought there'd be more, though the move to a DiT is a big one