RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space
Abstract
RepFusion leverages multimodal large language models as noisy representation encoders for diffusion transformers in text-to-image generation, outperforming traditional approaches that train new denoisers.
Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.
Community
RepFusion repurposes a frozen multimodal LLM as a noisy latent encoder for text-to-image generation, providing strong denoising priors in representation space and enabling test-time scaling via repeated MLLM conditioning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation (2026)
- Representation Forcing for Bottleneck-Free Unified Multimodal Models (2026)
- MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings (2026)
- Latent Denoising Improves Visual Alignment in Large Multimodal Models (2026)
- DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders (2026)
- TextLDM: Language Modeling with Continuous Latent Diffusion (2026)
- Noise-Aware Visual Representation Learning for Medical Visual Question Answering (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.14700 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper