| license: apache-2.0 | |
| library_name: diffusers | |
| pipeline_tag: text-to-image | |
| # Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis | |
| <div align="center"> | |
| <img src="https://github.com/tang-bd/fuse-dit/blob/main/assets/visual.jpg?raw=true" width="95%"/> | |
| </div> | |
| ## Resources | |
| - [arXiv: Paper](https://arxiv.org/pdf/2505.10046) | |
| - [GitHub: Code](https://github.com/tang-bd/fuse-dit) | |
| ## Quick Start | |
| You can download the pre-trained model and then use `FuseDiTPipeline` in our codebase to run inference: | |
| ```python | |
| import torch | |
| from diffusion.pipelines import FuseDiTPipeline | |
| pipeline = FuseDiTPipeline.from_pretrained("/path/to/pipeline/").to("cuda") | |
| image = pipeline( | |
| "your prompt", | |
| width=512, | |
| height=512, | |
| num_inference_steps=25, | |
| guidance_scale=6.0, | |
| use_cache=True, | |
| )[0][0] | |
| image.save("test.png") | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @article{tang2025exploringdeepfusion, | |
| title={Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis}, | |
| author={Bingda Tang and Boyang Zheng and Xichen Pan and Sayak Paul and Saining Xie}, | |
| year={2025}, | |
| journal={arXiv preprint arXiv:2505.10046}, | |
| } | |
| ``` |