arxiv:2511.22699

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Published on Nov 27, 2025

· Submitted by

Zhen Li on Dec 1, 2025

#1 Paper of the day

Tongyi-MAI

Upvote

224

Authors:

Huanqia Cai ,

Sihan Cao ,

Ruoyi Du ,

Dengyang Jiang ,

Xin Jin ,

Zhen Li ,

Zhong-Yu Li ,

Junhan Shi ,

Qilong Wu ,

Chi Zhang ,

Shilin Zhou

Abstract

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

AI-generated summary

The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

View arXiv page View PDF Project page GitHub 8.65k Add to collection

Community

Paper99

Paper author Paper submitter Dec 1, 2025

•

edited 22 days ago

GitHub: https://github.com/Tongyi-MAI/Z-Image
ModelScope: https://modelscope.ai/models/Tongyi-MAI/Z-Image-Turbo/summary
HuggingFace: https://huggingface.co/Tongyi-MAI/Z-Image-Turbo
Z-Image gallery : https://modelscope.cn/studios/Tongyi-MAI/Z-Image-Gallery
ComfyUI: https://huggingface.co/Comfy-Org/z_image_turbo

librarian-bot

Dec 2, 2025

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

PlayAI

Dec 6, 2025

•

edited Dec 6, 2025

Does Z-Image support embedding/textual-inversion–style adapters? If not, why?

Hi, and thanks for the work on Z-Image — I’ve been having a lot of fun with it.

I wanted to clarify something about adapter support:

Does Z-Image currently support embedding / textual-inversion–style adapters (i.e., token-based adapters), or does it only support LoRA-style adapters at the moment?

I'm trying to understand which adapter types the Z-Image architecture can make use of now, and which types might be possible in the future.

Example Scenario

If I train a new token embedding in Qwen3-4B for a novel concept — for example <glimmerwolf> — using text like:

A glimmerwolf is a luminous wolf-like creature with crystalline fur that glows softly in the dark, similar to bioluminescent jellyfish or frosted crystal. Glimmerwolves behave like normal wolves but leave shimmering mist trails as they move.

Does Z-Image only receive the final learned embedding vector for <glimmerwolf> at inference time, or does the diffusion model benefit in any way from the semantic components (e.g., wolf + glow + crystal + mist) that shaped that embedding during training?

In other words:

Is Z-Image conditioned solely on the resulting vector produced by Qwen3-4B, or does the model inherit any of the conceptual decomposition used during embedding training?

Thanks!

您好，非常感谢 Z-Image 项目的出色工作——我一直在使用它，体验非常愉快。

我想澄清关于适配器（adapter）支持方面的一些问题：

Z-Image 目前是否支持 embedding / textual-inversion（文本反演）类型的适配器（即基于 token 的适配器），还是目前仅支持 LoRA 类型的适配器？

我希望了解 Z-Image 架构目前能够使用哪些适配器类型，以及未来可能支持哪些类型。

示例场景

如果我在 Qwen3-4B 中为一个全新的概念训练一个新的 token embedding，例如 <glimmerwolf>，并使用如下文本进行训练：

Glimmerwolf（微光狼）是一种具有发光特性的狼状生物，它的晶体状皮毛会在黑暗中柔和地发光，类似于发光水母或磨砂晶体。微光狼的行为类似普通狼，但在移动时会留下闪烁的雾状轨迹。

那么在推理阶段：

Z-Image 是否只会接收到 <glimmerwolf> 最终训练得到的 embedding 向量？
还是说扩散模型也会从训练该 embedding 时所使用的语义成分中受益？
（例如：狼 + 发光特性 + 晶体材质 + 雾效）

换句话说：

Z-Image 是仅根据 Qwen3-4B 输出的最终向量进行条件化，还是会继承 embedding 训练过程中形成的语义分解信息？

谢谢！

James070707

25 days ago

I found this paper on the website interesting it explains a new way to build image generation models that work fast and could help make tools that run on normal computers

avahal

20 days ago

arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/z-image-an-efficient-image-generation-foundation-model-with-single-stream-diffusion-transformer-9846-b5faf99f

Key Findings
Executive Summary
Detailed Breakdown
Practical Applications

mishig

about 14 hours ago

Based on the paper and the search results, I can show you the main figures from the Z-Image paper. Here are the key figures that showcase the model's capabilities:

Figure 1: Photorealistic Image Generation

This figure demonstrates Z-Image-Turbo's exceptional photorealistic capabilities across various scenarios:

Character portraits with detailed skin textures and authentic emotions
Sports photography with dynamic action and realistic details
Landscapes and scenes with natural lighting and composition
Cultural contexts showing accurate representation of traditional settings

Figure 2: Bilingual Text Rendering

This figure highlights Z-Image's outstanding text rendering abilities in both Chinese and English:

Complex Chinese text in traditional calligraphy and modern designs
English text with perfect typography in various contexts
Poster designs with accurate text integration and aesthetic layouts
Mixed bilingual content showing seamless handling of both languages

Figure 3: Image Editing Capabilities

This figure showcases Z-Image-Edit's instruction-following editing capabilities:

Multi-instruction edits combining multiple changes in single operations
Precise object manipulation with accurate transformations
Background changes while maintaining subject consistency
Text editing with location-based constraints

Figure 4: Model Comparison

This comparative figure shows Z-Image-Turbo's performance against leading models including:

Qwen-Image, Hunyuan-Image-3.0, FLUX.2 (open-source)
Nano Banana Pro, Seedream 4.0, Imagen 4 (closed-source)
Demonstrates superior photorealistic generation capacity across diverse prompts

Figure 5: Active Curation Engine

This architectural diagram illustrates the data infrastructure:

Cross-modal embedding and deduplication pipeline
Rule-based filtering for quality control
Z-Image model feedback loop for continuous improvement
Dynamic data sampling to address long-tail distribution challenges

Key Performance Metrics from Figures

The paper's evaluation results show:

8th overall rank in Artificial Analysis Image Arena
#1 among open-source models
87.4% "Good+Same" rate vs FLUX 2 dev in user studies
Best text rendering on CVTG-2K benchmark (0.8671 accuracy)
Top performance on GenEval for object generation (0.84 score)

These figures collectively demonstrate that Z-Image achieves state-of-the-art results with only 6B parameters and 8-step inference, making it significantly more efficient than competitors requiring 20-80B parameters and 100+ steps.