File size: 7,218 Bytes
ff195eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e78d713
195a01d
ff195eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e78d713
195a01d
ff195eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e78d713
195a01d
e78d713
 
 
 
195a01d
 
 
ff195eb
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
base_model:
  - Qwen/Qwen-Image
base_model_relation: quantized
tags:
- dfloat11
- df11
- lossless compression
- 70% size, 100% accuracy
---

# DFloat11 Compressed Model: `Qwen/Qwen-Image`

This is a **DFloat11 losslessly compressed** version of the original `Qwen/Qwen-Image` model. It reduces model size by **32%** compared to the original BFloat16 model, while maintaining **bit-identical outputs** and supporting **efficient GPU inference**.

🔥🔥🔥 Thanks to DFloat11 compression, Qwen-Image can now run on **a single 32GB GPU**, or on **a single 16GB GPU with CPU offloading**, while maintaining full model quality. 🔥🔥🔥

### 📊 Performance Comparison

| Model                                     | Model Size | Peak GPU Memory (1328x1328 image generation) | Generation Time (A100 GPU) |
|-------------------------------------------|------------|----------------------------------------------|----------------------------|
| Qwen-Image (BFloat16)                     | ~41 GB     | OOM                                          | -                          |
| Qwen-Image (DFloat11)                     | 28.42 GB   | 29.74 GB                                     | 100 seconds                |
| Qwen-Image (DFloat11 + GPU Offloading)    | 28.42 GB   | 16.68 GB                                     | 260 seconds                |

### 🔧 How to Use

1. Install or upgrade the DFloat11 pip package *(installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed)*:

    ```bash
    pip install -U dfloat11[cuda12]
    ```

2. Install or upgrade diffusers:

    ```bash
    pip install git+https://github.com/huggingface/diffusers
    ```

3. Save the following code to a Python file `qwen_image.py`:

    ```python
    from diffusers import DiffusionPipeline, QwenImageTransformer2DModel
    import torch
    from transformers.modeling_utils import no_init_weights
    from dfloat11 import DFloat11Model
    import argparse

    def parse_args():
        parser = argparse.ArgumentParser(description='Generate images using Qwen-Image model')
        parser.add_argument('--cpu_offload', action='store_true', help='Enable CPU offloading')
        parser.add_argument('--cpu_offload_blocks', type=int, default=None, help='Number of transformer blocks to offload to CPU')
        parser.add_argument('--no_pin_memory', action='store_true', help='Disable memory pinning')
        parser.add_argument('--prompt', type=str, default='A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197".',
                            help='Text prompt for image generation')
        parser.add_argument('--negative_prompt', type=str, default=' ',
                            help='Negative prompt for image generation')
        parser.add_argument('--aspect_ratio', type=str, default='16:9', choices=['1:1', '16:9', '9:16', '4:3', '3:4'],
                            help='Aspect ratio of generated image')
        parser.add_argument('--num_inference_steps', type=int, default=50,
                            help='Number of denoising steps')
        parser.add_argument('--true_cfg_scale', type=float, default=4.0,
                            help='Classifier free guidance scale')
        parser.add_argument('--seed', type=int, default=42,
                            help='Random seed for generation')
        parser.add_argument('--output', type=str, default='example.png',
                            help='Output image path')
        parser.add_argument('--language', type=str, default='en', choices=['en', 'zh'],
                            help='Language for positive magic prompt')
        return parser.parse_args()

    args = parse_args()

    model_name = "Qwen/Qwen-Image"

    with no_init_weights():
        transformer = QwenImageTransformer2DModel.from_config(
            QwenImageTransformer2DModel.load_config(
                model_name, subfolder="transformer",
            ),
        ).to(torch.bfloat16)

    DFloat11Model.from_pretrained(
        "DFloat11/Qwen-Image-DF11",
        device="cpu",
        cpu_offload=args.cpu_offload,
        cpu_offload_blocks=args.cpu_offload_blocks,
        pin_memory=not args.no_pin_memory,
        bfloat16_model=transformer,
    )

    pipe = DiffusionPipeline.from_pretrained(
        model_name,
        transformer=transformer,
        torch_dtype=torch.bfloat16,
    )
    pipe.enable_model_cpu_offload()

    positive_magic = {
        "en": "Ultra HD, 4K, cinematic composition.", # for english prompt,
        "zh": "超清,4K,电影级构图" # for chinese prompt,
    }

    # Generate with different aspect ratios
    aspect_ratios = {
        "1:1": (1328, 1328),
        "16:9": (1664, 928),
        "9:16": (928, 1664),
        "4:3": (1472, 1140),
        "3:4": (1140, 1472),
    }

    width, height = aspect_ratios[args.aspect_ratio]

    image = pipe(
        prompt=args.prompt + positive_magic[args.language],
        negative_prompt=args.negative_prompt,
        width=width,
        height=height,
        num_inference_steps=args.num_inference_steps,
        true_cfg_scale=args.true_cfg_scale,
        generator=torch.Generator(device="cuda").manual_seed(args.seed)
    ).images[0]

    image.save(args.output)

    max_memory = torch.cuda.max_memory_allocated()
    print(f"Max memory: {max_memory / (1000 ** 3):.2f} GB")
    ```

4. To run without CPU offloading (32GB VRAM required):
    ```bash
    python qwen_image.py
    ```

    To run with CPU offloading (16GB VRAM required):
    ```bash
    python qwen_image.py --cpu_offload
    ```

    If you are getting out-of-CPU-memory errors, try limiting the number of offloaded blocks or disabling memory-pinning:
    ```bash
    # Offload only 16 blocks (offloading more blocks uses less GPU memory and more CPU memory; offloading less blocks is faster):
    python qwen_image.py --cpu_offload --cpu_offload_blocks 16

    # Disable memory-pinning (the most memory efficient way, but could be slower):
    python qwen_image.py --cpu_offload --no_pin_memory
    ```


### 🔍 How It Works

We apply **Huffman coding** to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.

The result is a model that is **~32% smaller**, delivers **bit-identical outputs**, and achieves performance **comparable to the original** BFloat16 model.

Learn more in our [research paper](https://arxiv.org/abs/2504.11651).

### 📄 Learn More

* **Paper**: [70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float](https://arxiv.org/abs/2504.11651)
* **GitHub**: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11)
* **HuggingFace**: [https://huggingface.co/DFloat11](https://huggingface.co/DFloat11)