LeanQuant commited on
Commit
729204d
·
verified ·
1 Parent(s): d12e85f

Add files using upload-large-folder tool

Browse files
Files changed (3) hide show
  1. README.md +147 -0
  2. config.json +28 -0
  3. diffusion_pytorch_model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen-Image-Edit
4
+ base_model_relation: quantized
5
+ tags:
6
+ - dfloat11
7
+ - df11
8
+ - lossless compression
9
+ - 70% size, 100% accuracy
10
+ pipeline_tag: image-to-image
11
+ ---
12
+
13
+ # DFloat11 Compressed Model: `Qwen/Qwen-Image-Edit`
14
+
15
+ This is a **DFloat11 losslessly compressed** version of the original `Qwen/Qwen-Image-Edit` model. It reduces model size by **32%** compared to the original BFloat16 model, while maintaining **bit-identical outputs** and supporting **efficient GPU inference**.
16
+
17
+ 🔥🔥🔥 Thanks to DFloat11 compression, Qwen-Image-Edit can now run on **a single 32GB GPU**, or on **a single 24GB GPU with CPU offloading**, while maintaining full model quality. 🔥🔥🔥
18
+
19
+ ### 📊 Performance Comparison
20
+
21
+ | Model | Model Size | Peak GPU Memory | Generation Time (A100 GPU) |
22
+ |------------------------------------------------|------------|----------------------------------------------|----------------------------|
23
+ | Qwen-Image-Edit (BFloat16) | ~41 GB | OOM | - |
24
+ | Qwen-Image-Edit (DFloat11) | 28.43 GB | 30.11 GB | 280 seconds |
25
+ | Qwen-Image-Edit (DFloat11 + CPU Offloading) | 28.43 GB | 22.71 GB | 570 seconds |
26
+
27
+ ### 🔧 How to Use
28
+
29
+ 1. Install or upgrade the DFloat11 pip package *(installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed)*:
30
+
31
+ ```bash
32
+ pip install -U dfloat11[cuda12]
33
+ ```
34
+
35
+ 2. Install or upgrade diffusers:
36
+
37
+ ```bash
38
+ pip install git+https://github.com/huggingface/diffusers
39
+ ```
40
+
41
+ 3. Save the following code to a Python file `qwen_image_edit.py`:
42
+
43
+ ```python
44
+ import argparse
45
+ import torch
46
+ from diffusers.utils import load_image
47
+ from diffusers import QwenImageTransformer2DModel, QwenImageEditPipeline
48
+ from transformers.modeling_utils import no_init_weights
49
+ from dfloat11 import DFloat11Model
50
+
51
+ def parse_args():
52
+ parser = argparse.ArgumentParser(description='Edit images using Qwen-Image-Edit model')
53
+ parser.add_argument('--cpu_offload', action='store_true', help='Enable CPU offloading')
54
+ parser.add_argument('--cpu_offload_blocks', type=int, default=16, help='Number of transformer blocks to offload to CPU')
55
+ parser.add_argument('--no_pin_memory', action='store_true', help='Disable memory pinning')
56
+ parser.add_argument('--image', type=str, default="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png",
57
+ help='Path to input image or URL')
58
+ parser.add_argument('--prompt', type=str, default='Add a hat to the cat.',
59
+ help='Text prompt for image editing')
60
+ parser.add_argument('--negative_prompt', type=str, default=' ',
61
+ help='Negative prompt for image editing')
62
+ parser.add_argument('--num_inference_steps', type=int, default=50,
63
+ help='Number of denoising steps')
64
+ parser.add_argument('--true_cfg_scale', type=float, default=4.0,
65
+ help='Classifier free guidance scale')
66
+ parser.add_argument('--seed', type=int, default=42,
67
+ help='Random seed for generation')
68
+ parser.add_argument('--output', type=str, default='qwen_image_edit.png',
69
+ help='Output image path')
70
+ return parser.parse_args()
71
+
72
+ args = parse_args()
73
+ model_id = "Qwen/Qwen-Image-Edit"
74
+
75
+ with no_init_weights():
76
+ transformer = QwenImageTransformer2DModel.from_config(
77
+ QwenImageTransformer2DModel.load_config(
78
+ model_id, subfolder="transformer",
79
+ ),
80
+ ).to(torch.bfloat16)
81
+
82
+ DFloat11Model.from_pretrained(
83
+ "DFloat11/Qwen-Image-Edit-DF11",
84
+ device="cpu",
85
+ cpu_offload=args.cpu_offload,
86
+ cpu_offload_blocks=args.cpu_offload_blocks,
87
+ pin_memory=not args.no_pin_memory,
88
+ bfloat16_model=transformer,
89
+ )
90
+
91
+ pipeline = QwenImageEditPipeline.from_pretrained(
92
+ model_id, transformer=transformer, torch_dtype=torch.bfloat16,
93
+ )
94
+ pipeline.enable_model_cpu_offload()
95
+ pipeline.set_progress_bar_config(disable=None)
96
+
97
+ image = load_image(args.image)
98
+ inputs = {
99
+ "image": image,
100
+ "prompt": args.prompt,
101
+ "generator": torch.manual_seed(args.seed),
102
+ "true_cfg_scale": args.true_cfg_scale,
103
+ "negative_prompt": args.negative_prompt,
104
+ "num_inference_steps": args.num_inference_steps,
105
+ }
106
+
107
+ with torch.inference_mode():
108
+ output = pipeline(**inputs)
109
+ output_image = output.images[0]
110
+ output_image.save(args.output)
111
+
112
+ max_gpu_memory = torch.cuda.max_memory_allocated()
113
+ print(f"Max GPU memory allocated: {max_gpu_memory / 1000 ** 3:.2f} GB")
114
+ ```
115
+
116
+ 4. To run without CPU offloading (32GB VRAM required):
117
+ ```bash
118
+ python qwen_image_edit.py
119
+ ```
120
+
121
+ To run with CPU offloading (24GB VRAM required, 50GB CPU RAM required):
122
+ ```bash
123
+ python qwen_image_edit.py --cpu_offload
124
+ ```
125
+
126
+ If you are getting out of (CPU or GPU) memory errors, try limiting the number of offloaded blocks or disabling memory-pinning:
127
+ ```bash
128
+ # Offload only 12 blocks (offloading more blocks uses less GPU memory and more CPU memory; offloading less blocks is faster):
129
+ python qwen_image_edit.py --cpu_offload --cpu_offload_blocks 12
130
+
131
+ # Disable memory-pinning (the most memory efficient way, but could be slower):
132
+ python qwen_image_edit.py --cpu_offload --cpu_offload_blocks 60 --no_pin_memory
133
+ ```
134
+
135
+ ### 🔍 How It Works
136
+
137
+ We apply **Huffman coding** to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.
138
+
139
+ The result is a model that is **~32% smaller**, delivers **bit-identical outputs**, and achieves performance **comparable to the original** BFloat16 model.
140
+
141
+ Learn more in our [research paper](https://arxiv.org/abs/2504.11651).
142
+
143
+ ### 📄 Learn More
144
+
145
+ * **Paper**: [70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float](https://arxiv.org/abs/2504.11651)
146
+ * **GitHub**: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11)
147
+ * **HuggingFace**: [https://huggingface.co/DFloat11](https://huggingface.co/DFloat11)
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dfloat11_config": {
3
+ "bytes_per_thread": 8,
4
+ "pattern_dict": {
5
+ "transformer_blocks\\.\\d+": [
6
+ "img_mod.1",
7
+ "attn.to_q",
8
+ "attn.to_k",
9
+ "attn.to_v",
10
+ "attn.add_k_proj",
11
+ "attn.add_v_proj",
12
+ "attn.add_q_proj",
13
+ "attn.to_out.0",
14
+ "attn.to_add_out",
15
+ "img_mlp.net.0.proj",
16
+ "img_mlp.net.2",
17
+ "txt_mod.1",
18
+ "txt_mlp.net.0.proj",
19
+ "txt_mlp.net.2"
20
+ ]
21
+ },
22
+ "threads_per_block": [
23
+ 512
24
+ ],
25
+ "version": "0.3.2"
26
+ },
27
+ "model_type": "qwen2_5_vl"
28
+ }
diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d77ed9467b509c793a70a85be2186daece79b3c5ef86ec66016a880835f420a
3
+ size 28430817772