ghunkins commited on
Commit
65a4106
·
verified ·
1 Parent(s): 582076a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +191 -103
README.md CHANGED
@@ -1,198 +1,286 @@
1
  ---
2
- library_name: diffusers
 
3
  ---
 
4
 
5
- # Model Card for Model ID
 
 
6
 
7
- <!-- Provide a quick summary of what the model is/does. -->
 
 
 
 
8
 
 
9
 
10
 
11
- ## Model Details
12
 
13
- ### Model Description
14
 
15
- <!-- Provide a longer summary of what this model is. -->
16
 
17
- This is the model card of a 🧨 diffusers pipeline that has been pushed on the Hub. This model card has been automatically generated.
18
 
19
- - **Developed by:** [More Information Needed]
20
- - **Funded by [optional]:** [More Information Needed]
21
- - **Shared by [optional]:** [More Information Needed]
22
- - **Model type:** [More Information Needed]
23
- - **Language(s) (NLP):** [More Information Needed]
24
- - **License:** [More Information Needed]
25
- - **Finetuned from model [optional]:** [More Information Needed]
26
 
27
- ### Model Sources [optional]
28
 
29
- <!-- Provide the basic links for the model. -->
30
 
31
- - **Repository:** [More Information Needed]
32
- - **Paper [optional]:** [More Information Needed]
33
- - **Demo [optional]:** [More Information Needed]
34
 
35
- ## Uses
36
 
37
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
 
 
 
38
 
39
- ### Direct Use
40
 
41
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
42
 
43
- [More Information Needed]
44
 
45
- ### Downstream Use [optional]
 
46
 
47
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
48
 
49
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- ### Out-of-Scope Use
52
 
53
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
 
 
 
 
54
 
55
- [More Information Needed]
 
 
 
 
56
 
57
- ## Bias, Risks, and Limitations
58
 
59
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
60
 
61
- [More Information Needed]
62
 
63
- ### Recommendations
64
 
65
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
 
 
 
66
 
67
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
68
 
69
- ## How to Get Started with the Model
 
70
 
71
- Use the code below to get started with the model.
72
 
73
- [More Information Needed]
 
 
 
 
74
 
75
- ## Training Details
 
 
 
 
76
 
77
- ### Training Data
78
 
79
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
80
 
81
- [More Information Needed]
82
 
83
- ### Training Procedure
84
 
85
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
86
 
87
- #### Preprocessing [optional]
88
 
89
- [More Information Needed]
 
 
90
 
 
91
 
92
- #### Training Hyperparameters
93
 
94
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
95
 
96
- #### Speeds, Sizes, Times [optional]
97
 
98
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
99
 
100
- [More Information Needed]
101
 
102
- ## Evaluation
 
 
103
 
104
- <!-- This section describes the evaluation protocols and provides the results. -->
105
 
106
- ### Testing Data, Factors & Metrics
107
 
108
- #### Testing Data
109
 
110
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
 
 
 
 
111
 
112
- [More Information Needed]
113
 
114
- #### Factors
 
 
 
 
115
 
116
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 
117
 
118
- [More Information Needed]
119
 
120
- #### Metrics
 
 
 
 
121
 
122
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
 
 
123
 
124
- [More Information Needed]
 
125
 
126
- ### Results
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
- [More Information Needed]
 
 
 
 
129
 
130
- #### Summary
131
 
 
132
 
133
 
134
- ## Model Examination [optional]
 
 
135
 
136
- <!-- Relevant interpretability work for the model goes here -->
 
 
 
 
 
137
 
138
- [More Information Needed]
139
 
140
- ## Environmental Impact
141
 
142
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
143
 
144
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
145
 
146
- - **Hardware Type:** [More Information Needed]
147
- - **Hours used:** [More Information Needed]
148
- - **Cloud Provider:** [More Information Needed]
149
- - **Compute Region:** [More Information Needed]
150
- - **Carbon Emitted:** [More Information Needed]
151
 
152
- ## Technical Specifications [optional]
153
 
154
- ### Model Architecture and Objective
 
 
155
 
156
- [More Information Needed]
157
 
158
- ### Compute Infrastructure
 
 
159
 
160
- [More Information Needed]
161
 
162
- #### Hardware
163
 
164
- [More Information Needed]
 
165
 
166
- #### Software
167
 
168
- [More Information Needed]
 
 
169
 
170
- ## Citation [optional]
171
 
172
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
173
 
174
- **BibTeX:**
 
175
 
176
- [More Information Needed]
177
 
178
- **APA:**
 
 
179
 
180
- [More Information Needed]
 
181
 
182
- ## Glossary [optional]
 
 
 
 
 
 
 
183
 
184
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
 
185
 
186
- [More Information Needed]
187
 
188
- ## More Information [optional]
189
 
190
- [More Information Needed]
191
 
192
- ## Model Card Authors [optional]
193
 
194
- [More Information Needed]
195
 
196
- ## Model Card Contact
197
-
198
- [More Information Needed]
 
1
  ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-to-video
4
  ---
5
+ # Wan2.2 + Lightx2v
6
 
7
+ <p align="center">
8
+ 💜 <a href="https://wan.video"><b>Wan</b></a> &nbsp&nbsp | &nbsp&nbsp 🖥️ <a href="https://github.com/Wan-Video/Wan2.2">GitHub</a> &nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/Wan-AI/">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/Wan-AI">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2503.20314">Technical Report</a> &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://wan.video/welcome?spm=a2ty_o02.30011076.0.0.6c9ee41eCcluqg">Blog</a> &nbsp&nbsp | &nbsp&nbsp💬 <a href="https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg">WeChat Group</a>&nbsp&nbsp | &nbsp&nbsp 📖 <a href="https://discord.gg/AKNgpMK4Yj">Discord</a>&nbsp&nbsp
9
+ <br>
10
 
11
+ <p align="center">
12
+ 🔗 <a href="https://huggingface.co/lightx2v/Wan2.2-Lightning"><b>Lightx2v</b></a> — Distilled & optimized Wan2.2 for fast, high-quality 480P / 720P image-to-video generation
13
+ </p>
14
+ <br>
15
+ -----
16
 
17
+ [**Wan: Open and Advanced Large-Scale Video Generative Models**](https://arxiv.org/abs/2503.20314) <be>
18
 
19
 
20
+ We are excited to introduce **Wan2.2**, a major upgrade to our foundational video models. With **Wan2.2**, we have focused on incorporating the following innovations:
21
 
22
+ - 👍 **Effective MoE Architecture**: Wan2.2 introduces a Mixture-of-Experts (MoE) architecture into video diffusion models. By separating the denoising process cross timesteps with specialized powerful expert models, this enlarges the overall model capacity while maintaining the same computational cost.
23
 
24
+ - 👍 **Cinematic-level Aesthetics**: Wan2.2 incorporates meticulously curated aesthetic data, complete with detailed labels for lighting, composition, contrast, color tone, and more. This allows for more precise and controllable cinematic style generation, facilitating the creation of videos with customizable aesthetic preferences.
25
 
26
+ - 👍 **Complex Motion Generation**: Compared to Wan2.1, Wan2.2 is trained on a significantly larger data, with +65.6% more images and +83.2% more videos. This expansion notably enhances the model's generalization across multiple dimensions such as motions, semantics, and aesthetics, achieving TOP performance among all open-sourced and closed-sourced models.
27
 
28
+ - 👍 **Efficient High-Definition Hybrid TI2V**: Wan2.2 open-sources a 5B model built with our advanced Wan2.2-VAE that achieves a compression ratio of **16×16×4**. This model supports both text-to-video and image-to-video generation at 720P resolution with 24fps and can also run on consumer-grade graphics cards like 4090. It is one of the fastest **720P@24fps** models currently available, capable of serving both the industrial and academic sectors simultaneously.
 
 
 
 
 
 
29
 
 
30
 
31
+ This repository contains our T2V-A14B model, which supports generating 5s videos at both 480P and 720P resolutions. Built with a Mixture-of-Experts (MoE) architecture, it delivers outstanding video generation quality. On our new benchmark Wan-Bench 2.0, the model surpasses leading commercial models across most key evaluation dimensions.
32
 
 
 
 
33
 
34
+ ## Video Demos
35
 
36
+ <div align="center">
37
+ <video width="80%" controls>
38
+ <source src="https://cloud.video.taobao.com/vod/4szTT1B0LqXvJzmuEURfGRA-nllnqN_G2AT0ZWkQXoQ.mp4" type="video/mp4">
39
+ Your browser does not support the video tag.
40
+ </video>
41
+ </div>
42
 
 
43
 
44
+ ## 🔥 Latest News!!
45
 
46
+ * Jul 28, 2025: 👋 We've released the inference code and model weights of **Wan2.2**.
47
 
48
+ ## Community Works
49
+ If your research or project builds upon [**Wan2.1**](https://github.com/Wan-Video/Wan2.1) or Wan2.2, we welcome you to share it with us so we can highlight it for the broader community.
50
 
 
51
 
52
+ ## 📑 Todo List
53
+ - Wan2.2 Text-to-Video
54
+ - [x] Multi-GPU Inference code of the A14B and 14B models
55
+ - [x] Checkpoints of the A14B and 14B models
56
+ - [x] ComfyUI integration
57
+ - [x] Diffusers integration
58
+ - Wan2.2 Image-to-Video
59
+ - [x] Multi-GPU Inference code of the A14B model
60
+ - [x] Checkpoints of the A14B model
61
+ - [x] ComfyUI integration
62
+ - [x] Diffusers integration
63
+ - Wan2.2 Text-Image-to-Video
64
+ - [x] Multi-GPU Inference code of the 5B model
65
+ - [x] Checkpoints of the 5B model
66
+ - [x] ComfyUI integration
67
+ - [x] Diffusers integration
68
 
69
+ ## Run Wan2.2
70
 
71
+ #### Installation
72
+ Clone the repo:
73
+ ```sh
74
+ git clone https://github.com/Wan-Video/Wan2.2.git
75
+ cd Wan2.2
76
+ ```
77
 
78
+ Install dependencies:
79
+ ```sh
80
+ # Ensure torch >= 2.4.0
81
+ pip install -r requirements.txt
82
+ ```
83
 
 
84
 
85
+ #### Model Download
86
 
 
87
 
 
88
 
89
+ | Models | Download Links | Description |
90
+ |--------------------|---------------------------------------------------------------------------------------------------------------------------------------------|-------------|
91
+ | T2V-A14B | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B) 🤖 [ModelScope](https://modelscope.cn/models/Wan-AI/Wan2.2-T2V-A14B) | Text-to-Video MoE model, supports 480P & 720P |
92
+ | I2V-A14B | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B) 🤖 [ModelScope](https://modelscope.cn/models/Wan-AI/Wan2.2-I2V-A14B) | Image-to-Video MoE model, supports 480P & 720P |
93
+ | TI2V-5B | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B) 🤖 [ModelScope](https://modelscope.cn/models/Wan-AI/Wan2.2-TI2V-5B) | High-compression VAE, T2V+I2V, supports 720P |
94
 
 
95
 
96
+ > 💡Note:
97
+ > The TI2V-5B model supports 720P video generation at **24 FPS**.
98
 
 
99
 
100
+ Download models using huggingface-cli:
101
+ ``` sh
102
+ pip install "huggingface_hub[cli]"
103
+ huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B
104
+ ```
105
 
106
+ Download models using modelscope-cli:
107
+ ``` sh
108
+ pip install modelscope
109
+ modelscope download Wan-AI/Wan2.2-T2V-A14B --local_dir ./Wan2.2-T2V-A14B
110
+ ```
111
 
112
+ #### Run Text-to-Video Generation
113
 
114
+ This repository supports the `Wan2.2-T2V-A14B` Text-to-Video model and can simultaneously support video generation at 480P and 720P resolutions.
115
 
 
116
 
117
+ ##### (1) Without Prompt Extension
118
 
119
+ To facilitate implementation, we will start with a basic version of the inference process that skips the [prompt extension](#2-using-prompt-extention) step.
120
 
121
+ - Single-GPU inference
122
 
123
+ ``` sh
124
+ python generate.py --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --offload_model True --convert_model_dtype --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
125
+ ```
126
 
127
+ > 💡 This command can run on a GPU with at least 80GB VRAM.
128
 
129
+ > 💡If you encounter OOM (Out-of-Memory) issues, you can use the `--offload_model True`, `--convert_model_dtype` and `--t5_cpu` options to reduce GPU memory usage.
130
 
 
131
 
132
+ - Multi-GPU inference using FSDP + DeepSpeed Ulysses
133
 
134
+ We use [PyTorch FSDP](https://docs.pytorch.org/docs/stable/fsdp.html) and [DeepSpeed Ulysses](https://arxiv.org/abs/2309.14509) to accelerate inference.
135
 
 
136
 
137
+ ``` sh
138
+ torchrun --nproc_per_node=8 generate.py --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
139
+ ```
140
 
 
141
 
142
+ ##### (2) Using Prompt Extension
143
 
144
+ Extending the prompts can effectively enrich the details in the generated videos, further enhancing the video quality. Therefore, we recommend enabling prompt extension. We provide the following two methods for prompt extension:
145
 
146
+ - Use the Dashscope API for extension.
147
+ - Apply for a `dashscope.api_key` in advance ([EN](https://www.alibabacloud.com/help/en/model-studio/getting-started/first-api-call-to-qwen) | [CN](https://help.aliyun.com/zh/model-studio/getting-started/first-api-call-to-qwen)).
148
+ - Configure the environment variable `DASH_API_KEY` to specify the Dashscope API key. For users of Alibaba Cloud's international site, you also need to set the environment variable `DASH_API_URL` to 'https://dashscope-intl.aliyuncs.com/api/v1'. For more detailed instructions, please refer to the [dashscope document](https://www.alibabacloud.com/help/en/model-studio/developer-reference/use-qwen-by-calling-api?spm=a2c63.p38356.0.i1).
149
+ - Use the `qwen-plus` model for text-to-video tasks and `qwen-vl-max` for image-to-video tasks.
150
+ - You can modify the model used for extension with the parameter `--prompt_extend_model`. For example:
151
+ ```sh
152
+ DASH_API_KEY=your_key torchrun --nproc_per_node=8 generate.py --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" --use_prompt_extend --prompt_extend_method 'dashscope' --prompt_extend_target_lang 'zh'
153
+ ```
154
 
155
+ - Using a local model for extension.
156
 
157
+ - By default, the Qwen model on HuggingFace is used for this extension. Users can choose Qwen models or other models based on the available GPU memory size.
158
+ - For text-to-video tasks, you can use models like `Qwen/Qwen2.5-14B-Instruct`, `Qwen/Qwen2.5-7B-Instruct` and `Qwen/Qwen2.5-3B-Instruct`.
159
+ - For image-to-video tasks, you can use models like `Qwen/Qwen2.5-VL-7B-Instruct` and `Qwen/Qwen2.5-VL-3B-Instruct`.
160
+ - Larger models generally provide better extension results but require more GPU memory.
161
+ - You can modify the model used for extension with the parameter `--prompt_extend_model` , allowing you to specify either a local model path or a Hugging Face model. For example:
162
 
163
+ ``` sh
164
+ torchrun --nproc_per_node=8 generate.py --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" --use_prompt_extend --prompt_extend_method 'local_qwen' --prompt_extend_target_lang 'zh'
165
+ ```
166
 
167
+ - Running with Diffusers
168
 
169
+ ```py
170
+ import torch
171
+ import numpy as np
172
+ from diffusers import WanPipeline, AutoencoderKLWan
173
+ from diffusers.utils import export_to_video, load_image
174
 
175
+ dtype = torch.bfloat16
176
+ device = "cuda:2"
177
+ vae = AutoencoderKLWan.from_pretrained("Wan-AI/Wan2.2-T2V-A14B-Diffusers", subfolder="vae", torch_dtype=torch.float32)
178
+ pipe = WanPipeline.from_pretrained("Wan-AI/Wan2.2-T2V-A14B-Diffusers", vae=vae, torch_dtype=dtype)
179
+ pipe.to(device)
180
 
181
+ height = 720
182
+ width = 1280
183
 
184
+ prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
185
+ negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
186
+ output = pipe(
187
+ prompt=prompt,
188
+ negative_prompt=negative_prompt,
189
+ height=height,
190
+ width=width,
191
+ num_frames=81,
192
+ guidance_scale=4.0,
193
+ guidance_scale_2=3.0,
194
+ num_inference_steps=40,
195
+ ).frames[0]
196
+ export_to_video(output, "t2v_out.mp4", fps=16)
197
+ ```
198
 
199
+ > 💡**Note**:This model requires features that are currently available only in the main branch of diffusers. The latest stable release on PyPI does not yet include these updates.
200
+ > To use this model, please install the library from source:
201
+ > ```
202
+ > pip install git+https://github.com/huggingface/diffusers
203
+ > ```
204
 
205
+ ## Computational Efficiency on Different GPUs
206
 
207
+ We test the computational efficiency of different **Wan2.2** models on different GPUs in the following table. The results are presented in the format: **Total time (s) / peak GPU memory (GB)**.
208
 
209
 
210
+ <div align="center">
211
+ <img src="assets/comp_effic.png" alt="" style="width: 80%;" />
212
+ </div>
213
 
214
+ > The parameter settings for the tests presented in this table are as follows:
215
+ > (1) Multi-GPU: 14B: `--ulysses_size 4/8 --dit_fsdp --t5_fsdp`, 5B: `--ulysses_size 4/8 --offload_model True --convert_model_dtype --t5_cpu`; Single-GPU: 14B: `--offload_model True --convert_model_dtype`, 5B: `--offload_model True --convert_model_dtype --t5_cpu`
216
+ (--convert_model_dtype converts model parameter types to config.param_dtype);
217
+ > (2) The distributed testing utilizes the built-in FSDP and Ulysses implementations, with FlashAttention3 deployed on Hopper architecture GPUs;
218
+ > (3) Tests were run without the `--use_prompt_extend` flag;
219
+ > (4) Reported results are the average of multiple samples taken after the warm-up phase.
220
 
 
221
 
222
+ -------
223
 
224
+ ## Introduction of Wan2.2
225
 
226
+ **Wan2.2** builds on the foundation of Wan2.1 with notable improvements in generation quality and model capability. This upgrade is driven by a series of key technical innovations, mainly including the Mixture-of-Experts (MoE) architecture, upgraded training data, and high-compression video generation.
227
 
228
+ ##### (1) Mixture-of-Experts (MoE) Architecture
 
 
 
 
229
 
230
+ Wan2.2 introduces Mixture-of-Experts (MoE) architecture into the video generation diffusion model. MoE has been widely validated in large language models as an efficient approach to increase total model parameters while keeping inference cost nearly unchanged. In Wan2.2, the A14B model series adopts a two-expert design tailored to the denoising process of diffusion models: a high-noise expert for the early stages, focusing on overall layout; and a low-noise expert for the later stages, refining video details. Each expert model has about 14B parameters, resulting in a total of 27B parameters but only 14B active parameters per step, keeping inference computation and GPU memory nearly unchanged.
231
 
232
+ <div align="center">
233
+ <img src="assets/moe_arch.png" alt="" style="width: 90%;" />
234
+ </div>
235
 
236
+ The transition point between the two experts is determined by the signal-to-noise ratio (SNR), a metric that decreases monotonically as the denoising step $t$ increases. At the beginning of the denoising process, $t$ is large and the noise level is high, so the SNR is at its minimum, denoted as ${SNR}_{min}$. In this stage, the high-noise expert is activated. We define a threshold step ${t}_{moe}$ corresponding to half of the ${SNR}_{min}$, and switch to the low-noise expert when $t<{t}_{moe}$.
237
 
238
+ <div align="center">
239
+ <img src="assets/moe_2.png" alt="" style="width: 90%;" />
240
+ </div>
241
 
242
+ To validate the effectiveness of the MoE architecture, four settings are compared based on their validation loss curves. The baseline **Wan2.1** model does not employ the MoE architecture. Among the MoE-based variants, the **Wan2.1 & High-Noise Expert** reuses the Wan2.1 model as the low-noise expert while uses the Wan2.2's high-noise expert, while the **Wan2.1 & Low-Noise Expert** uses Wan2.1 as the high-noise expert and employ the Wan2.2's low-noise expert. The **Wan2.2 (MoE)** (our final version) achieves the lowest validation loss, indicating that its generated video distribution is closest to ground-truth and exhibits superior convergence.
243
 
 
244
 
245
+ ##### (2) Efficient High-Definition Hybrid TI2V
246
+ To enable more efficient deployment, Wan2.2 also explores a high-compression design. In addition to the 27B MoE models, a 5B dense model, i.e., TI2V-5B, is released. It is supported by a high-compression Wan2.2-VAE, which achieves a $T\times H\times W$ compression ratio of $4\times16\times16$, increasing the overall compression rate to 64 while maintaining high-quality video reconstruction. With an additional patchification layer, the total compression ratio of TI2V-5B reaches $4\times32\times32$. Without specific optimization, TI2V-5B can generate a 5-second 720P video in under 9 minutes on a single consumer-grade GPU, ranking among the fastest 720P@24fps video generation models. This model also natively supports both text-to-video and image-to-video tasks within a single unified framework, covering both academic research and practical applications.
247
 
 
248
 
249
+ <div align="center">
250
+ <img src="assets/vae.png" alt="" style="width: 80%;" />
251
+ </div>
252
 
 
253
 
 
254
 
255
+ ##### Comparisons to SOTAs
256
+ We compared Wan2.2 with leading closed-source commercial models on our new Wan-Bench 2.0, evaluating performance across multiple crucial dimensions. The results demonstrate that Wan2.2 achieves superior performance compared to these leading models.
257
 
 
258
 
259
+ <div align="center">
260
+ <img src="assets/performance.png" alt="" style="width: 90%;" />
261
+ </div>
262
 
263
+ ## Citation
264
+ If you find our work helpful, please cite us.
265
 
266
+ ```
267
+ @article{wan2025,
268
+ title={Wan: Open and Advanced Large-Scale Video Generative Models},
269
+ author={Team Wan and Ang Wang and Baole Ai and Bin Wen and Chaojie Mao and Chen-Wei Xie and Di Chen and Feiwu Yu and Haiming Zhao and Jianxiao Yang and Jianyuan Zeng and Jiayu Wang and Jingfeng Zhang and Jingren Zhou and Jinkai Wang and Jixuan Chen and Kai Zhu and Kang Zhao and Keyu Yan and Lianghua Huang and Mengyang Feng and Ningyi Zhang and Pandeng Li and Pingyu Wu and Ruihang Chu and Ruili Feng and Shiwei Zhang and Siyang Sun and Tao Fang and Tianxing Wang and Tianyi Gui and Tingyu Weng and Tong Shen and Wei Lin and Wei Wang and Wei Wang and Wenmeng Zhou and Wente Wang and Wenting Shen and Wenyuan Yu and Xianzhong Shi and Xiaoming Huang and Xin Xu and Yan Kou and Yangyu Lv and Yifei Li and Yijing Liu and Yiming Wang and Yingya Zhang and Yitong Huang and Yong Li and You Wu and Yu Liu and Yulin Pan and Yun Zheng and Yuntao Hong and Yupeng Shi and Yutong Feng and Zeyinzi Jiang and Zhen Han and Zhi-Fan Wu and Ziyu Liu},
270
+ journal = {arXiv preprint arXiv:2503.20314},
271
+ year={2025}
272
+ }
273
+ ```
274
 
275
+ ## License Agreement
276
+ The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generated contents, granting you the freedom to use them while ensuring that your usage complies with the provisions of this license. You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws, causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations. For a complete list of restrictions and details regarding your rights, please refer to the full text of the [license](LICENSE.txt).
277
 
 
278
 
279
+ ## Acknowledgements
280
 
281
+ We would like to thank the contributors to the [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [Qwen](https://huggingface.co/Qwen), [umt5-xxl](https://huggingface.co/google/umt5-xxl), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research.
282
 
 
283
 
 
284
 
285
+ ## Contact Us
286
+ If you would like to leave a message to our research or product teams, feel free to join our [Discord](https://discord.gg/AKNgpMK4Yj) or [WeChat groups](https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg)!