Text-to-Video
Diffusers
Safetensors
WanDMDPipeline
File size: 3,009 Bytes
6901306
 
4b42dc1
 
 
 
6901306
 
 
c6b68cc
 
 
6901306
 
 
 
 
 
 
99a4262
6901306
 
 
 
 
4b42dc1
 
 
 
 
 
6901306
4b42dc1
 
3b45704
4b42dc1
 
 
b311ea4
4b42dc1
 
 
 
 
 
9560ff8
6901306
 
 
4b42dc1
6901306
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
license: apache-2.0
datasets:
- FastVideo/Wan-Syn_77x448x832_600k
base_model:
- Wan-AI/Wan2.1-T2V-1.3B-Diffusers
---

# FastVideo FastWan2.1-T2V-1.3B-Diffusers Model
<p align="center">
  <img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logo.jpg" width="200"/>
</p>
<div>
  <div align="center">
    <a href="https://github.com/hao-ai-lab/FastVideo" target="_blank">FastVideo Team</a>&emsp;
  </div>

  <div align="center">
    <a href="https://arxiv.org/pdf/2505.13389">Paper</a> | 
    <a href="https://github.com/hao-ai-lab/FastVideo">Github</a>
  </div>
</div>



## Introduction

This model is jointly finetuned with [DMD](https://arxiv.org/pdf/2405.14867) and [VSA](https://arxiv.org/pdf/2505.13389), based on [Wan-AI/Wan2.1-T2V-1.3B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers). It supports efficient 3-step inference and generates high-quality videos at **61×448×832** resolution. We adopt the [FastVideo 480P Synthetic Wan dataset](https://huggingface.co/datasets/FastVideo/Wan-Syn_77x448x832_600k), consisting of 600k synthetic latents.

---

## Model Overview

- 3-step inference is supported and achieves up to **20 FPS** on a single **H100** GPU.
- Our model is trained on **61×448×832** resolution, but it supports generating videos with any resolution.(quality may degrade)
- Finetuning and inference scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository:  
  - [Finetuning script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/distill/v1_distill_dmd_wan_VSA.sh)  
  - [Inference script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/inference/v1_inference_wan_dmd.sh)
- Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**, and also support **Mac** users!

### Training Infrastructure

Training was conducted on **4 nodes with 32 H200 GPUs** in total, using a `global batch size = 64`.  
We enable `gradient checkpointing`, set `gradient_accumulation_steps=2`, and use `learning rate = 1e-5`.  
We set **VSA attention sparsity** to 0.8, and training runs for **4000 steps (~12 hours)**   
The detailed **training example script** is available [here](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/distill/Wan-Syn-480P/distill_dmd_VSA_t2v.slurm).



If you use the FastWan2.1-T2V-1.3B-Diffusers model for your research, please cite our paper:
```
@article{zhang2025vsa,
  title={VSA: Faster Video Diffusion with Trainable Sparse Attention},
  author={Zhang, Peiyuan and Huang, Haofeng and Chen, Yongqi and Lin, Will and Liu, Zhengzhong and Stoica, Ion and Xing, Eric and Zhang, Hao},
  journal={arXiv preprint arXiv:2505.13389},
  year={2025}
}
@article{zhang2025fast,
  title={Fast video generation with sliding tile attention},
  author={Zhang, Peiyuan and Chen, Yongqi and Su, Runlong and Ding, Hangliang and Stoica, Ion and Liu, Zhengzhong and Zhang, Hao},
  journal={arXiv preprint arXiv:2502.04507},
  year={2025}
}
```