comfyui workflow
#1
by
littlC
- opened
README.md
CHANGED
@@ -7,7 +7,7 @@ base_model:
|
|
7 |
---
|
8 |
# FastVideo FastWan2.1-T2V-14B-480P-Diffusers
|
9 |
<p align="center">
|
10 |
-
<img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logo.
|
11 |
</p>
|
12 |
<div>
|
13 |
<div align="center">
|
@@ -24,59 +24,17 @@ base_model:
|
|
24 |
|
25 |
## Introduction
|
26 |
|
27 |
-
|
28 |
-
|
29 |
-
FastWan2.1-T2V-14B-480P-Diffuserss is built upon Wan-AI/Wan2.1-T2V-14B-Diffusers. It supports efficient **3-step inference** and produces high-quality videos at 61×448×832 resolution. For training, we use the FastVideo 480P Synthetic Wan dataset, which contains 600k synthetic latents.
|
30 |
|
|
|
31 |
|
32 |
## Model Overview
|
33 |
|
34 |
-
- 3-step inference is supported and achieves up to **50x speed up**
|
35 |
-
-
|
36 |
- Finetuning and inference scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository:
|
37 |
-
- [
|
38 |
-
- [
|
39 |
-
- Inference script in FastVideo:
|
40 |
-
```python
|
41 |
-
#!/bin/bash
|
42 |
-
|
43 |
-
# install FastVideo and VSA first
|
44 |
-
git clone https://github.com/hao-ai-lab/FastVideo
|
45 |
-
pip install -e .
|
46 |
-
cd csrc/attn
|
47 |
-
git submodule update --init --recursive
|
48 |
-
python setup_vsa.py install
|
49 |
-
|
50 |
-
num_gpus=1
|
51 |
-
export FASTVIDEO_ATTENTION_BACKEND=VIDEO_SPARSE_ATTN
|
52 |
-
export MODEL_BASE=FastVideo/FastWan2.1-T2V-14B-480P-Diffusers
|
53 |
-
|
54 |
-
# 720P 14B
|
55 |
-
# Torch compile is enabled. Expect generating the first video to be slow.
|
56 |
-
# Speed on H200 after warmup 3/3 [00:13<00:00, 4.45s/it]:
|
57 |
-
num_gpus=1
|
58 |
-
export FASTVIDEO_ATTENTION_BACKEND=VIDEO_SPARSE_ATTN
|
59 |
-
export MODEL_BASE=FastVideo/FastWan2.1-T2V-14B-480P-Diffusers
|
60 |
-
# export MODEL_BASE=hunyuanvideo-community/HunyuanVideo
|
61 |
-
# You can either use --prompt or --prompt-txt, but not both.
|
62 |
-
fastvideo generate \
|
63 |
-
--model-path $MODEL_BASE \
|
64 |
-
--sp-size $num_gpus \
|
65 |
-
--tp-size 1 \
|
66 |
-
--num-gpus $num_gpus \
|
67 |
-
--height 720 \
|
68 |
-
--width 1280 \
|
69 |
-
--num-frames 81 \
|
70 |
-
--num-inference-steps 3 \
|
71 |
-
--fps 16 \
|
72 |
-
--prompt-txt assets/prompt.txt \
|
73 |
-
--negative-prompt "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards" \
|
74 |
-
--seed 1024 \
|
75 |
-
--output-path outputs_video_dmd_14B_720P/ \
|
76 |
-
--VSA-sparsity 0.9 \
|
77 |
-
--dmd-denoising-steps "1000,757,522" \
|
78 |
-
--enable_torch_compile
|
79 |
-
```
|
80 |
- Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**, and also support **Mac** users!
|
81 |
|
82 |
### Training Infrastructure
|
@@ -84,6 +42,9 @@ fastvideo generate \
|
|
84 |
Training was conducted on **8 nodes with 64 H200 GPUs** in total, using a `global batch size = 64`.
|
85 |
We enable `gradient checkpointing`, set `HSDP_shard_dim = 8`, `sequence_parallel_size = 4`, and use `learning rate = 1e-5`.
|
86 |
We set **VSA attention sparsity** to 0.9, and training runs for **3000 steps (~52 hours)**
|
|
|
|
|
|
|
87 |
|
88 |
If you use FastWan2.1-T2V-14B-480P-Diffusers model for your research, please cite our paper:
|
89 |
```
|
|
|
7 |
---
|
8 |
# FastVideo FastWan2.1-T2V-14B-480P-Diffusers
|
9 |
<p align="center">
|
10 |
+
<img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logo.jpg" width="200"/>
|
11 |
</p>
|
12 |
<div>
|
13 |
<div align="center">
|
|
|
24 |
|
25 |
## Introduction
|
26 |
|
27 |
+
This model is jointly finetuned with [DMD](https://arxiv.org/pdf/2405.14867) and [VSA](https://arxiv.org/pdf/2505.13389), based on [Wan-AI/Wan2.1-T2V-1.3B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers). It supports efficient 3-step inference and generates high-quality videos at **61×448×832** resolution. We adopt the [FastVideo 480P Synthetic Wan dataset](https://huggingface.co/datasets/FastVideo/Wan-Syn_77x448x832_600k), consisting of 600k synthetic latents.
|
|
|
|
|
28 |
|
29 |
+
---
|
30 |
|
31 |
## Model Overview
|
32 |
|
33 |
+
- 3-step inference is supported and achieves up to **50x speed up** on a single **H100** GPU.
|
34 |
+
- Supports generating videos with resolution **61×448×832**.
|
35 |
- Finetuning and inference scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository:
|
36 |
+
- [Finetuning script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/distill/v1_distill_dmd_wan_VSA.sh)
|
37 |
+
- [Inference script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/inference/v1_inference_wan_dmd.sh)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
- Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**, and also support **Mac** users!
|
39 |
|
40 |
### Training Infrastructure
|
|
|
42 |
Training was conducted on **8 nodes with 64 H200 GPUs** in total, using a `global batch size = 64`.
|
43 |
We enable `gradient checkpointing`, set `HSDP_shard_dim = 8`, `sequence_parallel_size = 4`, and use `learning rate = 1e-5`.
|
44 |
We set **VSA attention sparsity** to 0.9, and training runs for **3000 steps (~52 hours)**
|
45 |
+
The detailed **training example script** is available [here](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/distill/Wan-Syn-480P/distill_dmd_VSA_t2v_14B_480P.slurm).
|
46 |
+
|
47 |
+
|
48 |
|
49 |
If you use FastWan2.1-T2V-14B-480P-Diffusers model for your research, please cite our paper:
|
50 |
```
|