LanguageBind commited on
Commit
25212bb
·
verified ·
1 Parent(s): 7af5c69

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +424 -3
  2. README_cn.md +424 -0
  3. Report-V1.5.0.md +188 -0
  4. Report-V1.5.0_cn.md +188 -0
README.md CHANGED
@@ -1,3 +1,424 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Open-Sora Plan v1.5.0 is trained using the MindSpeed-MM toolkit.
2
+
3
+ ### Prerequisites
4
+
5
+ Open-Sora Plan v1.5.0 is trained using CANN version 8.0.1. Please refer to the official guide [CANN8_0_1](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software/264595017?idAbsPath=fixnode01|23710424|251366513|22892968|252309113|251168373) for installation instructions.
6
+
7
+ ### Runtime Environment
8
+
9
+ 1、To begin, install **Torch** and **MindSpeed** as required for the training environment.
10
+
11
+ ```python
12
+ # python3.8
13
+ conda create -n osp python=3.8
14
+ conda activate osp
15
+
16
+ # Install torch and torch_npu, making sure to select the versions compatible with your Python version and system architecture (x86 or ARM), including the corresponding apex package.
17
+ pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
18
+ pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
19
+
20
+ # apex for Ascend, refer to https://gitee.com/ascend/apex
21
+ # It is recommended to build and install from the official source repository.
22
+
23
+ # Modify the environment variable paths in the shell script to the actual paths. Example:
24
+ source /usr/local/Ascend/ascend-toolkit/set_env.sh
25
+
26
+ # install mindspeed
27
+ git clone https://gitee.com/ascend/MindSpeed.git
28
+ cd MindSpeed
29
+ git checkout 59b4e983b7dc1f537f8c6b97a57e54f0316fafb0
30
+ pip install -r requirements.txt
31
+ pip3 install -e .
32
+ cd ..
33
+
34
+ # install other repos
35
+ pip install -e .
36
+ ```
37
+
38
+ 2、install decord
39
+
40
+ ```bash
41
+ git clone --recursive https://github.com/dmlc/decord
42
+ mkdir build && cd build
43
+ cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release -DFFMPEG_DIR=/usr/local/ffmpeg
44
+ make
45
+ cd ../python
46
+ pwd=$PWD
47
+ echo "PYTHONPATH=$PYTHONPATH:$pwd" >> ~/.bashrc
48
+ source ~/.bashrc
49
+ python3 setup.py install --user
50
+ ```
51
+
52
+ ### Download Weights
53
+
54
+ Modelers:
55
+
56
+ https://modelers.cn/models/PKU-YUAN-Group/Open-Sora-Plan-v1.5.0
57
+
58
+ huggingface:
59
+
60
+ https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.5.0
61
+
62
+ T5:
63
+
64
+ [google/t5-v1_1-xl · Hugging Face](https://huggingface.co/google/t5-v1_1-xl)
65
+
66
+ CLIP:
67
+
68
+ [laion/CLIP-ViT-bigG-14-laion2B-39B-b160k · Hugging Face](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
69
+
70
+ ### Train Text-to-Video
71
+
72
+ Make sure to properly configure `data.json` and `model_opensoraplan1_5.json`.
73
+
74
+ #### data.json:
75
+
76
+ ```
77
+ {
78
+ "dataset_param": {
79
+ "dataset_type": "t2v",
80
+ "basic_parameters": {
81
+ "data_path": "./examples/opensoraplan1.5/data.txt",
82
+ "data_folder": "",
83
+ "data_storage_mode": "combine"
84
+ },
85
+ "preprocess_parameters": {
86
+ "video_reader_type": "decoder",
87
+ "image_reader_type": "Image",
88
+ "num_frames": 121,
89
+ "frame_interval": 1,
90
+ "max_height": 576, # Sample height when fixed resolution is enabled; this setting is ignored when multi-resolution is enabled.
91
+ "max_width": 1024, # Sample width when fixed resolution is enabled; this setting is ignored when multi-resolution is enabled.
92
+ "max_hxw": 589824, # Maximum number of tokens when multi-resolution is enabled.
93
+ "min_hxw": 589824, # Minimum number of tokens when multi-resolution is enabled. Additionally, when force_resolution is enabled, min_hxw should be set to max_height * max_width to filter out low-resolution samples, or to a custom value for stricter filtering criteria.
94
+ "force_resolution": true, # Enable fixed-resolution training.
95
+ "force_5_ratio": false, # Enable multi-resolution training with 5 aspect ratios.
96
+ "max_h_div_w_ratio": 1.0, # Maximum allowed aspect ratio for filtering.
97
+ "min_h_div_w_ratio": 0.42, # Minimum allowed aspect ratio for filtering.
98
+ "hw_stride": 16,
99
+ "ae_stride_t": 8,
100
+ "train_fps": 24, # Sampling FPS during training; all videos with varying frame rates will be resampled to train_fps.
101
+ "speed_factor": 1.0,
102
+ "drop_short_ratio": 1.0,
103
+ "min_num_frames": 29,
104
+ "cfg": 0.1,
105
+ "batch_size": 1,
106
+ "gradient_accumulation_size": 4,
107
+ "use_aesthetic": false,
108
+ "train_pipeline": {
109
+ "video": [{
110
+ "trans_type": "ToTensorVideo"
111
+ },
112
+ {
113
+ "trans_type": "CenterCropResizeVideo",
114
+ "param": {
115
+ "size": [576, 1024],
116
+ "interpolation_mode": "bicubic"
117
+ }
118
+ },
119
+ {
120
+ "trans_type": "ae_norm"
121
+ }
122
+ ],
123
+ "image": [{
124
+ "trans_type": "ToTensorVideo"
125
+ },
126
+ {
127
+ "trans_type": "CenterCropResizeVideo",
128
+ "param": {
129
+ "size": [576, 1024],
130
+ "interpolation_mode": "bicubic"
131
+ }
132
+ },
133
+ {
134
+ "trans_type": "ae_norm"
135
+ }
136
+ ]
137
+ }
138
+ },
139
+ "use_text_processer": true,
140
+ "enable_text_preprocess": true,
141
+ "model_max_length": 512,
142
+ "tokenizer_config": {
143
+ "hub_backend": "hf",
144
+ "autotokenizer_name": "AutoTokenizer",
145
+ "from_pretrained": "/work/share/checkpoint/pretrained/t5/t5-v1_1-xl"
146
+ },
147
+ "tokenizer_config_2": {
148
+ "hub_backend": "hf",
149
+ "autotokenizer_name": "AutoTokenizer",
150
+ "from_pretrained": "/work/share/checkpoint/pretrained/clip/models--laion--CLIP-ViT-bigG-14-laion2B-39B-b160k/snapshots/bc7788f151930d91b58474715fdce5524ad9a189"
151
+ },
152
+ "use_feature_data": false,
153
+ "use_img_from_vid": false
154
+ },
155
+ "dataloader_param": {
156
+ "dataloader_mode": "sampler",
157
+ "sampler_type": "LengthGroupedSampler", # Enable the Group Data strategy (enabled by default).
158
+ "batch_size": 1,
159
+ "num_workers": 4,
160
+ "shuffle": false,
161
+ "drop_last": true,
162
+ "pin_memory": false,
163
+ "group_data": true,
164
+ "initial_global_step_for_sampler": 0,
165
+ "gradient_accumulation_size": 4,
166
+ "collate_param": {
167
+ "model_name": "GroupLength", # Enable the Group Data-specific collate function (enabled by default).
168
+ "batch_size": 1,
169
+ "num_frames": 121,
170
+ "group_data": true,
171
+ "ae_stride": 8,
172
+ "ae_stride_t": 8,
173
+ "patch_size": 2,
174
+ "patch_size_t": 1
175
+ }
176
+ }
177
+ }
178
+
179
+ ```
180
+
181
+ #### model_opensoraplan1_5.json
182
+
183
+ ```
184
+ {
185
+ "frames": 121,
186
+ "allow_tf32": false,
187
+ "allow_internal_format": false,
188
+ "load_video_features": false,
189
+ "load_text_features": false,
190
+ "enable_encoder_dp": true, # MindSpeed optimization. It takes effect when TP (tensor parallelism) degree is greater than 1.
191
+ "weight_dtype": "bf16",
192
+ "ae": {
193
+ "model_id": "wfvae",
194
+ "base_channels": 160,
195
+ "connect_res_layer_num": 1,
196
+ "decoder_energy_flow_hidden_size": 128,
197
+ "decoder_num_resblocks": 2,
198
+ "dropout": 0.0,
199
+ "encoder_energy_flow_hidden_size": 128,
200
+ "encoder_num_resblocks": 2,
201
+ "l1_dowmsample_block": "Spatial2xTime2x3DDownsample",
202
+ "l1_downsample_wavelet": "HaarWaveletTransform3D",
203
+ "l1_upsample_block": "Spatial2xTime2x3DUpsample",
204
+ "l1_upsample_wavelet": "InverseHaarWaveletTransform3D",
205
+ "l2_dowmsample_block": "Spatial2xTime2x3DDownsample",
206
+ "l2_downsample_wavelet": "HaarWaveletTransform3D",
207
+ "l2_upsample_block": "Spatial2xTime2x3DUpsample",
208
+ "l2_upsample_wavelet": "InverseHaarWaveletTransform3D",
209
+ "latent_dim": 32,
210
+ "norm_type": "layernorm",
211
+ "scale": [0.7031, 0.7109, 1.5391, 1.2969, 0.7109, 1.4141, 1.3828, 2.1719, 1.7266,
212
+ 1.8281, 1.9141, 1.2031, 0.6875, 0.9609, 1.6484, 1.1875, 1.5312, 1.1328,
213
+ 0.8828, 0.6836, 0.8828, 0.9219, 1.6953, 1.4453, 1.5312, 0.6836, 0.7656,
214
+ 0.8242, 1.2344, 1.0312, 1.7266, 0.9492],
215
+ "shift": [-0.2129, 0.1226, 1.6328, 0.6211, -0.8750, 0.6172, -0.5703, 0.1348,
216
+ -0.2178, -0.9375, 0.3184, 0.3281, -0.0544, -0.1826, -0.2812, 0.4355,
217
+ 0.1621, -0.2578, 0.7148, -0.7422, -0.2295, -0.2324, -1.4922, 0.6328,
218
+ 1.1250, -0.2578, -2.1094, 1.0391, 1.1797, -1.2422, -0.2988, -0.9570],
219
+ "t_interpolation": "trilinear",
220
+ "use_attention": true,
221
+ "use_tiling": true, # Whether to enable the tiling strategy.
222
+ "from_pretrained": "/work/share/checkpoint/pretrained/vae/Middle888/merged.ckpt",
223
+ "dtype": "fp32"
224
+ },
225
+ "text_encoder": {
226
+ "hub_backend": "hf",
227
+ "model_id": "T5",
228
+ "from_pretrained": "/work/share/checkpoint/pretrained/t5/t5-v1_1-xl",
229
+ "low_cpu_mem_usage": false
230
+ },
231
+ "text_encoder_2":{
232
+ "hub_backend": "hf",
233
+ "model_id": "CLIPWithProjection",
234
+ "from_pretrained": "/work/share/checkpoint/pretrained/clip/models--laion--CLIP-ViT-bigG-14-laion2B-39B-b160k/snapshots/bc7788f151930d91b58474715fdce5524ad9a189",
235
+ "low_cpu_mem_usage": false
236
+ },
237
+ "predictor": {
238
+ "model_id": "SparseUMMDiT",
239
+ "num_layers": [2, 4, 6, 8, 6, 4, 2], # Number of layers per stage.
240
+ "sparse_n": [1, 2, 4, 8, 4, 2, 1], # Sparsity level for each stage.
241
+ "double_ff": true, # Whether to use a shared FFN for visual and textual inputs, or separate FFNs for each.
242
+ "sparse1d": true, # Whether to use the Skiparse strategy; setting this to false results in a dense DiT.
243
+ "num_heads": 24,
244
+ "head_dim": 128,
245
+ "in_channels": 32,
246
+ "out_channels": 32,
247
+ "timestep_embed_dim": 1024,
248
+ "caption_channels": 2048,
249
+ "pooled_projection_dim": 1280,
250
+ "skip_connection": true, # Whether to add skip connections.
251
+ "dropout": 0.0,
252
+ "attention_bias": true,
253
+ "patch_size": 2,
254
+ "patch_size_t": 1,
255
+ "activation_fn": "gelu-approximate",
256
+ "norm_elementwise_affine": false,
257
+ "norm_eps": 1e-06,
258
+ "from_pretrained": null # Path to the pretrained weights; merged weights must be used.
259
+ },
260
+ "diffusion": {
261
+ "model_id": "OpenSoraPlan",
262
+ "weighting_scheme": "logit_normal",
263
+ "use_dynamic_shifting": true
264
+ }
265
+ }
266
+
267
+ ```
268
+
269
+ Enter the Open-Sora Plan directory and run:
270
+
271
+ ```
272
+ bash examples/opensoraplan1.5/pretrain_opensoraplan1_5.sh
273
+ ```
274
+
275
+ **Parameter Description:**
276
+
277
+ `--optimizer-selection fused_ema_adamw` Select the optimizer to use. In our case, `fused_ema_adamw` is required to obtain EMA-based weights.
278
+
279
+ `--model_custom_precision` Different components use different precisions, rather than adopting Megatron’s default of full-model bf16 precision. For example, the VAE is run in fp32, while the text encoder and DiT use bf16.
280
+
281
+ `--clip_grad_ema_decay 0.99` Set the EMA decay rate used in adaptive gradient clipping.
282
+
283
+ `--selective_recom` `--recom_ffn_layers 32` Whether to enable selective recomputation and specify the number of layers for it. When selective recomputation is activated, only the FFN layers are recomputed, while the Attention layers are skipped, enabling faster training. This parameter is mutually exclusive with `--recompute-granularity full`, `--recompute-method block`, and `--recompute-num-layers 0`. When selective recomputation is enabled, full-layer recomputation is disabled by default.
284
+
285
+ ### Sample Text-to-Video
286
+
287
+ Due to TP-based training, the model weights are partitioned. Therefore, weight merging is required prior to running inference.
288
+
289
+ #### Merge Weights
290
+
291
+ ```
292
+ python examples/opensoraplan1.5/convert_mm_to_ckpt.py --load_dir $load_dir --save_dir $save_dir --ema
293
+ ```
294
+
295
+ **Parameter Description:**
296
+
297
+ `--load_dir`: Path to the weights saved during training, partitioned by Megatron.
298
+
299
+ `--save_dir`: Path to save the merged weights.
300
+
301
+ `--ema`: Whether to use EMA (Exponential Moving Average) weights.
302
+
303
+ #### Inference
304
+
305
+ Make sure the `inference_t2v_model1_5.json` file is properly configured.
306
+
307
+ ```
308
+ {
309
+ "ae": {
310
+ "model_id": "wfvae",
311
+ "base_channels": 160,
312
+ "connect_res_layer_num": 1,
313
+ "decoder_energy_flow_hidden_size": 128,
314
+ "decoder_num_resblocks": 2,
315
+ "dropout": 0.0,
316
+ "encoder_energy_flow_hidden_size": 128,
317
+ "encoder_num_resblocks": 2,
318
+ "l1_dowmsample_block": "Spatial2xTime2x3DDownsample",
319
+ "l1_downsample_wavelet": "HaarWaveletTransform3D",
320
+ "l1_upsample_block": "Spatial2xTime2x3DUpsample",
321
+ "l1_upsample_wavelet": "InverseHaarWaveletTransform3D",
322
+ "l2_dowmsample_block": "Spatial2xTime2x3DDownsample",
323
+ "l2_downsample_wavelet": "HaarWaveletTransform3D",
324
+ "l2_upsample_block": "Spatial2xTime2x3DUpsample",
325
+ "l2_upsample_wavelet": "InverseHaarWaveletTransform3D",
326
+ "latent_dim": 32,
327
+ "vae_scale_factor": [8, 8, 8],
328
+ "norm_type": "layernorm",
329
+ "scale": [0.7031, 0.7109, 1.5391, 1.2969, 0.7109, 1.4141, 1.3828, 2.1719, 1.7266,
330
+ 1.8281, 1.9141, 1.2031, 0.6875, 0.9609, 1.6484, 1.1875, 1.5312, 1.1328,
331
+ 0.8828, 0.6836, 0.8828, 0.9219, 1.6953, 1.4453, 1.5312, 0.6836, 0.7656,
332
+ 0.8242, 1.2344, 1.0312, 1.7266, 0.9492],
333
+ "shift": [-0.2129, 0.1226, 1.6328, 0.6211, -0.8750, 0.6172, -0.5703, 0.1348,
334
+ -0.2178, -0.9375, 0.3184, 0.3281, -0.0544, -0.1826, -0.2812, 0.4355,
335
+ 0.1621, -0.2578, 0.7148, -0.7422, -0.2295, -0.2324, -1.4922, 0.6328,
336
+ 1.1250, -0.2578, -2.1094, 1.0391, 1.1797, -1.2422, -0.2988, -0.9570],
337
+ "t_interpolation": "trilinear",
338
+ "use_attention": true,
339
+ "use_tiling": true, # Whether to enable the tiling strategy; it is enabled by default during inference to reduce memory usage.
340
+ "from_pretrained": "/work/share/checkpoint/pretrained/vae/Middle888/merged.ckpt",
341
+ "dtype": "fp16"
342
+ },
343
+ "text_encoder": {
344
+ "hub_backend": "hf",
345
+ "model_id": "T5",
346
+ "from_pretrained": "/work/share/checkpoint/pretrained/t5/t5-v1_1-xl",
347
+ "low_cpu_mem_usage": false
348
+ },
349
+ "text_encoder_2":{
350
+ "hub_backend": "hf",
351
+ "model_id": "CLIPWithProjection",
352
+ "from_pretrained": "/work/share/checkpoint/pretrained/clip/models--laion--CLIP-ViT-bigG-14-laion2B-39B-b160k/snapshots/bc7788f151930d91b58474715fdce5524ad9a189",
353
+ "low_cpu_mem_usage": false
354
+ },
355
+ "tokenizer":{
356
+ "hub_backend": "hf",
357
+ "autotokenizer_name": "AutoTokenizer",
358
+ "from_pretrained": "/work/share/checkpoint/pretrained/t5/t5-v1_1-xl",
359
+ "low_cpu_mem_usage": false
360
+ },
361
+ "tokenizer_2":{
362
+ "hub_backend": "hf",
363
+ "autotokenizer_name": "AutoTokenizer",
364
+ "from_pretrained": "/work/share/checkpoint/pretrained/clip/models--laion--CLIP-ViT-bigG-14-laion2B-39B-b160k/snapshots/bc7788f151930d91b58474715fdce5524ad9a189",
365
+ "low_cpu_mem_usage": false
366
+ },
367
+ "predictor": {
368
+ "model_id": "SparseUMMDiT",
369
+ "num_layers": [2, 4, 6, 8, 6, 4, 2],
370
+ "sparse_n": [1, 2, 4, 8, 4, 2, 1],
371
+ "double_ff": true,
372
+ "sparse1d": true,
373
+ "num_heads": 24,
374
+ "head_dim": 128,
375
+ "in_channels": 32,
376
+ "out_channels": 32,
377
+ "timestep_embed_dim": 1024,
378
+ "caption_channels": 2048,
379
+ "pooled_projection_dim": 1280,
380
+ "skip_connection": true,
381
+ "skip_connection_zero_init": true,
382
+ "dropout": 0.0,
383
+ "attention_bias": true,
384
+ "patch_size": 2,
385
+ "patch_size_t": 1,
386
+ "activation_fn": "gelu-approximate",
387
+ "norm_elementwise_affine": true,
388
+ "norm_eps": 1e-06,
389
+ "from_pretrained": "/path/to/pretrained/model"
390
+ },
391
+ "diffusion": {
392
+ "model_id": "OpenSoraPlan",
393
+ "num_inference_steps": 50, # Inference steps
394
+ "guidance_scale": 8.0, # CFG strength. We recommend using a relatively high value; 8.0 is generally a good choice.
395
+ "guidance_rescale": 0.7, # Guidance rescale strength. If the sampled outputs appear overly saturated, we recommend increasing guidance_rescale instead of adjusting the CFG value.
396
+ "use_linear_quadratic_schedule": false, # Using a linear-to-quadratic sampling strategy.
397
+ "use_dynamic_shifting": false,
398
+ "shift": 7.0 # Using the shifting sampling strategy.
399
+ },
400
+ "pipeline_config": {
401
+ "use_attention_mask": true,
402
+ "input_size": [121, 576, 1024],
403
+ "version": "v1.5",
404
+ "model_type": "t2v"
405
+ },
406
+ "micro_batch_size": 1,
407
+ "frame_interval":1,
408
+ "model_max_length": 512,
409
+ "save_path":"./opensoraplan_samples/test_samples",
410
+ "fps":24,
411
+ "prompt":"./examples/opensoraplan1.5/sora.txt",
412
+ "device":"npu",
413
+ "weight_dtype": "fp16"
414
+ }
415
+
416
+ ```
417
+
418
+ Enter the Open-Sora Plan directory and run:
419
+
420
+ ```
421
+ bash examples/opensoraplan1.5/inference_t2v_1_5.sh
422
+ ```
423
+
424
+ In practice, inference at 121×576×1024 resolution can be run with TP=1 (i.e., without parallelism). To accelerate inference, you may manually increase the TP parallelism level.
README_cn.md ADDED
@@ -0,0 +1,424 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Open-Sora Plan v1.5.0采用mindspeed-mm套件训练。
2
+
3
+ ### 前置要求
4
+
5
+ Open-Sora Plan v1.5.0在CANN 8.0.1版本完成训练,请参照[CANN 系列 昇腾计算 8.0.1 软件补丁下载](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software/264595017?idAbsPath=fixnode01|23710424|251366513|22892968|252309113|251168373)安装。
6
+
7
+ ### 环境安装
8
+
9
+ 1、安装torch、Mindspeed
10
+
11
+ ```python
12
+ # python3.8
13
+ conda create -n osp python=3.8
14
+ conda activate osp
15
+
16
+ # 安装 torch 和 torch_npu,注意要选择对应python版本、x86或arm的torch、torch_npu及apex包
17
+ pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
18
+ pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
19
+
20
+ # apex for Ascend 参考 https://gitee.com/ascend/apex
21
+ # 建议从原仓编译安装
22
+
23
+ # 将shell脚本中的环境变量路径修改为真实路径,下面为参考路径
24
+ source /usr/local/Ascend/ascend-toolkit/set_env.sh
25
+
26
+ # 安装加速库
27
+ git clone https://gitee.com/ascend/MindSpeed.git
28
+ cd MindSpeed
29
+ git checkout 59b4e983b7dc1f537f8c6b97a57e54f0316fafb0
30
+ pip install -r requirements.txt
31
+ pip3 install -e .
32
+ cd ..
33
+
34
+ # 安装其余依赖库
35
+ pip install -e .
36
+ ```
37
+
38
+ 2、安装decord
39
+
40
+ ```bash
41
+ git clone --recursive https://github.com/dmlc/decord
42
+ mkdir build && cd build
43
+ cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release -DFFMPEG_DIR=/usr/local/ffmpeg
44
+ make
45
+ cd ../python
46
+ pwd=$PWD
47
+ echo "PYTHONPATH=$PYTHONPATH:$pwd" >> ~/.bashrc
48
+ source ~/.bashrc
49
+ python3 setup.py install --user
50
+ ```
51
+
52
+ ### 权重下载
53
+
54
+ 魔乐社区:
55
+
56
+ https://modelers.cn/models/PKU-YUAN-Group/Open-Sora-Plan-v1.5.0
57
+
58
+ huggingface:
59
+
60
+ https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.5.0
61
+
62
+ T5:
63
+
64
+ [google/t5-v1_1-xl · Hugging Face](https://huggingface.co/google/t5-v1_1-xl)
65
+
66
+ CLIP:
67
+
68
+ [laion/CLIP-ViT-bigG-14-laion2B-39B-b160k · Hugging Face](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
69
+
70
+ ### Train Text-to-Video
71
+
72
+ 需要设置好data.json和model_opensoraplan1_5.json。
73
+
74
+ #### data.json:
75
+
76
+ ```
77
+ {
78
+ "dataset_param": {
79
+ "dataset_type": "t2v",
80
+ "basic_parameters": {
81
+ "data_path": "./examples/opensoraplan1.5/data.txt", # 数据路径
82
+ "data_folder": "",
83
+ "data_storage_mode": "combine"
84
+ },
85
+ "preprocess_parameters": {
86
+ "video_reader_type": "decoder",
87
+ "image_reader_type": "Image",
88
+ "num_frames": 121,
89
+ "frame_interval": 1,
90
+ "max_height": 576, # 开启固定分辨率时的样本高度,在开启多分辨率时无效
91
+ "max_width": 1024, # 开启固定分辨率时的样本宽度,在开启多分辨率时无效
92
+ "max_hxw": 589824, # 开启多分辨率时的最大token数
93
+ "min_hxw": 589824, # 开启多分辨率时的最小token数。此外,min_hxw需要在开启force_resolution时设置为max_height * max_width以过滤低分辨率样本,或自定义更严格的筛选标准
94
+ "force_resolution": true, # 开启固定分辨率训练
95
+ "force_5_ratio": false, # 开启5宽高比多分辨率策略训练
96
+ "max_h_div_w_ratio": 1.0, # 筛选最大高宽比
97
+ "min_h_div_w_ratio": 0.42, # 筛选最小高宽比
98
+ "hw_stride": 16,
99
+ "ae_stride_t": 8,
100
+ "train_fps": 24, # 训练时采样fps,会将不同fps的视频都重采样到train_fps
101
+ "speed_factor": 1.0,
102
+ "drop_short_ratio": 1.0,
103
+ "min_num_frames": 29,
104
+ "cfg": 0.1,
105
+ "batch_size": 1,
106
+ "gradient_accumulation_size": 4,
107
+ "use_aesthetic": false,
108
+ "train_pipeline": {
109
+ "video": [{
110
+ "trans_type": "ToTensorVideo"
111
+ },
112
+ {
113
+ "trans_type": "CenterCropResizeVideo",
114
+ "param": {
115
+ "size": [576, 1024],
116
+ "interpolation_mode": "bicubic"
117
+ }
118
+ },
119
+ {
120
+ "trans_type": "ae_norm"
121
+ }
122
+ ],
123
+ "image": [{
124
+ "trans_type": "ToTensorVideo"
125
+ },
126
+ {
127
+ "trans_type": "CenterCropResizeVideo",
128
+ "param": {
129
+ "size": [576, 1024],
130
+ "interpolation_mode": "bicubic"
131
+ }
132
+ },
133
+ {
134
+ "trans_type": "ae_norm"
135
+ }
136
+ ]
137
+ }
138
+ },
139
+ "use_text_processer": true,
140
+ "enable_text_preprocess": true,
141
+ "model_max_length": 512,
142
+ "tokenizer_config": {
143
+ "hub_backend": "hf",
144
+ "autotokenizer_name": "AutoTokenizer",
145
+ "from_pretrained": "/work/share/checkpoint/pretrained/t5/t5-v1_1-xl"
146
+ },
147
+ "tokenizer_config_2": {
148
+ "hub_backend": "hf",
149
+ "autotokenizer_name": "AutoTokenizer",
150
+ "from_pretrained": "/work/share/checkpoint/pretrained/clip/models--laion--CLIP-ViT-bigG-14-laion2B-39B-b160k/snapshots/bc7788f151930d91b58474715fdce5524ad9a189"
151
+ },
152
+ "use_feature_data": false,
153
+ "use_img_from_vid": false
154
+ },
155
+ "dataloader_param": {
156
+ "dataloader_mode": "sampler",
157
+ "sampler_type": "LengthGroupedSampler", # 开启Group Data策略,默认指定
158
+ "batch_size": 1,
159
+ "num_workers": 4,
160
+ "shuffle": false,
161
+ "drop_last": true,
162
+ "pin_memory": false,
163
+ "group_data": true,
164
+ "initial_global_step_for_sampler": 0,
165
+ "gradient_accumulation_size": 4,
166
+ "collate_param": {
167
+ "model_name": "GroupLength", # 开启Group Data对应的Collate,默认指定
168
+ "batch_size": 1,
169
+ "num_frames": 121,
170
+ "group_data": true,
171
+ "ae_stride": 8,
172
+ "ae_stride_t": 8,
173
+ "patch_size": 2,
174
+ "patch_size_t": 1
175
+ }
176
+ }
177
+ }
178
+
179
+ ```
180
+
181
+ #### model_opensoraplan1_5.json
182
+
183
+ ```
184
+ {
185
+ "frames": 121,
186
+ "allow_tf32": false,
187
+ "allow_internal_format": false,
188
+ "load_video_features": false,
189
+ "load_text_features": false,
190
+ "enable_encoder_dp": true, # mindspeed架构优化,在TP并行度大于1时起作用
191
+ "weight_dtype": "bf16",
192
+ "ae": {
193
+ "model_id": "wfvae",
194
+ "base_channels": 160,
195
+ "connect_res_layer_num": 1,
196
+ "decoder_energy_flow_hidden_size": 128,
197
+ "decoder_num_resblocks": 2,
198
+ "dropout": 0.0,
199
+ "encoder_energy_flow_hidden_size": 128,
200
+ "encoder_num_resblocks": 2,
201
+ "l1_dowmsample_block": "Spatial2xTime2x3DDownsample",
202
+ "l1_downsample_wavelet": "HaarWaveletTransform3D",
203
+ "l1_upsample_block": "Spatial2xTime2x3DUpsample",
204
+ "l1_upsample_wavelet": "InverseHaarWaveletTransform3D",
205
+ "l2_dowmsample_block": "Spatial2xTime2x3DDownsample",
206
+ "l2_downsample_wavelet": "HaarWaveletTransform3D",
207
+ "l2_upsample_block": "Spatial2xTime2x3DUpsample",
208
+ "l2_upsample_wavelet": "InverseHaarWaveletTransform3D",
209
+ "latent_dim": 32,
210
+ "norm_type": "layernorm",
211
+ "scale": [0.7031, 0.7109, 1.5391, 1.2969, 0.7109, 1.4141, 1.3828, 2.1719, 1.7266,
212
+ 1.8281, 1.9141, 1.2031, 0.6875, 0.9609, 1.6484, 1.1875, 1.5312, 1.1328,
213
+ 0.8828, 0.6836, 0.8828, 0.9219, 1.6953, 1.4453, 1.5312, 0.6836, 0.7656,
214
+ 0.8242, 1.2344, 1.0312, 1.7266, 0.9492],
215
+ "shift": [-0.2129, 0.1226, 1.6328, 0.6211, -0.8750, 0.6172, -0.5703, 0.1348,
216
+ -0.2178, -0.9375, 0.3184, 0.3281, -0.0544, -0.1826, -0.2812, 0.4355,
217
+ 0.1621, -0.2578, 0.7148, -0.7422, -0.2295, -0.2324, -1.4922, 0.6328,
218
+ 1.1250, -0.2578, -2.1094, 1.0391, 1.1797, -1.2422, -0.2988, -0.9570],
219
+ "t_interpolation": "trilinear",
220
+ "use_attention": true,
221
+ "use_tiling": true, # 是否开启tiling策略
222
+ "from_pretrained": "/work/share/checkpoint/pretrained/vae/Middle888/merged.ckpt",
223
+ "dtype": "fp32"
224
+ },
225
+ "text_encoder": {
226
+ "hub_backend": "hf",
227
+ "model_id": "T5",
228
+ "from_pretrained": "/work/share/checkpoint/pretrained/t5/t5-v1_1-xl",
229
+ "low_cpu_mem_usage": false
230
+ },
231
+ "text_encoder_2":{
232
+ "hub_backend": "hf",
233
+ "model_id": "CLIPWithProjection",
234
+ "from_pretrained": "/work/share/checkpoint/pretrained/clip/models--laion--CLIP-ViT-bigG-14-laion2B-39B-b160k/snapshots/bc7788f151930d91b58474715fdce5524ad9a189",
235
+ "low_cpu_mem_usage": false
236
+ },
237
+ "predictor": {
238
+ "model_id": "SparseUMMDiT",
239
+ "num_layers": [2, 4, 6, 8, 6, 4, 2], # 每个stage的层数
240
+ "sparse_n": [1, 2, 4, 8, 4, 2, 1], # 每个stage的稀疏度
241
+ "double_ff": true, # 采用visual和text共享FFN还是各自独立FFN
242
+ "sparse1d": true, # 是否采用Skiparse策略,设置为false则为dense dit
243
+ "num_heads": 24,
244
+ "head_dim": 128,
245
+ "in_channels": 32,
246
+ "out_channels": 32,
247
+ "timestep_embed_dim": 1024,
248
+ "caption_channels": 2048,
249
+ "pooled_projection_dim": 1280,
250
+ "skip_connection": true, # 是否添加skip connection
251
+ "dropout": 0.0,
252
+ "attention_bias": true,
253
+ "patch_size": 2,
254
+ "patch_size_t": 1,
255
+ "activation_fn": "gelu-approximate",
256
+ "norm_elementwise_affine": false,
257
+ "norm_eps": 1e-06,
258
+ "from_pretrained": null # 预训练权重路径,需采用合并后的权重
259
+ },
260
+ "diffusion": {
261
+ "model_id": "OpenSoraPlan",
262
+ "weighting_scheme": "logit_normal",
263
+ "use_dynamic_shifting": true
264
+ }
265
+ }
266
+
267
+ ```
268
+
269
+ 进入Open-Sora Plan目录下,运行
270
+
271
+ ```
272
+ bash examples/opensoraplan1.5/pretrain_opensoraplan1_5.sh
273
+ ```
274
+
275
+ 参数解析:
276
+
277
+ `--optimizer-selection fused_ema_adamw` 选择使用的优化器,我们这里需要选择fused_ema_adamw以获得EMA版本权重。
278
+
279
+ `--model_custom_precision` 不同组件使用不同的精度,而不是采用megatron默认的整网bf16精度。例如对VAE使用fp32精度,对text encoder、dit使用bf16精度。
280
+
281
+ `--clip_grad_ema_decay 0.99` 设置adaptive grad clipping中使用的EMA衰减率。
282
+
283
+ `--selective_recom` `--recom_ffn_layers 32` 是否开启选择性重加算及选择性重计算的层数。在开启选择性重计算时,我们只对FFN进行重计算而不对Attention进行重计算,以获得加速训练效果。该参数与`--recompute-granularity full` `--recompute-method block` `--recompute-num-layers 0` 互斥,当开启选择性重计算时,默认全重计算已关闭。
284
+
285
+ ### Sample Text-to-Video
286
+
287
+ 由于模型训练时进行了TP切分,所以我们需要先将切分后的权重进行合并,然后再进行推理。
288
+
289
+ #### 合并权重
290
+
291
+ ```
292
+ python examples/opensoraplan1.5/convert_mm_to_ckpt.py --load_dir $load_dir --save_dir $save_dir --ema
293
+ ```
294
+
295
+ 参数解析:
296
+
297
+ `--load_dir` 训练时经过megatron切分后保存的权重路径
298
+
299
+ `--save_dir` 合并后的权重路径
300
+
301
+ `--ema` 是否采用EMA权重
302
+
303
+ ### 推理
304
+
305
+ 需要配置好inference_t2v_model1_5.json。
306
+
307
+ ```
308
+ {
309
+ "ae": {
310
+ "model_id": "wfvae",
311
+ "base_channels": 160,
312
+ "connect_res_layer_num": 1,
313
+ "decoder_energy_flow_hidden_size": 128,
314
+ "decoder_num_resblocks": 2,
315
+ "dropout": 0.0,
316
+ "encoder_energy_flow_hidden_size": 128,
317
+ "encoder_num_resblocks": 2,
318
+ "l1_dowmsample_block": "Spatial2xTime2x3DDownsample",
319
+ "l1_downsample_wavelet": "HaarWaveletTransform3D",
320
+ "l1_upsample_block": "Spatial2xTime2x3DUpsample",
321
+ "l1_upsample_wavelet": "InverseHaarWaveletTransform3D",
322
+ "l2_dowmsample_block": "Spatial2xTime2x3DDownsample",
323
+ "l2_downsample_wavelet": "HaarWaveletTransform3D",
324
+ "l2_upsample_block": "Spatial2xTime2x3DUpsample",
325
+ "l2_upsample_wavelet": "InverseHaarWaveletTransform3D",
326
+ "latent_dim": 32,
327
+ "vae_scale_factor": [8, 8, 8],
328
+ "norm_type": "layernorm",
329
+ "scale": [0.7031, 0.7109, 1.5391, 1.2969, 0.7109, 1.4141, 1.3828, 2.1719, 1.7266,
330
+ 1.8281, 1.9141, 1.2031, 0.6875, 0.9609, 1.6484, 1.1875, 1.5312, 1.1328,
331
+ 0.8828, 0.6836, 0.8828, 0.9219, 1.6953, 1.4453, 1.5312, 0.6836, 0.7656,
332
+ 0.8242, 1.2344, 1.0312, 1.7266, 0.9492],
333
+ "shift": [-0.2129, 0.1226, 1.6328, 0.6211, -0.8750, 0.6172, -0.5703, 0.1348,
334
+ -0.2178, -0.9375, 0.3184, 0.3281, -0.0544, -0.1826, -0.2812, 0.4355,
335
+ 0.1621, -0.2578, 0.7148, -0.7422, -0.2295, -0.2324, -1.4922, 0.6328,
336
+ 1.1250, -0.2578, -2.1094, 1.0391, 1.1797, -1.2422, -0.2988, -0.9570],
337
+ "t_interpolation": "trilinear",
338
+ "use_attention": true,
339
+ "use_tiling": true, # 是否开启tiling策略,推理时默认开启节省显存
340
+ "from_pretrained": "/work/share/checkpoint/pretrained/vae/Middle888/merged.ckpt",
341
+ "dtype": "fp16"
342
+ },
343
+ "text_encoder": {
344
+ "hub_backend": "hf",
345
+ "model_id": "T5",
346
+ "from_pretrained": "/work/share/checkpoint/pretrained/t5/t5-v1_1-xl",
347
+ "low_cpu_mem_usage": false
348
+ },
349
+ "text_encoder_2":{
350
+ "hub_backend": "hf",
351
+ "model_id": "CLIPWithProjection",
352
+ "from_pretrained": "/work/share/checkpoint/pretrained/clip/models--laion--CLIP-ViT-bigG-14-laion2B-39B-b160k/snapshots/bc7788f151930d91b58474715fdce5524ad9a189",
353
+ "low_cpu_mem_usage": false
354
+ },
355
+ "tokenizer":{
356
+ "hub_backend": "hf",
357
+ "autotokenizer_name": "AutoTokenizer",
358
+ "from_pretrained": "/work/share/checkpoint/pretrained/t5/t5-v1_1-xl",
359
+ "low_cpu_mem_usage": false
360
+ },
361
+ "tokenizer_2":{
362
+ "hub_backend": "hf",
363
+ "autotokenizer_name": "AutoTokenizer",
364
+ "from_pretrained": "/work/share/checkpoint/pretrained/clip/models--laion--CLIP-ViT-bigG-14-laion2B-39B-b160k/snapshots/bc7788f151930d91b58474715fdce5524ad9a189",
365
+ "low_cpu_mem_usage": false
366
+ },
367
+ "predictor": {
368
+ "model_id": "SparseUMMDiT",
369
+ "num_layers": [2, 4, 6, 8, 6, 4, 2],
370
+ "sparse_n": [1, 2, 4, 8, 4, 2, 1],
371
+ "double_ff": true,
372
+ "sparse1d": true,
373
+ "num_heads": 24,
374
+ "head_dim": 128,
375
+ "in_channels": 32,
376
+ "out_channels": 32,
377
+ "timestep_embed_dim": 1024,
378
+ "caption_channels": 2048,
379
+ "pooled_projection_dim": 1280,
380
+ "skip_connection": true,
381
+ "skip_connection_zero_init": true,
382
+ "dropout": 0.0,
383
+ "attention_bias": true,
384
+ "patch_size": 2,
385
+ "patch_size_t": 1,
386
+ "activation_fn": "gelu-approximate",
387
+ "norm_elementwise_affine": true,
388
+ "norm_eps": 1e-06,
389
+ "from_pretrained": "/path/to/pretrained/model"
390
+ },
391
+ "diffusion": {
392
+ "model_id": "OpenSoraPlan",
393
+ "num_inference_steps": 50, # 推理步数
394
+ "guidance_scale": 8.0, # CFG强度,我们推荐较大的CFG,8.0是较好的值
395
+ "guidance_rescale": 0.7, # guidance rescale强度,如认为采样饱和度过高,我们推荐将gudance_rescale增大,而非调整CFG
396
+ "use_linear_quadratic_schedule": false, # 采用线性——平方采样策略
397
+ "use_dynamic_shifting": false,
398
+ "shift": 7.0 # 采用shifting采样策略
399
+ },
400
+ "pipeline_config": {
401
+ "use_attention_mask": true,
402
+ "input_size": [121, 576, 1024],
403
+ "version": "v1.5",
404
+ "model_type": "t2v"
405
+ },
406
+ "micro_batch_size": 1,
407
+ "frame_interval":1,
408
+ "model_max_length": 512,
409
+ "save_path":"./opensoraplan_samples/test_samples",
410
+ "fps":24,
411
+ "prompt":"./examples/opensoraplan1.5/sora.txt",
412
+ "device":"npu",
413
+ "weight_dtype": "fp16"
414
+ }
415
+
416
+ ```
417
+
418
+ 进入Open-Sora Plan目录下,运行
419
+
420
+ ```
421
+ bash examples/opensoraplan1.5/inference_t2v_1_5.sh
422
+ ```
423
+
424
+ 实测TP=1即不开启并行策略能够运行121x576x1024推理,如需加快推理速度请自行调节TP并行度。
Report-V1.5.0.md ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Report v1.5.0
2
+
3
+ In October 2024, we released Open-Sora Plan v1.3.0, introducing the sparse attention structure, Skiparse Attention, to the field of video generation for the first time. Additionally, we adopted the efficient WFVAE, significantly reducing encoding time and memory usage during training.
4
+
5
+ In Open-Sora Plan v1.5.0, We introduce several key updates to enhance the framework:
6
+
7
+ 1、Improved Sparse DiT, SUV. Building on Skiparse Attention, we extend sparse DiT into a U-shaped sparse structure. This design preserves speed advantages while enabling sparse DiT to achieve performance comparable to dense DiT.
8
+
9
+ 2、Higher-compression WFVAE. In Open-Sora Plan v1.5.0, we explore a WFVAE with an 8×8×8 downsampling rate. It outperforms the performance of the widely adopted 4×8×8 VAE in the community, while reducing the latent shape by half and shortening the attention sequence length.
10
+
11
+ 3、Data and model scaling. In Open-Sora Plan v1.5.0, we collect 1.1 billion high-quality images and 40 million high-quality videos. The model is scaled up to 8.5 billion parameters, resulting in strong overall performance.
12
+
13
+ 4、Simplified Adaptive Gradient Clipping strategy. Compared to the more complex batch-dropping method in version 1.3.0, version 1.5.0 maintains a simple adaptive gradient norm threshold for clipping, making it more compatible with various parallel training strategies.
14
+
15
+ Open-Sora Plan v1.5.0 is fully trained and inferred on Ascend 910-series accelerators, using the mindspeed-mm framework to support parallel training strategies.
16
+
17
+ ### Open-Source Release
18
+
19
+ Open-Sora Plan v1.5.0 is open-sourced with the following components:
20
+
21
+ 1、All training and inference code. You can also find the implementation of Open-Sora Plan v1.5.0 in the official [MindSpeed-MM](https://gitee.com/ascend/MindSpeed-MM) repository.
22
+
23
+ 2、The WFVAE weights with 8×8×8 compression, along with the 8.5B SUV denoiser weights.
24
+
25
+ ## Detailed Technical Report
26
+
27
+ ### Data collection and processing
28
+
29
+ Our dataset includes 1.1B images from [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B)、[COYO-700M](https://github.com/kakaobrain/coyo-dataset)、[LAION-Aesthetics](https://laion.ai/blog/laion-aesthetics/), with no filtering applied aside from resolution checks. The video data are drawn from [Panda-70M](https://github.com/snap-research/Panda-70M) and internal sources, and filtered using the same protocol as in Open-Sora Plan v1.3.0, yielding 40M high-quality videos.
30
+
31
+ ### Adaptive Grad Clipping
32
+
33
+ In Open-Sora Plan v1.3.0, we introduce an Adaptive Grad Clipping strategy based on discarding gradient-abnormal batches. While highly stable, this method involve overly complex execution logic. In Open-Sora Plan v1.5.0, we optimize the strategy by maintaining the gradient norm threshold via an exponential moving average (EMA). Gradients exceeding the threshold are clipped accordingly. This approach effectively extends the fixed threshold of 1.0, which is commonly used in large-scale models, into a dynamic, training-dependent threshold.
34
+
35
+ ```python
36
+ '''
37
+ moving_avg_max_grad_norm: the maximum gradient norm maintained via EMA
38
+ moving_avg_max_grad_norm_var: the variance of the maximum gradient norm maintained via EMA
39
+ clip_threshold: the gradient clipping threshold computed using the 3-sigma rule
40
+ ema_decay: the EMA decay coefficient, typically set to 0.99.
41
+ grad_norm: grad norm at the current step
42
+ '''
43
+ clip_threshold = moving_avg_max_grad_norm + 3.0 * (moving_avg_max_grad_norm_var ** 0.5)
44
+ if grad_norm <= clip_threshold:
45
+ # If the gradient norm is below the clipping threshold, the parameters are updated normally at this step, and both the moving_avg_max_grad_norm and moving_avg_max_grad_norm_var are updated accordingly.
46
+ moving_avg_max_grad_norm = ema_decay * moving_avg_max_grad_norm + (1 - ema_decay) * grad_norm
47
+ max_grad_norm_var = (moving_avg_max_grad_norm - grad_norm) ** 2
48
+ moving_avg_max_grad_norm_var = ema_decay * moving_avg_max_grad_norm_var + (1 - ema_decay) * max_grad_norm_var
49
+ # update weights...
50
+ else:
51
+ # If the gradient norm exceeds the clipping threshold, the gradients are first clipped to reduce the norm to the threshold value before updating the parameters.
52
+ clip_coef = grad_norm / clip_threshold
53
+ grads = clip(grads, clip_coef) # clipping grads
54
+ # update weights...
55
+ ```
56
+
57
+ Compared to the strategy in v1.3.0, this approach is simpler to implement and effectively addresses the issue of loss spikes that occur in the later stages of diffusion training when the gradient norm is significantly below 1.0.
58
+
59
+ ### WFVAE with 8x8x8 compression
60
+
61
+ In version 1.5.0, we increase the temporal compression rate of the VAE from 4× to 8×, reducing the latent shape to half that of the previous version. This enables the generation of videos with higher frame counts.
62
+
63
+ | Model | THW(C) | PSNR | LPIPS | rFVD |
64
+ | ----------------- | ------------- | ------------ | ------------- | ------------ |
65
+ | CogVideoX | 4x8x8 (16) | <u>36.38</u> | 0.0243 | <u>50.33</u> |
66
+ | StepVideo | 8x16x16 (16) | 33.61 | 0.0337 | 113.68 |
67
+ | LTXVideo | 8x32x32 (128) | 33.84 | 0.0380 | 150.87 |
68
+ | Wan2.1 | 4x8x8 (16) | 35.77 | **0.0197** | **46.05** |
69
+ | Ours (WF-VAE-M) | 8x8x8 (32) | **36.91** | <u>0.0205</u> | 52.53 |
70
+
71
+ **Test on an open-domain dataset with 1K samples.**
72
+
73
+ For more details on WFVAE, please refer to [WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model](https://arxiv.org/abs/2411.17459)
74
+
75
+ ### Training Text-to-Video Diffusion Model
76
+
77
+ #### Framework —— SUV: A Sparse U-shaped Diffusion Transformer For Fast Video Generation
78
+
79
+ In Open-Sora Plan v1.3.0, we discuss the strengths and weaknesses of Full 3D Attention and 2+1D Attention. Based on their characteristics, we propose Skiparse Attention, a novel global sparse attention mechanism. For details, please refer to [Report-v1.3.0](https://github.com/yunyangge/Open-Sora-Plan/blob/main/docs/Report-v1.3.0.md).
80
+
81
+ Under a predefined sparsity $k$, Skiparse Attention selects a subsequence of length $\frac{1}{k}$ of the original sequence in an alternating Single-Skip and Group-Skip pattern for attention interaction. This design approximates the effect of Full 3D Attention. As the sparsity increases, the selected positions become more widely spaced; as it decreases, the positions become more concentrated. Regardless of the sparsity, Skiparse Attention remains global.
82
+
83
+ In Open-Sora Plan v1.5.0, we interpret this sparse interaction pattern as a form of token-level information downsampling. Sparser Skiparse Attention performs more semantic-level interactions, while denser Skiparse Attention captures fine-grained information. Following the multi-scale design principle in neural networks, we introduce Skiparse Attention with U-shaped sparsity variation: low-sparsity Skiparse Attention is used in shallow layers, with Full 3D Attention applied at the shallowest layer, and high-sparsity Skiparse Attention in deeper layers. Inspired by the UNet architecture, we further incorporate long skip connections between stages with identical sparsity. This U-shaped DiT architecture based on Skiparse Attention is referred to as **SUV**.
84
+
85
+ ![SUV](https://img.picui.cn/free/2025/06/05/684108197cbb8.png)
86
+
87
+ In Open-Sora Plan v1.5.0, we adopt an SUV architecture based on MMDiT. Skiparse Attention is applied to the video latents, while the text embeddings are only repeated to align with the skiparse-processed latent shape, without any sparsification.
88
+
89
+ The SUV architecture offers the following advantages:
90
+
91
+ 1、SUV is the first sparsification method proven effective for video generation. Our ablation studies show that it achieves performance comparable to dense DiT within the approximate training steps. Moreover, it can be applied during both pretraining and inference.
92
+
93
+ 2、Unlike UNet structures that explicitly downsample feature maps and cause information loss, the U-shaped structure of SUV operates on attention. The shape of the feature map remains unchanged, preserving information while altering only the granularity of token-level interactions.
94
+
95
+ 3、Skiparse Attention and SUV only change the attention computation during the forward pass instead of modifying model weights. This allows dynamic adjustment of sparsity throughout training: lower sparsity for image or low-resolution video training, and higher sparsity for high-resolution video training. As a result, FLOPS grow approximately linearly with increasing of sequence length.
96
+
97
+ A more detailed analysis of the SUV architecture will be released in a future arXiv update.
98
+
99
+ #### Training Stage
100
+
101
+ Our training consists of two stages: Text-to-Image and Text-to-Video.
102
+
103
+ #### Text-to-Image
104
+
105
+ Previous studies have shown that image weights trained on synthetic data may negatively impact video training. Therefore, in the v1.5.0 update, we choose to train image weights using a much larger corpus of real-world data, totaling 1.1B images. Since image data come in various resolutions, whereas videos are primarily in a 9:16 aspect ratio, we adopt multi-resolution training for images using five common aspect ratios—(1,1), (3,4), (4,3), (9,16), and (16,9)—along with the Min-Max Token Strategy. In contrast, video training is conducted using a fixed 9:16 resolution.
106
+
107
+ The difference between Skiparse Attention and Full Attention lies in the token sequences involved in the forward computation; the required weights remain identical. Therefore, we can first train the model using Dense MMDiT with Full 3D Attention, and then fine-tune it to the Sparse MMDiT mode after sufficient training.
108
+
109
+ **Image-Stage-1:** Training is conducted using 512 Ascend 910B accelerators. We train a randomly initialized Dense MMDiT on 256²-pixel images with multi-resolution enabled. The learning rate is set to 1e-4, with a batch size of 8096. This stage runs for a total of 225k steps.
110
+
111
+ **Image-Stage-2:** Training is conducted using 384 Ascend 910B accelerators. We train on 384²-pixel images with multi-resolution still enabled. The learning rate remains 1e-4, the batch size is 6144, and training lasts for 150k steps.
112
+
113
+ **Image-Stage-3:** Training is conducted using 256 Ascend 910B accelerators. We train on 288x512 images with force resolution. The learning rate is 1e-4, the batch size is 4096, and training lasts for 110k steps. This stage completes the Dense MMDiT training.
114
+
115
+ **Image-Stage-4:** Training is conducted using 256 Ascend 910B accelerators. We initialize the SUV model using the pretrained weights from Dense MMDiT, with skip connections zero-initialized to ensure that the model could produce non-noise outputs at the start. In practice, zero-shot inference reveals that the generated images contained meaningful low-frequency structures. Our experiments confirm that fine-tuning from Dense DiT to SUV converges quickly. This stage uses a fixed resolution of 288×512, a learning rate of 1e-4, a batch size of 4096, and is trained for approximately 160k steps.
116
+
117
+ #### Text-to-Video
118
+
119
+ For video training, we fix the aspect ratio at 9:16 and training solely on video data instead of joint training with image data. All training in this stage is performed using 512 Ascend 910B accelerators.
120
+
121
+ **Video-Stage-1:** Starting from the SUV weights pretrained during the Text-to-Image phase, we train on videos with a shape of 57×288×512 for about 40k steps. The setup includes a learning rate of 6e-5, TP/SP parallelism of 2, gradient accumulation set to 2, a micro batch size of 2, and a global batch size of 1024. Videos are trained at 24 fps, representing approximately 2.4 seconds (57/24 ≈ 2.4s) of content per sample. This stage marks the initial adaptation from image-based to video-based weights, for which shorter video clips are intentionally selected to ensure stable initialization.
122
+
123
+ **Video-Stage-2: **We further train on videos with a shape of 57×288×512 for 45k steps, keeping the learning rate, TP/SP parallelism, and gradient accumulation settings unchanged. However, the training frame rate is reduced to 12 fps, corresponding to ~4.8 seconds of video content per sample (57/12 ≈ 4.8s). This stage aims to enhance temporal learning without increasing sequence length, serving as preparation for later high-frame-counts training.
124
+
125
+ **Video-Stage-3:** We train on videos with a shape of 121×288×512 for approximately 25k steps. The learning rate is adjusted to 4e-5, with TP/SP parallelism set to 4, gradient accumulation steps set to 2, a micro batch size of 4, and a global batch size of 1024. In this stage, we revert to a training frame rate of 24 fps.
126
+
127
+ **Video-Stage-4:** We conduct training on videos with a shape of 121×576×1024 for a total of 16k + 9k steps. The learning rates are set to 2e-5 and 1e-5 for the two phases, respectively. TP/SP parallelism is configured as 4, with gradient accumulation steps set to 4, a micro batch size of 1, and a global batch size of 512.
128
+
129
+ **Video-Stage-5:** We train on a high-quality subset of the dataset for 5k steps, using a learning rate of 1e-5. TP/SP parallelism is set to 4, with gradient accumulation steps of 4, a micro batch size of 1, and a global batch size of 512.
130
+
131
+ #### Performance on Vbench
132
+
133
+ | Model | Total Score | Quality Score | Semantic Score | **aesthetic quality** |
134
+ | -------------------------- | ------------- | ------------- | -------------- | --------------------- |
135
+ | Mochi-1 | 80.13% | 82.64% | 70.08% | 56.94% |
136
+ | CogvideoX-2B | 80.91% | 82.18% | 75.83% | 60.82% |
137
+ | CogvideoX-5B | 81.61% | 82.75% | 77.04% | 61.98% |
138
+ | Step-Video-T2V | 81.83% | <u>84.46%</u> | 71.28% | 61.23% |
139
+ | CogvideoX1.5-5B | 82.17% | 82.78% | **79.76%** | 62.79% |
140
+ | Gen-3 | 82.32% | 84.11% | 75.17% | <u>63.34%</u> |
141
+ | HunyuanVideo (Open-Source) | **83.24%** | **85.09%** | 75.82% | 60.36% |
142
+ | Open-Sora Plan v1.5.0 | <u>82.95%</u> | 84.15% | <u>78.17%</u> | **66.93%** |
143
+
144
+ ### Training Image-to-Video Diffusion Model
145
+
146
+ Coming Soon...
147
+
148
+ ### Future Work
149
+
150
+ Currently, open-source models such as Wan2.1 have achieved performance comparable to closed-source commercial counterparts. Given the gap in computing resources and data availability compared to industry-scale efforts, the future development of the Open-Sora Plan will focus on the following directions:
151
+
152
+ 1、Latents Cache。
153
+
154
+ In the training process of Text-to-Video models, the data must be processed through two key modules—the Variational Autoencoder (VAE) and the Text Encoder—to extract features from both video/images and their corresponding prompts. These encoded features serve as inputs to the training model. However, in existing industry practices, feature encoding is redundantly performed on the multimodal training dataset during every training epoch. This leads to additional computational overhead and significantly prolongs the total training time.
155
+
156
+ Specifically, in conventional training pipelines, the VAE and Text Encoder modules are typically kept resident in GPU memory to perform feature encoding in real time during each epoch. While this ensures on-the-fly encoding, it also results in persistently high GPU memory usage, becoming a major bottleneck for training efficiency. This issue is exacerbated when handling large-scale datasets or complex models, where memory constraints further limit model capacity and training speed.
157
+
158
+ To address the above issue, we propose an optimization strategy that replaces repeated feature computation with feature lookup. The core idea is to decouple feature encoding from model training. Specifically, during pretraining or the first training epoch, we compute and store the most computationally expensive text prompt features in external high-performance storage. During subsequent training, the model directly loads these precomputed features from storage, avoiding redundant encoding operations. This design significantly reduces computational overhead and GPU memory usage, allowing more memory to be allocated to model training.
159
+
160
+ Based on the following configuration environment, we compare the training time per epoch and per step before and after applying the feature caching strategy. Experimental results show that storing precomputed features reduces multi-epoch training time by approximately 30% and frees up around 20% of GPU memory resources.
161
+
162
+ | **Configuration** | **Details** |
163
+ | :---------------: | :-----------------------------------------: |
164
+ | Model | Open-Sora Plan v1.5.0 (2B-level parameters) |
165
+ | Dataset | 100K images and 10K videos |
166
+ | Accelerators | 8× Nvidia A800 GPUs |
167
+ | Feature Storage | Huawei OceanStor AI Storage |
168
+
169
+ Test cases:
170
+
171
+ | **Training Stage** | **Test Type** | **Batch Size** | **Time per Step** | **Time per Epoch** | **Memory Usage** |
172
+ | ------------------ | ---------------------- | -------------- | ----------------- | ------------------ | ---------------- |
173
+ | Low-Res Images | General Method | 64 | 6.53s | 21 min 12s | 56 GB |
174
+ | | Feature Caching Method | 64 | 4.10s | 13 min 19s | 40 GB |
175
+ | | General Method | 128 | 12.78s | 20 min 39s | 74 GB |
176
+ | | Feature Caching Method | 128 | 7.81s | 12 min 38s | 50 GB |
177
+ | Low-Res Videos | General Method | 8 | 8.90s | 26 min 23s | 68 GB |
178
+ | | Feature Caching Method | 8 | 7.78s | 23 min 05s | 51 GB |
179
+ | High-Res Videos | General Method | 4 | 17.00s | 101 min | 71 GB |
180
+ | | Feature Caching Method | 4 | 16.00s | 97 min | 57 GB |
181
+
182
+ 2、Improved DiT pretraining with sparse or linear attention. In v1.3.0, we introduce the first DiT pretrained with sparse attention in the community. This is extended in v1.5.0 into the SUV architecture, enabling sparse DiT to achieve performance comparable to its dense counterpart. While sparse and linear attention have demonstrated significant success in the LLM domain, their application in video generation remains underexplored. In future versions, we plan to further investigate the integration of sparse and linear attention into video generation models.
183
+
184
+ 3、MoE-based DiT. Since the release of [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), the MoE (Mixture-of-Experts) paradigm has become a common approach for scaling LLMs to larger parameter sizes. Currently, open-source video generation models are capped at around 14B parameters, which is still relatively small compared to the 100B+ scales in the LLM field. Incorporating MoE into the DiT architecture, and exploring its combination with sparse and linear attention, is a future direction under consideration by the Open-Sora Plan team.
185
+
186
+ 4、Unified video generation models for both generation and understanding. The March release of GPT-4o demonstrates that unified architectures combining generation and understanding can offer fundamentally different capabilities compared to purely generative models. In the video domain, we should similarly anticipate the potential breakthroughs that such unified generative models might bring.
187
+
188
+ 5、Enhancing Image-to-Video generation models. Current approaches in this field still largely follow either the SVD paradigm or the inpainting-based paradigm adopted since Open-Sora Plan v1.2.0. Both approaches require extensive fine-tuning of pretrained Text-to-Video models. From a practical standpoint, Text-to-Video is more aligned with academic exploration, while Image-to-Video is more relevant to real-world production scenarios. As a result, developing a new paradigm for Image-to-Video will be a key focus for the Open-Sora Plan team moving forward.
Report-V1.5.0_cn.md ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Report v1.5.0
2
+
3
+ 在2024年的10月,我们发布了Open-Sora Plan v1.3.0,第一次将一种稀疏化的attention结构——skiparse attention引入video generation领域。同时,我们采用了高效的WFVAE,使得训练时的编码时间和显存占用大大降低。
4
+
5
+ 在Open-Sora Plan v1.5.0中,Open-Sora Plan引入了几个关键的更新:
6
+
7
+ 1、更好的sparse dit——SUV。在skiparse attention的基础上,我们将sparse dit扩展至U形变化的稀疏结构,使得在保持速度优势的基础上sparse dit可以取得和dense dit相近的性能。
8
+
9
+ 2、更高压缩率的WFVAE。在Open-Sora Plan v1.5.0中,我们尝试了8x8x8下采样率的WFVAE,它在性能上媲美社区中广泛存在的4x8x8下采样率的VAE的同时latent shape减半,降低attention序列长度。
10
+
11
+ 3、data和model scaling。在Open-Sora Plan v1.5.0中,我们收集了1.1B的高质量图片数据和40m的高质量视频数据,并将模型大小scale到8.5B,使最终得到的模型呈现出不俗的性能。
12
+
13
+ 4、更简易的Adaptive Grad Clipping。相比于version 1.3.0中较复杂的丢弃污点batch的策略,在version 1.5.0中我们简单地维护一个adaptive的grad norm threshold并clipping,以此更适应各种并行策略的需要。
14
+
15
+ Open-Sora Plan v.1.5.0全程在昇腾910系列加速卡上完成训练和推理,并采用mindspeed-mm训练框架适配并行策略。
16
+
17
+ ### Open-Source Release
18
+
19
+ Open-Sora Plan v1.5.0的开源包括:
20
+
21
+ 1、所有训练和推理代码。你也可以在[MindSpeed-MM](https://gitee.com/ascend/MindSpeed-MM)官方仓库找到open-sora plan v1.5.0版本的实现。
22
+
23
+ 2、8x8x8下采样的WFVAE权重以及8.5B的SUV去噪器权重。
24
+
25
+ ## Detailed Technical Report
26
+
27
+ ### Data collection and processing
28
+
29
+ 我们共收集了来自Recap-DataComp-1B、Coyo700M、Laion-aesthetic的共1.1B图片数据。对于图片数据,我们不进行除了分辨率之外的筛选。我们的视频数据来自于Panda70M以及其他自有数据。对于视频数据,我们采用与Open-Sora Plan v1.3.0相同的处理策略进行筛选,最终数据量为40m的高质量视频数据。
30
+
31
+ ### Adaptive Grad Clipping
32
+
33
+ 在Open-Sora Plan v1.3.0中,我们介绍了一种基于丢弃梯度异常batch的Adaptive Grad Clipping策略,这种策略具有很高的稳定性,但是执行逻辑过于复杂。因此,在Open-Sora Plan v1.5.0中,我们选择将该策略进行优化,采用EMA方式维护grad norm的threshold,并在grad norm超过该threshold时裁剪到threshold以下。该策略本质上是将大模型领域常用的1.0常数grad norm threshold扩展为一个随着训练进程动态变化的threshold。
34
+
35
+ ```python
36
+ '''
37
+ moving_avg_max_grad_norm: EMA方式维护的最大grad norm
38
+ moving_avg_max_grad_norm_var: EMA方式维护的最大grad norm的方差
39
+ clip_threshold: 根据3 sigma策略计算得到的梯度裁剪阈值
40
+ ema_decay: EMA衰减系数,一般为0.99
41
+ grad_norm: 当前step的grad norm
42
+ '''
43
+ clip_threshold = moving_avg_max_grad_norm + 3.0 * (moving_avg_max_grad_norm_var ** 0.5)
44
+ if grad_norm <= clip_threshold:
45
+ # grad norm小于裁剪阈值,则该step参数正常更新,同时更新维护的moving_avg_max_grad_norm 和 moving_avg_max_grad_norm_var
46
+ moving_avg_max_grad_norm = ema_decay * moving_avg_max_grad_norm + (1 - ema_decay) * grad_norm
47
+ max_grad_norm_var = (moving_avg_max_grad_norm - grad_norm) ** 2
48
+ moving_avg_max_grad_norm_var = ema_decay * moving_avg_max_grad_norm_var + (1 - ema_decay) * max_grad_norm_var
49
+ 参数更新...
50
+ else:
51
+ # grad norm大于裁剪阈值,则先裁剪grad使grad norm减少至clip_threshold,再进行参数更新。
52
+ clip_coef = grad_norm / clip_threshold
53
+ grads = clip(grads, clip_coef) # 裁剪grads
54
+ 参数更新...
55
+ ```
56
+
57
+ 该策略相较于v1.3.0中策略实现更简单,且能够很好应对diffusion训练后期grad norm远小于1.0时仍存在loss spike的问题。
58
+
59
+ ### WFVAE with 8x8x8 downsampling
60
+
61
+ 在V1.5.0版本中,我们将VAE的时间压缩率从4倍压缩提高至8倍压缩,使得对于同样原始尺寸的视频,latent shape减少为先前版本的一半,这使得我们可以实现更高帧数的视频生成。
62
+
63
+ | Model | THW(C) | PSNR | LPIPS | rFVD |
64
+ | ----------------- | ------------- | ------------ | ------------- | ------------ |
65
+ | CogVideoX | 4x8x8 (16) | <u>36.38</u> | 0.0243 | <u>50.33</u> |
66
+ | StepVideo | 8x16x16 (16) | 33.61 | 0.0337 | 113.68 |
67
+ | LTXVideo | 8x32x32 (128) | 33.84 | 0.0380 | 150.87 |
68
+ | Wan2.1 | 4x8x8 (16) | 35.77 | **0.0197** | **46.05** |
69
+ | Ours (WF-VAE-M) | 8x8x8 (32) | **36.91** | <u>0.0205</u> | 52.53 |
70
+
71
+ **Test on an open-domain dataset with 1K samples.**
72
+
73
+ WFVAE详情请见[WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model](https://arxiv.org/abs/2411.17459)
74
+
75
+ ### Training Text-to-Video Diffusion Model
76
+
77
+ #### Framework —— SUV: A Sparse U-shaped Diffusion Transformer For Fast Video Generation
78
+
79
+ 在Open-Sora Plan v1.3.0中,我们讨论了Full 3D Attention以及2+1D Attention的优劣,并综合他们的特点提出了Skiparse Attention——一种新型的global sparse attention,具体原理请参考[Report-v1.3.0](https://github.com/yunyangge/Open-Sora-Plan/blob/main/docs/Report-v1.3.0.md)。
80
+
81
+ 在一个事先指定的sparse ratio $k$ 下,Skiparse Attention按照Single Skip - Group Skip交替的方式选定原序列长度 $\frac{1}{k}$ 的子序列进行attention交互,以此达到近似Full 3D Attention的效果。在Skiparse Attention中,sparse ratio越大,子序列在原序列中的位置越稀疏;sparse ratio越小,子序列在原序列中的位置越密集。但无论sparse ratio为多少,Skiparse Attention总是global的。
82
+
83
+ 在Open-Sora Plan v1.5.0中,我们将这种稀疏交互方式看作一种token上的信息下采样,越稀疏的Skiparse Attention是一种更偏语义级的信息交互,越密集的Skiparse Attention是一种更偏细粒度的信息交互。遵循神经网络中多尺度设计的准则,我们在网络中引入U形变化稀疏度的Skiparse Attention,即浅层采用稀疏度低的Skiparse Attention,并在最浅层使用Full 3D Attention,深层采用稀疏度高的Skiparse Attention。特别的,类比UNet的设计,我们在相同稀疏度的Stage之间引入了Long Skip Connection。我们将这种U形变化的基于Skiparse Attention的DiT称之为SUV。
84
+
85
+ ![SUV](https://img.picui.cn/free/2025/06/05/684108197cbb8.png)
86
+
87
+ 在Open-Sora Plan v1.5.0中我们采用了基于MMDiT的SUV架构。对于video latents,我们对其进行skiparse attention操作,对于text embedding,我们仅对其进行repeat以对齐skiparse后的latent shape而不进行任何稀疏化操作。
88
+
89
+ SUV架构存在以下优点:
90
+
91
+ 1、SUV是首个在视频生成模型上验证有效的稀疏化方法,在我们的消融实验中表明其在同样训练步数下可以达到接近dense dit的性能,且可以同时应用于预训练和推理中。
92
+
93
+ 2、相较于UNet结构对feature map进行显式的下采样造成了信息损失,SUV的U形结构作用在Attention上,feature map的shape并没有发生变化,即信息并未发生损失,改变的只是token间信息交互的粒度。
94
+
95
+ 3、Skiparse Attention及SUV不改变权重大小,只改变forward时attention的计算方式。这使得我们可以随着训练进程动态调整稀疏度,在图片训练或低分辨率视频训练时采用较低的稀疏度,在高分辨率视频训练时提高稀疏度,获得随序列长度近似线性增长的FLOPS。
96
+
97
+ 对SUV架构更细致的分析,将会在后续更新至arxiv。
98
+
99
+ #### Training Stage
100
+
101
+ 我们的训练包括Text-to-Image和Text-to-Video两个阶段。
102
+
103
+ #### Text-to-Image
104
+
105
+ 先前的工作表明从合成数据训练得到的图像权重可能会影响视频训练时的效果。因此,在v1.5.0更新中,我们选择在更大的真实数据域内训练图像权重。我们收集了共1.1B的图片数据进行训练。由于图片存在多种不同的分辨率,而视频主要为9:16分辨率,因此我们选择在训练图片权重时开启多分辨率(5个常见宽高比:(1,1), (3,4), (4,3), (9,16), (16,9) )及Min-Max token Strategy训练,而在训练视频时采用固定9:16的宽高比固定分辨率训练。
106
+
107
+ Skiparse Attention与Full Attention的区别在于前向过程中参与计算的token序列不同,所需要的权重变量则完全相同。因此,我们可以先用Full 3D Attention的Dense MMDiT做训练,并在训练充分后Fine-tune至Sparse MMDiT模式。
108
+
109
+ **Image-Stage-1:** 采用512张Ascend 910B进行训练。 我们采用随机初始化的Dense MMDiT在256^2px级别分辨率的图片上训练,开启多分辨率。学习率为1e-4,batch size为8096。在这个阶段我们总共训练了225k steps。
110
+
111
+ **Image-Stage-2:** 采用384张Ascend 910B进行训练。在384^px级别的图片上训练,开启多分辨率训练。学习率为1e-4,batch size为6144,共训练150k step。
112
+
113
+ **Image-Stage-3:** 采用256张Ascend 910B进行训练。固定288x512分辨率训练。学习率为1e-4,batch size为4096,共训练110k step。Dense MMDiT阶段训练完成。
114
+
115
+ **Image-Stage-4:** 采用256张Ascend 910B进行训练。采用Dense MMDiT的权重初始化SUV,其中skip connection采用零初始化,保证初始SUV权重能够推出非噪声图片。事实上,zero shot推理得到的图片具备一定的低频信息,我们验证了Dense DiT到SUV的finetune可以很快达成。该阶段固定分辨率为288x512,学习率为1e-4,batch size为4096,共训练约160k step。
116
+
117
+ #### Text-to-Video
118
+
119
+ 在训练视频时,我们采用的宽高比固定为9:16,且并未采用视频图像联合训练,而是仅用视频数据做训练。以下训练均在512张Ascend 910B上完成。
120
+
121
+ **Video-Stage-1:** 继承Text-to-Image阶段得到的SUV权重,我们在57x288x512的视频上训练了大约40k step,学习率为6e-5,TP/SP并行度为2,学习率为6e-5,梯度累积次数为2, micro batch size为2,global batch size为1024。在这个阶段,我们采用的train fps为24,即大约57/24≈2.4s的视频内容。该阶段作为图片权重到视频权重迁移的第一个阶段,我们选择了较短的视频训练作为良好的初始化。
122
+
123
+ **Video-Stage-2: **我们同样在57x288x512的视频上训练45k step,学习率、TP/SP并行度和梯度累积设置保持不变,但是train fps更改为12,即对应的原视频长度为57/12≈4.8s的内容。该阶段旨在不增加序列长度的同时提高对时序的学习,为后续高帧数训练阶段做准备。
124
+
125
+ **Video-Stage-3:** 我们在121x288x512的视频上训练约25k step,学习率调整为4e-5、TP/SP并行度设置为4,梯度累积次数设置为2,micro batch size为4,global batch size为1024。在这个阶段我们重新采用train fps为24。
126
+
127
+ **Video-Stage-4:** 在121x576x1024的视频上共训练16k + 9k step,学习率分别为2e-5和1e-5,TP/SP并行度设置为4,梯度累积次数设置为4,micro batch size为1,global batch size为512。
128
+
129
+ **Video-Stage-5:** 我们选择数据中的高质量子集训练了5k step,学习率为1e-5,TP/SP并行度设置为4,梯度累积次数设置为4,micro batch size为1,global batch size为512。
130
+
131
+ #### Performance on Vbench
132
+
133
+ | Model | Total Score | Quality Score | Semantic Score | **aesthetic quality** |
134
+ | -------------------------- | ------------- | ------------- | -------------- | --------------------- |
135
+ | Mochi-1 | 80.13% | 82.64% | 70.08% | 56.94% |
136
+ | CogvideoX-2B | 80.91% | 82.18% | 75.83% | 60.82% |
137
+ | CogvideoX-5B | 81.61% | 82.75% | 77.04% | 61.98% |
138
+ | Step-Video-T2V | 81.83% | <u>84.46%</u> | 71.28% | 61.23% |
139
+ | CogvideoX1.5-5B | 82.17% | 82.78% | **79.76%** | 62.79% |
140
+ | Gen-3 | 82.32% | 84.11% | 75.17% | <u>63.34%</u> |
141
+ | HunyuanVideo (Open-Source) | **83.24%** | **85.09%** | 75.82% | 60.36% |
142
+ | Open-Sora Plan v1.5.0 | <u>82.95%</u> | 84.15% | <u>78.17%</u> | **66.93%** |
143
+
144
+ ### Training Image-to-Video Diffusion Model
145
+
146
+ Comming Soon...
147
+
148
+ ### Future Work
149
+
150
+ 目前,开源社区已经有与闭源商业版本相当性能的模型,如Wan2.1。鉴于算力和数据相比企业来说仍存在不足,后续Open-Sora Plan团队的改进方向为:
151
+
152
+ 1、Latents Cache。
153
+
154
+ 在Text2Video模型的训练过程中,训练数据需要经过变分自编码器(VAE)和文本编码器(Text Encoder)两个关键模块的处理,以实现对视频/图片和对应引导词的特征编码。这些编码后的特征数据作为模型训练的输入,参与后续训练流程。然而业界训练方案中,每个训练周期(Epoch)都需要对多模态训练数据集进行重复的特征编码计算,这不仅增加了额外的计算开销,还显著延长了整体训练时间。
155
+
156
+ 具体而言,在传统的训练流程中,VAE和Text Encoder模型通常需要常驻于GPU显存中,以便在每个Epoch中实时执行特征编码任务。这种设计虽然确保了特征编码的实时性,但也导致了GPU显存占用率居高不下,成为制约训练效率的主要瓶颈之一。尤其是在处理大规模数据集或复杂模型时,显存资源的紧张会进一步加剧这一问题,限制了模型的参数量和训练速度。
157
+
158
+ 为了解决上述问题,我们提出了一种特征值以查代算的优化方案。该方案的核心思想是将特征编码的计算过程与模型训练过程进行解耦。具体实现方式为:在训练前或首轮训练时计算耗时最高的引导词特征值,将其保存至外置高性能文件存储中。后续的训练过程中,模型可以直接从文件存储中读取这些预计算的特征数据,避免了重复的特征编码计算。这种设计不仅显著减少了计算资源的浪费,还大幅降低了GPU显存的占用率,使更多的显存资源可用于模型训练。
159
+
160
+ 基于以下配置环境,统计使用特征数据存储前后的单个epoch及单个step的训练数据。实验表明,特征值存储方案**可缩短约30%多轮迭代训练时间,同时释放约20%显存资源。**
161
+
162
+ | 配置环境 | 详细信息 |
163
+ | :--------: | :-------------------------------: |
164
+ | 模型 | Open-Sora Plan v1.5.0 with 2B量级 |
165
+ | 数据集 | 100K图片及10K视频 |
166
+ | GPU服务器 | 8张Nvidia A800 |
167
+ | 特征值存储 | 华为OceanStor AI存储 |
168
+
169
+ 测试数据:
170
+
171
+ | 训练阶段 | 测试类型 | Batch Size | 单Step耗时 | 单Epoch耗时 | 显存占用 |
172
+ | ------------ | ---------------- | ---------- | ---------- | ----------- | -------- |
173
+ | 低分辨率图片 | 通用方案 | 64 | 6.53s | 21min12s | 56GB |
174
+ | | 特征数据存储方案 | 64 | 4.10s | 13min19s | 40GB |
175
+ | | 通用方案 | 128 | 12.78s | 20min39s | 74GB |
176
+ | | 特征数据存储方案 | 128 | 7.81s | 12min38s | 50GB |
177
+ | 低分辨率视频 | 通用方案 | 8 | 8.90s | 26min23s | 68GB |
178
+ | | 特征数据存储方案 | 8 | 7.78s | 23min05s | 51GB |
179
+ | 高分辨率视频 | 通用方案 | 4 | 17s | 101min | 71GB |
180
+ | | 特征数据存储方案 | 4 | 16s | 97min | 57GB |
181
+
182
+ 2、更好的基于稀疏化attention or 线性attention预训练的DiT。在V1.3.0中,我们推出了社区中第一个基于稀疏attention预训练的DiT,并在V1.5.0版本中将其扩展为SUV架构,使稀疏DiT获得了与Dense DiT相当的模型性能。稀疏attention和线性attention在LLM领域已经获得了很大的成功,但在视频生成领域中的应用仍不够明显。在后续版本中,我们将进一步探索稀疏attention和线性attention在video generation领域的应用。
183
+
184
+ 3、基于MoE的DiT。自Mixtral 8x7B发布以来,LLM领域通常会采用MoE的方式将模型scale至更大的参数量。目前开源视频模型的最大大小仅限于14B,相比于LLM领域上百B的参数量来说仍属于小模型。在DiT架构中引入MoE,以及MoE与稀疏attention和线性attention的结合,是Open-Sora Plan团队未来考虑的方向。
185
+
186
+ 4、生成和理解统一的视频生成模型。3月份gpt-4o的更新让大家认识到了生成理解统一架构的生成模型能够获得与纯生成模型完全不同的能力。在视频领域,我们同样应该期待一个统一的生成模型能够为我们带来哪些惊喜。
187
+
188
+ 5、更好的Image-to-Video模型。目前Image-to-Video领域仍基本遵循SVD范式和Open-Sora Plan v1.2.0起采用的Inpainting范式。这两种范式都需要在Text-to-Video模型权重的基础上进行长时间的finetune。从应用意义上看,Text-to-Video更接近于学术上的探索,而Image-to-Video则更贴近现实的生产环境。因此,Image-to-Video的更新范式也会是Open-Sora Plan团队未来的重点探索方向。