Add files
Browse files- .gitattributes +1 -0
- README.md +166 -0
- assets/images/Fig_1.png +3 -0
- configuration.json +24 -0
- non_ema_0035000.pth +3 -0
- open_clip_pytorch_model.bin +3 -0
- v2-1_512-ema-pruned.ckpt +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*.png filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
backbone:
|
| 3 |
+
- diffusion
|
| 4 |
+
domain:
|
| 5 |
+
- multi-modal
|
| 6 |
+
frameworks:
|
| 7 |
+
- pytorch
|
| 8 |
+
license: cc-by-nc-nd-4.0
|
| 9 |
+
metrics:
|
| 10 |
+
- realism
|
| 11 |
+
- video-video similarity
|
| 12 |
+
studios:
|
| 13 |
+
- damo/Video-to-Video
|
| 14 |
+
tags:
|
| 15 |
+
- video2video generation
|
| 16 |
+
- diffusion model
|
| 17 |
+
- 视频到视频
|
| 18 |
+
- 视频超分辨率
|
| 19 |
+
- 视频生成视频
|
| 20 |
+
- 生成
|
| 21 |
+
tasks:
|
| 22 |
+
- video-to-video
|
| 23 |
+
widgets:
|
| 24 |
+
- examples:
|
| 25 |
+
- inputs:
|
| 26 |
+
- data: A panda eating bamboo on a rock.
|
| 27 |
+
name: text
|
| 28 |
+
- data: XXX/test.mpt
|
| 29 |
+
name: video_path
|
| 30 |
+
name: 2
|
| 31 |
+
title: 示例1
|
| 32 |
+
inferencespec:
|
| 33 |
+
cpu: 4
|
| 34 |
+
gpu: 1
|
| 35 |
+
gpu_memory: 28000
|
| 36 |
+
memory: 32000
|
| 37 |
+
inputs:
|
| 38 |
+
- name: text, video_path
|
| 39 |
+
title: 输入英文prompt, 视频路径
|
| 40 |
+
type: str, str
|
| 41 |
+
validator:
|
| 42 |
+
max_words: 75, /
|
| 43 |
+
task: video-to-video
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
# Video-to-Video
|
| 47 |
+
|
| 48 |
+
本项目**MS-Vid2Vid**由达摩院研发和训练,主要用于提升文生视频、图生视频的分辨率和时空连续性,其训练数据包含了精选的海量的高清视频、图像数据(最短边>720),可以将低分辨率的(16:9)的视频提升到更高分辨率(1280 * 720),可以用于任意低分辨率的的超分,本页面我们将称之为**MS-Vid2Vid-XL**。
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
The **MS-Vid2Vid** project is developed and trained by Damo Academy and is primarily used to enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos. The training data consists of a large selection of high-definition videos and image data (with a minimum short side length of 720), which can upscale low-resolution (16:9) videos to higher resolutions (1280 * 720). It can be used for arbitrary low-resolution super-resolution tasks. On this page, we refer to it as **MS-Vid2Vid-XL**.
|
| 52 |
+
|
| 53 |
+
<center>
|
| 54 |
+
<p align="center">
|
| 55 |
+
<img src="assets/images/Fig_1.png"/>
|
| 56 |
+
<br/>
|
| 57 |
+
Fig.1 Video-to-Video-XL
|
| 58 |
+
<p></center>
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
## 模型介绍 (Introduction)
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
**MS-Vid2VidL**是基于Stable Diffusion设计而得,其设计细节延续我们自研[VideoComposer](https://videocomposer.github.io),具体可以参考其技术报告。如下示例中,左边是低分(448 * 256),细节会存在抖动,时序一致性较差
|
| 66 |
+
右边是高分(1280 * 720),总体会平滑很多,在很多case具有较强的修正能力。
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
**MS-Vid2Vid-XL** is designed based on Stable Diffusion, with design details inherited from our in-house [VideoComposer](https://videocomposer.github.io). For specific information, please refer to our technical report.
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
<center>
|
| 73 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424496410559.mp4"></video>
|
| 74 |
+
</center>
|
| 75 |
+
<br />
|
| 76 |
+
<center>
|
| 77 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424814395007.mp4"></video>
|
| 78 |
+
</center>
|
| 79 |
+
<br />
|
| 80 |
+
<center>
|
| 81 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424166441720.mp4"></video>
|
| 82 |
+
</center>
|
| 83 |
+
<br />
|
| 84 |
+
<center>
|
| 85 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424151609672.mp4"></video>
|
| 86 |
+
</center>
|
| 87 |
+
<br />
|
| 88 |
+
<center>
|
| 89 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424162741042.mp4"></video>
|
| 90 |
+
</center>
|
| 91 |
+
<br />
|
| 92 |
+
<center>
|
| 93 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424162741043.mp4"></video>
|
| 94 |
+
</center>
|
| 95 |
+
<br />
|
| 96 |
+
<center>
|
| 97 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424160549937.mp4"></video>
|
| 98 |
+
</center>
|
| 99 |
+
<br />
|
| 100 |
+
<center>
|
| 101 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423819156083.mp4"></video>
|
| 102 |
+
</center>
|
| 103 |
+
<br />
|
| 104 |
+
<center>
|
| 105 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423826392315.mp4"></video>
|
| 106 |
+
</center>
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
### 代码范例 (Code example)
|
| 110 |
+
```python
|
| 111 |
+
from modelscope.pipelines import pipeline
|
| 112 |
+
from modelscope.outputs import OutputKeys
|
| 113 |
+
|
| 114 |
+
# VID_PATH: your video path
|
| 115 |
+
# TEXT : your text description
|
| 116 |
+
pipe = pipeline(task="video-to-video", model='damo/Video-to-Video')
|
| 117 |
+
p_input = {
|
| 118 |
+
'video_path': VID_PATH,
|
| 119 |
+
'text': TEXT
|
| 120 |
+
}
|
| 121 |
+
|
| 122 |
+
output_video_path = pipe(p_input, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
### 模型局限 (Limitation)
|
| 128 |
+
|
| 129 |
+
本**MS-Vid2Vid-XL**可能存在如下可能局限性:
|
| 130 |
+
|
| 131 |
+
- 目标距离较远时可能会存在一定的模糊,该问题可以通过输入文本来解决或缓解;
|
| 132 |
+
- 计算时耗大,因为需要生成720P的视频,隐空间的尺寸为(160 * 90),单个视频计算时长>2分钟
|
| 133 |
+
- 目前仅支持英文,因为训练数据的原因目前仅支持英文输入
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
This **MS-Vid2Vid-XL** may have the following limitations:
|
| 137 |
+
- There may be some blurriness when the target is far away. This issue can be addressed by providing input text.
|
| 138 |
+
- Computation time is high due to the need to generate 720P videos. The latent space size is (160 * 90), and the computation time for a single video is more than 2 minutes.
|
| 139 |
+
- Currently, it only supports English. This is due to the training data, which is limited to English inputs at the moment.
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
## 相关论文以及引用信息 (Reference)
|
| 144 |
+
```
|
| 145 |
+
@article{videocomposer2023,
|
| 146 |
+
title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
|
| 147 |
+
author={Wang, Xiang* and Yuan, Hangjie* and Zhang, Shiwei* and Chen, Dayou* and Wang, Jiuniu and Zhang, Yingya and Shen, Yujun and Zhao, Deli and Zhou, Jingren},
|
| 148 |
+
journal={arXiv preprint arXiv:2306.02018},
|
| 149 |
+
year={2023}
|
| 150 |
+
}
|
| 151 |
+
|
| 152 |
+
@inproceedings{videofusion2023,
|
| 153 |
+
title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
|
| 154 |
+
author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
|
| 155 |
+
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
|
| 156 |
+
year={2023}
|
| 157 |
+
}
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
## 使用协议 (License Agreement)
|
| 163 |
+
我们的代码和模型权重仅可用于个人/学术研究,暂不支持商用。
|
| 164 |
+
|
| 165 |
+
Our code and model weights are only available for personal/academic research use and are currently not supported for commercial use.
|
| 166 |
+
|
assets/images/Fig_1.png
ADDED
|
Git LFS Details
|
configuration.json
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"framework": "pytorch",
|
| 3 |
+
"task": "video-to-video",
|
| 4 |
+
"model": {
|
| 5 |
+
"type": "video-to-video-model",
|
| 6 |
+
"model_args": {
|
| 7 |
+
"ckpt_clip": "open_clip_pytorch_model.bin",
|
| 8 |
+
"ckpt_unet": "non_ema_0035000.pth",
|
| 9 |
+
"ckpt_autoencoder": "v2-1_512-ema-pruned.ckpt",
|
| 10 |
+
"seed": 666,
|
| 11 |
+
"solver_mode": "fast"
|
| 12 |
+
},
|
| 13 |
+
"model_cfg": {
|
| 14 |
+
"batch_size": 1,
|
| 15 |
+
"target_fps": 8,
|
| 16 |
+
"max_frames": 32,
|
| 17 |
+
"latent_hei": 90,
|
| 18 |
+
"latent_wid": 160
|
| 19 |
+
}
|
| 20 |
+
},
|
| 21 |
+
"pipeline": {
|
| 22 |
+
"type": "video-to-video-pipeline"
|
| 23 |
+
}
|
| 24 |
+
}
|
non_ema_0035000.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d146dd22a8158896c882dd96e6b14d2962a63398a3f2ac37611dcadcdab3a15d
|
| 3 |
+
size 5645549113
|
open_clip_pytorch_model.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9a78ef8e8c73fd0df621682e7a8e8eb36c6916cb3c16b291a082ecd52ab79cc4
|
| 3 |
+
size 3944692325
|
v2-1_512-ema-pruned.ckpt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:88ecb782561455673c4b78d05093494b9c539fc6bfc08f3a9a4a0dd7b0b10f36
|
| 3 |
+
size 5214865159
|