|
|
--- |
|
|
license: agpl-3.0 |
|
|
datasets: |
|
|
- nkp37/OpenVid-1M |
|
|
- TempoFunk/webvid-10M |
|
|
--- |
|
|
⚡️ In this work, we present **AMD Hummingbird-I2V**, a compact and efficient **diffusion-based** I2V model designed for high-quality video synthesis under limited |
|
|
computational budgets.Hummingbird-I2V adopts a lightweight **U-Net** architecture with **0.9B parameters** and a novel two-stage training strategy guided by |
|
|
**reward-based feedback**, resulting in substantial improvements in inference speed, model efficiency, and visual quality. To further improve output resolution with minimal |
|
|
overhead, we introduce a **super-resolution** module at the end of the pipeline. Additionally, we leverage **ReNeg**, an AMD proposed reward-guided framework for learning |
|
|
negative embeddings via gradient descent, to further boost visual quality. As a result, Hummingbird-I2V can generate high-quality 4K video in just **11 seconds** with 16 |
|
|
inference steps on an AMD Radeon™ RX 7900 XTX GPU. Quantitative results on the VBench-I2V benchmark show that Hummingbird-I2V achieves state-of-the-art performance among |
|
|
U-Net-based diffusion models and competitive results compared to significantly larger DiT-based models. We provide a detailed analysis of the model architecture, training |
|
|
methodology, and benchmark performance. |
|
|
|
|
|
<img src="src/key_takeway.png" alt="key_takeway" title="key_takeway" class="key_takeway"> |
|
|
|
|
|
<img src="src/i2v_training_pipeline.png" alt="i2v_training_pipeline" title="i2v_training_pipeline" class="i2v_training_pipeline"> |
|
|
|
|
|
<style> |
|
|
table { |
|
|
width: auto; |
|
|
border-collapse: collapse; |
|
|
} |
|
|
th, td { |
|
|
border: 1px solid #ddd; |
|
|
text-align: center; |
|
|
padding: 0px; |
|
|
vertical-align: middle; |
|
|
width: 256px; /* 每列宽度固定 */ |
|
|
} |
|
|
tr.text-row { |
|
|
height: 30px; /* 文字行高度 */ |
|
|
} |
|
|
tr.image-row { |
|
|
height: 160px; /* 图片行高度 */ |
|
|
} |
|
|
/* 默认表格中的图片大小 */ |
|
|
img { |
|
|
width: 256px; |
|
|
height: 160px; |
|
|
object-fit: cover; |
|
|
} |
|
|
/* 只影响 vbench.png */ |
|
|
.vbench-img { |
|
|
width: 785px !important; |
|
|
height: 698px !important; |
|
|
object-fit: contain; /* 让图片完整显示,不裁剪 */ |
|
|
} |
|
|
</style> |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Model | I2V Subj | I2V Bkg | Cam Mot | Subj Cons | Bkg Cons | Mot Smo | Dyn Deg | Aes Qual | Img Qual | Total Score | |
|
|
|---------------------|----------|---------|---------|-----------|-----------|----------|----------|-----------|-----------|--------------| |
|
|
| CogVideoSFT | 97.67% | 98.76% | 84.93% | 95.47% | 98.30% | 98.35% | 36.51% | 59.76% | 67.64% | 87.98% | |
|
|
| CogVideoX-12V-5B | 98.87% | 99.08% | 76.25% | 96.99% | 99.02% | 98.85% | 21.79% | 60.76% | 69.53% | 88.21% | |
|
|
| Step-Video-T12V | 97.44% | 98.45% | 48.15% | 95.62% | 96.92% | 99.08% | 48.78% | 61.74% | 70.17% | 87.98% | |
|
|
| HunYuan | - | - | - | - | 93.85% | 99.39% | - | - | - | - | |
|
|
| Wan-2.1-14B | - | - | - | - | 98.46% | 96.07% | - | - | - | - | |
|
|
| Animate-Anything | 98.76% | 98.58% | 13.08% | 98.90% | 98.19% | 98.61% | 2.68% | 67.12% | 72.09% | 86.48% | |
|
|
| SEINE-512 | 97.15% | 96.94% | 20.97% | 95.28% | 97.12% | 97.12% | 27.07% | 64.55% | 71.39% | 85.52% | |
|
|
| I2VGen-XL | 96.48% | 96.83% | 18.46% | 95.45% | 96.42% | 98.03% | 24.08% | 64.82% | 69.14% | 85.28% | |
|
|
| ConsistI2V | 95.82% | 95.95% | 33.92% | 95.27% | 94.38% | 97.38% | 18.62% | 59.00% | 66.92% | 84.91% | |
|
|
| DynamiCrafter-512 | 97.05% | 97.56% | 20.92% | 94.74% | 98.29% | 97.83% | 40.57% | 58.71% | 62.28% | 85.25% | |
|
|
| Hummingbird-I2V | 96.30% | 96.39% | 12.69% | 97.10% | 98.60% | 98.24% | 62.60% | 64.45% | 69.27% | 87.05% | |
|
|
|