Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,123 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
<meta name="google-site-verification" content="-XQC-POJtlDPD3i2KSOxbFkSBde_Uq9obAIh_4mxTkM" />
|
| 5 |
+
|
| 6 |
+
<div align="center">
|
| 7 |
+
|
| 8 |
+
<h2><a href="https://arxiv.org/abs/2408.10605">MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation</a></h2>
|
| 9 |
+
|
| 10 |
+
> Official project page of **MTVCrafter**, a novel framework for general and high-quality human image animation using raw 3D motion sequences.
|
| 11 |
+
|
| 12 |
+
<!--
|
| 13 |
+
[Yanbo Ding](https://github.com/DINGYANB),
|
| 14 |
+
[Shaobin Zhuang](https://scholar.google.com/citations?user=PGaDirMAAAAJ&hl=zh-CN&oi=ao),
|
| 15 |
+
[Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ),
|
| 16 |
+
[Zhengrong Yue](https://arxiv.org/search/?searchtype=author&query=Zhengrong%20Yue),
|
| 17 |
+
[Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl),
|
| 18 |
+
[Yali Wang†](https://scholar.google.com/citations?user=hD948dkAAAAJ)
|
| 19 |
+
-->
|
| 20 |
+
|
| 21 |
+
[](https://www.arxiv.org/abs/2505.10238)
|
| 22 |
+
[](https://github.com/DINGYANB/MTVCrafter)
|
| 23 |
+
[](https://huggingface.co/yanboding/)
|
| 24 |
+
[](https://dingyanb.github.io/MTVCtafter/)
|
| 25 |
+
|
| 26 |
+
</div>
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
## 🔍 Abstract
|
| 30 |
+
|
| 31 |
+
Human image animation has attracted increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information.
|
| 32 |
+
To tackle these problems, we propose **MTVCrafter (Motion Tokenization Video Crafter)**, the first framework that directly models raw 3D motion sequences for open-world human image animation beyond intermediate 2D representations.
|
| 33 |
+
|
| 34 |
+
- We introduce **4DMoT (4D motion tokenizer)** to encode raw motion data into discrete motion tokens, preserving 4D compact yet expressive spatio-temporal information.
|
| 35 |
+
- Then, we propose **MV-DiT (Motion-aware Video DiT)**, which integrates a motion attention module and 4D positional encodings to effectively modulate vision tokens with motion tokens.
|
| 36 |
+
- The overall pipeline facilitates high-quality human video generation guided by 4D motion tokens.
|
| 37 |
+
|
| 38 |
+
MTVCrafter achieves **state-of-the-art results with an FID-VID of 6.98**, outperforming the second-best by approximately **65%**. It generalizes well to diverse characters (single/multiple, full/half-body) across various styles.
|
| 39 |
+
|
| 40 |
+
## 🎯 Motivation
|
| 41 |
+
|
| 42 |
+

|
| 43 |
+
|
| 44 |
+
Our motivation is that directly tokenizing 4D motion captures more faithful and expressive information than traditional 2D-rendered pose images derived from the driven video.
|
| 45 |
+
|
| 46 |
+
## 💡 Method
|
| 47 |
+
|
| 48 |
+

|
| 49 |
+
|
| 50 |
+
*(1) 4DMoT*:
|
| 51 |
+
Our 4D motion tokenizer consists of an encoder-decoder framework to learn spatio-temporal latent representations of SMPL motion sequences,
|
| 52 |
+
and a vector quantizer to learn discrete tokens in a unified space.
|
| 53 |
+
All operations are performed in 2D space along frame and joint axes.
|
| 54 |
+
|
| 55 |
+

|
| 56 |
+
|
| 57 |
+
*(2) MV-DiT*:
|
| 58 |
+
Based on video DiT architecture,
|
| 59 |
+
we design a 4D motion attention module to combine motion tokens with vision tokens.
|
| 60 |
+
Since the tokenization and flattening disrupted positional information,
|
| 61 |
+
we introduce 4D RoPE to recover the spatio-temporal relationships.
|
| 62 |
+
To further improve the quality of generation and generalization,
|
| 63 |
+
we use learnable unconditional tokens for motion classifier-free guidance.
|
| 64 |
+
|
| 65 |
+
---
|
| 66 |
+
|
| 67 |
+
## 🛠️ Installation
|
| 68 |
+
|
| 69 |
+
We recommend using a clean Python environment (Python 3.10+).
|
| 70 |
+
|
| 71 |
+
```bash
|
| 72 |
+
clone this repository && cd MTVCrafter
|
| 73 |
+
|
| 74 |
+
# Create virtual environment
|
| 75 |
+
conda create -n mtvcrafter python=3.11
|
| 76 |
+
conda activate mtvcrafter
|
| 77 |
+
|
| 78 |
+
# Install dependencies
|
| 79 |
+
pip install -r requirements.txt
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
## 🚀 Usage
|
| 83 |
+
|
| 84 |
+
To animate a human image with a given 3D motion sequence,
|
| 85 |
+
you first need to obtain the SMPL motion sequnces from the driven video:
|
| 86 |
+
|
| 87 |
+
```bash
|
| 88 |
+
python process_nlf.py "your_video_directory"
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
Then, you can use the following command to animate the image guided by 4D motion tokens:
|
| 92 |
+
|
| 93 |
+
```bash
|
| 94 |
+
python infer.py --ref_image_path "ref_images/hunam.png" --motion_data_path "data/sample_data.pkl" --output_path "inference_output"
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
- `--ref_image_path`: Path to the image of reference character.
|
| 98 |
+
- `--motion_data_path`: Path to the motion sequence (.pkl format).
|
| 99 |
+
- `--output_path`: Where to save the generated animation results.
|
| 100 |
+
|
| 101 |
+
For our 4DMoT, you can run the following command to train the model on your dataset:
|
| 102 |
+
|
| 103 |
+
```bash
|
| 104 |
+
accelerate launch train_vqvae.py
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
## 📄 Citation
|
| 108 |
+
|
| 109 |
+
If you find our work useful, please consider citing:
|
| 110 |
+
|
| 111 |
+
```bibtex
|
| 112 |
+
@article{ding2025mtvcrafter,
|
| 113 |
+
title={MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation},
|
| 114 |
+
author={Ding, Yanbo},
|
| 115 |
+
journal={arXiv preprint arXiv:2505.10238},
|
| 116 |
+
year={2025}
|
| 117 |
+
}
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
## 📬 Contact
|
| 121 |
+
|
| 122 |
+
For questions or collaboration, feel free to reach out via GitHub Issues
|
| 123 |
+
or email me at 📧 [email protected].
|