weijiangchuan commited on
Commit
712663f
·
0 Parent(s):

initial commit

Browse files
.gitattributes ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ transformer/diffusion_pytorch_model-00001-of-00002.safetensors filter=lfs diff=lfs merge=lfs -text
37
+ transformer/diffusion_pytorch_model-00002-of-00002.safetensors filter=lfs diff=lfs merge=lfs -text
38
+ asset/examples/3.gif filter=lfs diff=lfs merge=lfs -text
39
+ asset/examples/4.gif filter=lfs diff=lfs merge=lfs -text
40
+ asset/examples/2.gif filter=lfs diff=lfs merge=lfs -text
41
+ asset/examples/ filter=lfs diff=lfs merge=lfs -text
42
+ asset/examples/5.gif filter=lfs diff=lfs merge=lfs -text
43
+ asset/examples/6.gif filter=lfs diff=lfs merge=lfs -text
44
+ asset/examples/7.gif filter=lfs diff=lfs merge=lfs -text
45
+ asset/examples/8.gif filter=lfs diff=lfs merge=lfs -text
46
+ asset/examples/framework.jpg filter=lfs diff=lfs merge=lfs -text
47
+ asset/examples/IITF.jpg filter=lfs diff=lfs merge=lfs -text
48
+ asset/examples/1.gif filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1 @@
 
 
1
+ The model weights of EchoVideo are licensed under CC BY NC 4.0.
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_link: https://huggingface.co/bytedance-research/EchoVideo/blob/main/LICENSE
4
+ language:
5
+ - en
6
+ tags:
7
+ - EchoVideo
8
+ - video-generation
9
+ - id-preserving
10
+ ---
11
+
12
+ # EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion
13
+
14
+ This repo contains PyTorch model definitions, pre-trained weights and inference code for our video generation model, EchoVideo.
15
+ > [**EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion**](https://arxiv.org/abs/2501.13452) <be>
16
+
17
+ # News
18
+
19
+ **[2025.02.27]** We release the inference code and model weights of EchoVideo.
20
+
21
+ # Introduction
22
+
23
+ EchoVideo is capable of generating a personalized video from a single photo and a text description. It excels in addressing issues related to "semantic conflict" and "copy-paste" problems. And demonstrates state-of-the-art performance.
24
+
25
+
26
+ # Gallery
27
+ ## 1. Text-to-Video Generation
28
+ | Face-ID Preserving | Full-Body Preserving|
29
+ | ---- | ---- |
30
+ | <img height="300" src="asset/examples/3.gif" > | <img height="300" src="asset/examples/4.gif" > |
31
+
32
+ ## 2. Comparisons
33
+ | EchoVideo | ConsisID | IDAnimator |
34
+ | ---- | ---- | ---- |
35
+ | <img height="240" src="asset/examples/2.gif" > | <img height="240" src="asset/examples/5.gif" > | <img height="240" src="asset/examples/6.gif" > |
36
+ | <img height="240" src="asset/examples/1.gif" > | <img height="240" src="asset/examples/7.gif" > | <img height="240" src="asset/examples/8.gif" > |
37
+
38
+
39
+ # Usage
40
+ **Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12. Support both gpu and npu**
41
+
42
+ ## cloning the repository:
43
+ ```shell
44
+ git clone https://github.com/bytedance/EchoVideo
45
+ cd EchoVideo
46
+ ```
47
+
48
+ ## Installation
49
+ ```shell
50
+ pip install -r requirements.txt
51
+ ```
52
+ ## Download Pretrained Weights
53
+ The details of download pretrained models are shown [here](https://github.com/bytedance/EchoVideo/ckpts/README.md).
54
+ ## Run Demo
55
+ ```shell
56
+ # multi-resolution video generation [(480, 640), (480, 848), (480, 480), (848, 480), (640, 480)]
57
+ python infer.py
58
+ ```
59
+
60
+ # Methods
61
+ ## **Overall Architecture**
62
+ <p align="center">
63
+ <img src="asset/examples/framework.jpg" height=350>
64
+ </p>
65
+
66
+ Overall architecture of EchoVideo. By employing a meticulously designed IITF module and mitigating the over-reliance on input images, our model effectively unifies the semantic information between the input facial image and the textual prompt. This integration enables the generation of consistent characters with multi-view facial coherence, ensuring that the synthesized outputs maintain both visual and semantic fidelity across diverse perspectives.
67
+
68
+ ## **Key Features**
69
+ <p align="center">
70
+ <img src="asset/examples/IITF.jpg" height=350>
71
+ </p>
72
+
73
+
74
+ Illustration of facial information injection methods. (a) IITF. Facial and textual information are fused to ensure consistent guidance throughout the generation process. we propose IITF to fuse text and facial information, establishing a semantic bridge between facial and textual information, coordinating the influence of different information on character features, thereby ensuring the consistency of generated characters. IITF consists of two core components: facial feature alignment and conditional feature alignment. (b) Dual branch. Facial and textual information are independently injected through Cross Attention mechanisms, providing separate guidance for the generation process.
75
+
76
+ ## Benchmark
77
+
78
+ | Model | Identity Average↑ | Identity Variation↓ | Inception Distance↓ | Dynamic Degree↑ |
79
+ | -- | -- | -- | -- |-----------------|
80
+ | IDAnimator | 0.349 | **0.032** | **159.11** | 0.280 |
81
+ | ConsisID | <u>0.414</u> | 0.094 | 200.40 | 0.871 |
82
+ | pika | 0.329 | 0.091 | 268.35 | <u>0.954</u> |
83
+ | Ours | **0.516** | <u>0.075</u> | <u>176.53</u> | **0.955** |
84
+
85
+ # Acknowledgements
86
+ * [CogVideo](https://huggingface.co/THUDM/CogVideoX-5b): The DiT module we adpated from, and the VAE module we used. [MODEL LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE)
87
+ * [SigLip](https://huggingface.co/google/siglip-base-patch16-224): Vision Encoder we used.
88
+
89
+
90
+ # BibTeX
91
+ If you find our work useful in your research, please consider citing the paper
92
+ ```bibtex
93
+ @article{wei2025echovideo,
94
+ title={EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion},
95
+ author={Wei, Jiangchuan and Yan, Shiyue and Lin, Wenfeng and Liu, Boyuan and Chen, Renjie and Guo, Mingyu},
96
+ journal={arXiv preprint arXiv:2501.13452},
97
+ year={2025}
98
+ }
99
+ ```
asset/examples/1.gif ADDED

Git LFS Details

  • SHA256: 1a78f5b5ffaa63138d92fceeb9e36c36c2712eef898fcf9b0b40ceaa8bb79e93
  • Pointer size: 132 Bytes
  • Size of remote file: 9.85 MB
asset/examples/2.gif ADDED

Git LFS Details

  • SHA256: 37af34fd8ed3a1d525bd353f4258793c33d2719ea3f355a4b8d1fec809ba4ee3
  • Pointer size: 132 Bytes
  • Size of remote file: 9.31 MB
asset/examples/3.gif ADDED

Git LFS Details

  • SHA256: ac56c2724a7166a1892cfc0003f395b955e0bfd482653ad11da82b8998f75531
  • Pointer size: 132 Bytes
  • Size of remote file: 7.85 MB
asset/examples/4.gif ADDED

Git LFS Details

  • SHA256: f05b04158c3c6c01d7b6e902097c94d59d9c0fc46c776a2a5e7a77b5c2fca380
  • Pointer size: 132 Bytes
  • Size of remote file: 8.36 MB
asset/examples/5.gif ADDED

Git LFS Details

  • SHA256: 0f67756626ee85135d759047ff86ce35c692cefeecb429c21c755786c58f7cb0
  • Pointer size: 132 Bytes
  • Size of remote file: 7.27 MB
asset/examples/6.gif ADDED

Git LFS Details

  • SHA256: aec6ee4c502550bb35035f5dcedfbeab2bd0e0820f86e0c7b93faff102eba0d0
  • Pointer size: 132 Bytes
  • Size of remote file: 2.47 MB
asset/examples/7.gif ADDED

Git LFS Details

  • SHA256: ec2f4b578edaf3418989a740802ff5d805e2dc8ca9280f3fd2a16c2f6428b81a
  • Pointer size: 132 Bytes
  • Size of remote file: 8.04 MB
asset/examples/8.gif ADDED

Git LFS Details

  • SHA256: 53a2639f5b47c2297cff3913fbf6d60c389d383c68d23c66a6210c924fd9bcc9
  • Pointer size: 132 Bytes
  • Size of remote file: 2.6 MB
asset/examples/IITF.jpg ADDED

Git LFS Details

  • SHA256: 3af76806fb4c251ae4a8226211e3f0a77430acd2f0fae6edacb546ad0748edcb
  • Pointer size: 131 Bytes
  • Size of remote file: 933 kB
asset/examples/framework.jpg ADDED

Git LFS Details

  • SHA256: d2c916de69a21146c2a55cd98d788296d6d6239d2b6e4f226717bb4c9e9ed9b0
  • Pointer size: 131 Bytes
  • Size of remote file: 663 kB
configuration.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"framework":"Pytorch","task":"image-to-video"}
model_index.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "EchoVideoPipeline",
3
+ "_diffusers_version": "0.31.0.dev0",
4
+ "scheduler": [
5
+ "diffusers",
6
+ "CogVideoXDPMScheduler"
7
+ ],
8
+ "text_encoder": [
9
+ "transformers",
10
+ "T5EncoderModel"
11
+ ],
12
+ "tokenizer": [
13
+ "transformers",
14
+ "T5Tokenizer"
15
+ ],
16
+ "transformer": [
17
+ "models.echovideo_transformer_3d",
18
+ "EchoVideoLDM"
19
+ ],
20
+ "vae": [
21
+ "diffusers",
22
+ "AutoencoderKLCogVideoX"
23
+ ]
24
+ }
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "CogVideoXDPMScheduler",
3
+ "_diffusers_version": "0.31.0.dev0",
4
+ "beta_end": 0.012,
5
+ "beta_schedule": "scaled_linear",
6
+ "beta_start": 0.00085,
7
+ "clip_sample": false,
8
+ "clip_sample_range": 1.0,
9
+ "num_train_timesteps": 1000,
10
+ "prediction_type": "v_prediction",
11
+ "rescale_betas_zero_snr": true,
12
+ "sample_max_value": 1.0,
13
+ "set_alpha_to_one": true,
14
+ "snr_shift_scale": 1.0,
15
+ "steps_offset": 0,
16
+ "timestep_spacing": "trailing",
17
+ "trained_betas": null
18
+ }
transformer/config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "EchoVideoLDM",
3
+ "_diffusers_version": "0.31.0",
4
+ "_name_or_path": "EchoVideo/ckpts",
5
+ "activation_fn": "gelu-approximate",
6
+ "attention_bias": true,
7
+ "attention_head_dim": 64,
8
+ "dropout": 0.0,
9
+ "face_embed_dim": 768,
10
+ "face_features_embed_dim": 512,
11
+ "face_features_seq_length": 144,
12
+ "face_seq_length": 196,
13
+ "flip_sin_to_cos": true,
14
+ "freq_shift": 0,
15
+ "in_channels": 32,
16
+ "max_text_seq_length": 226,
17
+ "norm_elementwise_affine": true,
18
+ "norm_eps": 1e-05,
19
+ "num_attention_heads": 48,
20
+ "num_layers": 42,
21
+ "out_channels": 16,
22
+ "patch_size": 2,
23
+ "sample_frames": 49,
24
+ "sample_height": 60,
25
+ "sample_width": 90,
26
+ "spatial_interpolation_scale": 1.875,
27
+ "temporal_compression_ratio": 4,
28
+ "temporal_interpolation_scale": 1.0,
29
+ "text_embed_dim": 4096,
30
+ "time_embed_dim": 512,
31
+ "timestep_activation_fn": "silu",
32
+ "use_rotary_positional_embeddings": true
33
+ }
transformer/diffusion_pytorch_model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a6d0c109ec3c1d8a11f096a79c1b9cc06442f510b6f2fd65ab097dbecc8c78bd
3
+ size 9925735424
transformer/diffusion_pytorch_model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d2bb9a5d79cb17533bb26d0d686d6d38471c039ae2c298026e1abb4b7001c4c
3
+ size 1316905404
transformer/diffusion_pytorch_model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff