FrancisRing
/

StableAvatar

@@ -1,20 +1,21 @@
 ---
 license: mit
-task_categories:
-- image-to-video
-- text-to-video
 tags:
 - video-generation
 - video diffusion transformer
 - audio-driven avatar animation
-base_model:
-- Wan-AI/Wan2.1-T2V-14B
-pipeline_tag: image-to-video
-library_name: diffusers
 ---
 # StableAvatar
-<a href='https://francis-rings.github.io/StableAvatar'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2508.08248'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/FrancisRing/StableAvatar/tree/main'><img src='https://img.shields.io/badge/HuggingFace-Model-orange'></a> <a href='https://www.youtube.com/watch?v=6lhvmbzvv3Y'><img src='https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube'></a> <a href='https://www.bilibili.com/video/BV1hUt9z4EoQ'><img src='https://img.shields.io/badge/Bilibili-Watch-blue?style=flat-square&logo=bilibili'></a>
 StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
 <br/>
@@ -78,10 +79,13 @@ StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
 </p>
 Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation.
-We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually.
-To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion’s own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.
 ## News
 * `[2025-8-11]`:🔥 The project page, code, technical report and [a basic model checkpoint](https://huggingface.co/FrancisRing/StableAvatar/tree/main) are released. Further lora training codes, the evaluation dataset and StableAvatar-pro will be released very soon. Stay tuned!
 ## 🛠️ To-Do List
@@ -90,9 +94,9 @@ To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter t
 - [x] Data Pre-Processing Code (Audio Extraction)
 - [x] Data Pre-Processing Code (Vocal Separation)
 - [x] Training Code
-- [ ] Lora Training Code (Before 2025.8.17)
-- [ ] Lora Finetuning Code (Before 2025.8.17)
-- [ ] Full Finetuning Code (Before 2025.8.17)
 - [ ] Inference Code with Audio Native Guidance
 - [ ] StableAvatar-pro
@@ -109,6 +113,15 @@ pip install -r requirements.txt
 pip install flash_attn
 ```
 ### 🧱 Download weights
 If you encounter connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: `export HF_ENDPOINT=https://hf-mirror.com`.
 Please download weights manually as follows:
@@ -152,7 +165,7 @@ python audio_extractor.py --video_path="path/test/video.mp4" --saved_audio_path=
 As noisy background music may negatively impact the performance of StableAvatar to some extents, you can further separate the vocal from the audio file for better lip synchronization.
 Given the path to an audio file (.wav), you can run the following command to extract the corresponding vocal signals:
 ```
-pip install audio-separator
 python vocal_seperator.py --audio_separator_model_file="path/StableAvatar/checkpoints/Kim_Vocal_2.onnx" --audio_file_path="path/test/audio.wav" --saved_vocal_path="path/test/vocal.wav"
 ```
@@ -169,6 +182,11 @@ Prompts are also very important. It is recommended to `[Description of first fra
 Notably, the recommended `--sample_steps` range is [30-50], more steps bring higher quality. The recommended `--overlap_window_length` range is [5-15], as longer overlapping length results in higher quality and slower inference speed.
 "--sample_text_guide_scale" and "--sample_audio_guide_scale" are Classify-Free-Guidance scale of text prompt and audio. The recommended range for prompt and audio cfg is `[3-6]`. You can increase the audio cfg to facilitate the lip synchronization with audio.
 We provide 6 cases in different resolution settings in `path/StableAvatar/examples` for validation. ❤️❤️Please feel free to try it out and enjoy the endless entertainment of infinite-length avatar video generation❤️❤️!
 #### 💡Tips
@@ -178,6 +196,11 @@ We provide 6 cases in different resolution settings in `path/StableAvatar/exampl
 Setting `--GPU_memory_mode` to `model_cpu_offload` can significantly cut GPU memory usage, reducing it by roughly half compared to `model_full_load` mode.
 - If you have multiple Gpus, you can run Multi-GPU inference to speed up by modifying "--ulysses_degree" and "--ring_degree" in `inference.sh`. For example, if you have 8 GPUs, you can set `--ulysses_degree=4` and `--ring_degree=2`. Notably, you have to ensure ulysses_degree*ring_degree=total GPU number/world-size. Moreover, you can also add `--fsdp_dit` in `inference.sh` to activate FSDP in DiT to further reduce GPU memory consumption.
 The video synthesized by StableAvatar is without audio. If you want to obtain the high quality MP4 file with audio, we recommend you to leverage ffmpeg on the <b>output_path</b> as follows:
 ```
@@ -380,8 +403,34 @@ We utilize deepspeed stage-2 to train Wan2.1-14B-based StableAvatar. The GPU con
 The deepspeed optimization configuration and deepspeed scheduler configuration are in `path/StableAvatar/deepspeed_config/zero_stage2_config.json`.
 Notably, we observe that Wan2.1-1.3B-based StableAvatar is already capable of synthesizing infinite-length high quality avatar videos. The Wan2.1-14B backbone significantly increase the inference latency and GPU memory consumption during training, indicating limited efficiency in terms of performance-to-resource ratio.
 If you want to train 720P Wan2.1-1.3B-based or Wan2.1-14B-based StableAvatar, you can directly modify the height and width of the dataloader (480p-->720p) in `train_1B_square.py`/`train_1B_vec_rec.py`/`train_14B.py`.
 ### 🧱 VRAM requirement and Runtime
 For the 5s video (480x832, fps=25), the basic model (--GPU_memory_mode="model_full_load") requires approximately 18GB VRAM and finishes in 3 minutes on a 4090 GPU.

 ---
+base_model:
+- Wan-AI/Wan2.1-T2V-14B
+library_name: diffusers
 license: mit
+pipeline_tag: image-to-video
 tags:
 - video-generation
 - video diffusion transformer
 - audio-driven avatar animation
+task_categories:
+- image-to-video
+- text-to-video
 ---
 # StableAvatar
+<a href='https://francis-rings.github.io/StableAvatar'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2508.08248'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://github.com/Francis-Rings/StableAvatar'><img src='https://img.shields.io/badge/GitHub-Code-blue?logo=github'></a> <a href='https://huggingface.co/FrancisRing/StableAvatar/tree/main'><img src='https://img.shields.io/badge/HuggingFace-Model-orange'></a> <a href='https://www.youtube.com/watch?v=6lhvmbzvv3Y'><img src='https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube'></a> <a href='https://www.bilibili.com/video/BV1hUt9z4EoQ'><img src='https://img.shields.io/badge/Bilibili-Watch-blue?style=flat-square&logo=bilibili'></a>
 StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
 <br/>
 </p>
 Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation.
+We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion’s own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.
 ## News
+* `[2025-8-16]`:🔥 We release the finetuning codes and lora training/finetuning codes! Other codes will be public as soon as possible. Stay tuned!
+* `[2025-8-15]`:🔥 StableAvatar can run on Gradio Interface. Thanks @[gluttony-10](https://space.bilibili.com/893892) for the contribution!
+* `[2025-8-15]`:🔥 StableAvatar can run on [ComfyUI](https://github.com/smthemex/ComfyUI_StableAvatar). Thanks @[smthemex](https://github.com/smthemex) for the contribution.
+* `[2025-8-13]`:🔥 Added changes to run StableAvatar on the new Blackwell series Nvidia chips, including the RTX 6000 Pro.
 * `[2025-8-11]`:🔥 The project page, code, technical report and [a basic model checkpoint](https://huggingface.co/FrancisRing/StableAvatar/tree/main) are released. Further lora training codes, the evaluation dataset and StableAvatar-pro will be released very soon. Stay tuned!
 ## 🛠️ To-Do List
 - [x] Data Pre-Processing Code (Audio Extraction)
 - [x] Data Pre-Processing Code (Vocal Separation)
 - [x] Training Code
+- [x] Full Finetuning Code
+- [x] Lora Training Code
+- [x] Lora Finetuning Code
 - [ ] Inference Code with Audio Native Guidance
 - [ ] StableAvatar-pro
 pip install flash_attn
 ```
+### 🧱 Environment setup for Blackwell series chips
+```
+pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
+pip install -r requirements.txt
+# Optional to install flash_attn to accelerate attention computation
+pip install flash_attn
+```
 ### 🧱 Download weights
 If you encounter connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: `export HF_ENDPOINT=https://hf-mirror.com`.
 Please download weights manually as follows:
 As noisy background music may negatively impact the performance of StableAvatar to some extents, you can further separate the vocal from the audio file for better lip synchronization.
 Given the path to an audio file (.wav), you can run the following command to extract the corresponding vocal signals:
 ```
+pip install audio-separator[gpu]
 python vocal_seperator.py --audio_separator_model_file="path/StableAvatar/checkpoints/Kim_Vocal_2.onnx" --audio_file_path="path/test/audio.wav" --saved_vocal_path="path/test/vocal.wav"
 ```
 Notably, the recommended `--sample_steps` range is [30-50], more steps bring higher quality. The recommended `--overlap_window_length` range is [5-15], as longer overlapping length results in higher quality and slower inference speed.
 "--sample_text_guide_scale" and "--sample_audio_guide_scale" are Classify-Free-Guidance scale of text prompt and audio. The recommended range for prompt and audio cfg is `[3-6]`. You can increase the audio cfg to facilitate the lip synchronization with audio.
+Additionally, you can also run the following command to launch a Gradio interface:
+```
+python app.py
+```
 We provide 6 cases in different resolution settings in `path/StableAvatar/examples` for validation. ❤️❤️Please feel free to try it out and enjoy the endless entertainment of infinite-length avatar video generation❤️❤️!
 #### 💡Tips
 Setting `--GPU_memory_mode` to `model_cpu_offload` can significantly cut GPU memory usage, reducing it by roughly half compared to `model_full_load` mode.
 - If you have multiple Gpus, you can run Multi-GPU inference to speed up by modifying "--ulysses_degree" and "--ring_degree" in `inference.sh`. For example, if you have 8 GPUs, you can set `--ulysses_degree=4` and `--ring_degree=2`. Notably, you have to ensure ulysses_degree*ring_degree=total GPU number/world-size. Moreover, you can also add `--fsdp_dit` in `inference.sh` to activate FSDP in DiT to further reduce GPU memory consumption.
+You can fun the following command:
+```
+bash multiple_gpu_inference.sh
+```
+In my setting, 4 GPUs are utilized for inference.
 The video synthesized by StableAvatar is without audio. If you want to obtain the high quality MP4 file with audio, we recommend you to leverage ffmpeg on the <b>output_path</b> as follows:
 ```
 The deepspeed optimization configuration and deepspeed scheduler configuration are in `path/StableAvatar/deepspeed_config/zero_stage2_config.json`.
 Notably, we observe that Wan2.1-1.3B-based StableAvatar is already capable of synthesizing infinite-length high quality avatar videos. The Wan2.1-14B backbone significantly increase the inference latency and GPU memory consumption during training, indicating limited efficiency in terms of performance-to-resource ratio.
+You can also run the following commands to perform lora training:
+```
+# Training StableAvatar-1.3B on a mixed resolution setting (480x832 and 832x480) in a single machine
+bash train_1B_rec_vec_lora.sh
+# Training StableAvatar-1.3B on a mixed resolution setting (480x832 and 832x480) in multiple machines
+bash train_1B_rec_vec_lora_64.sh
+# Lora-Training StableAvatar-14B on a mixed resolution setting (480x832, 832x480, and 512x512) in multiple machines
+bash train_14B_lora.sh
+```
+You can modify `--rank` and `--network_alpha` to control the quality of your lora training/finetuning.
 If you want to train 720P Wan2.1-1.3B-based or Wan2.1-14B-based StableAvatar, you can directly modify the height and width of the dataloader (480p-->720p) in `train_1B_square.py`/`train_1B_vec_rec.py`/`train_14B.py`.
+### 🧱 Model Finetuning
+Regarding fully finetuning StableAvatar, you can add `--transformer_path="path/StableAvatar/checkpoints/StableAvatar-1.3B/transformer3d-square.pt"` to the `train_1B_rec_vec.sh` or `train_1B_rec_vec_64.sh`:
+```
+# Finetuning StableAvatar on a mixed resolution setting (480x832 and 832x480) in a single machine
+bash train_1B_rec_vec.sh
+# Finetuning StableAvatar on a mixed resolution setting (480x832 and 832x480) in multiple machines
+bash train_1B_rec_vec_64.sh
+```
+In terms of lora finetuning StableAvatar, you can add `--transformer_path="path/StableAvatar/checkpoints/StableAvatar-1.3B/transformer3d-square.pt"` to the `train_1B_rec_vec_lora.sh`:
+```
+# Lora-Finetuning StableAvatar-1.3B on a mixed resolution setting (480x832 and 832x480) in a single machine
+bash train_1B_rec_vec_lora.sh
+```
+You can modify `--rank` and `--network_alpha` to control the quality of your lora training/finetuning.
 ### 🧱 VRAM requirement and Runtime
 For the 5s video (480x832, fps=25), the basic model (--GPU_memory_mode="model_full_load") requires approximately 18GB VRAM and finishes in 3 minutes on a 4090 GPU.