base_model:
- Wan-AI/Wan2.1-T2V-14B
pipeline_tag: text-to-video
license: apache-2.0
Stand-In is a lightweight, plug-and-play framework for identity-preserving video generation. By training only 1% additional parameters compared to the base video generation model, we achieve state-of-the-art results in both Face Similarity and Naturalness, outperforming various full-parameter training methods. Moreover, Stand-In can be seamlessly integrated into other tasks such as subject-driven video generation, pose-controlled video generation, video stylization, and face swapping.
🔥 News
[2025.08.13] Special thanks to @kijai for integrating Stand-In into the custom ComfyUI node WanVideoWrapper. However, the implementation differs from the official version, which may affect Stand-In’s performance.
To partially mitigate this issue, we have urgently released the official Stand-In preprocessing ComfyUI node:
👉 https://github.com/WeChatCV/Stand-In_Preprocessor_ComfyUI
If you wish to experience Stand-In within ComfyUI, please use our official preprocessing node to replace the one implemented by kijai.
For the best results, we recommend waiting for the release of our full official Stand-In ComfyUI.[2025.08.12] Released Stand-In v1.0 (153M parameters), the Wan2.1-14B-T2V–adapted weights and inference code are now open-sourced.
🌟 Showcase
Identity-Preserving Text-to-Video Generation
Non-Human Subjects-Preserving Video Generation
Identity-Preserving Stylized Video Generation
Video Face Swapping
Pose-Guided Video Generation (With VACE)
For more results, please visit https://stand-in-video.github.io/
📖 Key Features
- Efficient Training: Only 1% of the base model parameters need to be trained.
- High Fidelity: Outstanding identity consistency without sacrificing video generation quality.
- Plug-and-Play: Easily integrates into existing T2V (Text-to-Video) models.
- Highly Extensible: Compatible with community models such as LoRA, and supports various downstream video tasks.
✅ Todo List
- Release IP2V inference script (compatible with community LoRA).
- Open-source model weights compatible with Wan2.1-14B-T2V:
Stand-In_Wan2.1-T2V-14B_153M_v1.0
。 - Open-source model weights compatible with Wan2.2-T2V-A14B.
- Release training dataset, data preprocessing scripts, and training code.
🚀 Quick Start
1. Environment Setup
# Clone the project repository
git clone https://github.com/WeChatCV/Stand-In.git
cd Stand-In
# Create and activate Conda environment
conda create -n Stand-In python=3.11 -y
conda activate Stand-In
# Install dependencies
pip install -r requirements.txt
# (Optional) Install Flash Attention for faster inference
# Note: Make sure your GPU and CUDA version are compatible with Flash Attention
pip install flash-attn --no-build-isolation
2. Model Download
We provide an automatic download script that will fetch all required model weights into the checkpoints
directory.
python download_models.py
This script will download the following models:
wan2.1-T2V-14B
(base text-to-video model)antelopev2
(face recognition model)Stand-In
(our Stand-In model)
Note: If you already have the
wan2.1-T2V-14B model
locally, you can manually edit thedownload_model.py
script to comment out the relevant download code and place the model in thecheckpoints/wan2.1-T2V-14B
directory.
🧪 Usage
Standard Inference
Use the infer.py
script for standard identity-preserving text-to-video generation.
python infer.py \
--prompt "A man sits comfortably at a desk, facing the camera as if talking to a friend or family member on the screen. His gaze is focused and gentle, with a natural smile. The background is his carefully decorated personal space, with photos and a world map on the wall, conveying a sense of intimate and modern communication." \
--ip_image "test/input/lecun.jpg" \
--output "test/output/lecun.mp4"
Prompt Writing Tip: If you do not wish to alter the subject's facial features, simply use "a man" or "a woman" without adding extra descriptions of their appearance. Prompts support both Chinese and English input. The prompt is intended for generating frontal, medium-to-close-up videos.
Input Image Recommendation: For best results, use a high-resolution frontal face image. There are no restrictions on resolution or file extension, as our built-in preprocessing pipeline will handle them automatically.
Inference with Community LoRA
Use the infer_with_lora.py
script to load one or more community LoRA models alongside Stand-In.
python infer_with_lora.py \
--prompt "A man sits comfortably at a desk, facing the camera as if talking to a friend or family member on the screen. His gaze is focused and gentle, with a natural smile. The background is his carefully decorated personal space, with photos and a world map on the wall, conveying a sense of intimate and modern communication." \
--ip_image "test/input/lecun.jpg" \
--output "test/output/lecun.mp4" \
--lora_path "path/to/your/lora.safetensors" \
--lora_scale 1.0
We recommend using this stylization LoRA: https://civitai.com/models/1404755/studio-ghibli-wan21-t2v-14b
🤝 Acknowledgements
This project is built upon the following excellent open-source projects:
- DiffSynth-Studio (training/inference framework)
- Wan2.1 (base video generation model)
We sincerely thank the authors and contributors of these projects.
✏ Citation
If you find our work helpful for your research, please consider citing our paper:
@article{xue2025standin,
title={Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation},
author={Bowen Xue and Qixin Yan and Wenjing Wang and Hao Liu and Chen Li},
journal={arXiv preprint arXiv:2508.07901},
year={2025},
}
📬 Contact Us
If you have any questions or suggestions, feel free to reach out via GitHub Issues . We look forward to your feedback!