Self-Forcing2.1-T2V-1.3B-GGUF

📄 Self-Forcing    |    🧬 Wan2.1    |    🤖 GGUF


Developed by Nichonauta.

This repository contains the quantized versions in GGUF format of the Self-Forcing video generation model.

The Self-Forcing model is an evolution of Wan2.1-T2V-1.3B, optimized with an innovative "self-forcing" technique that allows it to correct its own generation errors in real-time. This results in more coherent and higher-quality videos.

These GGUF files allow the model to be run efficiently on GPU/CPU, drastically reducing VRAM consumption and making video generation accessible without the need for high-end GPUs.

✨ Key Features

  • ⚡️ GPU/CPU Inference: Thanks to the GGUF format, the model can run on a wide range of hardware with optimized performance.
  • 🧠 Self-Forcing Technique: The model learns from its own predictions during generation to improve temporal consistency and visual quality of the video.
  • 🖼️ Image-guided Generation: Ability to generate smooth video transitions between a start and an end image, guided by a text prompt.
  • 📉 Low Memory Consumption: Quantization significantly reduces the RAM/VRAM memory footprint compared to the original models (FP16/FP32).
  • 🧬 Based on a Solid Architecture: It inherits the powerful base of the Wan2.1-T2V-1.3B model, known for its efficiency and quality.

Usage

The model files can be used in ComfyUI with the ComfyUI-GGUF custom node.


🧐 What is GGUF?

GGUF is a file format designed to store large language models (and other architectures) for fast inference on CPUs. The key advantages are:

  • Fast Loading: Does not require complex deserialization.
  • Quantization: Allows model weights to be stored with reduced precision (e.g., 4 or 8 bits instead of 16 or 32), which reduces file size and RAM usage.
  • GPU/CPU Execution: It is optimized to run on general-purpose processors through libraries like llama.cpp.

Note: Running this video model in GGUF format requires compatible software that can interpret the video diffusion transformer architecture.


📚 Model Details and Attribution

This work would not be possible without the open-source projects that precede it.

Base Model: Wan2.1

This model is based on Wan2.1-T2V-1.3B, a powerful 1.3 billion parameter text-to-video model. It uses a Diffusion Transformer (DiT) architecture and a 3D VAE (Wan-VAE) optimized to preserve temporal information, making it ideal for video generation.

  • Original Repository: Wan-AI/Wan2.1-T2V-1.3B
  • Architecture: Diffusion Transformer (DiT) with a T5 text encoder.

Optimization Technique: Self-Forcing

The Wan2.1 model was enhanced with the Self-Forcing method, which trains the model to recognize and correct its own diffusion errors in a single forward pass. This improves fidelity and coherence without the need for costly additional training.


🙏 Acknowledgements

We thank the teams behind Wan2.1, Self-Forcing, Stable Diffusion, diffusers, and the entire Hugging Face community for their contribution to the open-source ecosystem.

✍️ Citation

If you find our work useful, please cite the original projects:

@article{wan2.1,
    title   = {Wan: Open and Advanced Large-Scale Video Generative Models},
    author  = {Wan Team},
    journal = {},
    year    = {2025}
}

@misc{bar2024self,
      title={Self-Forcing for Real-Time Video Generation},
      author={Tal Bar and Roy Vovers and Yael Vinker and Eliahu Horwitz and Mark B. Zkharya and Yedid Hoshen},
      year={2024},
      eprint={2405.03358},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Downloads last month
436
GGUF
Model size
1.42B params
Architecture
wan
Hardware compatibility
Log In to view the estimation

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nichonauta/Self-Forcing2.1-T2V-1.3B-GGUF

Quantized
(2)
this model