File size: 7,884 Bytes
f5b74e6 1aee4d0 f5b74e6 1aee4d0 f2e3711 1aee4d0 f2e3711 1aee4d0 f2e3711 1aee4d0 f2e3711 1aee4d0 f2e3711 1aee4d0 045bf19 1aee4d0 25e5912 1aee4d0 6dad806 1aee4d0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
---
language:
- en
tags:
- stable-diffusion
- pytorch
- text-to-image
- image-to-image
- diffusion-models
- computer-vision
- generative-ai
- deep-learning
- neural-networks
license: mit
library_name: pytorch
pipeline_tag: text-to-image
base_model: stable-diffusion-v1-5
model-index:
- name: pytorch-stable-diffusion
results:
- task:
type: text-to-image
name: Text-to-Image Generation
dataset:
type: custom
name: Stable Diffusion v1.5
metrics:
- type: inference_steps
value: 50
- type: cfg_scale
value: 8
- type: image_size
value: 512x512
---
# PyTorch Stable Diffusion Implementation
A complete, from-scratch PyTorch implementation of Stable Diffusion v1.5, featuring both text-to-image and image-to-image generation capabilities. This project demonstrates the inner workings of diffusion models by implementing all components without relying on pre-built libraries.
## 🚀 Features
- **Text-to-Image Generation**: Create high-quality images from text descriptions
- **Image-to-Image Generation**: Transform existing images using text prompts
- **Complete Implementation**: All components built from scratch in PyTorch
- **Flexible Sampling**: Configurable inference steps and CFG scale
- **Model Compatibility**: Support for various fine-tuned Stable Diffusion models
- **Clean Architecture**: Modular design with separate components for each part of the pipeline
## 🏗️ Architecture
This implementation includes all the core components of Stable Diffusion:
- **CLIP Text Encoder**: Processes text prompts into embeddings
- **VAE Encoder/Decoder**: Handles image compression and reconstruction
- **U-Net Diffusion Model**: Core denoising network with attention mechanisms
- **DDPM Sampler**: Implements the denoising diffusion probabilistic model
- **Pipeline Orchestration**: Coordinates all components for generation
## 📁 Project Structure
```
├── main/
│ ├── attention.py # Multi-head attention implementation
│ ├── clip.py # CLIP text encoder
│ ├── ddpm.py # DDPM sampling algorithm
│ ├── decoder.py # VAE decoder for image reconstruction
│ ├── diffusion.py # U-Net diffusion model
│ ├── encoder.py # VAE encoder for image compression
│ ├── model_converter.py # Converts checkpoint files to PyTorch format
│ ├── model_loader.py # Loads and manages model weights
│ ├── pipeline.py # Main generation pipeline
│ └── demo.py # Example usage and demonstration
├── data/ # Model weights and tokenizer files
└── images/ # Input/output images
```
## 🛠️ Installation
### Prerequisites
- Python 3.8+
- PyTorch 1.12+
- Transformers library
- PIL (Pillow)
- NumPy
- tqdm
### Setup
1. **Clone the repository:**
```bash
git clone https://github.com/https://github.com/ApoorvBrooklyn/Stable-Diffusion
cd pytorch-stable-diffusion
```
2. **Create virtual environment:**
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. **Install dependencies:**
```bash
pip install torch torchvision torchaudio
pip install transformers pillow numpy tqdm
```
4. **Download required model files:**
- Download `vocab.json` and `merges.txt` from [Stable Diffusion v1.5 tokenizer](https://huggingface.co/ApoorvBrooklyn/stable-diffusion-implementation/tree/main/data)
- Download `v1-5-pruned-emaonly.ckpt` from [Stable Diffusion v1.5](https://huggingface.co/ApoorvBrooklyn/stable-diffusion-implementation/tree/main/data)
- Place all files in the `data/` folder
## 🎯 Usage
### Basic Text-to-Image Generation
```python
import model_loader
import pipeline
from transformers import CLIPTokenizer
# Initialize tokenizer and load models
tokenizer = CLIPTokenizer("data/vocab.json", merges_file="data/merges.txt")
models = model_loader.preload_models_from_standard_weights("data/v1-5-pruned-emaonly.ckpt", "cpu")
# Generate image from text
output_image = pipeline.generate(
prompt="A beautiful sunset over mountains, highly detailed, 8k resolution",
uncond_prompt="", # Negative prompt
do_cfg=True,
cfg_scale=8,
sampler_name="ddpm",
n_inference_steps=50,
seed=42,
models=models,
device="cpu",
tokenizer=tokenizer
)
```
### Image-to-Image Generation
```python
from PIL import Image
# Load input image
input_image = Image.open("images/input.jpg")
# Generate transformed image
output_image = pipeline.generate(
prompt="Transform this into a watercolor painting",
input_image=input_image,
strength=0.8, # Controls how much to change the input
# ... other parameters
)
```
### Advanced Configuration
- **CFG Scale**: Controls how closely the image follows the prompt (1-14)
- **Inference Steps**: More steps = higher quality but slower generation
- **Strength**: For image-to-image, controls transformation intensity (0-1)
- **Seed**: Set for reproducible results
## 🔧 Model Conversion
The `model_converter.py` script converts Stable Diffusion checkpoint files to PyTorch format:
```bash
python main/model_converter.py --checkpoint_path data/v1-5-pruned-emaonly.ckpt --output_dir converted_models/
```
## 🎨 Supported Models
This implementation is compatible with:
- **Stable Diffusion v1.5**: Base model
- **Fine-tuned Models**: Any SD v1.5 compatible checkpoint
- **Custom Models**: Models trained on specific datasets or styles
### Tested Fine-tuned Models:
- **InkPunk Diffusion**: Artistic ink-style images
- **Illustration Diffusion**: Hollie Mengert's illustration style
## 🚀 Performance Tips
- **Device Selection**: Use CUDA for GPU acceleration, MPS for Apple Silicon
- **Batch Processing**: Process multiple prompts simultaneously
- **Memory Management**: Use `idle_device="cpu"` to free GPU memory
- **Optimization**: Adjust inference steps based on quality vs. speed needs
## 🔬 Technical Details
### Diffusion Process
- Implements DDPM (Denoising Diffusion Probabilistic Models)
- Uses U-Net architecture with cross-attention for text conditioning
- VAE handles 512x512 image compression to 64x64 latents
### Attention Mechanisms
- Multi-head self-attention in U-Net
- Cross-attention between text embeddings and image features
- Efficient attention implementation for memory optimization
### Sampling
- Configurable number of denoising steps
- Classifier-free guidance (CFG) for prompt adherence
- Deterministic generation with seed control
## 🤝 Contributing
Contributions are welcome! Please feel free to submit pull requests or open issues for:
- Bug fixes
- Performance improvements
- New sampling algorithms
- Additional model support
- Documentation improvements
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
- **Stability AI** for the original Stable Diffusion model
- **OpenAI** for the CLIP architecture
- **CompVis** for the VAE implementation
- **Hugging Face** for the transformers library
## 📚 References
- [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)
- [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)
- [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
## 📞 Support
If you encounter any issues or have questions:
- Open an issue on GitHub
- Check the existing documentation
- Review the demo code for examples
---
**Note**: This is a research and educational implementation. For production use, consider using the official Stable Diffusion implementations or cloud-based APIs. |