File size: 7,884 Bytes
f5b74e6
 
 
 
 
 
 
 
 
 
 
1aee4d0
 
f5b74e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1aee4d0
f2e3711
1aee4d0
f2e3711
1aee4d0
f2e3711
1aee4d0
 
 
 
 
 
f2e3711
1aee4d0
f2e3711
1aee4d0
045bf19
1aee4d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25e5912
1aee4d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6dad806
 
1aee4d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
language:
  - en
tags:
  - stable-diffusion
  - pytorch
  - text-to-image
  - image-to-image
  - diffusion-models
  - computer-vision
  - generative-ai
  - deep-learning
  - neural-networks
license: mit
library_name: pytorch
pipeline_tag: text-to-image
base_model: stable-diffusion-v1-5
model-index:
  - name: pytorch-stable-diffusion
    results:
      - task:
          type: text-to-image
          name: Text-to-Image Generation
        dataset:
          type: custom
          name: Stable Diffusion v1.5
        metrics:
          - type: inference_steps
            value: 50
          - type: cfg_scale
            value: 8
          - type: image_size
            value: 512x512
---

# PyTorch Stable Diffusion Implementation

A complete, from-scratch PyTorch implementation of Stable Diffusion v1.5, featuring both text-to-image and image-to-image generation capabilities. This project demonstrates the inner workings of diffusion models by implementing all components without relying on pre-built libraries.

## 🚀 Features

- **Text-to-Image Generation**: Create high-quality images from text descriptions
- **Image-to-Image Generation**: Transform existing images using text prompts
- **Complete Implementation**: All components built from scratch in PyTorch
- **Flexible Sampling**: Configurable inference steps and CFG scale
- **Model Compatibility**: Support for various fine-tuned Stable Diffusion models
- **Clean Architecture**: Modular design with separate components for each part of the pipeline

## 🏗️ Architecture

This implementation includes all the core components of Stable Diffusion:

- **CLIP Text Encoder**: Processes text prompts into embeddings
- **VAE Encoder/Decoder**: Handles image compression and reconstruction
- **U-Net Diffusion Model**: Core denoising network with attention mechanisms
- **DDPM Sampler**: Implements the denoising diffusion probabilistic model
- **Pipeline Orchestration**: Coordinates all components for generation

## 📁 Project Structure

```
├── main/
│   ├── attention.py      # Multi-head attention implementation
│   ├── clip.py           # CLIP text encoder
│   ├── ddpm.py           # DDPM sampling algorithm
│   ├── decoder.py        # VAE decoder for image reconstruction
│   ├── diffusion.py      # U-Net diffusion model
│   ├── encoder.py        # VAE encoder for image compression
│   ├── model_converter.py # Converts checkpoint files to PyTorch format
│   ├── model_loader.py   # Loads and manages model weights
│   ├── pipeline.py       # Main generation pipeline
│   └── demo.py           # Example usage and demonstration
├── data/                 # Model weights and tokenizer files
└── images/               # Input/output images
```

## 🛠️ Installation

### Prerequisites

- Python 3.8+
- PyTorch 1.12+
- Transformers library
- PIL (Pillow)
- NumPy
- tqdm

### Setup

1. **Clone the repository:**
   ```bash
   git clone https://github.com/https://github.com/ApoorvBrooklyn/Stable-Diffusion
   cd pytorch-stable-diffusion
   ```

2. **Create virtual environment:**
   ```bash
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   ```

3. **Install dependencies:**
   ```bash
   pip install torch torchvision torchaudio
   pip install transformers pillow numpy tqdm
   ```

4. **Download required model files:**
   - Download `vocab.json` and `merges.txt` from [Stable Diffusion v1.5 tokenizer](https://huggingface.co/ApoorvBrooklyn/stable-diffusion-implementation/tree/main/data)
   - Download `v1-5-pruned-emaonly.ckpt` from [Stable Diffusion v1.5](https://huggingface.co/ApoorvBrooklyn/stable-diffusion-implementation/tree/main/data)
   - Place all files in the `data/` folder

## 🎯 Usage

### Basic Text-to-Image Generation

```python
import model_loader
import pipeline
from transformers import CLIPTokenizer

# Initialize tokenizer and load models
tokenizer = CLIPTokenizer("data/vocab.json", merges_file="data/merges.txt")
models = model_loader.preload_models_from_standard_weights("data/v1-5-pruned-emaonly.ckpt", "cpu")

# Generate image from text
output_image = pipeline.generate(
    prompt="A beautiful sunset over mountains, highly detailed, 8k resolution",
    uncond_prompt="",  # Negative prompt
    do_cfg=True,
    cfg_scale=8,
    sampler_name="ddpm",
    n_inference_steps=50,
    seed=42,
    models=models,
    device="cpu",
    tokenizer=tokenizer
)
```

### Image-to-Image Generation

```python
from PIL import Image

# Load input image
input_image = Image.open("images/input.jpg")

# Generate transformed image
output_image = pipeline.generate(
    prompt="Transform this into a watercolor painting",
    input_image=input_image,
    strength=0.8,  # Controls how much to change the input
    # ... other parameters
)
```

### Advanced Configuration

- **CFG Scale**: Controls how closely the image follows the prompt (1-14)
- **Inference Steps**: More steps = higher quality but slower generation
- **Strength**: For image-to-image, controls transformation intensity (0-1)
- **Seed**: Set for reproducible results

## 🔧 Model Conversion

The `model_converter.py` script converts Stable Diffusion checkpoint files to PyTorch format:

```bash
python main/model_converter.py --checkpoint_path data/v1-5-pruned-emaonly.ckpt --output_dir converted_models/
```

## 🎨 Supported Models

This implementation is compatible with:
- **Stable Diffusion v1.5**: Base model
- **Fine-tuned Models**: Any SD v1.5 compatible checkpoint
- **Custom Models**: Models trained on specific datasets or styles

### Tested Fine-tuned Models:
- **InkPunk Diffusion**: Artistic ink-style images
- **Illustration Diffusion**: Hollie Mengert's illustration style

## 🚀 Performance Tips

- **Device Selection**: Use CUDA for GPU acceleration, MPS for Apple Silicon
- **Batch Processing**: Process multiple prompts simultaneously
- **Memory Management**: Use `idle_device="cpu"` to free GPU memory
- **Optimization**: Adjust inference steps based on quality vs. speed needs

## 🔬 Technical Details

### Diffusion Process
- Implements DDPM (Denoising Diffusion Probabilistic Models)
- Uses U-Net architecture with cross-attention for text conditioning
- VAE handles 512x512 image compression to 64x64 latents

### Attention Mechanisms
- Multi-head self-attention in U-Net
- Cross-attention between text embeddings and image features
- Efficient attention implementation for memory optimization

### Sampling
- Configurable number of denoising steps
- Classifier-free guidance (CFG) for prompt adherence
- Deterministic generation with seed control

## 🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for:
- Bug fixes
- Performance improvements
- New sampling algorithms
- Additional model support
- Documentation improvements

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- **Stability AI** for the original Stable Diffusion model
- **OpenAI** for the CLIP architecture
- **CompVis** for the VAE implementation
- **Hugging Face** for the transformers library

## 📚 References

- [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)
- [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)
- [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)

## 📞 Support

If you encounter any issues or have questions:
- Open an issue on GitHub
- Check the existing documentation
- Review the demo code for examples

---

**Note**: This is a research and educational implementation. For production use, consider using the official Stable Diffusion implementations or cloud-based APIs.