ApoorvBrooklyn commited on
Commit
1aee4d0
·
verified ·
1 Parent(s): d60c1ee

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +204 -14
README.md CHANGED
@@ -9,6 +9,8 @@ tags:
9
  - diffusion-models
10
  - computer-vision
11
  - generative-ai
 
 
12
  license: mit
13
  library_name: pytorch
14
  pipeline_tag: text-to-image
@@ -31,23 +33,211 @@ model-index:
31
  value: 512x512
32
  ---
33
 
34
- # pytorch-stable-diffusion
35
- PyTorch implementation of Stable Diffusion from scratch
36
 
37
- ## Download weights and tokenizer files:
38
 
39
- 1. Download `vocab.json` and `merges.txt` from https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main/tokenizer and save them in the `data` folder
40
- 2. Download `v1-5-pruned-emaonly.ckpt` from https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main and save it in the `data` folder
41
 
42
- ## Tested fine-tuned models:
 
 
 
 
 
43
 
44
- Just download the `ckpt` file from any fine-tuned SD (up to v1.5).
45
 
46
- 1. InkPunk Diffusion: https://huggingface.co/Envvi/Inkpunk-Diffusion/tree/main
47
- 2. Illustration Diffusion (Hollie Mengert): https://huggingface.co/ogkalu/Illustration-Diffusion/tree/main
48
 
49
- ## Features:
50
- - Text-to-image generation
51
- - Image-to-image generation
52
- - Support for fine-tuned models
53
- - PyTorch implementation from scratch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - diffusion-models
10
  - computer-vision
11
  - generative-ai
12
+ - deep-learning
13
+ - neural-networks
14
  license: mit
15
  library_name: pytorch
16
  pipeline_tag: text-to-image
 
33
  value: 512x512
34
  ---
35
 
36
+ # PyTorch Stable Diffusion Implementation
 
37
 
38
+ A complete, from-scratch PyTorch implementation of Stable Diffusion v1.5, featuring both text-to-image and image-to-image generation capabilities. This project demonstrates the inner workings of diffusion models by implementing all components without relying on pre-built libraries.
39
 
40
+ ## 🚀 Features
 
41
 
42
+ - **Text-to-Image Generation**: Create high-quality images from text descriptions
43
+ - **Image-to-Image Generation**: Transform existing images using text prompts
44
+ - **Complete Implementation**: All components built from scratch in PyTorch
45
+ - **Flexible Sampling**: Configurable inference steps and CFG scale
46
+ - **Model Compatibility**: Support for various fine-tuned Stable Diffusion models
47
+ - **Clean Architecture**: Modular design with separate components for each part of the pipeline
48
 
49
+ ## 🏗️ Architecture
50
 
51
+ This implementation includes all the core components of Stable Diffusion:
 
52
 
53
+ - **CLIP Text Encoder**: Processes text prompts into embeddings
54
+ - **VAE Encoder/Decoder**: Handles image compression and reconstruction
55
+ - **U-Net Diffusion Model**: Core denoising network with attention mechanisms
56
+ - **DDPM Sampler**: Implements the denoising diffusion probabilistic model
57
+ - **Pipeline Orchestration**: Coordinates all components for generation
58
+
59
+ ## 📁 Project Structure
60
+
61
+ ```
62
+ ├── main/
63
+ │ ├── attention.py # Multi-head attention implementation
64
+ │ ├── clip.py # CLIP text encoder
65
+ │ ├── ddpm.py # DDPM sampling algorithm
66
+ │ ├── decoder.py # VAE decoder for image reconstruction
67
+ │ ├── diffusion.py # U-Net diffusion model
68
+ │ ├── encoder.py # VAE encoder for image compression
69
+ │ ├── model_converter.py # Converts checkpoint files to PyTorch format
70
+ │ ├── model_loader.py # Loads and manages model weights
71
+ │ ├── pipeline.py # Main generation pipeline
72
+ │ └── demo.py # Example usage and demonstration
73
+ ├── data/ # Model weights and tokenizer files
74
+ └── images/ # Input/output images
75
+ ```
76
+
77
+ ## 🛠️ Installation
78
+
79
+ ### Prerequisites
80
+
81
+ - Python 3.8+
82
+ - PyTorch 1.12+
83
+ - Transformers library
84
+ - PIL (Pillow)
85
+ - NumPy
86
+ - tqdm
87
+
88
+ ### Setup
89
+
90
+ 1. **Clone the repository:**
91
+ ```bash
92
+ git clone https://github.com/yourusername/pytorch-stable-diffusion.git
93
+ cd pytorch-stable-diffusion
94
+ ```
95
+
96
+ 2. **Create virtual environment:**
97
+ ```bash
98
+ python -m venv venv
99
+ source venv/bin/activate # On Windows: venv\Scripts\activate
100
+ ```
101
+
102
+ 3. **Install dependencies:**
103
+ ```bash
104
+ pip install torch torchvision torchaudio
105
+ pip install transformers pillow numpy tqdm
106
+ ```
107
+
108
+ 4. **Download required model files:**
109
+ - Download `vocab.json` and `merges.txt` from [Stable Diffusion v1.5 tokenizer](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main/tokenizer)
110
+ - Download `v1-5-pruned-emaonly.ckpt` from [Stable Diffusion v1.5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main)
111
+ - Place all files in the `data/` folder
112
+
113
+ ## 🎯 Usage
114
+
115
+ ### Basic Text-to-Image Generation
116
+
117
+ ```python
118
+ import model_loader
119
+ import pipeline
120
+ from transformers import CLIPTokenizer
121
+
122
+ # Initialize tokenizer and load models
123
+ tokenizer = CLIPTokenizer("data/vocab.json", merges_file="data/merges.txt")
124
+ models = model_loader.preload_models_from_standard_weights("data/v1-5-pruned-emaonly.ckpt", "cpu")
125
+
126
+ # Generate image from text
127
+ output_image = pipeline.generate(
128
+ prompt="A beautiful sunset over mountains, highly detailed, 8k resolution",
129
+ uncond_prompt="", # Negative prompt
130
+ do_cfg=True,
131
+ cfg_scale=8,
132
+ sampler_name="ddpm",
133
+ n_inference_steps=50,
134
+ seed=42,
135
+ models=models,
136
+ device="cpu",
137
+ tokenizer=tokenizer
138
+ )
139
+ ```
140
+
141
+ ### Image-to-Image Generation
142
+
143
+ ```python
144
+ from PIL import Image
145
+
146
+ # Load input image
147
+ input_image = Image.open("images/input.jpg")
148
+
149
+ # Generate transformed image
150
+ output_image = pipeline.generate(
151
+ prompt="Transform this into a watercolor painting",
152
+ input_image=input_image,
153
+ strength=0.8, # Controls how much to change the input
154
+ # ... other parameters
155
+ )
156
+ ```
157
+
158
+ ### Advanced Configuration
159
+
160
+ - **CFG Scale**: Controls how closely the image follows the prompt (1-14)
161
+ - **Inference Steps**: More steps = higher quality but slower generation
162
+ - **Strength**: For image-to-image, controls transformation intensity (0-1)
163
+ - **Seed**: Set for reproducible results
164
+
165
+ ## 🔧 Model Conversion
166
+
167
+ The `model_converter.py` script converts Stable Diffusion checkpoint files to PyTorch format:
168
+
169
+ ```bash
170
+ python main/model_converter.py --checkpoint_path data/v1-5-pruned-emaonly.ckpt --output_dir converted_models/
171
+ ```
172
+
173
+ ## 🎨 Supported Models
174
+
175
+ This implementation is compatible with:
176
+ - **Stable Diffusion v1.5**: Base model
177
+ - **Fine-tuned Models**: Any SD v1.5 compatible checkpoint
178
+ - **Custom Models**: Models trained on specific datasets or styles
179
+
180
+ ### Tested Fine-tuned Models:
181
+ - **InkPunk Diffusion**: Artistic ink-style images
182
+ - **Illustration Diffusion**: Hollie Mengert's illustration style
183
+
184
+ ## 🚀 Performance Tips
185
+
186
+ - **Device Selection**: Use CUDA for GPU acceleration, MPS for Apple Silicon
187
+ - **Batch Processing**: Process multiple prompts simultaneously
188
+ - **Memory Management**: Use `idle_device="cpu"` to free GPU memory
189
+ - **Optimization**: Adjust inference steps based on quality vs. speed needs
190
+
191
+ ## 🔬 Technical Details
192
+
193
+ ### Diffusion Process
194
+ - Implements DDPM (Denoising Diffusion Probabilistic Models)
195
+ - Uses U-Net architecture with cross-attention for text conditioning
196
+ - VAE handles 512x512 image compression to 64x64 latents
197
+
198
+ ### Attention Mechanisms
199
+ - Multi-head self-attention in U-Net
200
+ - Cross-attention between text embeddings and image features
201
+ - Efficient attention implementation for memory optimization
202
+
203
+ ### Sampling
204
+ - Configurable number of denoising steps
205
+ - Classifier-free guidance (CFG) for prompt adherence
206
+ - Deterministic generation with seed control
207
+
208
+ ## 🤝 Contributing
209
+
210
+ Contributions are welcome! Please feel free to submit pull requests or open issues for:
211
+ - Bug fixes
212
+ - Performance improvements
213
+ - New sampling algorithms
214
+ - Additional model support
215
+ - Documentation improvements
216
+
217
+ ## 📄 License
218
+
219
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
220
+
221
+ ## 🙏 Acknowledgments
222
+
223
+ - **Stability AI** for the original Stable Diffusion model
224
+ - **OpenAI** for the CLIP architecture
225
+ - **CompVis** for the VAE implementation
226
+ - **Hugging Face** for the transformers library
227
+
228
+ ## 📚 References
229
+
230
+ - [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)
231
+ - [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)
232
+ - [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
233
+
234
+ ## 📞 Support
235
+
236
+ If you encounter any issues or have questions:
237
+ - Open an issue on GitHub
238
+ - Check the existing documentation
239
+ - Review the demo code for examples
240
+
241
+ ---
242
+
243
+ **Note**: This is a research and educational implementation. For production use, consider using the official Stable Diffusion implementations or cloud-based APIs.