stable-diffusion-implementation / README.md

Upload README.md with huggingface_hub

25e5912 verified 14 days ago

7.88 kB

	---
	language:
	- en
	tags:
	- stable-diffusion
	- pytorch
	- text-to-image
	- image-to-image
	- diffusion-models
	- computer-vision
	- generative-ai
	- deep-learning
	- neural-networks
	license: mit
	library_name: pytorch
	pipeline_tag: text-to-image
	base_model: stable-diffusion-v1-5
	model-index:
	- name: pytorch-stable-diffusion
	results:
	- task:
	type: text-to-image
	name: Text-to-Image Generation
	dataset:
	type: custom
	name: Stable Diffusion v1.5
	metrics:
	- type: inference_steps
	value: 50
	- type: cfg_scale
	value: 8
	- type: image_size
	value: 512x512
	---

	# PyTorch Stable Diffusion Implementation

	A complete, from-scratch PyTorch implementation of Stable Diffusion v1.5, featuring both text-to-image and image-to-image generation capabilities. This project demonstrates the inner workings of diffusion models by implementing all components without relying on pre-built libraries.

	## 🚀 Features

	- Text-to-Image Generation: Create high-quality images from text descriptions
	- Image-to-Image Generation: Transform existing images using text prompts
	- Complete Implementation: All components built from scratch in PyTorch
	- Flexible Sampling: Configurable inference steps and CFG scale
	- Model Compatibility: Support for various fine-tuned Stable Diffusion models
	- Clean Architecture: Modular design with separate components for each part of the pipeline

	## 🏗️ Architecture

	This implementation includes all the core components of Stable Diffusion:

	- CLIP Text Encoder: Processes text prompts into embeddings
	- VAE Encoder/Decoder: Handles image compression and reconstruction
	- U-Net Diffusion Model: Core denoising network with attention mechanisms
	- DDPM Sampler: Implements the denoising diffusion probabilistic model
	- Pipeline Orchestration: Coordinates all components for generation

	## 📁 Project Structure

	```
	├── main/
	│ ├── attention.py # Multi-head attention implementation
	│ ├── clip.py # CLIP text encoder
	│ ├── ddpm.py # DDPM sampling algorithm
	│ ├── decoder.py # VAE decoder for image reconstruction
	│ ├── diffusion.py # U-Net diffusion model
	│ ├── encoder.py # VAE encoder for image compression
	│ ├── model_converter.py # Converts checkpoint files to PyTorch format
	│ ├── model_loader.py # Loads and manages model weights
	│ ├── pipeline.py # Main generation pipeline
	│ └── demo.py # Example usage and demonstration
	├── data/ # Model weights and tokenizer files
	└── images/ # Input/output images
	```

	## 🛠️ Installation

	### Prerequisites

	- Python 3.8+
	- PyTorch 1.12+
	- Transformers library
	- PIL (Pillow)
	- NumPy
	- tqdm

	### Setup

	1. Clone the repository:
	```bash
	git clone https://github.com/https://github.com/ApoorvBrooklyn/Stable-Diffusion
	cd pytorch-stable-diffusion
	```

	2. Create virtual environment:
	```bash
	python -m venv venv
	source venv/bin/activate # On Windows: venv\Scripts\activate
	```

	3. Install dependencies:
	```bash
	pip install torch torchvision torchaudio
	pip install transformers pillow numpy tqdm
	```

	4. Download required model files:
	- Download `vocab.json` and `merges.txt` from [Stable Diffusion v1.5 tokenizer](https://huggingface.co/ApoorvBrooklyn/stable-diffusion-implementation/tree/main/data)
	- Download `v1-5-pruned-emaonly.ckpt` from [Stable Diffusion v1.5](https://huggingface.co/ApoorvBrooklyn/stable-diffusion-implementation/tree/main/data)
	- Place all files in the `data/` folder

	## 🎯 Usage

	### Basic Text-to-Image Generation

	```python
	import model_loader
	import pipeline
	from transformers import CLIPTokenizer

	# Initialize tokenizer and load models
	tokenizer = CLIPTokenizer("data/vocab.json", merges_file="data/merges.txt")
	models = model_loader.preload_models_from_standard_weights("data/v1-5-pruned-emaonly.ckpt", "cpu")

	# Generate image from text
	output_image = pipeline.generate(
	prompt="A beautiful sunset over mountains, highly detailed, 8k resolution",
	uncond_prompt="", # Negative prompt
	do_cfg=True,
	cfg_scale=8,
	sampler_name="ddpm",
	n_inference_steps=50,
	seed=42,
	models=models,
	device="cpu",
	tokenizer=tokenizer
	)
	```

	### Image-to-Image Generation

	```python
	from PIL import Image

	# Load input image
	input_image = Image.open("images/input.jpg")

	# Generate transformed image
	output_image = pipeline.generate(
	prompt="Transform this into a watercolor painting",
	input_image=input_image,
	strength=0.8, # Controls how much to change the input
	# ... other parameters
	)
	```

	### Advanced Configuration

	- CFG Scale: Controls how closely the image follows the prompt (1-14)
	- Inference Steps: More steps = higher quality but slower generation
	- Strength: For image-to-image, controls transformation intensity (0-1)
	- Seed: Set for reproducible results

	## 🔧 Model Conversion

	The `model_converter.py` script converts Stable Diffusion checkpoint files to PyTorch format:

	```bash
	python main/model_converter.py --checkpoint_path data/v1-5-pruned-emaonly.ckpt --output_dir converted_models/
	```

	## 🎨 Supported Models

	This implementation is compatible with:
	- Stable Diffusion v1.5: Base model
	- Fine-tuned Models: Any SD v1.5 compatible checkpoint
	- Custom Models: Models trained on specific datasets or styles

	### Tested Fine-tuned Models:
	- InkPunk Diffusion: Artistic ink-style images
	- Illustration Diffusion: Hollie Mengert's illustration style

	## 🚀 Performance Tips

	- Device Selection: Use CUDA for GPU acceleration, MPS for Apple Silicon
	- Batch Processing: Process multiple prompts simultaneously
	- Memory Management: Use `idle_device="cpu"` to free GPU memory
	- Optimization: Adjust inference steps based on quality vs. speed needs

	## 🔬 Technical Details

	### Diffusion Process
	- Implements DDPM (Denoising Diffusion Probabilistic Models)
	- Uses U-Net architecture with cross-attention for text conditioning
	- VAE handles 512x512 image compression to 64x64 latents

	### Attention Mechanisms
	- Multi-head self-attention in U-Net
	- Cross-attention between text embeddings and image features
	- Efficient attention implementation for memory optimization

	### Sampling
	- Configurable number of denoising steps
	- Classifier-free guidance (CFG) for prompt adherence
	- Deterministic generation with seed control

	## 🤝 Contributing

	Contributions are welcome! Please feel free to submit pull requests or open issues for:
	- Bug fixes
	- Performance improvements
	- New sampling algorithms
	- Additional model support
	- Documentation improvements

	## 📄 License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

	## 🙏 Acknowledgments

	- Stability AI for the original Stable Diffusion model
	- OpenAI for the CLIP architecture
	- CompVis for the VAE implementation
	- Hugging Face for the transformers library

	## 📚 References

	- [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)
	- [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)
	- [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)

	## 📞 Support

	If you encounter any issues or have questions:
	- Open an issue on GitHub
	- Check the existing documentation
	- Review the demo code for examples

	---

	Note: This is a research and educational implementation. For production use, consider using the official Stable Diffusion implementations or cloud-based APIs.