matthewyuan's picture
Update README.md
93cc485 verified
---
license: mit
tags:
- image-quality-assessment
- computer-vision
- brisque
- aesthetic-predictor
- clip
- fusion
- pytorch
- image-classification
language:
- en
pipeline_tag: image-classification
library_name: pytorch
datasets:
- spaq
metrics:
- correlation
- r2
- mae
base_model:
- openai/clip-vit-base-patch32
---
# Image Quality Fusion Model
A multi-modal image quality assessment system that combines BRISQUE, Aesthetic Predictor, and CLIP features to predict human-like quality judgments on a 1-10 scale.
## 🎯 Model Description
This model fuses three complementary approaches to comprehensive image quality assessment:
- **🔧 BRISQUE (OpenCV)**: Technical quality assessment detecting blur, noise, compression artifacts, and distortions
- **🎨 Aesthetic Predictor (LAION)**: Visual appeal assessment using CLIP ViT-B-32 features trained on human aesthetic ratings
- **🧠 CLIP (OpenAI)**: Semantic understanding and high-level feature extraction for content awareness
The fusion network learns optimal weights to combine these diverse quality signals, producing human-like quality judgments that correlate strongly with subjective assessments.
## 🚀 Quick Start
### Installation
```bash
pip install torch torchvision huggingface_hub opencv-python pillow open-clip-torch
```
### Basic Usage
```python
# Define a minimal loader class that matches the uploaded head (512 -> 256 -> 1)
import torch
import torch.nn as nn
from huggingface_hub import PyTorchModelHubMixin
class IQFModel(nn.Module, PyTorchModelHubMixin):
def __init__(self, in_dim=512, hidden=256, **kwargs):
# Accept either in_dim/hidden or clip_embed_dim/hidden_dim from config.json
in_dim = kwargs.pop("clip_embed_dim", in_dim)
hidden = kwargs.pop("hidden_dim", hidden)
super().__init__()
self.mlp = nn.Sequential(
nn.Linear(in_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, 1),
)
def forward(self, x):
return self.mlp(x)
# Load weights from the Hub (defaults to model.safetensors)
model = IQFModel.from_pretrained("matthewyuan/image-quality-fusion", map_location="cpu")
model.eval()
# Smoke test on a dummy 512-d vector
with torch.no_grad():
y = model(torch.randn(1, 512)).item()
print(f"score: {y}")
```
### Advanced Usage
```python
import torch
import torch.nn as nn
from PIL import Image
import open_clip
from huggingface_hub import PyTorchModelHubMixin
# Minimal loader class (same as above)
class IQFModel(nn.Module, PyTorchModelHubMixin):
def __init__(self, in_dim=512, hidden=256, **kwargs):
in_dim = kwargs.pop("clip_embed_dim", in_dim)
hidden = kwargs.pop("hidden_dim", hidden)
super().__init__()
self.mlp = nn.Sequential(
nn.Linear(in_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, 1),
)
def forward(self, x):
return self.mlp(x)
# 1) Load CLIP ViT-B/32 image encoder (512-d output)
clip_model, _, clip_preprocess = open_clip.create_model_and_transforms(
"ViT-B-32", pretrained="openai"
)
clip_model.eval()
# 2) Load the fusion head from the Hub
fusion = IQFModel.from_pretrained("matthewyuan/image-quality-fusion", map_location="cpu")
fusion.eval()
def image_to_clip_embedding(img: Image.Image) -> torch.Tensor:
x = clip_preprocess(img).unsqueeze(0) # [1, 3, H, W]
with torch.no_grad():
feat = clip_model.encode_image(x) # [1, 512]
feat = feat / feat.norm(dim=-1, keepdim=True)
return feat
def predict_quality(image_path: str) -> float:
img = Image.open(image_path).convert("RGB")
emb = image_to_clip_embedding(img) # [1, 512]
with torch.no_grad():
score = fusion(emb).item() # scalar
return float(score)
print("score:", predict_quality("test.jpg"))
```
## 📊 Performance Metrics
Evaluated on the SPAQ dataset (11,125 smartphone images with human quality ratings):
| Metric | Value | Description |
|--------|-------|-------------|
| **Pearson Correlation** | 0.520 | Correlation with human judgments |
| **R² Score** | 0.250 | Coefficient of determination |
| **Mean Absolute Error** | 1.41 | Average prediction error (1-10 scale) |
| **Root Mean Square Error** | 1.69 | RMS prediction error |
### Comparison with Individual Components
| Method | Correlation | R² Score | MAE |
|--------|-------------|----------|-----|
| **Fusion Model** | **0.520** | **0.250** | **1.41** |
| BRISQUE Only | 0.31 | 0.12 | 2.1 |
| Aesthetic Only | 0.41 | 0.18 | 1.8 |
| CLIP Only | 0.28 | 0.09 | 2.3 |
*The fusion approach significantly outperforms individual components.*
## 🏗️ Model Architecture
```
Input Image (RGB)
├── OpenCV BRISQUE → Technical Quality Score (0-100, normalized)
├── LAION Aesthetic → Aesthetic Score (0-10, normalized)
└── OpenAI CLIP-B32 → Semantic Features (512-dimensional)
Feature Fusion Network
┌─────────────────────────┐
│ BRISQUE: 1D → 64 → 128 │
│ Aesthetic: 1D → 64 → 128│
│ CLIP: 512D → 256 → 128 │
└─────────────────────────┘
↓ (concat)
Deep Fusion Layers (384D → 256D → 128D → 1D)
Dropout (0.3) + ReLU activations
Human-like Quality Score (1.0 - 10.0)
```
### Technical Details
- **Input Resolution**: Any size (resized to 224×224 for CLIP)
- **Architecture**: Feed-forward neural network with residual connections
- **Activation Functions**: ReLU for hidden layers, Linear for output
- **Regularization**: Dropout (0.3), Early stopping
- **Output Range**: 1.0 - 10.0 (human rating scale)
- **Parameters**: ~2.1M total parameters
## 🔬 Training Details
### Dataset
- **Name**: SPAQ (Smartphone Photography Attribute and Quality)
- **Size**: 11,125 high-resolution smartphone images
- **Annotations**: Human quality ratings (1-10 scale, 5+ annotators per image)
- **Split**: 80% train, 10% validation, 10% test
- **Domain**: Consumer smartphone photography
### Training Configuration
- **Framework**: PyTorch 2.0+ with MPS acceleration (M1 optimized)
- **Optimizer**: AdamW (lr=1e-3, weight_decay=1e-4)
- **Batch Size**: 128 (optimized for 32GB unified memory)
- **Epochs**: 50 with early stopping (patience=10)
- **Loss Function**: Mean Squared Error (MSE)
- **Learning Rate Schedule**: ReduceLROnPlateau (factor=0.5, patience=5)
- **Hardware**: M1 MacBook Pro (32GB RAM)
- **Training Time**: ~1 hour (with feature caching)
### Optimization Techniques
- **Mixed Precision Training**: MPS autocast for M1 acceleration
- **Feature Caching**: Pre-computed embeddings for 20-30x speedup
- **Data Loading**: Optimized DataLoader (6-8 workers, memory pinning)
- **Memory Management**: Garbage collection every 10 batches
- **Preprocessing Pipeline**: Parallel BRISQUE computation
## 📱 Use Cases
### Professional Applications
- **Content Management**: Automatic quality filtering for large image databases
- **Social Media**: Real-time quality assessment for user uploads
- **E-commerce**: Product image quality validation
- **Digital Asset Management**: Automated quality scoring for photo libraries
### Research Applications
- **Image Quality Research**: Benchmark for perceptual quality metrics
- **Dataset Curation**: Quality-based dataset filtering and ranking
- **Human Perception Studies**: Computational model of aesthetic judgment
- **Multi-modal Learning**: Example of successful feature fusion
### Creative Applications
- **Photography Tools**: Automated photo rating and selection
- **Mobile Apps**: Real-time quality feedback during capture
- **Photo Editing**: Quality-guided automatic enhancement
- **Portfolio Management**: Intelligent photo organization
## ⚠️ Limitations and Biases
### Model Limitations
- **Domain Specificity**: Trained primarily on smartphone photography
- **Resolution Dependency**: Performance may vary with very low/high resolution images
- **Cultural Bias**: Aesthetic preferences may reflect training data demographics
- **Temporal Bias**: Training data from specific time period may not reflect evolving preferences
### Technical Limitations
- **BRISQUE Scope**: May not capture all types of technical degradation
- **CLIP Bias**: Inherits biases from CLIP's training data
- **Aesthetic Subjectivity**: Individual preferences vary significantly
- **Computational Requirements**: Requires GPU for optimal inference speed
### Recommended Usage
- **Validation**: Always validate on your specific domain before production use
- **Human Oversight**: Use as a tool to assist, not replace, human judgment
- **Bias Mitigation**: Consider diverse evaluation datasets
- **Performance Monitoring**: Monitor performance on your specific use case
## 📚 Citation
If you use this model in your research, please cite:
```bibtex
@misc{image-quality-fusion-2024,
title={Image Quality Fusion: Multi-Modal Assessment with BRISQUE, Aesthetic, and CLIP Features},
author={Matthew Yuan},
year={2024},
howpublished={\url{https://huggingface.co/matthewyuan/image-quality-fusion}},
note={Trained on SPAQ dataset, deployed via GitHub Actions CI/CD}
}
```
## 🔗 Related Work
### Datasets
- [SPAQ Dataset](https://github.com/h4nwei/SPAQ) - Smartphone Photography Attribute and Quality
- [AVA Dataset](https://github.com/mtobeiyf/ava_downloader) - Aesthetic Visual Analysis
- [LIVE IQA](https://live.ece.utexas.edu/research/Quality/) - Laboratory for Image & Video Engineering
### Models
- [LAION Aesthetic Predictor](https://github.com/LAION-AI/aesthetic-predictor) - Aesthetic scoring model
- [OpenCLIP](https://github.com/mlfoundations/open_clip) - Open source CLIP implementation
- [BRISQUE](https://learnopencv.com/image-quality-assessment-brisque/) - Blind/Referenceless Image Spatial Quality Evaluator
## 🛠️ Development
### Local Development
```bash
# Clone repository
git clone https://github.com/mattkyuan/image-quality-fusion.git
cd image-quality-fusion
# Install dependencies
pip install -r requirements.txt
# Run training
python src/image_quality_fusion/training/train_fusion.py \
--image_dir data/images \
--annotations data/annotations.csv \
--prepare_data \
--epochs 50
```
### CI/CD Pipeline
This model is automatically deployed via GitHub Actions:
- **Training Pipeline**: Automated model training on code changes
- **Deployment Pipeline**: Automatic HF Hub deployment on model updates
- **Testing Pipeline**: Comprehensive model validation and testing
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](https://github.com/mattkyuan/image-quality-fusion/blob/main/LICENSE) file for details.
## 🙏 Acknowledgments
- **SPAQ Dataset**: H4nwei et al. for the comprehensive smartphone photography dataset
- **LAION**: For the aesthetic predictor model and training methodology
- **OpenAI**: For CLIP model architecture and pre-trained weights
- **OpenCV**: For BRISQUE implementation and computer vision tools
- **Hugging Face**: For model hosting and deployment infrastructure
- **PyTorch Team**: For the deep learning framework and MPS acceleration
## 📞 Contact
- **Repository**: [github.com/mattkyuan/image-quality-fusion](https://github.com/mattkyuan/image-quality-fusion)
- **Issues**: [GitHub Issues](https://github.com/mattkyuan/image-quality-fusion/issues)
- **Hugging Face**: [matthewyuan/image-quality-fusion](https://huggingface.co/matthewyuan/image-quality-fusion)
---
*This model was trained and deployed using automated CI/CD pipelines for reproducible ML workflows.*