|
--- |
|
license: mit |
|
tags: |
|
- image-quality-assessment |
|
- computer-vision |
|
- brisque |
|
- aesthetic-predictor |
|
- clip |
|
- fusion |
|
- pytorch |
|
- image-classification |
|
language: |
|
- en |
|
pipeline_tag: image-classification |
|
library_name: pytorch |
|
datasets: |
|
- spaq |
|
metrics: |
|
- correlation |
|
- r2 |
|
- mae |
|
base_model: |
|
- openai/clip-vit-base-patch32 |
|
--- |
|
|
|
# Image Quality Fusion Model |
|
|
|
A multi-modal image quality assessment system that combines BRISQUE, Aesthetic Predictor, and CLIP features to predict human-like quality judgments on a 1-10 scale. |
|
|
|
## 🎯 Model Description |
|
|
|
This model fuses three complementary approaches to comprehensive image quality assessment: |
|
|
|
- **🔧 BRISQUE (OpenCV)**: Technical quality assessment detecting blur, noise, compression artifacts, and distortions |
|
- **🎨 Aesthetic Predictor (LAION)**: Visual appeal assessment using CLIP ViT-B-32 features trained on human aesthetic ratings |
|
- **🧠 CLIP (OpenAI)**: Semantic understanding and high-level feature extraction for content awareness |
|
|
|
The fusion network learns optimal weights to combine these diverse quality signals, producing human-like quality judgments that correlate strongly with subjective assessments. |
|
|
|
## 🚀 Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install torch torchvision huggingface_hub opencv-python pillow open-clip-torch |
|
``` |
|
|
|
### Basic Usage |
|
|
|
```python |
|
# Define a minimal loader class that matches the uploaded head (512 -> 256 -> 1) |
|
import torch |
|
import torch.nn as nn |
|
from huggingface_hub import PyTorchModelHubMixin |
|
|
|
class IQFModel(nn.Module, PyTorchModelHubMixin): |
|
def __init__(self, in_dim=512, hidden=256, **kwargs): |
|
# Accept either in_dim/hidden or clip_embed_dim/hidden_dim from config.json |
|
in_dim = kwargs.pop("clip_embed_dim", in_dim) |
|
hidden = kwargs.pop("hidden_dim", hidden) |
|
super().__init__() |
|
self.mlp = nn.Sequential( |
|
nn.Linear(in_dim, hidden), |
|
nn.ReLU(), |
|
nn.Linear(hidden, 1), |
|
) |
|
def forward(self, x): |
|
return self.mlp(x) |
|
|
|
# Load weights from the Hub (defaults to model.safetensors) |
|
model = IQFModel.from_pretrained("matthewyuan/image-quality-fusion", map_location="cpu") |
|
model.eval() |
|
|
|
# Smoke test on a dummy 512-d vector |
|
with torch.no_grad(): |
|
y = model(torch.randn(1, 512)).item() |
|
print(f"score: {y}") |
|
``` |
|
|
|
### Advanced Usage |
|
|
|
```python |
|
import torch |
|
import torch.nn as nn |
|
from PIL import Image |
|
import open_clip |
|
from huggingface_hub import PyTorchModelHubMixin |
|
|
|
# Minimal loader class (same as above) |
|
class IQFModel(nn.Module, PyTorchModelHubMixin): |
|
def __init__(self, in_dim=512, hidden=256, **kwargs): |
|
in_dim = kwargs.pop("clip_embed_dim", in_dim) |
|
hidden = kwargs.pop("hidden_dim", hidden) |
|
super().__init__() |
|
self.mlp = nn.Sequential( |
|
nn.Linear(in_dim, hidden), |
|
nn.ReLU(), |
|
nn.Linear(hidden, 1), |
|
) |
|
def forward(self, x): |
|
return self.mlp(x) |
|
|
|
# 1) Load CLIP ViT-B/32 image encoder (512-d output) |
|
clip_model, _, clip_preprocess = open_clip.create_model_and_transforms( |
|
"ViT-B-32", pretrained="openai" |
|
) |
|
clip_model.eval() |
|
|
|
# 2) Load the fusion head from the Hub |
|
fusion = IQFModel.from_pretrained("matthewyuan/image-quality-fusion", map_location="cpu") |
|
fusion.eval() |
|
|
|
def image_to_clip_embedding(img: Image.Image) -> torch.Tensor: |
|
x = clip_preprocess(img).unsqueeze(0) # [1, 3, H, W] |
|
with torch.no_grad(): |
|
feat = clip_model.encode_image(x) # [1, 512] |
|
feat = feat / feat.norm(dim=-1, keepdim=True) |
|
return feat |
|
|
|
def predict_quality(image_path: str) -> float: |
|
img = Image.open(image_path).convert("RGB") |
|
emb = image_to_clip_embedding(img) # [1, 512] |
|
with torch.no_grad(): |
|
score = fusion(emb).item() # scalar |
|
return float(score) |
|
|
|
print("score:", predict_quality("test.jpg")) |
|
``` |
|
|
|
## 📊 Performance Metrics |
|
|
|
Evaluated on the SPAQ dataset (11,125 smartphone images with human quality ratings): |
|
|
|
| Metric | Value | Description | |
|
|--------|-------|-------------| |
|
| **Pearson Correlation** | 0.520 | Correlation with human judgments | |
|
| **R² Score** | 0.250 | Coefficient of determination | |
|
| **Mean Absolute Error** | 1.41 | Average prediction error (1-10 scale) | |
|
| **Root Mean Square Error** | 1.69 | RMS prediction error | |
|
|
|
### Comparison with Individual Components |
|
|
|
| Method | Correlation | R² Score | MAE | |
|
|--------|-------------|----------|-----| |
|
| **Fusion Model** | **0.520** | **0.250** | **1.41** | |
|
| BRISQUE Only | 0.31 | 0.12 | 2.1 | |
|
| Aesthetic Only | 0.41 | 0.18 | 1.8 | |
|
| CLIP Only | 0.28 | 0.09 | 2.3 | |
|
|
|
*The fusion approach significantly outperforms individual components.* |
|
|
|
## 🏗️ Model Architecture |
|
|
|
``` |
|
Input Image (RGB) |
|
├── OpenCV BRISQUE → Technical Quality Score (0-100, normalized) |
|
├── LAION Aesthetic → Aesthetic Score (0-10, normalized) |
|
└── OpenAI CLIP-B32 → Semantic Features (512-dimensional) |
|
↓ |
|
Feature Fusion Network |
|
┌─────────────────────────┐ |
|
│ BRISQUE: 1D → 64 → 128 │ |
|
│ Aesthetic: 1D → 64 → 128│ |
|
│ CLIP: 512D → 256 → 128 │ |
|
└─────────────────────────┘ |
|
↓ (concat) |
|
Deep Fusion Layers (384D → 256D → 128D → 1D) |
|
Dropout (0.3) + ReLU activations |
|
↓ |
|
Human-like Quality Score (1.0 - 10.0) |
|
``` |
|
|
|
### Technical Details |
|
|
|
- **Input Resolution**: Any size (resized to 224×224 for CLIP) |
|
- **Architecture**: Feed-forward neural network with residual connections |
|
- **Activation Functions**: ReLU for hidden layers, Linear for output |
|
- **Regularization**: Dropout (0.3), Early stopping |
|
- **Output Range**: 1.0 - 10.0 (human rating scale) |
|
- **Parameters**: ~2.1M total parameters |
|
|
|
## 🔬 Training Details |
|
|
|
### Dataset |
|
- **Name**: SPAQ (Smartphone Photography Attribute and Quality) |
|
- **Size**: 11,125 high-resolution smartphone images |
|
- **Annotations**: Human quality ratings (1-10 scale, 5+ annotators per image) |
|
- **Split**: 80% train, 10% validation, 10% test |
|
- **Domain**: Consumer smartphone photography |
|
|
|
### Training Configuration |
|
- **Framework**: PyTorch 2.0+ with MPS acceleration (M1 optimized) |
|
- **Optimizer**: AdamW (lr=1e-3, weight_decay=1e-4) |
|
- **Batch Size**: 128 (optimized for 32GB unified memory) |
|
- **Epochs**: 50 with early stopping (patience=10) |
|
- **Loss Function**: Mean Squared Error (MSE) |
|
- **Learning Rate Schedule**: ReduceLROnPlateau (factor=0.5, patience=5) |
|
- **Hardware**: M1 MacBook Pro (32GB RAM) |
|
- **Training Time**: ~1 hour (with feature caching) |
|
|
|
### Optimization Techniques |
|
- **Mixed Precision Training**: MPS autocast for M1 acceleration |
|
- **Feature Caching**: Pre-computed embeddings for 20-30x speedup |
|
- **Data Loading**: Optimized DataLoader (6-8 workers, memory pinning) |
|
- **Memory Management**: Garbage collection every 10 batches |
|
- **Preprocessing Pipeline**: Parallel BRISQUE computation |
|
|
|
## 📱 Use Cases |
|
|
|
### Professional Applications |
|
- **Content Management**: Automatic quality filtering for large image databases |
|
- **Social Media**: Real-time quality assessment for user uploads |
|
- **E-commerce**: Product image quality validation |
|
- **Digital Asset Management**: Automated quality scoring for photo libraries |
|
|
|
### Research Applications |
|
- **Image Quality Research**: Benchmark for perceptual quality metrics |
|
- **Dataset Curation**: Quality-based dataset filtering and ranking |
|
- **Human Perception Studies**: Computational model of aesthetic judgment |
|
- **Multi-modal Learning**: Example of successful feature fusion |
|
|
|
### Creative Applications |
|
- **Photography Tools**: Automated photo rating and selection |
|
- **Mobile Apps**: Real-time quality feedback during capture |
|
- **Photo Editing**: Quality-guided automatic enhancement |
|
- **Portfolio Management**: Intelligent photo organization |
|
|
|
## ⚠️ Limitations and Biases |
|
|
|
### Model Limitations |
|
- **Domain Specificity**: Trained primarily on smartphone photography |
|
- **Resolution Dependency**: Performance may vary with very low/high resolution images |
|
- **Cultural Bias**: Aesthetic preferences may reflect training data demographics |
|
- **Temporal Bias**: Training data from specific time period may not reflect evolving preferences |
|
|
|
### Technical Limitations |
|
- **BRISQUE Scope**: May not capture all types of technical degradation |
|
- **CLIP Bias**: Inherits biases from CLIP's training data |
|
- **Aesthetic Subjectivity**: Individual preferences vary significantly |
|
- **Computational Requirements**: Requires GPU for optimal inference speed |
|
|
|
### Recommended Usage |
|
- **Validation**: Always validate on your specific domain before production use |
|
- **Human Oversight**: Use as a tool to assist, not replace, human judgment |
|
- **Bias Mitigation**: Consider diverse evaluation datasets |
|
- **Performance Monitoring**: Monitor performance on your specific use case |
|
|
|
## 📚 Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{image-quality-fusion-2024, |
|
title={Image Quality Fusion: Multi-Modal Assessment with BRISQUE, Aesthetic, and CLIP Features}, |
|
author={Matthew Yuan}, |
|
year={2024}, |
|
howpublished={\url{https://huggingface.co/matthewyuan/image-quality-fusion}}, |
|
note={Trained on SPAQ dataset, deployed via GitHub Actions CI/CD} |
|
} |
|
``` |
|
|
|
## 🔗 Related Work |
|
|
|
### Datasets |
|
- [SPAQ Dataset](https://github.com/h4nwei/SPAQ) - Smartphone Photography Attribute and Quality |
|
- [AVA Dataset](https://github.com/mtobeiyf/ava_downloader) - Aesthetic Visual Analysis |
|
- [LIVE IQA](https://live.ece.utexas.edu/research/Quality/) - Laboratory for Image & Video Engineering |
|
|
|
### Models |
|
- [LAION Aesthetic Predictor](https://github.com/LAION-AI/aesthetic-predictor) - Aesthetic scoring model |
|
- [OpenCLIP](https://github.com/mlfoundations/open_clip) - Open source CLIP implementation |
|
- [BRISQUE](https://learnopencv.com/image-quality-assessment-brisque/) - Blind/Referenceless Image Spatial Quality Evaluator |
|
|
|
## 🛠️ Development |
|
|
|
### Local Development |
|
```bash |
|
# Clone repository |
|
git clone https://github.com/mattkyuan/image-quality-fusion.git |
|
cd image-quality-fusion |
|
|
|
# Install dependencies |
|
pip install -r requirements.txt |
|
|
|
# Run training |
|
python src/image_quality_fusion/training/train_fusion.py \ |
|
--image_dir data/images \ |
|
--annotations data/annotations.csv \ |
|
--prepare_data \ |
|
--epochs 50 |
|
``` |
|
|
|
### CI/CD Pipeline |
|
This model is automatically deployed via GitHub Actions: |
|
- **Training Pipeline**: Automated model training on code changes |
|
- **Deployment Pipeline**: Automatic HF Hub deployment on model updates |
|
- **Testing Pipeline**: Comprehensive model validation and testing |
|
|
|
## 📄 License |
|
|
|
This project is licensed under the MIT License - see the [LICENSE](https://github.com/mattkyuan/image-quality-fusion/blob/main/LICENSE) file for details. |
|
|
|
## 🙏 Acknowledgments |
|
|
|
- **SPAQ Dataset**: H4nwei et al. for the comprehensive smartphone photography dataset |
|
- **LAION**: For the aesthetic predictor model and training methodology |
|
- **OpenAI**: For CLIP model architecture and pre-trained weights |
|
- **OpenCV**: For BRISQUE implementation and computer vision tools |
|
- **Hugging Face**: For model hosting and deployment infrastructure |
|
- **PyTorch Team**: For the deep learning framework and MPS acceleration |
|
|
|
## 📞 Contact |
|
|
|
- **Repository**: [github.com/mattkyuan/image-quality-fusion](https://github.com/mattkyuan/image-quality-fusion) |
|
- **Issues**: [GitHub Issues](https://github.com/mattkyuan/image-quality-fusion/issues) |
|
- **Hugging Face**: [matthewyuan/image-quality-fusion](https://huggingface.co/matthewyuan/image-quality-fusion) |
|
|
|
--- |
|
|
|
*This model was trained and deployed using automated CI/CD pipelines for reproducible ML workflows.* |
|
|