casale-xyz
/

codestral-vit-mlx

vision-language

code-generation

Model card Files Files and versions

Michaelq commited on Dec 13, 2024

Commit

cf39aa0

·

verified ·

1 Parent(s): 84fd40b

Update README.md

Files changed (1) hide show

README.md +115 -3

README.md CHANGED Viewed

@@ -1,3 +1,115 @@
----
-license: apache-2.0
----

+---
+language: en
+tags:
+- codestral
+- vision-language
+- code-generation
+- multimodal
+- mlx
+license: other
+library_name: mlx
+inference: false
+license_name: mnpl
+license_link: https://mistral.ai/licences/MNPL-0.1.md
+---
+# Codestral-ViT
+A multimodal code generation model that combines vision and language understanding. Built on MLX for Apple Silicon, it integrates CLIP's visual capabilities with Codestral's code generation abilities.
+## Overview
+Codestral-ViT extends the Codestral language model with visual understanding capabilities. It can:
+- Generate code from text descriptions
+- Understand and explain code from screenshots
+- Suggest improvements to code based on visual context
+- Process multiple images with advanced tiling strategies
+## Technical Details
+- **Base Models:**
+  - Language: Codestral-22B (4-bit quantized)
+  - Vision: CLIP ViT-Large/14
+  - Framework: MLX (Apple Silicon)
+- **Architecture:**
+  - Vision encoder processes images into 512-dim embeddings
+  - Learned projection layer maps vision features to language space
+  - Dynamic RoPE scaling for 32K context window
+  - Support for overlapping image crops and tiling
+- **Input Processing:**
+  - Images: 224x224 pixels, CLIP normalization
+  - Text: Up to 32,768 tokens
+  - Special tokens for image-text fusion
+## Example Usage
+```python
+from PIL import Image
+from src.model import MultimodalCodestral
+model = MultimodalCodestral()
+# Code generation from screenshot
+image = Image.open("code_screenshot.png")
+response = model.generate_with_images(
+    prompt="Explain this code and suggest improvements",
+    images=[image]
+)
+# Multiple image processing
+images = [Image.open(f) for f in ["img1.png", "img2.png"]]
+response = model.generate_with_images(
+    prompt="Compare these code implementations",
+    images=images
+)
+```
+## Capabilities
+- **Code Understanding:**
+  - Analyzes code structure from screenshots
+  - Identifies patterns and anti-patterns
+  - Suggests contextual improvements
+- **Image Processing:**
+  - Handles multiple image inputs
+  - Supports various image formats
+  - Advanced crop and resize strategies
+- **Generation Features:**
+  - Context-aware code completion
+  - Documentation generation
+  - Code refactoring suggestions
+  - Bug identification and fixes
+## Requirements
+- Apple Silicon hardware (M1/M2/M3)
+- 32GB+ RAM recommended
+- MLX framework
+- Python 3.8+
+## Limitations
+- Apple Silicon only (no CPU/CUDA support)
+- Memory intensive for large images/codebases
+- Visual understanding bounded by CLIP's capabilities
+- Generation quality depends on input clarity
+## License
+This model is released under the Mistral Non-Profit License (MNPL). See [license details](https://mistral.ai/licences/MNPL-0.1.md).
+## Citation
+```bibtex
+@software{codestral-vit,
+  author = {Mike Casale},
+  title = {Codestral-ViT: A Vision-Language Model for Code Generation},
+  year = {2023},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/casale-xyz/codestral-vit}
+}
+```