Update README.md
Browse files
README.md
CHANGED
@@ -17,6 +17,25 @@ base_model:
|
|
17 |
|
18 |
A compact vision language model that you can pretrain and finetune on a single consumer GPU.
|
19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
## 📰 News and Updates
|
22 |
|
|
|
17 |
|
18 |
A compact vision language model that you can pretrain and finetune on a single consumer GPU.
|
19 |
|
20 |
+
## 🔍 Performance & Training Highlights
|
21 |
+
|
22 |
+
- 📊 **VQAv2 Accuracy**:
|
23 |
+
Achieves **56.91%** on VQAv2 dev/test — making MicroLLaVA one of the best-performing open-source language models with vision capabilities under **700M parameters**.
|
24 |
+
|
25 |
+
- 🧠 **Parameter Budget**:
|
26 |
+
- 🗣️ Language Model: **MicroLLaMA (300M)**
|
27 |
+
- 👁️ Vision Encoder: **SigLIP2 (400M)**
|
28 |
+
→ **~700M total parameters**
|
29 |
+
|
30 |
+
- 🏆 **Best in Class**:
|
31 |
+
According to ChatGPT’s Deep Research Agent (Aug 2025):
|
32 |
+
> *“No known open model below ~700M currently surpasses MicroLLaVA’s VQAv2 accuracy. Models that do perform better tend to have larger language components.”*
|
33 |
+
|
34 |
+
- 🧪 **Ongoing Experiments**:
|
35 |
+
- 🔧 **Qwen3-0.6B + SigLIP2**
|
36 |
+
→ Training is **converging**, showing promising loss curves. (Qwen3-0.6B is significantly larger than MicroLLaMA.)
|
37 |
+
- ❌ **Gemma-3B-270M-IT + SigLIP2**
|
38 |
+
→ Training **did not converge**, likely due to instability, bugs, or poor alignment under current hyperparameters.
|
39 |
|
40 |
## 📰 News and Updates
|
41 |
|