keeeeenw
/

MicroLlava

Visual Question Answering

text-generation

vision-language

Model card Files Files and versions

keeeeenw commited on Aug 18

Commit

f4e9991

·

verified ·

1 Parent(s): df70f3b

Update README.md

Files changed (1) hide show

README.md +19 -0

README.md CHANGED Viewed

@@ -17,6 +17,25 @@ base_model:
 A compact vision language model that you can pretrain and finetune on a single consumer GPU.
 ## 📰 News and Updates

 A compact vision language model that you can pretrain and finetune on a single consumer GPU.
+## 🔍 Performance & Training Highlights
+- 📊 **VQAv2 Accuracy**:
+  Achieves **56.91%** on VQAv2 dev/test — making MicroLLaVA one of the best-performing open-source language models with vision capabilities under **700M parameters**.
+- 🧠 **Parameter Budget**:
+  - 🗣️ Language Model: **MicroLLaMA (300M)**
+  - 👁️ Vision Encoder: **SigLIP2 (400M)**
+  → **~700M total parameters**
+- 🏆 **Best in Class**:
+  According to ChatGPT’s Deep Research Agent (Aug 2025):
+  > *“No known open model below ~700M currently surpasses MicroLLaVA’s VQAv2 accuracy. Models that do perform better tend to have larger language components.”*
+- 🧪 **Ongoing Experiments**:
+  - 🔧 **Qwen3-0.6B + SigLIP2**
+    → Training is **converging**, showing promising loss curves. (Qwen3-0.6B is significantly larger than MicroLLaMA.)
+  - ❌ **Gemma-3B-270M-IT + SigLIP2**
+    → Training **did not converge**, likely due to instability, bugs, or poor alignment under current hyperparameters.
 ## 📰 News and Updates