keeeeenw commited on
Commit
f4e9991
·
verified ·
1 Parent(s): df70f3b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -0
README.md CHANGED
@@ -17,6 +17,25 @@ base_model:
17
 
18
  A compact vision language model that you can pretrain and finetune on a single consumer GPU.
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ## 📰 News and Updates
22
 
 
17
 
18
  A compact vision language model that you can pretrain and finetune on a single consumer GPU.
19
 
20
+ ## 🔍 Performance & Training Highlights
21
+
22
+ - 📊 **VQAv2 Accuracy**:
23
+ Achieves **56.91%** on VQAv2 dev/test — making MicroLLaVA one of the best-performing open-source language models with vision capabilities under **700M parameters**.
24
+
25
+ - 🧠 **Parameter Budget**:
26
+ - 🗣️ Language Model: **MicroLLaMA (300M)**
27
+ - 👁️ Vision Encoder: **SigLIP2 (400M)**
28
+ → **~700M total parameters**
29
+
30
+ - 🏆 **Best in Class**:
31
+ According to ChatGPT’s Deep Research Agent (Aug 2025):
32
+ > *“No known open model below ~700M currently surpasses MicroLLaVA’s VQAv2 accuracy. Models that do perform better tend to have larger language components.”*
33
+
34
+ - 🧪 **Ongoing Experiments**:
35
+ - 🔧 **Qwen3-0.6B + SigLIP2**
36
+ → Training is **converging**, showing promising loss curves. (Qwen3-0.6B is significantly larger than MicroLLaMA.)
37
+ - ❌ **Gemma-3B-270M-IT + SigLIP2**
38
+ → Training **did not converge**, likely due to instability, bugs, or poor alignment under current hyperparameters.
39
 
40
  ## 📰 News and Updates
41