Update README.md
Browse files
README.md
CHANGED
@@ -14,42 +14,46 @@ base_model:
|
|
14 |
- google/siglip-so400m-patch14-384
|
15 |
---
|
16 |
|
17 |
-
# MicroLLaVA-siglip-so400m
|
18 |
|
19 |
-
A compact vision language model that you can pretrain and finetune on a single consumer GPU such as NVIDIA RTX 4090 with 24G of VRAM
|
20 |
|
21 |
-
|
|
|
|
|
22 |
|
23 |
-
| Item
|
24 |
-
|
25 |
-
| Framework
|
26 |
-
| Checkpoint type | `safetensors` |
|
27 |
-
| LLM
|
28 |
-
| Vision tower
|
29 |
-
| Hardware used
|
30 |
-
| Training stack
|
31 |
-
| Intended tasks
|
32 |
|
33 |
---
|
34 |
|
35 |
-
## Introduction
|
36 |
|
37 |
-
MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.
|
38 |
-
The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.
|
39 |
|
|
|
|
|
|
|
40 |
- **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters
|
41 |
- **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
|
42 |
- **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
|
43 |
|
44 |
-
|
45 |
-
|
46 |
-
Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.
|
47 |
|
48 |
-
|
|
|
49 |
|
50 |
---
|
51 |
|
52 |
-
## Quick
|
53 |
|
54 |
```python
|
55 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
@@ -58,115 +62,123 @@ hf_path = 'keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune'
|
|
58 |
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
|
59 |
# model.cuda() # turn on cuda as needed by the model runs fairly quickly on CPU.
|
60 |
config = model.config
|
61 |
-
tokenizer = AutoTokenizer.from_pretrained(
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
67 |
|
68 |
print('model output:', output_text)
|
69 |
print('runing time:', genertaion_time)
|
70 |
```
|
71 |
|
72 |
-
|
|
|
|
|
73 |
|
74 |
-
**Input Image:**
|
75 |

|
76 |
|
77 |
-
**Prompt:**
|
|
|
78 |
|
79 |
-
**Model Output:**
|
80 |
```
|
81 |
When I visit the beach at the waterfront, I should be cautious about several things. First, I should be cautious about the water, as it is a popular spot for boating and fishing. The water is shallow and shallow, making it difficult for boats to navigate and navigate. Additionally, the water is not a suitable surface for boating, as it is too shallow for boating. Additionally, the water is not suitable for swimming or fishing, as it is too cold and wet. Lastly, I should be cautious about the presence of other boats, such as boats that are parked on the beach, or boats that are not visible from the water. These factors can lead to potential accidents or accidents, as they can cause damage to the boat and the other boats in the water.
|
82 |
```
|
83 |
|
84 |
-
### Implementation Notes
|
85 |
|
86 |
-
For inference, I created a custom class [modeling_tinyllava_llama.py](https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m/blob/main/modeling_tinyllava_llama.py) which:
|
87 |
- Loads the same chat template as the TinyLlava model for TinyLlama
|
88 |
- Connects the LLM to the vision tower
|
89 |
- May require additional dependencies such as PyTorch and Transformers library
|
90 |
|
91 |
---
|
92 |
|
93 |
-
## Evaluation
|
94 |
-
|
95 |
-
Evaluation results will be added in the coming days.
|
96 |
|
97 |
-
### VQAv2 Results
|
98 |
|
99 |
-
| Split | Yes/No | Number | Other | Overall |
|
100 |
-
|
101 |
-
| test-dev | 65.08 | 28.97 | 29.32 |
|
102 |
|
103 |
-
#### Evaluation Details
|
104 |
- **Dataset**: VQAv2 (Visual Question Answering v2.0)
|
105 |
- **Challenge**: [VQA Challenge 2017](https://eval.ai/web/challenges/challenge-page/830/)
|
106 |
- **Split**: test-dev
|
107 |
-
- **Overall Accuracy**: 44.01
|
108 |
|
109 |
-
#### Performance Breakdown
|
110 |
- **Yes/No Questions**: 65.08% - Performance on binary questions
|
111 |
- **Number Questions**: 28.97% - Performance on counting/numerical questions
|
112 |
- **Other Questions**: 29.32% - Performance on open-ended questions
|
113 |
- **Overall**: 44.01% - Weighted average across all question types
|
114 |
|
115 |
-
|
116 |
-
Planned tests include:
|
117 |
|
118 |
- VQAv2 test set (instead of test-dev)
|
119 |
-
-
|
120 |
|
121 |
-
Community contributions with benchmark results are welcome and encouraged
|
122 |
|
123 |
---
|
124 |
|
125 |
-
## Intended
|
126 |
|
127 |
-
**Intended
|
128 |
-
- Rapid experimentation for vision-language research on limited hardware
|
129 |
-
- Educational demonstrations for students and hobbyists
|
130 |
-
- Starting point for domain-specific finetuning
|
131 |
|
132 |
-
**Limitations**
|
133 |
-
- The small LLM size and compact vision encoder may limit reasoning depth and OCR performance
|
134 |
-
- Performance can vary significantly depending on the image domain and quality
|
135 |
-
- The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards
|
136 |
|
137 |
-
> ⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.
|
138 |
|
139 |
---
|
140 |
|
141 |
-
## Reproducibility
|
142 |
|
143 |
For reproducibility, please visit my fork of [TinyLLaVA_Factory](https://github.com/keeeeenw/TinyLLaVA_Factory), which follows the exact same pre-training and fine-tuning steps as the original implementation.
|
144 |
|
145 |
-
### Key Differences
|
146 |
|
147 |
-
|
148 |
To support training on a single GPU, I modified several hyperparameters:
|
149 |
-
- `gradient_accumulation_steps`: 2 → 8
|
150 |
-
- `learning_rate`: 1e-3 → 2.5e-4
|
151 |
-
- `warmup_ratio`: 0.03 → 0.06
|
152 |
|
153 |
-
The original hyperparameters were too aggressive for pre-training, causing training loss to increase after some time. With the updated hyperparameters, pre-training loss remained stable, which is expected for LLaVA's first stage where we align the LLM output with ViT features
|
154 |
|
155 |
-
|
156 |
-
- All major hyperparameters remain the same as the original
|
157 |
-
- Used `bfloat16` precision instead of `float16` for improved numerical stability
|
158 |
-
- The current model version does not use `ocr_vqa
|
159 |
|
160 |
-
### Training Setup
|
161 |
- **Hardware**: Single GPU configuration
|
162 |
-
- **Precision**: bfloat16 (fine-tuning), modified from original float16
|
163 |
- **Stages**: Two-stage training following LLaVA methodology
|
164 |
-
1. Pre-training
|
165 |
-
2. Fine-tuning
|
166 |
|
167 |
---
|
168 |
|
169 |
-
## Citation
|
170 |
|
171 |
```bibtex
|
172 |
@misc{wang2024microllama,
|
@@ -177,24 +189,30 @@ The original hyperparameters were too aggressive for pre-training, causing train
|
|
177 |
}
|
178 |
```
|
179 |
|
180 |
-
|
|
|
|
|
181 |
|
182 |
-
This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
|
183 |
|
184 |
-
You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license.
|
185 |
-
If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made.
|
186 |
|
187 |
-
|
|
|
|
|
188 |
|
189 |
---
|
190 |
|
191 |
-
## Acknowledgements
|
192 |
|
193 |
This work builds upon the efforts of many in the open-source AI community:
|
194 |
|
195 |
- **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework
|
196 |
-
- **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work!
|
197 |
- **SigLIP** authors for the efficient vision encoder architecture
|
198 |
- Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning
|
199 |
-
- The Hugging Face ecosystem for hosting, tools, and community support
|
|
|
|
|
200 |
|
|
|
|
14 |
- google/siglip-so400m-patch14-384
|
15 |
---
|
16 |
|
17 |
+
# 🔥 MicroLLaVA-siglip-so400m
|
18 |
|
19 |
+
> **A compact vision language model that you can pretrain and finetune on a single consumer GPU such as NVIDIA RTX 4090 with 24G of VRAM.**
|
20 |
|
21 |
+
---
|
22 |
+
|
23 |
+
## ⚡ TLDR
|
24 |
|
25 |
+
| 📋 **Item** | 🔧 **Detail** |
|
26 |
+
|-------------|---------------|
|
27 |
+
| **Framework** | Transformers + PyTorch |
|
28 |
+
| **Checkpoint type** | `safetensors` |
|
29 |
+
| **LLM** | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) |
|
30 |
+
| **Vision tower** | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384) |
|
31 |
+
| **Hardware used** | Single NVIDIA RTX 4090 |
|
32 |
+
| **Training stack** | No DeepSpeed required |
|
33 |
+
| **Intended tasks** | Visual Question Answering, caption-style prompts |
|
34 |
|
35 |
---
|
36 |
|
37 |
+
## 🚀 Introduction
|
38 |
|
39 |
+
**MicroLLaVA** is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.
|
|
|
40 |
|
41 |
+
🎯 **The goal**: Create a vision language model that almost anyone can train and iterate on with one consumer GPU.
|
42 |
+
|
43 |
+
### 🧠 **Model Components**
|
44 |
- **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters
|
45 |
- **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
|
46 |
- **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
|
47 |
|
48 |
+
### ⏱️ **Training Times**
|
49 |
+
Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 **without DeepSpeed**.
|
|
|
50 |
|
51 |
+
- **Pretraining** on LAION-CC-SBU-558K: **~5 hours** ⚡
|
52 |
+
- **Supervised finetuning** on all TinyLLaVA Factory datasets (except `ocr_vqa`): **~12 hours** 🔥
|
53 |
|
54 |
---
|
55 |
|
56 |
+
## 💻 Quick Start
|
57 |
|
58 |
```python
|
59 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
62 |
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
|
63 |
# model.cuda() # turn on cuda as needed by the model runs fairly quickly on CPU.
|
64 |
config = model.config
|
65 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
66 |
+
hf_path,
|
67 |
+
use_fast=False,
|
68 |
+
model_max_length=config.tokenizer_model_max_length,
|
69 |
+
padding_side=config.tokenizer_padding_side
|
70 |
+
)
|
71 |
+
|
72 |
+
prompt = "What are the things I should be cautious about when I visit here?"
|
73 |
+
image_url = "https://llava-vl.github.io/static/images/view.jpg"
|
74 |
+
output_text, genertaion_time = model.chat(
|
75 |
+
prompt=prompt,
|
76 |
+
image=image_url,
|
77 |
+
tokenizer=tokenizer
|
78 |
+
)
|
79 |
|
80 |
print('model output:', output_text)
|
81 |
print('runing time:', genertaion_time)
|
82 |
```
|
83 |
|
84 |
+
---
|
85 |
+
|
86 |
+
## 🖼️ Example Usage
|
87 |
|
88 |
+
### 📸 **Input Image:**
|
89 |

|
90 |
|
91 |
+
### 💬 **Prompt:**
|
92 |
+
*"What are the things I should be cautious about when I visit here?"*
|
93 |
|
94 |
+
### 🤖 **Model Output:**
|
95 |
```
|
96 |
When I visit the beach at the waterfront, I should be cautious about several things. First, I should be cautious about the water, as it is a popular spot for boating and fishing. The water is shallow and shallow, making it difficult for boats to navigate and navigate. Additionally, the water is not a suitable surface for boating, as it is too shallow for boating. Additionally, the water is not suitable for swimming or fishing, as it is too cold and wet. Lastly, I should be cautious about the presence of other boats, such as boats that are parked on the beach, or boats that are not visible from the water. These factors can lead to potential accidents or accidents, as they can cause damage to the boat and the other boats in the water.
|
97 |
```
|
98 |
|
99 |
+
### 🔧 **Implementation Notes**
|
100 |
|
101 |
+
For inference, I created a custom class [`modeling_tinyllava_llama.py`](https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m/blob/main/modeling_tinyllava_llama.py) which:
|
102 |
- Loads the same chat template as the TinyLlava model for TinyLlama
|
103 |
- Connects the LLM to the vision tower
|
104 |
- May require additional dependencies such as PyTorch and Transformers library
|
105 |
|
106 |
---
|
107 |
|
108 |
+
## 📊 Evaluation
|
|
|
|
|
109 |
|
110 |
+
### 🏆 VQAv2 Results
|
111 |
|
112 |
+
| **Split** | **Yes/No** | **Number** | **Other** | **Overall** |
|
113 |
+
|-----------|------------|------------|-----------|-------------|
|
114 |
+
| test-dev | **65.08** | **28.97** | **29.32** | **🎯 44.01** |
|
115 |
|
116 |
+
#### 📈 **Evaluation Details**
|
117 |
- **Dataset**: VQAv2 (Visual Question Answering v2.0)
|
118 |
- **Challenge**: [VQA Challenge 2017](https://eval.ai/web/challenges/challenge-page/830/)
|
119 |
- **Split**: test-dev
|
120 |
+
- **Overall Accuracy**: **44.01%**
|
121 |
|
122 |
+
#### 🎯 **Performance Breakdown**
|
123 |
- **Yes/No Questions**: 65.08% - Performance on binary questions
|
124 |
- **Number Questions**: 28.97% - Performance on counting/numerical questions
|
125 |
- **Other Questions**: 29.32% - Performance on open-ended questions
|
126 |
- **Overall**: 44.01% - Weighted average across all question types
|
127 |
|
128 |
+
### 🔜 **Planned Evaluations**
|
|
|
129 |
|
130 |
- VQAv2 test set (instead of test-dev)
|
131 |
+
- Datasets from [TinyLlava evaluation](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html)
|
132 |
|
133 |
+
Community contributions with benchmark results are welcome and encouraged! 🤝
|
134 |
|
135 |
---
|
136 |
|
137 |
+
## 🎯 Intended Uses and Limitations
|
138 |
|
139 |
+
### ✅ **Intended Uses**
|
140 |
+
- **🔬 Rapid experimentation** for vision-language research on limited hardware
|
141 |
+
- **🎓 Educational demonstrations** for students and hobbyists
|
142 |
+
- **🚀 Starting point** for domain-specific finetuning
|
143 |
|
144 |
+
### ⚠️ **Limitations**
|
145 |
+
- The small LLM size and compact vision encoder may limit **reasoning depth** and **OCR performance**
|
146 |
+
- Performance can **vary significantly** depending on the image domain and quality
|
147 |
+
- The model includes **minimal safety filtering** and refusal behavior — downstream applications should implement their own safeguards
|
148 |
|
149 |
+
> ⚠️ **Important**: This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.
|
150 |
|
151 |
---
|
152 |
|
153 |
+
## 🔬 Reproducibility
|
154 |
|
155 |
For reproducibility, please visit my fork of [TinyLLaVA_Factory](https://github.com/keeeeenw/TinyLLaVA_Factory), which follows the exact same pre-training and fine-tuning steps as the original implementation.
|
156 |
|
157 |
+
### 🔧 **Key Differences**
|
158 |
|
159 |
+
#### **🎯 Pre-training Modifications:**
|
160 |
To support training on a single GPU, I modified several hyperparameters:
|
161 |
+
- `gradient_accumulation_steps`: **2 → 8**
|
162 |
+
- `learning_rate`: **1e-3 → 2.5e-4**
|
163 |
+
- `warmup_ratio`: **0.03 → 0.06**
|
164 |
|
165 |
+
*The original hyperparameters were too aggressive for pre-training, causing training loss to increase after some time. With the updated hyperparameters, pre-training loss remained stable, which is expected for LLaVA's first stage where we align the LLM output with ViT features.*
|
166 |
|
167 |
+
#### **🎨 Fine-tuning Changes:**
|
168 |
+
- All major hyperparameters remain **the same** as the original
|
169 |
+
- Used `bfloat16` precision instead of `float16` for **improved numerical stability**
|
170 |
+
- The current model version does **not use `ocr_vqa`** due to difficulties downloading all required images for fine-tuning
|
171 |
|
172 |
+
### 🛠️ **Training Setup**
|
173 |
- **Hardware**: Single GPU configuration
|
174 |
+
- **Precision**: `bfloat16` (fine-tuning), modified from original `float16`. For pre-training, I used `float16` which is the same configuration as the original TinyLlava model.
|
175 |
- **Stages**: Two-stage training following LLaVA methodology
|
176 |
+
1. **Pre-training**: Vision-language alignment with stable loss
|
177 |
+
2. **Fine-tuning**: Task-specific adaptation
|
178 |
|
179 |
---
|
180 |
|
181 |
+
## 📝 Citation
|
182 |
|
183 |
```bibtex
|
184 |
@misc{wang2024microllama,
|
|
|
189 |
}
|
190 |
```
|
191 |
|
192 |
+
---
|
193 |
+
|
194 |
+
## 📄 License
|
195 |
|
196 |
+
This model is released under the [**Apache License 2.0**](https://www.apache.org/licenses/LICENSE-2.0).
|
197 |
|
198 |
+
You are **free to use, modify, and distribute** this model and its derivatives, provided that you comply with the terms of the license.
|
|
|
199 |
|
200 |
+
If you use this model in your research or applications, please **credit the original authors** and clearly indicate any modifications you have made.
|
201 |
+
|
202 |
+
> **📌 Note**: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights.
|
203 |
|
204 |
---
|
205 |
|
206 |
+
## 🙏 Acknowledgements
|
207 |
|
208 |
This work builds upon the efforts of many in the open-source AI community:
|
209 |
|
210 |
- **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework
|
211 |
+
- **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work! ⭐
|
212 |
- **SigLIP** authors for the efficient vision encoder architecture
|
213 |
- Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning
|
214 |
+
- The **Hugging Face ecosystem** for hosting, tools, and community support 🤗
|
215 |
+
|
216 |
+
---
|
217 |
|
218 |
+
### 🌟 **Star this model if you find it useful!** 🌟
|