keeeeenw commited on
Commit
dbe1446
·
verified ·
1 Parent(s): f598e6a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -80
README.md CHANGED
@@ -14,42 +14,46 @@ base_model:
14
  - google/siglip-so400m-patch14-384
15
  ---
16
 
17
- # MicroLLaVA-siglip-so400m
18
 
19
- A compact vision language model that you can pretrain and finetune on a single consumer GPU such as NVIDIA RTX 4090 with 24G of VRAM.
20
 
21
- ## TLDR
 
 
22
 
23
- | Item | Detail |
24
- |-----------------|--------|
25
- | Framework | Transformers + PyTorch |
26
- | Checkpoint type | `safetensors` |
27
- | LLM | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) |
28
- | Vision tower | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384) |
29
- | Hardware used | Single NVIDIA RTX 4090 |
30
- | Training stack | No DeepSpeed required |
31
- | Intended tasks | Visual Question Answering, caption-style prompts |
32
 
33
  ---
34
 
35
- ## Introduction
36
 
37
- MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.
38
- The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.
39
 
 
 
 
40
  - **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters
41
  - **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
42
  - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
43
 
44
- Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.
45
-
46
- Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.
47
 
48
- Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `ocr_vqa`) took about **12 hours** on the same GPU.
 
49
 
50
  ---
51
 
52
- ## Quick start
53
 
54
  ```python
55
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -58,115 +62,123 @@ hf_path = 'keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune'
58
  model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
59
  # model.cuda() # turn on cuda as needed by the model runs fairly quickly on CPU.
60
  config = model.config
61
- tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False, model_max_length = config.tokenizer_model_max_length,padding_side = config.tokenizer_padding_side)
62
- prompt="What are the things I should be cautious about when I visit here?"
63
- image_url="https://llava-vl.github.io/static/images/view.jpg"
64
- output_text, genertaion_time = model.chat(prompt=prompt,
65
- image=image_url,
66
- tokenizer=tokenizer)
 
 
 
 
 
 
 
 
67
 
68
  print('model output:', output_text)
69
  print('runing time:', genertaion_time)
70
  ```
71
 
72
- ### Example Usage
 
 
73
 
74
- **Input Image:**
75
  ![Llava Input Image Example](https://llava-vl.github.io/static/images/view.jpg "Llava Input Image Example")
76
 
77
- **Prompt:** "What are the things I should be cautious about when I visit here?"
 
78
 
79
- **Model Output:**
80
  ```
81
  When I visit the beach at the waterfront, I should be cautious about several things. First, I should be cautious about the water, as it is a popular spot for boating and fishing. The water is shallow and shallow, making it difficult for boats to navigate and navigate. Additionally, the water is not a suitable surface for boating, as it is too shallow for boating. Additionally, the water is not suitable for swimming or fishing, as it is too cold and wet. Lastly, I should be cautious about the presence of other boats, such as boats that are parked on the beach, or boats that are not visible from the water. These factors can lead to potential accidents or accidents, as they can cause damage to the boat and the other boats in the water.
82
  ```
83
 
84
- ### Implementation Notes
85
 
86
- For inference, I created a custom class [modeling_tinyllava_llama.py](https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m/blob/main/modeling_tinyllava_llama.py) which:
87
  - Loads the same chat template as the TinyLlava model for TinyLlama
88
  - Connects the LLM to the vision tower
89
  - May require additional dependencies such as PyTorch and Transformers library
90
 
91
  ---
92
 
93
- ## Evaluation
94
-
95
- Evaluation results will be added in the coming days.
96
 
97
- ### VQAv2 Results
98
 
99
- | Split | Yes/No | Number | Other | Overall |
100
- |-------|--------|---------|-------|---------|
101
- | test-dev | 65.08 | 28.97 | 29.32 | **44.01** |
102
 
103
- #### Evaluation Details
104
  - **Dataset**: VQAv2 (Visual Question Answering v2.0)
105
  - **Challenge**: [VQA Challenge 2017](https://eval.ai/web/challenges/challenge-page/830/)
106
  - **Split**: test-dev
107
- - **Overall Accuracy**: 44.01%
108
 
109
- #### Performance Breakdown
110
  - **Yes/No Questions**: 65.08% - Performance on binary questions
111
  - **Number Questions**: 28.97% - Performance on counting/numerical questions
112
  - **Other Questions**: 29.32% - Performance on open-ended questions
113
  - **Overall**: 44.01% - Weighted average across all question types
114
 
115
-
116
- Planned tests include:
117
 
118
  - VQAv2 test set (instead of test-dev)
119
- - and datasets from [TinyLlava evaluation](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html)
120
 
121
- Community contributions with benchmark results are welcome and encouraged.
122
 
123
  ---
124
 
125
- ## Intended uses and limitations
126
 
127
- **Intended uses**
128
- - Rapid experimentation for vision-language research on limited hardware
129
- - Educational demonstrations for students and hobbyists
130
- - Starting point for domain-specific finetuning
131
 
132
- **Limitations**
133
- - The small LLM size and compact vision encoder may limit reasoning depth and OCR performance
134
- - Performance can vary significantly depending on the image domain and quality
135
- - The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards
136
 
137
- > ⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.
138
 
139
  ---
140
 
141
- ## Reproducibility
142
 
143
  For reproducibility, please visit my fork of [TinyLLaVA_Factory](https://github.com/keeeeenw/TinyLLaVA_Factory), which follows the exact same pre-training and fine-tuning steps as the original implementation.
144
 
145
- ### Key Differences
146
 
147
- **Pre-training Modifications:**
148
  To support training on a single GPU, I modified several hyperparameters:
149
- - `gradient_accumulation_steps`: 2 → 8
150
- - `learning_rate`: 1e-3 → 2.5e-4
151
- - `warmup_ratio`: 0.03 → 0.06
152
 
153
- The original hyperparameters were too aggressive for pre-training, causing training loss to increase after some time. With the updated hyperparameters, pre-training loss remained stable, which is expected for LLaVA's first stage where we align the LLM output with ViT features.
154
 
155
- **Fine-tuning Changes:**
156
- - All major hyperparameters remain the same as the original
157
- - Used `bfloat16` precision instead of `float16` for improved numerical stability
158
- - The current model version does not use `ocr_vqa` due to difficulties downloading all required images for fine-tuning
159
 
160
- ### Training Setup
161
  - **Hardware**: Single GPU configuration
162
- - **Precision**: bfloat16 (fine-tuning), modified from original float16. For pre-training, I used float16 which is the same configuration as the original TinyLlava model.
163
  - **Stages**: Two-stage training following LLaVA methodology
164
- 1. Pre-training: Vision-language alignment with stable loss
165
- 2. Fine-tuning: Task-specific adaptation
166
 
167
  ---
168
 
169
- ## Citation
170
 
171
  ```bibtex
172
  @misc{wang2024microllama,
@@ -177,24 +189,30 @@ The original hyperparameters were too aggressive for pre-training, causing train
177
  }
178
  ```
179
 
180
- ## License
 
 
181
 
182
- This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
183
 
184
- You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license.
185
- If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made.
186
 
187
- > **Note**: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights.
 
 
188
 
189
  ---
190
 
191
- ## Acknowledgements
192
 
193
  This work builds upon the efforts of many in the open-source AI community:
194
 
195
  - **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework
196
- - **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work!
197
  - **SigLIP** authors for the efficient vision encoder architecture
198
  - Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning
199
- - The Hugging Face ecosystem for hosting, tools, and community support
 
 
200
 
 
 
14
  - google/siglip-so400m-patch14-384
15
  ---
16
 
17
+ # 🔥 MicroLLaVA-siglip-so400m
18
 
19
+ > **A compact vision language model that you can pretrain and finetune on a single consumer GPU such as NVIDIA RTX 4090 with 24G of VRAM.**
20
 
21
+ ---
22
+
23
+ ## ⚡ TLDR
24
 
25
+ | 📋 **Item** | 🔧 **Detail** |
26
+ |-------------|---------------|
27
+ | **Framework** | Transformers + PyTorch |
28
+ | **Checkpoint type** | `safetensors` |
29
+ | **LLM** | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) |
30
+ | **Vision tower** | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384) |
31
+ | **Hardware used** | Single NVIDIA RTX 4090 |
32
+ | **Training stack** | No DeepSpeed required |
33
+ | **Intended tasks** | Visual Question Answering, caption-style prompts |
34
 
35
  ---
36
 
37
+ ## 🚀 Introduction
38
 
39
+ **MicroLLaVA** is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.
 
40
 
41
+ 🎯 **The goal**: Create a vision language model that almost anyone can train and iterate on with one consumer GPU.
42
+
43
+ ### 🧠 **Model Components**
44
  - **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters
45
  - **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
46
  - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
47
 
48
+ ### ⏱️ **Training Times**
49
+ Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 **without DeepSpeed**.
 
50
 
51
+ - **Pretraining** on LAION-CC-SBU-558K: **~5 hours**
52
+ - **Supervised finetuning** on all TinyLLaVA Factory datasets (except `ocr_vqa`): **~12 hours** 🔥
53
 
54
  ---
55
 
56
+ ## 💻 Quick Start
57
 
58
  ```python
59
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
62
  model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
63
  # model.cuda() # turn on cuda as needed by the model runs fairly quickly on CPU.
64
  config = model.config
65
+ tokenizer = AutoTokenizer.from_pretrained(
66
+ hf_path,
67
+ use_fast=False,
68
+ model_max_length=config.tokenizer_model_max_length,
69
+ padding_side=config.tokenizer_padding_side
70
+ )
71
+
72
+ prompt = "What are the things I should be cautious about when I visit here?"
73
+ image_url = "https://llava-vl.github.io/static/images/view.jpg"
74
+ output_text, genertaion_time = model.chat(
75
+ prompt=prompt,
76
+ image=image_url,
77
+ tokenizer=tokenizer
78
+ )
79
 
80
  print('model output:', output_text)
81
  print('runing time:', genertaion_time)
82
  ```
83
 
84
+ ---
85
+
86
+ ## 🖼️ Example Usage
87
 
88
+ ### 📸 **Input Image:**
89
  ![Llava Input Image Example](https://llava-vl.github.io/static/images/view.jpg "Llava Input Image Example")
90
 
91
+ ### 💬 **Prompt:**
92
+ *"What are the things I should be cautious about when I visit here?"*
93
 
94
+ ### 🤖 **Model Output:**
95
  ```
96
  When I visit the beach at the waterfront, I should be cautious about several things. First, I should be cautious about the water, as it is a popular spot for boating and fishing. The water is shallow and shallow, making it difficult for boats to navigate and navigate. Additionally, the water is not a suitable surface for boating, as it is too shallow for boating. Additionally, the water is not suitable for swimming or fishing, as it is too cold and wet. Lastly, I should be cautious about the presence of other boats, such as boats that are parked on the beach, or boats that are not visible from the water. These factors can lead to potential accidents or accidents, as they can cause damage to the boat and the other boats in the water.
97
  ```
98
 
99
+ ### 🔧 **Implementation Notes**
100
 
101
+ For inference, I created a custom class [`modeling_tinyllava_llama.py`](https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m/blob/main/modeling_tinyllava_llama.py) which:
102
  - Loads the same chat template as the TinyLlava model for TinyLlama
103
  - Connects the LLM to the vision tower
104
  - May require additional dependencies such as PyTorch and Transformers library
105
 
106
  ---
107
 
108
+ ## 📊 Evaluation
 
 
109
 
110
+ ### 🏆 VQAv2 Results
111
 
112
+ | **Split** | **Yes/No** | **Number** | **Other** | **Overall** |
113
+ |-----------|------------|------------|-----------|-------------|
114
+ | test-dev | **65.08** | **28.97** | **29.32** | **🎯 44.01** |
115
 
116
+ #### 📈 **Evaluation Details**
117
  - **Dataset**: VQAv2 (Visual Question Answering v2.0)
118
  - **Challenge**: [VQA Challenge 2017](https://eval.ai/web/challenges/challenge-page/830/)
119
  - **Split**: test-dev
120
+ - **Overall Accuracy**: **44.01%**
121
 
122
+ #### 🎯 **Performance Breakdown**
123
  - **Yes/No Questions**: 65.08% - Performance on binary questions
124
  - **Number Questions**: 28.97% - Performance on counting/numerical questions
125
  - **Other Questions**: 29.32% - Performance on open-ended questions
126
  - **Overall**: 44.01% - Weighted average across all question types
127
 
128
+ ### 🔜 **Planned Evaluations**
 
129
 
130
  - VQAv2 test set (instead of test-dev)
131
+ - Datasets from [TinyLlava evaluation](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html)
132
 
133
+ Community contributions with benchmark results are welcome and encouraged! 🤝
134
 
135
  ---
136
 
137
+ ## 🎯 Intended Uses and Limitations
138
 
139
+ ### ✅ **Intended Uses**
140
+ - **🔬 Rapid experimentation** for vision-language research on limited hardware
141
+ - **🎓 Educational demonstrations** for students and hobbyists
142
+ - **🚀 Starting point** for domain-specific finetuning
143
 
144
+ ### ⚠️ **Limitations**
145
+ - The small LLM size and compact vision encoder may limit **reasoning depth** and **OCR performance**
146
+ - Performance can **vary significantly** depending on the image domain and quality
147
+ - The model includes **minimal safety filtering** and refusal behavior — downstream applications should implement their own safeguards
148
 
149
+ > ⚠️ **Important**: This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.
150
 
151
  ---
152
 
153
+ ## 🔬 Reproducibility
154
 
155
  For reproducibility, please visit my fork of [TinyLLaVA_Factory](https://github.com/keeeeenw/TinyLLaVA_Factory), which follows the exact same pre-training and fine-tuning steps as the original implementation.
156
 
157
+ ### 🔧 **Key Differences**
158
 
159
+ #### **🎯 Pre-training Modifications:**
160
  To support training on a single GPU, I modified several hyperparameters:
161
+ - `gradient_accumulation_steps`: **2 → 8**
162
+ - `learning_rate`: **1e-3 → 2.5e-4**
163
+ - `warmup_ratio`: **0.03 → 0.06**
164
 
165
+ *The original hyperparameters were too aggressive for pre-training, causing training loss to increase after some time. With the updated hyperparameters, pre-training loss remained stable, which is expected for LLaVA's first stage where we align the LLM output with ViT features.*
166
 
167
+ #### **🎨 Fine-tuning Changes:**
168
+ - All major hyperparameters remain **the same** as the original
169
+ - Used `bfloat16` precision instead of `float16` for **improved numerical stability**
170
+ - The current model version does **not use `ocr_vqa`** due to difficulties downloading all required images for fine-tuning
171
 
172
+ ### 🛠️ **Training Setup**
173
  - **Hardware**: Single GPU configuration
174
+ - **Precision**: `bfloat16` (fine-tuning), modified from original `float16`. For pre-training, I used `float16` which is the same configuration as the original TinyLlava model.
175
  - **Stages**: Two-stage training following LLaVA methodology
176
+ 1. **Pre-training**: Vision-language alignment with stable loss
177
+ 2. **Fine-tuning**: Task-specific adaptation
178
 
179
  ---
180
 
181
+ ## 📝 Citation
182
 
183
  ```bibtex
184
  @misc{wang2024microllama,
 
189
  }
190
  ```
191
 
192
+ ---
193
+
194
+ ## 📄 License
195
 
196
+ This model is released under the [**Apache License 2.0**](https://www.apache.org/licenses/LICENSE-2.0).
197
 
198
+ You are **free to use, modify, and distribute** this model and its derivatives, provided that you comply with the terms of the license.
 
199
 
200
+ If you use this model in your research or applications, please **credit the original authors** and clearly indicate any modifications you have made.
201
+
202
+ > **📌 Note**: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights.
203
 
204
  ---
205
 
206
+ ## 🙏 Acknowledgements
207
 
208
  This work builds upon the efforts of many in the open-source AI community:
209
 
210
  - **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework
211
+ - **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work!
212
  - **SigLIP** authors for the efficient vision encoder architecture
213
  - Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning
214
+ - The **Hugging Face ecosystem** for hosting, tools, and community support 🤗
215
+
216
+ ---
217
 
218
+ ### 🌟 **Star this model if you find it useful!** 🌟