keeeeenw commited on
Commit
b2e8b6a
·
verified ·
1 Parent(s): 7458065

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -49
README.md CHANGED
@@ -14,9 +14,9 @@ base_model:
14
  - google/siglip-so400m-patch14-384
15
  ---
16
 
17
- # MicroLLaVA (TinyLLaVA Factory based)
18
 
19
- A compact vision language model that you can pretrain and finetune on a single consumer GPU.
20
 
21
  ## TLDR
22
 
@@ -41,24 +41,6 @@ The goal is to create a vision language model that almost anyone can train and i
41
  - **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
42
  - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
43
 
44
- ---
45
-
46
- ## Files included
47
-
48
- | File | Purpose |
49
- |----------------------------|---------|
50
- | `config.json` | Model configuration for Transformers |
51
- | `generation_config.json` | Generation defaults |
52
- | `model.safetensors` | Weights |
53
- | `tokenizer.model` | SentencePiece model |
54
- | `tokenizer_config.json` | Tokenizer configuration |
55
- | `special_tokens_map.json` | Special token mapping |
56
- | `trainer_state.json` | Trainer state |
57
- | `training_args.bin` | Training arguments |
58
- | `log.txt` | Training log |
59
-
60
- If your workflow uses a custom processor, also include `preprocessor_config.json` or `processor_config.json` so `AutoProcessor.from_pretrained` works.
61
-
62
  Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.
63
 
64
  Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.
@@ -70,37 +52,64 @@ Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `
70
  ## Quick start
71
 
72
  ```python
73
- from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
74
- import torch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
- repo_id = "keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune"
77
 
78
- tokenizer = AutoTokenizer.from_pretrained(repo_id)
79
 
80
- # If processor config is available
81
- try:
82
- processor = AutoProcessor.from_pretrained(repo_id)
83
- except Exception:
84
- processor = None # Optional if images are preprocessed manually
85
 
86
- model = AutoModelForCausalLM.from_pretrained(
87
- repo_id,
88
- torch_dtype=torch.float16,
89
- device_map="auto",
90
- trust_remote_code=True # Set to True if repo includes custom code
91
- )
92
 
93
- inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device)
94
- output_ids = model.generate(**inputs, max_new_tokens=64)
95
- print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
96
- ```
 
97
 
98
  ## Evaluation
99
 
100
- Evaluation results will be added in the coming days. Planned tests include:
101
 
102
- - VQAv2-style prompts for question answering
103
- - and more
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
  Community contributions with benchmark results are welcome and encouraged.
106
 
@@ -122,15 +131,31 @@ Community contributions with benchmark results are welcome and encouraged.
122
 
123
  ---
124
 
125
- ## Reproducibility checklist
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
- To reproduce results and training runs:
 
 
 
128
 
129
- 1. Fix all random seeds in training scripts
130
- 2. Record exact dataset versions and any filtering applied
131
- 3. Log optimizer type, learning rate schedule, precision settings, and gradient accumulation steps
132
- 4. Save the exact TinyLLaVA Factory commit or fork commit used for both pretraining and finetuning
133
- 5. Document hardware and software versions (CUDA, PyTorch, etc.)
 
134
 
135
  ---
136
 
 
14
  - google/siglip-so400m-patch14-384
15
  ---
16
 
17
+ # MicroLLaVA
18
 
19
+ A compact vision language model that you can pretrain and finetune on a single consumer GPU such as NVIDIA RTX 4090 with 24G of VRAM.
20
 
21
  ## TLDR
22
 
 
41
  - **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
42
  - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.
45
 
46
  Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.
 
52
  ## Quick start
53
 
54
  ```python
55
+ from transformers import AutoTokenizer, AutoModelForCausalLM
56
+
57
+ hf_path = 'keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune'
58
+ model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
59
+ # model.cuda() # turn on cuda as needed by the model runs fairly quickly on CPU.
60
+ config = model.config
61
+ tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False, model_max_length = config.tokenizer_model_max_length,padding_side = config.tokenizer_padding_side)
62
+ prompt="What are the things I should be cautious about when I visit here?"
63
+ image_url="https://llava-vl.github.io/static/images/view.jpg"
64
+ output_text, genertaion_time = model.chat(prompt=prompt,
65
+ image=image_url,
66
+ tokenizer=tokenizer)
67
+
68
+ print('model output:', output_text)
69
+ print('runing time:', genertaion_time)
70
+ ```
71
 
72
+ Example Image from Llava
73
 
74
+ ![Llava Input Image Example](https://llava-vl.github.io/static/images/view.jpg "Llava Input Image Example")
75
 
76
+ Example output
 
 
 
 
77
 
78
+ model output: When I visit the beach at the waterfront, I should be cautious about several things. First, I should be cautious about the water, as it is a popular spot for boating and fishing. The water is shallow and shallow, making it difficult for boats to navigate and navigate. Additionally, the water is not a suitable surface for boating, as it is too shallow for boating. Additionally, the water is not suitable for swimming or fishing, as it is too cold and wet. Lastly, I should be cautious about the presence of other boats, such as boats that are parked on the beach, or boats that are not visible from the water. These factors can lead to potential accidents or accidents, as they can cause damage to the boat and the other boats in the water.
 
 
 
 
 
79
 
80
+
81
+ Note: for inference, I created the special class modeling_tinyllava_llama.py which loads the same chat template as the TinyLlava model for TinyLlama and connect the llm to the vision tower.
82
+ This class may require additional dependencies such as PyTorch and Transformer library.
83
+
84
+ ---
85
 
86
  ## Evaluation
87
 
88
+ Evaluation results will be added in the coming days.
89
 
90
+ ### VQAv2 Results
91
+
92
+ | Split | Yes/No | Number | Other | Overall |
93
+ |-------|--------|---------|-------|---------|
94
+ | test-dev | 65.08 | 28.97 | 29.32 | **44.01** |
95
+
96
+ #### Evaluation Details
97
+ - **Dataset**: VQAv2 (Visual Question Answering v2.0)
98
+ - **Challenge**: [VQA Challenge 2017](https://eval.ai/web/challenges/challenge-page/830/)
99
+ - **Split**: test-dev
100
+ - **Overall Accuracy**: 44.01%
101
+
102
+ #### Performance Breakdown
103
+ - **Yes/No Questions**: 65.08% - Performance on binary questions
104
+ - **Number Questions**: 28.97% - Performance on counting/numerical questions
105
+ - **Other Questions**: 29.32% - Performance on open-ended questions
106
+ - **Overall**: 44.01% - Weighted average across all question types
107
+
108
+
109
+ Planned tests include:
110
+
111
+ - VQAv2 test set (instead of test-dev)
112
+ - and datasets from [TinyLlava evaluation](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html)
113
 
114
  Community contributions with benchmark results are welcome and encouraged.
115
 
 
131
 
132
  ---
133
 
134
+ ## Reproducibility
135
+
136
+ For reproducibility, please visit my fork of [TinyLLaVA_Factory](https://github.com/keeeeenw/TinyLLaVA_Factory), which follows the exact same pre-training and fine-tuning steps as the original implementation.
137
+
138
+ ### Key Differences
139
+
140
+ **Pre-training Modifications:**
141
+ To support training on a single GPU, I modified several hyperparameters:
142
+ - `gradient_accumulation_steps`: 2 → 8
143
+ - `learning_rate`: 1e-3 → 2.5e-4
144
+ - `warmup_ratio`: 0.03 → 0.06
145
+
146
+ The original hyperparameters were too aggressive for pre-training, causing training loss to increase after some time. With the updated hyperparameters, pre-training loss remained stable, which is expected for LLaVA's first stage where we align the LLM output with ViT features.
147
 
148
+ **Fine-tuning Changes:**
149
+ - All major hyperparameters remain the same as the original
150
+ - Used `bfloat16` precision instead of `float16` for improved numerical stability
151
+ - The current model version does not use `ocr_vqa` due to difficulties downloading all required images for fine-tuning
152
 
153
+ ### Training Setup
154
+ - **Hardware**: Single GPU configuration
155
+ - **Precision**: bfloat16 (fine-tuning), modified from original float16. For pre-training, I used float16 which is the same configuration as the original TinyLlava model.
156
+ - **Stages**: Two-stage training following LLaVA methodology
157
+ 1. Pre-training: Vision-language alignment with stable loss
158
+ 2. Fine-tuning: Task-specific adaptation
159
 
160
  ---
161