keeeeenw
/

MicroLlava

@@ -14,9 +14,9 @@ base_model:
 - google/siglip-so400m-patch14-384
 ---
-# MicroLLaVA (TinyLLaVA Factory based)
-A compact vision language model that you can pretrain and finetune on a single consumer GPU.
 ## TLDR
@@ -41,24 +41,6 @@ The goal is to create a vision language model that almost anyone can train and i
 - **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
 - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
----
-## Files included
-| File                       | Purpose |
-|----------------------------|---------|
-| `config.json`              | Model configuration for Transformers |
-| `generation_config.json`   | Generation defaults |
-| `model.safetensors`        | Weights |
-| `tokenizer.model`          | SentencePiece model |
-| `tokenizer_config.json`    | Tokenizer configuration |
-| `special_tokens_map.json`  | Special token mapping |
-| `trainer_state.json`       | Trainer state |
-| `training_args.bin`        | Training arguments |
-| `log.txt`                  | Training log |
-If your workflow uses a custom processor, also include `preprocessor_config.json` or `processor_config.json` so `AutoProcessor.from_pretrained` works.
 Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.
 Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.
@@ -70,37 +52,64 @@ Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `
 ## Quick start
 ```python
-from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
-import torch
-repo_id = "keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune"
-tokenizer = AutoTokenizer.from_pretrained(repo_id)
-# If processor config is available
-try:
-    processor = AutoProcessor.from_pretrained(repo_id)
-except Exception:
-    processor = None  # Optional if images are preprocessed manually
-model = AutoModelForCausalLM.from_pretrained(
-    repo_id,
-    torch_dtype=torch.float16,
-    device_map="auto",
-    trust_remote_code=True  # Set to True if repo includes custom code
-)
-inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device)
-output_ids = model.generate(**inputs, max_new_tokens=64)
-print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
-```
 ## Evaluation
-Evaluation results will be added in the coming days. Planned tests include:
-- VQAv2-style prompts for question answering
-- and more
 Community contributions with benchmark results are welcome and encouraged.
@@ -122,15 +131,31 @@ Community contributions with benchmark results are welcome and encouraged.
 ---
-## Reproducibility checklist
-To reproduce results and training runs:
-1. Fix all random seeds in training scripts
-2. Record exact dataset versions and any filtering applied
-3. Log optimizer type, learning rate schedule, precision settings, and gradient accumulation steps
-4. Save the exact TinyLLaVA Factory commit or fork commit used for both pretraining and finetuning
-5. Document hardware and software versions (CUDA, PyTorch, etc.)
 ---

 - google/siglip-so400m-patch14-384
 ---
+# MicroLLaVA
+A compact vision language model that you can pretrain and finetune on a single consumer GPU such as NVIDIA RTX 4090 with 24G of VRAM.
 ## TLDR
 - **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
 - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
 Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.
 Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.
 ## Quick start
 ```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+hf_path = 'keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune'
+model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
+# model.cuda() # turn on cuda as needed by the model runs fairly quickly on CPU.
+config = model.config
+tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False, model_max_length = config.tokenizer_model_max_length,padding_side = config.tokenizer_padding_side)
+prompt="What are the things I should be cautious about when I visit here?"
+image_url="https://llava-vl.github.io/static/images/view.jpg"
+output_text, genertaion_time = model.chat(prompt=prompt,
+                                          image=image_url,
+                                          tokenizer=tokenizer)
+print('model output:', output_text)
+print('runing time:', genertaion_time)
+```
+Example Image from Llava
+![Llava Input Image Example](https://llava-vl.github.io/static/images/view.jpg "Llava Input Image Example")
+Example output
+model output: When I visit the beach at the waterfront, I should be cautious about several things. First, I should be cautious about the water, as it is a popular spot for boating and fishing. The water is shallow and shallow, making it difficult for boats to navigate and navigate. Additionally, the water is not a suitable surface for boating, as it is too shallow for boating. Additionally, the water is not suitable for swimming or fishing, as it is too cold and wet. Lastly, I should be cautious about the presence of other boats, such as boats that are parked on the beach, or boats that are not visible from the water. These factors can lead to potential accidents or accidents, as they can cause damage to the boat and the other boats in the water.
+Note: for inference, I created the special class modeling_tinyllava_llama.py which loads the same chat template as the TinyLlava model for TinyLlama and connect the llm to the vision tower.
+This class may require additional dependencies such as PyTorch and Transformer library.
+---
 ## Evaluation
+Evaluation results will be added in the coming days.
+### VQAv2 Results
+| Split | Yes/No | Number | Other | Overall |
+|-------|--------|---------|-------|---------|
+| test-dev | 65.08 | 28.97 | 29.32 | **44.01** |
+#### Evaluation Details
+- **Dataset**: VQAv2 (Visual Question Answering v2.0)
+- **Challenge**: [VQA Challenge 2017](https://eval.ai/web/challenges/challenge-page/830/)
+- **Split**: test-dev
+- **Overall Accuracy**: 44.01%
+#### Performance Breakdown
+- **Yes/No Questions**: 65.08% - Performance on binary questions
+- **Number Questions**: 28.97% - Performance on counting/numerical questions
+- **Other Questions**: 29.32% - Performance on open-ended questions
+- **Overall**: 44.01% - Weighted average across all question types
+Planned tests include:
+- VQAv2 test set (instead of test-dev)
+- and datasets from [TinyLlava evaluation](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html)
 Community contributions with benchmark results are welcome and encouraged.
 ---
+## Reproducibility
+For reproducibility, please visit my fork of [TinyLLaVA_Factory](https://github.com/keeeeenw/TinyLLaVA_Factory), which follows the exact same pre-training and fine-tuning steps as the original implementation.
+### Key Differences
+**Pre-training Modifications:**
+To support training on a single GPU, I modified several hyperparameters:
+- `gradient_accumulation_steps`: 2 → 8
+- `learning_rate`: 1e-3 → 2.5e-4
+- `warmup_ratio`: 0.03 → 0.06
+The original hyperparameters were too aggressive for pre-training, causing training loss to increase after some time. With the updated hyperparameters, pre-training loss remained stable, which is expected for LLaVA's first stage where we align the LLM output with ViT features.
+**Fine-tuning Changes:**
+- All major hyperparameters remain the same as the original
+- Used `bfloat16` precision instead of `float16` for improved numerical stability
+- The current model version does not use `ocr_vqa` due to difficulties downloading all required images for fine-tuning
+### Training Setup
+- **Hardware**: Single GPU configuration
+- **Precision**: bfloat16 (fine-tuning), modified from original float16. For pre-training, I used float16 which is the same configuration as the original TinyLlava model.
+- **Stages**: Two-stage training following LLaVA methodology
+  1. Pre-training: Vision-language alignment with stable loss
+  2. Fine-tuning: Task-specific adaptation
 ---