language:
- en
library_name: transformers
tags:
- pytorch
- safetensors
- vision-language
- visual-question-answering
pipeline_tag: visual-question-answering
license: apache-2.0
base_model:
- keeeeenw/MicroLlama
- google/siglip-so400m-patch14-384
MicroLLaVA (TinyLLaVA Factory based)
A compact vision language model that you can pretrain and finetune on a single consumer GPU.
TLDR
Item | Detail |
---|---|
Framework | Transformers + PyTorch |
Checkpoint type | safetensors |
LLM | keeeeenw/MicroLlama (about 300M parameters) |
Vision tower | siglip-so400m-patch14-384 |
Hardware used | Single NVIDIA RTX 4090 |
Training stack | No DeepSpeed required |
Intended tasks | Visual Question Answering, caption-style prompts |
Introduction
MicroLLaVA is a TinyLLaVA Factory based model that pairs a very small language model keeeeenw/MicroLlama
with an efficient SigLIP vision encoder.
The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.
- Language model:
keeeeenw/MicroLlama
with ~300M parameters - Vision encoder:
siglip-so400m-patch14-384
- Training codebase: TinyLLaVA Factory with additional changes in my fork: Custom fork with training tweaks
Files included
File | Purpose |
---|---|
config.json |
Model configuration for Transformers |
generation_config.json |
Generation defaults |
model.safetensors |
Weights |
tokenizer.model |
SentencePiece model |
tokenizer_config.json |
Tokenizer configuration |
special_tokens_map.json |
Special token mapping |
trainer_state.json |
Trainer state |
training_args.bin |
Training arguments |
log.txt |
Training log |
If your workflow uses a custom processor, also include preprocessor_config.json
or processor_config.json
so AutoProcessor.from_pretrained
works.
Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.
Pretraining on LAION-CC-SBU-558K took about 5 hours on a single NVIDIA RTX 4090 without DeepSpeed.
Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except ocr_vqa
) took about 12 hours on the same GPU.
Quick start
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
import torch
repo_id = "keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
# If processor config is available
try:
processor = AutoProcessor.from_pretrained(repo_id)
except Exception:
processor = None # Optional if images are preprocessed manually
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True # Set to True if repo includes custom code
)
inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Evaluation
Evaluation results will be added in the coming days. Planned tests include:
- VQAv2-style prompts for question answering
- and more
Community contributions with benchmark results are welcome and encouraged.
Intended uses and limitations
Intended uses
- Rapid experimentation for vision-language research on limited hardware
- Educational demonstrations for students and hobbyists
- Starting point for domain-specific finetuning
Limitations
- The small LLM size and compact vision encoder may limit reasoning depth and OCR performance
- Performance can vary significantly depending on the image domain and quality
- The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards
⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.
Reproducibility checklist
To reproduce results and training runs:
- Fix all random seeds in training scripts
- Record exact dataset versions and any filtering applied
- Log optimizer type, learning rate schedule, precision settings, and gradient accumulation steps
- Save the exact TinyLLaVA Factory commit or fork commit used for both pretraining and finetuning
- Document hardware and software versions (CUDA, PyTorch, etc.)
Citation
@misc{wang2024microllama,
title = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
author = {Zixiao Ken Wang},
year = {2025},
url = {https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune}
}
License
This model is released under the Apache License 2.0.
You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license.
If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made.
Note: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights.
Acknowledgements
This work builds upon the efforts of many in the open-source AI community:
- TinyLLaVA Factory maintainers and contributors for creating the training framework
keeeeenw/MicroLlama
I am also the creator of MicroLlama. Please help support my work!- SigLIP authors for the efficient vision encoder architecture
- Contributors to LAION-CC-SBU-558K and other datasets used in pretraining and finetuning
- The Hugging Face ecosystem for hosting, tools, and community support