MicroLlava / README.md

keeeeenw

Create README.md

df4a083 verified about 2 months ago

preview code

raw

history blame

6.56 kB

metadata

language:
  - en
library_name: transformers
tags:
  - pytorch
  - safetensors
  - vision-language
  - visual-question-answering
pipeline_tag: visual-question-answering
license: apache-2.0
base_model:
  - keeeeenw/MicroLlama
  - google/siglip-so400m-patch14-384

MicroLLaVA (TinyLLaVA Factory based)

A compact vision language model that you can pretrain and finetune on a single consumer GPU.

TLDR

Item	Detail
Framework	Transformers + PyTorch
Checkpoint type	`safetensors`
LLM	`keeeeenw/MicroLlama` (about 300M parameters)
Vision tower	`siglip-so400m-patch14-384`
Hardware used	Single NVIDIA RTX 4090
Training stack	No DeepSpeed required
Intended tasks	Visual Question Answering, caption-style prompts

Introduction

MicroLLaVA is a TinyLLaVA Factory based model that pairs a very small language model keeeeenw/MicroLlama with an efficient SigLIP vision encoder.
The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.

Language model: keeeeenw/MicroLlama with ~300M parameters
Vision encoder: siglip-so400m-patch14-384
Training codebase: TinyLLaVA Factory with additional changes in my fork: Custom fork with training tweaks

Files included

File	Purpose
`config.json`	Model configuration for Transformers
`generation_config.json`	Generation defaults
`model.safetensors`	Weights
`tokenizer.model`	SentencePiece model
`tokenizer_config.json`	Tokenizer configuration
`special_tokens_map.json`	Special token mapping
`trainer_state.json`	Trainer state
`training_args.bin`	Training arguments
`log.txt`	Training log

If your workflow uses a custom processor, also include preprocessor_config.json or processor_config.json so AutoProcessor.from_pretrained works.

Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.

Pretraining on LAION-CC-SBU-558K took about 5 hours on a single NVIDIA RTX 4090 without DeepSpeed.

Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except ocr_vqa) took about 12 hours on the same GPU.

Quick start

from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
import torch

repo_id = "keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune"

tokenizer = AutoTokenizer.from_pretrained(repo_id)

# If processor config is available
try:
    processor = AutoProcessor.from_pretrained(repo_id)
except Exception:
    processor = None  # Optional if images are preprocessed manually

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True  # Set to True if repo includes custom code
)

inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Evaluation

Evaluation results will be added in the coming days. Planned tests include:

VQAv2-style prompts for question answering
and more

Community contributions with benchmark results are welcome and encouraged.

Intended uses and limitations

Intended uses

Rapid experimentation for vision-language research on limited hardware
Educational demonstrations for students and hobbyists
Starting point for domain-specific finetuning

Limitations

The small LLM size and compact vision encoder may limit reasoning depth and OCR performance
Performance can vary significantly depending on the image domain and quality
The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards

⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.

Reproducibility checklist

To reproduce results and training runs:

Fix all random seeds in training scripts
Record exact dataset versions and any filtering applied
Log optimizer type, learning rate schedule, precision settings, and gradient accumulation steps
Save the exact TinyLLaVA Factory commit or fork commit used for both pretraining and finetuning
Document hardware and software versions (CUDA, PyTorch, etc.)

Citation

@misc{wang2024microllama,
  title        = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
  author       = {Zixiao Ken Wang},
  year         = {2025},
  url          = {https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune}
}

License

This model is released under the Apache License 2.0.

You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license.
If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made.

Note: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights.

Acknowledgements

This work builds upon the efforts of many in the open-source AI community:

TinyLLaVA Factory maintainers and contributors for creating the training framework
keeeeenw/MicroLlama I am also the creator of MicroLlama. Please help support my work!
SigLIP authors for the efficient vision encoder architecture
Contributors to LAION-CC-SBU-558K and other datasets used in pretraining and finetuning
The Hugging Face ecosystem for hosting, tools, and community support