Dots MOCR – 4-bit Quantized (NF4)

🔍 Introduction

This repository provides a 4-bit quantized version of dots.mocr, optimized using BitsAndBytes (NF4 precision) for efficient, low-memory inference.

The original model is a powerful multimodal OCR system capable of:

Document parsing
Layout understanding
Multilingual OCR
Structured outputs (JSON / Markdown / SVG)

This version enables deployment on low-VRAM GPUs while maintaining strong performance.

⚙️ Key Features

4-bit quantization (NF4)
Reduced VRAM usage (~70–80%)
Faster inference
Compatible with Hugging Face Transformers
Supports OCR and document parsing
Suitable for edge and local deployments

🛠️ Installation (Base Setup)

⚠️ This model depends on the original dots.mocr repository.

conda create -n dots_mocr python=3.12
conda activate dots_mocr

git clone https://github.com/rednote-hilab/dots.mocr.git
cd dots.mocr

pip install -e .
pip install flash-attn==2.8.0.post2

🚀 Usage (Quantized Inference)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "rednote-hilab/dots.mocr"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Example usage
inputs = tokenizer("Extract text from image", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📊 Quantization Details

Parameter	Value
Precision	4-bit
Quant Type	NF4
Compute Dtype	float16
Double Quant	Enabled
Library	BitsAndBytes

📌 Use Cases

Document OCR
PDF parsing
Layout detection
Structured data extraction
AI-powered document understanding
Edge deployment of large OCR models

⚠️ Limitations

Slight accuracy drop compared to full precision
GPU recommended for optimal performance
Some layers remain in higher precision
Not fully optimized for CPU inference

🔮 Future Work

GGUF conversion for CPU inference
FlashAttention optimization improvements
Integration with full OCR pipelines
Web UI (Gradio / Streamlit demo)
Benchmark comparisons (VRAM vs accuracy)

🙌 Acknowledgement

Base Model: rednote-hilab/dots.mocr
Quantization: BitsAndBytes
Framework: Hugging Face Transformers

📄 License

MIT License

Downloads last month: 781

Safetensors

Model size

3B params

Tensor type

F32

F16

Model tree for Durgaram/dots.mocr-4bit

Base model

rednote-hilab/dots.mocr

Quantized

(8)

this model