Dots MOCR – 4-bit Quantized (NF4)

🔍 Introduction

This repository provides a 4-bit quantized version of dots.mocr, optimized using BitsAndBytes (NF4 precision) for efficient, low-memory inference.

The original model is a powerful multimodal OCR system capable of:

  • Document parsing
  • Layout understanding
  • Multilingual OCR
  • Structured outputs (JSON / Markdown / SVG)

This version enables deployment on low-VRAM GPUs while maintaining strong performance.


⚙️ Key Features

  • 4-bit quantization (NF4)
  • Reduced VRAM usage (~70–80%)
  • Faster inference
  • Compatible with Hugging Face Transformers
  • Supports OCR and document parsing
  • Suitable for edge and local deployments

🛠️ Installation (Base Setup)

⚠️ This model depends on the original dots.mocr repository.

conda create -n dots_mocr python=3.12
conda activate dots_mocr

git clone https://github.com/rednote-hilab/dots.mocr.git
cd dots.mocr

pip install -e .
pip install flash-attn==2.8.0.post2

🚀 Usage (Quantized Inference)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "rednote-hilab/dots.mocr"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Example usage
inputs = tokenizer("Extract text from image", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📊 Quantization Details

Parameter Value
Precision 4-bit
Quant Type NF4
Compute Dtype float16
Double Quant Enabled
Library BitsAndBytes

📌 Use Cases

  • Document OCR
  • PDF parsing
  • Layout detection
  • Structured data extraction
  • AI-powered document understanding
  • Edge deployment of large OCR models

⚠️ Limitations

  • Slight accuracy drop compared to full precision
  • GPU recommended for optimal performance
  • Some layers remain in higher precision
  • Not fully optimized for CPU inference

🔮 Future Work

  • GGUF conversion for CPU inference
  • FlashAttention optimization improvements
  • Integration with full OCR pipelines
  • Web UI (Gradio / Streamlit demo)
  • Benchmark comparisons (VRAM vs accuracy)

🙌 Acknowledgement

  • Base Model: rednote-hilab/dots.mocr
  • Quantization: BitsAndBytes
  • Framework: Hugging Face Transformers

📄 License

MIT License

Downloads last month
781
Safetensors
Model size
3B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Durgaram/dots.mocr-4bit

Quantized
(8)
this model
Free AI Image Generator No sign-up. Instant results. Open Now