dots.ocr-4bit: A 4-bit Quantized Version

This repository contains a 4-bit quantized version of the powerful dots.ocr model by the Rednote HiLab. The quantization was performed using bitsandbytes (NF4 precision), providing significant memory and speed improvements with minimal performance loss, making this state-of-the-art model accessible on consumer-grade GPUs.

This work is entirely a derivative of the original model. All credit for the model architecture, training, and groundbreaking research goes to the original authors. A huge thank you to them for open-sourcing their work.

Model Description (from original authors)

dots.ocr is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art(SOTA) performance.

How to Use This 4-bit Version

First, ensure you have the necessary dependencies installed. Because this model uses custom code, you must clone the original repository and install it.

# It's recommended to clone the original repo to get all utility scripts
git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr

# Install the custom code and dependencies
pip install -e .
pip install torch transformers accelerate bitsandbytes peft sentencepiece

You can then use the 4-bit model with the following Python script. Note the inclusion of generation parameters (repetition_penalty, do_sample, etc.), which are recommended to prevent potential looping with the quantized model.

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
from huggingface_hub import snapshot_download

from qwen_vl_utils import process_vision_info

MODEL_ID = "helizac/dots.ocr-4bit"

local_model_path = snapshot_download(repo_id=MODEL_ID)

model = AutoModelForCausalLM.from_pretrained(local_model_path, device_map="auto", trust_remote_code=True, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")
processor = AutoProcessor.from_pretrained(local_model_path, trust_remote_code=True, use_fast=True)

image_path = "test.jpg"
image = Image.open(image_path)

prompt_text = """\
Please output the layout information from the image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
1. Bbox format: [x1, y1, x2, y2]
2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].
3. Text Extraction & Formatting Rules:
- Picture: For the 'Picture' category, the text field should be omitted.
- Formula: Format its text as LaTeX.
- Table: Format its text as HTML.
- All Others (Text, Title, etc.): Format their text as Markdown.
4. Constraints:
- The output text must be the original text from the image, with no translation.
- All layout elements must be sorted according to human reading order.
5. Final Output: The entire output must be a single JSON object.\
"""

messages = [{"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": prompt_text}]}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, padding=True, return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.6, top_p=0.9, repetition_penalty=1.15)

generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output_text)

License

This model is released under the MIT License, same as the original model.

Downloads last month
669
Safetensors
Model size
1.79B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for helizac/dots.ocr-4bit

Quantized
(2)
this model

Space using helizac/dots.ocr-4bit 1