prithivMLmods's picture
Update README.md
2ea7439 verified
---
license: apache-2.0
language:
- en
datasets:
- mychen76/invoices-and-receipts_ocr_v1
- unsloth/LaTeX_OCR
- prithivMLmods/Latex-KIE
base_model:
- Qwen/Qwen2-VL-2B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- text-generation-inference
- image-caption
- mini
- art explain
- visual report generation
- photo captions
- cutlines
- qwen2
- inscription subtitle
- representation
---
![2.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/yUKVKSX2E18k0h3YwCx1h.png)
# **Imgscope-OCR-2B-0527**
> The **Imgscope-OCR-2B-0527** model is a fine-tuned version of *Qwen2-VL-2B-Instruct*, specifically optimized for *messy handwriting recognition*, *document OCR*, *realistic handwritten OCR*, and *math problem solving with LaTeX formatting*. This model is trained on custom datasets for document and handwriting OCR tasks and integrates a conversational approach with strong visual and textual understanding for multi-modal applications.
> [!note]
Colab Demo : https://huggingface.co/prithivMLmods/Imgscope-OCR-2B-0527/blob/main/Imgscope%20OCR%202B%200527%20Demo/Imgscope-OCR-2B-0527.ipynb
> [!note]
Video Understanding Demo : https://huggingface.co/prithivMLmods/Imgscope-OCR-2B-0527/blob/main/Imgscope-OCR-2B-05270-Video-Understanding/Imgscope-OCR-2B-0527-Video-Understanding.ipynb
---
### Key Enhancements
* **SoTA Understanding of Images of Various Resolution & Ratio**
Imgscope-OCR-2B-0527 achieves state-of-the-art performance on visual understanding benchmarks such as MathVista, DocVQA, RealWorldQA, and MTVQA.
* **Enhanced Handwriting OCR**
Specifically optimized for recognizing and interpreting **realistic and messy handwriting** with high accuracy. Ideal for digitizing handwritten documents and notes.
* **Document OCR Fine-Tuning**
Fine-tuned with curated and realistic **document OCR datasets**, enabling accurate extraction of text from various structured and unstructured layouts.
* **Understanding Videos of 20+ Minutes**
Capable of processing long videos for **video-based question answering**, **transcription**, and **content generation**.
* **Device Control Agent**
Supports decision-making and control capabilities for integration with **mobile devices**, **robots**, and **automation systems** using visual-textual commands.
* **Multilingual OCR Support**
In addition to English and Chinese, the model supports **OCR in multiple languages** including European languages, Japanese, Korean, Arabic, and Vietnamese.
---
### How to Use
```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load the model
model = Qwen2VLForConditionalGeneration.from_pretrained(
"prithivMLmods/Imgscope-OCR-2B-0527", # replace with updated model ID if available
torch_dtype="auto",
device_map="auto"
)
# Optional: Flash Attention for performance optimization
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "prithivMLmods/Imgscope-OCR-2B-0527",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# Load processor
processor = AutoProcessor.from_pretrained("prithivMLmods/Imgscope-OCR-2B-0527")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Recognize the handwriting in this image."},
],
}
]
# Prepare input
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
---
### Demo Inference
![Screenshot 2025-05-27 at 03-40-34 Gradio.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/9KiRkOGPB8cLl6VHwh2UD.png)
![Screenshot 2025-05-27 at 03-40-56 (anonymous) - output_e0fbfa20-686e-4bce-b2e8-25991be5a5a0.pdf.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/VOHQIrT7hCs5afGMRROvD.png)
### Video Inference
![Screenshot 2025-05-27 at 20-14-22 Video Understanding with Imgscope-OCR-2B-0527.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/fyAVI0hZICWpSXlcKaJF4.png)
---
### Buffering Output (Streaming)
```python
buffer = ""
for new_text in streamer:
buffer += new_text
buffer = buffer.replace("<|im_end|>", "")
yield buffer
```
---
### Key Features
1. **Realistic Messy Handwriting OCR**
* Fine-tuned for **complex and hard-to-read handwritten inputs** using real-world handwriting datasets.
2. **Document OCR and Layout Understanding**
* Accurately extracts text from structured documents, including scanned pages, forms, and academic papers.
3. **Image and Text Multi-modal Reasoning**
* Combines **vision-language capabilities** for tasks like captioning, answering image-based queries, and understanding image+text prompts.
4. **Math Problem Solving and LaTeX Rendering**
* Converts mathematical expressions and problem-solving steps into **LaTeX** format.
5. **Multi-turn Conversations**
* Supports **dialogue-based reasoning**, retaining context for follow-up questions.
6. **Video + Image + Text-to-Text Generation**
* Accepts inputs from videos, images, or combined media with text, and generates relevant output accordingly.
---
## **Intended Use**
**Imgscope-OCR-2B-0527** is intended for:
* Handwritten and printed document digitization
* OCR pipelines for educational institutions and businesses
* Academic and scientific content parsing, especially math-heavy documents
* Assistive tools for visually impaired users
* Robotic and mobile automation agents interpreting screen or camera data
* Multilingual OCR processing for document translation or archiving