metadata
language:
- en
- zh
- es
- fr
- de
- ja
- ko
- ar
- hi
- ru
license: apache-2.0
tags:
- ocr
- vision-language
- qwen2-vl
- custom-model
- text-extraction
- document-ai
library_name: transformers
pipeline_tag: image-to-text
base_model: Qwen/Qwen2-VL-2B-Instruct
datasets:
- custom
metrics:
- accuracy
- bleu
widget:
- src: >-
https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg
example_title: Document OCR
textract-ai
A custom OCR (Optical Character Recognition) model built on top of Qwen2.5-VL-2B-Instruct, specifically designed for high-accuracy text extraction from images and documents.
Model Description
This model combines the powerful vision-language capabilities of Qwen2.5-VL with custom OCR-specific heads to provide:
- High-accuracy text extraction from images and documents
- Multi-language support for 10+ languages
- Robust architecture with fallback mechanisms
- Production-ready inference capabilities
- Custom OCR heads trained for text recognition tasks
Architecture
Custom OCR Model
├── Qwen2.5-VL-2B (Frozen Backbone)
│ ├── Vision Encoder (ViT-based)
│ └── Language Model (Qwen2-2B)
├── Custom OCR Heads
│ ├── Text Recognition Head
│ └── Confidence Estimation Head
└── Multi-API Processing Pipeline
Model Details
- Base Model: Qwen/Qwen2-VL-2B-Instruct
- Model Size: ~2.5B parameters
- Architecture: Vision-Language Transformer with custom OCR heads
- Languages: English, Chinese, Spanish, French, German, Japanese, Korean, Arabic, Hindi, Russian
- Input: Images (JPEG, PNG, PDF, TIFF)
- Output: Extracted text with confidence scores
Usage
Quick Start
from transformers import AutoModel, AutoProcessor
from PIL import Image
# Load model and processor
model = AutoModel.from_pretrained("BabaK07/textract-ai", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("BabaK07/textract-ai")
# Load image
image = Image.open("document.jpg")
# Extract text
result = model.generate_ocr_text(image, use_native=True)
print(f"Extracted text: {result['text']}")
print(f"Confidence: {result['confidence']:.3f}")
Advanced Usage
import torch
from PIL import Image
# Load model
model = AutoModel.from_pretrained("BabaK07/textract-ai", trust_remote_code=True)
# Process image
image = Image.open("invoice.jpg")
# Extract text with custom parameters
result = model.generate_ocr_text(
image=image,
use_native=True # Use Qwen's native OCR capabilities
)
# Access detailed results
print(f"Text: {result['text']}")
print(f"Confidence: {result['confidence']}")
print(f"Method: {result['method']}")
Batch Processing
from PIL import Image
import torch
# Load multiple images
images = [Image.open(f"doc_{i}.jpg") for i in range(5)]
# Process batch
results = []
for image in images:
result = model.generate_ocr_text(image)
results.append(result)
# Print results
for i, result in enumerate(results):
print(f"Document {i+1}: {result['text'][:50]}...")
Performance
- Accuracy: High accuracy on document OCR tasks
- Speed: ~1-3 seconds per image (depending on hardware)
- Memory: ~6GB GPU memory recommended
- Languages: Supports 10+ major languages
Training
This model was built using:
- Base Model: Qwen2.5-VL-2B-Instruct (frozen)
- Custom Heads: Trained OCR-specific layers
- Architecture: Vision-language transformer with custom components
- Optimization: Multiple API fallbacks for robustness
Limitations
- Performance depends on image quality and text clarity
- Best results with printed text; handwriting accuracy may vary
- Requires sufficient GPU memory for optimal performance
- Some complex layouts may need preprocessing
Use Cases
- Document Digitization: Convert scanned documents to text
- Invoice Processing: Extract data from invoices and receipts
- Form Processing: Digitize forms and applications
- Multi-language Documents: Process documents in various languages
- Batch Processing: Handle large volumes of documents
Technical Details
Model Architecture
- Vision Encoder: Based on Vision Transformer (ViT)
- Language Decoder: Qwen2-2B language model
- Custom Heads: OCR-specific text recognition and confidence estimation
- Integration: Multiple API approaches for robustness
Inference Pipeline
- Image preprocessing and normalization
- Vision feature extraction using Qwen's ViT encoder
- Text generation using language model
- Confidence estimation and post-processing
- Multiple fallback methods for reliability
Installation
pip install transformers torch pillow
For GPU support:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Citation
@software{custom_ocr_qwen,
title={Custom OCR Model based on Qwen2.5-VL},
author={BabaK07},
year={2024},
url={https://huggingface.co/BabaK07/textract-ai}
}
License
This model is released under the Apache 2.0 license, following the base Qwen2.5-VL model license.
Acknowledgments
- Built on top of Qwen2.5-VL-2B-Instruct
- Thanks to the Qwen team for the excellent base model
- Custom architecture and training by BabaK07
Contact
For questions or issues, please open an issue on the model repository or contact the author.