textract-ai / README.md

BabaK07

Upload custom OCR model based on Qwen2.5-VL

b127e5d verified 3 months ago

preview code

raw

history blame

5.54 kB

metadata

language:
  - en
  - zh
  - es
  - fr
  - de
  - ja
  - ko
  - ar
  - hi
  - ru
license: apache-2.0
tags:
  - ocr
  - vision-language
  - qwen2-vl
  - custom-model
  - text-extraction
  - document-ai
library_name: transformers
pipeline_tag: image-to-text
base_model: Qwen/Qwen2-VL-2B-Instruct
datasets:
  - custom
metrics:
  - accuracy
  - bleu
widget:
  - src: >-
      https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg
    example_title: Document OCR

textract-ai

A custom OCR (Optical Character Recognition) model built on top of Qwen2.5-VL-2B-Instruct, specifically designed for high-accuracy text extraction from images and documents.

Model Description

This model combines the powerful vision-language capabilities of Qwen2.5-VL with custom OCR-specific heads to provide:

High-accuracy text extraction from images and documents
Multi-language support for 10+ languages
Robust architecture with fallback mechanisms
Production-ready inference capabilities
Custom OCR heads trained for text recognition tasks

Architecture

Custom OCR Model
├── Qwen2.5-VL-2B (Frozen Backbone)
│   ├── Vision Encoder (ViT-based)
│   └── Language Model (Qwen2-2B)
├── Custom OCR Heads
│   ├── Text Recognition Head
│   └── Confidence Estimation Head
└── Multi-API Processing Pipeline

Model Details

Base Model: Qwen/Qwen2-VL-2B-Instruct
Model Size: ~2.5B parameters
Architecture: Vision-Language Transformer with custom OCR heads
Languages: English, Chinese, Spanish, French, German, Japanese, Korean, Arabic, Hindi, Russian
Input: Images (JPEG, PNG, PDF, TIFF)
Output: Extracted text with confidence scores

Usage

Quick Start

from transformers import AutoModel, AutoProcessor
from PIL import Image

# Load model and processor
model = AutoModel.from_pretrained("BabaK07/textract-ai", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("BabaK07/textract-ai")

# Load image
image = Image.open("document.jpg")

# Extract text
result = model.generate_ocr_text(image, use_native=True)
print(f"Extracted text: {result['text']}")
print(f"Confidence: {result['confidence']:.3f}")

Advanced Usage

import torch
from PIL import Image

# Load model
model = AutoModel.from_pretrained("BabaK07/textract-ai", trust_remote_code=True)

# Process image
image = Image.open("invoice.jpg")

# Extract text with custom parameters
result = model.generate_ocr_text(
    image=image,
    use_native=True  # Use Qwen's native OCR capabilities
)

# Access detailed results
print(f"Text: {result['text']}")
print(f"Confidence: {result['confidence']}")
print(f"Method: {result['method']}")

Batch Processing

from PIL import Image
import torch

# Load multiple images
images = [Image.open(f"doc_{i}.jpg") for i in range(5)]

# Process batch
results = []
for image in images:
    result = model.generate_ocr_text(image)
    results.append(result)

# Print results
for i, result in enumerate(results):
    print(f"Document {i+1}: {result['text'][:50]}...")

Performance

Accuracy: High accuracy on document OCR tasks
Speed: ~1-3 seconds per image (depending on hardware)
Memory: ~6GB GPU memory recommended
Languages: Supports 10+ major languages

Training

This model was built using:

Base Model: Qwen2.5-VL-2B-Instruct (frozen)
Custom Heads: Trained OCR-specific layers
Architecture: Vision-language transformer with custom components
Optimization: Multiple API fallbacks for robustness

Limitations

Performance depends on image quality and text clarity
Best results with printed text; handwriting accuracy may vary
Requires sufficient GPU memory for optimal performance
Some complex layouts may need preprocessing

Use Cases

Document Digitization: Convert scanned documents to text
Invoice Processing: Extract data from invoices and receipts
Form Processing: Digitize forms and applications
Multi-language Documents: Process documents in various languages
Batch Processing: Handle large volumes of documents

Technical Details

Model Architecture

Vision Encoder: Based on Vision Transformer (ViT)
Language Decoder: Qwen2-2B language model
Custom Heads: OCR-specific text recognition and confidence estimation
Integration: Multiple API approaches for robustness

Inference Pipeline

Image preprocessing and normalization
Vision feature extraction using Qwen's ViT encoder
Text generation using language model
Confidence estimation and post-processing
Multiple fallback methods for reliability

Installation

pip install transformers torch pillow

For GPU support:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Citation

@software{custom_ocr_qwen,
  title={Custom OCR Model based on Qwen2.5-VL},
  author={BabaK07},
  year={2024},
  url={https://huggingface.co/BabaK07/textract-ai}
}

License

This model is released under the Apache 2.0 license, following the base Qwen2.5-VL model license.

Acknowledgments

Built on top of Qwen2.5-VL-2B-Instruct
Thanks to the Qwen team for the excellent base model
Custom architecture and training by BabaK07

Contact

For questions or issues, please open an issue on the model repository or contact the author.