textract-ai / README.md
BabaK07's picture
Upload custom OCR model based on Qwen2.5-VL
b127e5d verified
|
raw
history blame
5.54 kB
metadata
language:
  - en
  - zh
  - es
  - fr
  - de
  - ja
  - ko
  - ar
  - hi
  - ru
license: apache-2.0
tags:
  - ocr
  - vision-language
  - qwen2-vl
  - custom-model
  - text-extraction
  - document-ai
library_name: transformers
pipeline_tag: image-to-text
base_model: Qwen/Qwen2-VL-2B-Instruct
datasets:
  - custom
metrics:
  - accuracy
  - bleu
widget:
  - src: >-
      https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg
    example_title: Document OCR

textract-ai

A custom OCR (Optical Character Recognition) model built on top of Qwen2.5-VL-2B-Instruct, specifically designed for high-accuracy text extraction from images and documents.

Model Description

This model combines the powerful vision-language capabilities of Qwen2.5-VL with custom OCR-specific heads to provide:

  • High-accuracy text extraction from images and documents
  • Multi-language support for 10+ languages
  • Robust architecture with fallback mechanisms
  • Production-ready inference capabilities
  • Custom OCR heads trained for text recognition tasks

Architecture

Custom OCR Model
├── Qwen2.5-VL-2B (Frozen Backbone)
│   ├── Vision Encoder (ViT-based)
│   └── Language Model (Qwen2-2B)
├── Custom OCR Heads
│   ├── Text Recognition Head
│   └── Confidence Estimation Head
└── Multi-API Processing Pipeline

Model Details

  • Base Model: Qwen/Qwen2-VL-2B-Instruct
  • Model Size: ~2.5B parameters
  • Architecture: Vision-Language Transformer with custom OCR heads
  • Languages: English, Chinese, Spanish, French, German, Japanese, Korean, Arabic, Hindi, Russian
  • Input: Images (JPEG, PNG, PDF, TIFF)
  • Output: Extracted text with confidence scores

Usage

Quick Start

from transformers import AutoModel, AutoProcessor
from PIL import Image

# Load model and processor
model = AutoModel.from_pretrained("BabaK07/textract-ai", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("BabaK07/textract-ai")

# Load image
image = Image.open("document.jpg")

# Extract text
result = model.generate_ocr_text(image, use_native=True)
print(f"Extracted text: {result['text']}")
print(f"Confidence: {result['confidence']:.3f}")

Advanced Usage

import torch
from PIL import Image

# Load model
model = AutoModel.from_pretrained("BabaK07/textract-ai", trust_remote_code=True)

# Process image
image = Image.open("invoice.jpg")

# Extract text with custom parameters
result = model.generate_ocr_text(
    image=image,
    use_native=True  # Use Qwen's native OCR capabilities
)

# Access detailed results
print(f"Text: {result['text']}")
print(f"Confidence: {result['confidence']}")
print(f"Method: {result['method']}")

Batch Processing

from PIL import Image
import torch

# Load multiple images
images = [Image.open(f"doc_{i}.jpg") for i in range(5)]

# Process batch
results = []
for image in images:
    result = model.generate_ocr_text(image)
    results.append(result)

# Print results
for i, result in enumerate(results):
    print(f"Document {i+1}: {result['text'][:50]}...")

Performance

  • Accuracy: High accuracy on document OCR tasks
  • Speed: ~1-3 seconds per image (depending on hardware)
  • Memory: ~6GB GPU memory recommended
  • Languages: Supports 10+ major languages

Training

This model was built using:

  • Base Model: Qwen2.5-VL-2B-Instruct (frozen)
  • Custom Heads: Trained OCR-specific layers
  • Architecture: Vision-language transformer with custom components
  • Optimization: Multiple API fallbacks for robustness

Limitations

  • Performance depends on image quality and text clarity
  • Best results with printed text; handwriting accuracy may vary
  • Requires sufficient GPU memory for optimal performance
  • Some complex layouts may need preprocessing

Use Cases

  • Document Digitization: Convert scanned documents to text
  • Invoice Processing: Extract data from invoices and receipts
  • Form Processing: Digitize forms and applications
  • Multi-language Documents: Process documents in various languages
  • Batch Processing: Handle large volumes of documents

Technical Details

Model Architecture

  • Vision Encoder: Based on Vision Transformer (ViT)
  • Language Decoder: Qwen2-2B language model
  • Custom Heads: OCR-specific text recognition and confidence estimation
  • Integration: Multiple API approaches for robustness

Inference Pipeline

  1. Image preprocessing and normalization
  2. Vision feature extraction using Qwen's ViT encoder
  3. Text generation using language model
  4. Confidence estimation and post-processing
  5. Multiple fallback methods for reliability

Installation

pip install transformers torch pillow

For GPU support:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Citation

@software{custom_ocr_qwen,
  title={Custom OCR Model based on Qwen2.5-VL},
  author={BabaK07},
  year={2024},
  url={https://huggingface.co/BabaK07/textract-ai}
}

License

This model is released under the Apache 2.0 license, following the base Qwen2.5-VL model license.

Acknowledgments

  • Built on top of Qwen2.5-VL-2B-Instruct
  • Thanks to the Qwen team for the excellent base model
  • Custom architecture and training by BabaK07

Contact

For questions or issues, please open an issue on the model repository or contact the author.