🤖 Nayana VQA - Advanced Kannada Visual Question Answering Model

Developed by: CognitiveLab
License: Apache 2.0
Base Model: unsloth/gemma-3n-E4B-it
Architecture: Gemma 3n (4B parameters)

🌟 Model Overview

Nayana VQA is an advanced vision-language model specifically fine-tuned for Visual Question Answering (VQA) and Document Visual Question Answering (Document VQA) tasks. Built on the powerful Gemma 3n architecture, this model excels at understanding and answering questions about visual content, with a special focus on Kannada language support.

🌍 Supported Languages

  • Kannada (kn) - Primary focus language

More languages coming soon! We are actively working on expanding language support to include additional 20 languages

🎯 Key Features

  • Visual Question Answering: Accurate question answering from images in Kannada
  • Document Understanding: Advanced comprehension of document layouts and content
  • Multimodal Reasoning: Combines visual and textual understanding for complex queries
  • Fast Inference: Optimized for real-time applications
  • High Accuracy: Fine-tuned on diverse VQA datasets
  • Easy Integration: Compatible with Transformers and Modal deployment

📋 Model Specifications

Parameter Value
Model Size 4B parameters
Context Length 32K tokens
Image Resolution Flexible (optimized for documents and general images)
Precision BFloat16
Framework Transformers + Unsloth

🚀 Quick Start

Installation

pip install transformers torch pillow unsloth

Basic Usage

from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import torch

# Load model and processor
model_id = "Nayana-cognitivelab/NayanaVQA"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

# System prompt
system_prompt = "You are Nayana, an advanced AI assistant developed by CognitiveLab. You specialize in vision-based tasks, particularly Visual Question Answering (VQA) and Document Visual Question Answering (Document VQA). You are highly accurate, fast, and reliable when working with visual content. You can understand and respond to questions about images in Kannada with high precision."

# Load and process image
image = Image.open("your_image.jpg")
user_question = "ಈ ಚಿತ್ರದಲ್ಲಿ ಏನಿದೆ?"  # "What is in this image?" in Kannada

# Prepare messages
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": system_prompt}]
    },
    {
        "role": "user", 
        "content": [
            {"type": "text", "text": user_question},
            {"type": "image", "image": image}
        ]
    }
]

# Apply chat template
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
)

# Generate response
with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=1.0,
        top_p=0.95,
        top_k=64,
        do_sample=True
    )

# Decode response
response = processor.tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:], 
    skip_special_tokens=True
)
print(response)

This model was trained 2x faster with Unsloth and Hugging Face's TRL library.

📜 Citation

@model{nayana_vqa_2024,
  title={Nayana VQA: Advanced Kannada Visual Question Answering with Gemma 3n},
  author={CognitiveLab},
  year={2024},
  url={https://huggingface.co/Nayana-cognitivelab/NayanaVQA}
}
Downloads last month
7
Safetensors
Model size
8.39B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nayana-cognitivelab/NayanaVQA

Finetuned
(1)
this model

Collection including Nayana-cognitivelab/NayanaVQA