|
--- |
|
library_name: peft |
|
license: mit |
|
base_model: Qwen/Qwen2.5-VL-7B-Instruct |
|
datasets: |
|
- SinaLab/ImageEval2025Task2TrainDataset |
|
tags: |
|
- arabic |
|
- image-captioning |
|
- vision-language |
|
- lora |
|
- qwen2.5-vl |
|
- cultural-heritage |
|
language: |
|
- ar |
|
model-index: |
|
- name: arabic-image-captioning-qwen2.5vl |
|
results: [] |
|
--- |
|
|
|
# Arabic Image Captioning - Qwen2.5-VL Fine-tuned |
|
|
|
This model is a LoRA fine-tuned version of [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) for generating Arabic captions for images. |
|
|
|
## Model Description |
|
|
|
This model was developed as part of the [Arabic Image Captioning Shared Task 2025](https://sina.birzeit.edu/image_eval2025/index.html). It generates natural Arabic captions for images with focus on historical and cultural content related to Palestinian heritage. |
|
|
|
please refer to the [training dataset](https://huggingface.co/datasets/SinaLab/ImageEval2025Task2TrainDataset) for more details. |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor |
|
from peft import PeftModel |
|
import torch |
|
from PIL import Image |
|
|
|
# Load base model and processor |
|
base_model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct") |
|
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct") |
|
|
|
# Load LoRA adapter |
|
model = PeftModel.from_pretrained(base_model, "your-username/arabic-image-captioning-qwen2.5vl") |
|
|
|
# Process image and generate caption |
|
image = Image.open("your_image.jpg") |
|
prompt = "اكتب وصفاً مختصراً لهذه الصورة باللغة العربية" |
|
|
|
inputs = processor(images=image, text=prompt, return_tensors="pt") |
|
with torch.no_grad(): |
|
outputs = model.generate(**inputs, max_new_tokens=128) |
|
|
|
caption = processor.decode(outputs[0], skip_special_tokens=True) |
|
print(caption) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Dataset |
|
- **Training data**: Arabic image captions dataset from the shared task |
|
- **Languages**: Arabic (ar) |
|
- **Dataset size**: ~2,700 training images with Arabic captions |
|
|
|
### Training Procedure |
|
- **Fine-tuning method**: LoRA (Low-Rank Adaptation) |
|
- **Training epochs**: 15 |
|
- **Learning rate**: 2e-05 |
|
- **Batch size**: 1 with gradient accumulation (effective batch size: 16) |
|
- **Optimizer**: AdamW with cosine learning rate scheduling |
|
- **Hardware**: NVIDIA A100 GPU |
|
- **Training time**: ~6 hours |
|
|
|
### Framework Versions |
|
- PEFT 0.15.2 |
|
- Transformers 4.49.0 |
|
- PyTorch 2.4.1+cu121 |
|
|
|
|
|
|
|
## Contact |
|
|
|
For questions or support: |
|
- [email protected] |
|
- [email protected] |
|
- [email protected] |
|
|