File size: 2,552 Bytes
b753e4c
 
bacb3bc
b753e4c
0b7d988
 
b753e4c
bacb3bc
 
 
b753e4c
bacb3bc
 
 
 
b753e4c
bacb3bc
b753e4c
 
 
bacb3bc
b753e4c
bacb3bc
b753e4c
bacb3bc
b753e4c
bacb3bc
b753e4c
0b7d988
b753e4c
bacb3bc
b753e4c
bacb3bc
 
 
 
 
b753e4c
bacb3bc
 
 
b753e4c
bacb3bc
 
b753e4c
bacb3bc
 
 
b753e4c
bacb3bc
 
 
 
 
 
 
b753e4c
bacb3bc
b753e4c
bacb3bc
 
 
 
b753e4c
bacb3bc
 
 
 
 
 
 
 
b753e4c
bacb3bc
b753e4c
 
bacb3bc
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
library_name: peft
license: mit
base_model: Qwen/Qwen2.5-VL-7B-Instruct
datasets:
- SinaLab/ImageEval2025Task2TrainDataset
tags:
- arabic
- image-captioning
- vision-language
- lora
- qwen2.5-vl
- cultural-heritage
language:
- ar
model-index:
- name: arabic-image-captioning-qwen2.5vl
  results: []
---

# Arabic Image Captioning - Qwen2.5-VL Fine-tuned

This model is a LoRA fine-tuned version of [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) for generating Arabic captions for images.

## Model Description

This model was developed as part of the [Arabic Image Captioning Shared Task 2025](https://sina.birzeit.edu/image_eval2025/index.html). It generates natural Arabic captions for images with focus on historical and cultural content related to Palestinian heritage.

please refer to the [training dataset](https://huggingface.co/datasets/SinaLab/ImageEval2025Task2TrainDataset) for more details.

## Usage

```python
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
from PIL import Image

# Load base model and processor
base_model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "your-username/arabic-image-captioning-qwen2.5vl")

# Process image and generate caption
image = Image.open("your_image.jpg")
prompt = "اكتب وصفاً مختصراً لهذه الصورة باللغة العربية"

inputs = processor(images=image, text=prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=128)
    
caption = processor.decode(outputs[0], skip_special_tokens=True)
print(caption)
```

## Training Details

### Dataset
- **Training data**: Arabic image captions dataset from the shared task
- **Languages**: Arabic (ar)
- **Dataset size**: ~2,700 training images with Arabic captions

### Training Procedure
- **Fine-tuning method**: LoRA (Low-Rank Adaptation)
- **Training epochs**: 15
- **Learning rate**: 2e-05
- **Batch size**: 1 with gradient accumulation (effective batch size: 16)
- **Optimizer**: AdamW with cosine learning rate scheduling
- **Hardware**: NVIDIA A100 GPU
- **Training time**: ~6 hours

### Framework Versions
- PEFT 0.15.2
- Transformers 4.49.0
- PyTorch 2.4.1+cu121



## Contact

For questions or support:
- [email protected]
- [email protected]  
- [email protected]