SinaLab
/

Qwen-2.5-VL-7B-Instruct-Image-Captioning

image-captioning

vision-language

cultural-heritage

Model card Files Files and versions

Qwen-2.5-VL-7B-Instruct-Image-Captioning / README.md

Alaa Aljabari

added training dataset link

0b7d988 3 months ago

|

history blame contribute delete

2.55 kB

	---
	library_name: peft
	license: mit
	base_model: Qwen/Qwen2.5-VL-7B-Instruct
	datasets:
	- SinaLab/ImageEval2025Task2TrainDataset
	tags:
	- arabic
	- image-captioning
	- vision-language
	- lora
	- qwen2.5-vl
	- cultural-heritage
	language:
	- ar
	model-index:
	- name: arabic-image-captioning-qwen2.5vl
	results: []
	---

	# Arabic Image Captioning - Qwen2.5-VL Fine-tuned

	This model is a LoRA fine-tuned version of [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) for generating Arabic captions for images.

	## Model Description

	This model was developed as part of the [Arabic Image Captioning Shared Task 2025](https://sina.birzeit.edu/image_eval2025/index.html). It generates natural Arabic captions for images with focus on historical and cultural content related to Palestinian heritage.

	please refer to the [training dataset](https://huggingface.co/datasets/SinaLab/ImageEval2025Task2TrainDataset) for more details.

	## Usage

	```python
	from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
	from peft import PeftModel
	import torch
	from PIL import Image

	# Load base model and processor
	base_model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
	processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

	# Load LoRA adapter
	model = PeftModel.from_pretrained(base_model, "your-username/arabic-image-captioning-qwen2.5vl")

	# Process image and generate caption
	image = Image.open("your_image.jpg")
	prompt = "اكتب وصفاً مختصراً لهذه الصورة باللغة العربية"

	inputs = processor(images=image, text=prompt, return_tensors="pt")
	with torch.no_grad():
	outputs = model.generate(**inputs, max_new_tokens=128)

	caption = processor.decode(outputs[0], skip_special_tokens=True)
	print(caption)
	```

	## Training Details

	### Dataset
	- Training data: Arabic image captions dataset from the shared task
	- Languages: Arabic (ar)
	- Dataset size: ~2,700 training images with Arabic captions

	### Training Procedure
	- Fine-tuning method: LoRA (Low-Rank Adaptation)
	- Training epochs: 15
	- Learning rate: 2e-05
	- Batch size: 1 with gradient accumulation (effective batch size: 16)
	- Optimizer: AdamW with cosine learning rate scheduling
	- Hardware: NVIDIA A100 GPU
	- Training time: ~6 hours

	### Framework Versions
	- PEFT 0.15.2
	- Transformers 4.49.0
	- PyTorch 2.4.1+cu121



	## Contact

	For questions or support:
	- [email protected]
	- [email protected]
	- [email protected]