BLIP Image Captioning - English (Flickr30k)

This model is a fine-tuned version of Salesforce/blip-image-captioning-large, adapted for image captioning in English using the Flickr30K dataset. It takes an input image and generates a relevant caption in English, describing the image content.

Model Sources

How to Get Started with the Model

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
import matplotlib.pyplot as plt

# Load model and processor
processor = BlipProcessor.from_pretrained("omarsabri8756/blip-merged-lora-flickr-30k")
model = BlipForConditionalGeneration.from_pretrained("omarsabri8756/blip-merged-lora-flickr-30k")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Load an image from local path
image_path = "path/to/your/image.jpg"
image = Image.open(image_path).convert("RGB")

# Show image
plt.imshow(image)
plt.axis('off')
plt.title("Input Image")
plt.show()

# Generate English caption
model.eval()
with torch.no_grad():
    pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
    generated_output = model.generate(
        pixel_values=pixel_values,
        max_length=75,
        min_length=5,
        num_beams=5,
        repetition_penalty=1.5,
        length_penalty=1.0,
        no_repeat_ngram_size=3,
        early_stopping=True
    )
    caption = processor.batch_decode(generated_output, skip_special_tokens=True)[0]
    print(caption)  # Prints English caption

🏋️ Training Details

📂 Dataset

  • Name: Flickr30K
  • Description: Contains 30,000 images with 5 English captions each.
  • Preprocessing: Images resized to 384×384, text lowercased and tokenized.

⚙️ Hyperparameters

  • Optimizer: AdamW
  • Learning Rate: 5e-5
  • Batch Size: 16
  • Precision: FP16 mixed precision
  • Epochs: 5
  • LR Scheduler: Cosine with warmup
  • Weight Decay: 0.01
  • Rank: 32
  • Lora Alpha: 64
  • Lora Dropout: 0.01

📊 Evaluation Results

Metric Score
BLEU-1 75
BLEU-2 55
BLEU-3 41
BLEU-4 30
ROUGE-1 57
ROUGE-2 34
METEOR 54

Evaluation

Testing Data

The model was evaluated on the Flickr30k test split, which contains 1,000 images with 5 reference captions each.

Results

The model performs well on everyday scenes and common activities, generating grammatically correct and contextually appropriate English captions.
Performance may be slightly lower for highly specific or rare visual concepts.

Downloads last month
7
Safetensors
Model size
470M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for omarsabri8756/blip-merged-lora-flickr-30k

Finetuned
(13)
this model