Image captioning model finetuned on BLIP-base, responds like how Yoda speaks,

"Sitting in a car, a man is"

Try web app here: https://yodacaptioner.up.railway.app/

Model Details

Model Description

An image-to-text model finetuned on BLIP-base with the transformers package

  • Developed by: vkao8264
  • Model type: Image-to-text
  • Language(s) (NLP): English
  • License: bsd-3-clause
  • Finetuned from model [optional]: blip-image-captioning-base

Uses

from PIL import Image
from transformers import AutoProcessor, BlipForConditionalGeneration

processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("vkao8264/blip-yoda-captioning")

filepath = "path-to-your-image"
raw_image = Image.open(filepath).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda")
output_tokens = model.generate(**inputs)
caption = processor.decode(output_tokens[0], skip_special_tokens=True)
print(caption)

Training Details

Training Data

The model was fine-tuned on 30000 image-caption pairs from the COCO captions dataset. Specifically, captions_train2014.

Before training, captions were changed to yoda-style captions using phi3 with few-shot learning

Scripts can be found on https://github.com/vincent8264/yoda_captioning

Downloads last month
491
Safetensors
Model size
247M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vkao8264/blip-yoda-captioning

Finetuned
(38)
this model