Great improvement!

by seedmanc - opened 5 days ago

•

This is cool, I've got almost 50% increase of accuracy for SNS likes prediction using this model over vanilla Clip-L/14@336px. I'm gonna try the gigantic version and see how it fares too. The base one couldn't beat the old clip though (but then I was comparing it to Large Clip ver., so there's that).

rwightman

PyTorch Image Models org 5 days ago

•

edited 5 days ago

@seedmanc yeah PE encoders are strong... for other good, recent vit image encoders, also look at https://huggingface.co/timm/vit_large_patch14_reg1_tipsv2.webli (and related that I just uploaded, need main branch timm), and dinov3 models (https://huggingface.co/collections/timm/timm-dinov3)

seedmanc

3 days ago

I tried the Gigantic ver, it gave only 10% of improvement over Large.
@rwightman I tried Dino before, it wasn't fit for my task, even v3 produced worse metrics than vanilla Clip. Now I tried the reg1_tipsv2 you suggested, but it gives me a major metrics drop compared to others. I dunno if it's a special kind of model that requires a particular config, I just swapped my embeddings to ones produced by it same way as I do with others. Well, for now I'll stick to PE Core Gigantic.

I don't suppose Timm supports Jina CLIP? Thought I'd try out one more model before settling.

rwightman

PyTorch Image Models org 3 days ago

@seedmanc Yeah, for different downstream tasks you can get quite a bit of variation based on the upstream datasets, training methodology. There's also EUPE that I just pushed up too https://huggingface.co/timm/vit_base_patch16_dinov3_qkvb.eupe_lvd1689m ... I assume you're aware but all of these models mentioned have different input normalizations, so make sure you're using the timm helpers that use that metadata or setting it appropriately per model family.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment