Instructions to use timm/vit_pe_core_large_patch14_336.fb with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- timm
How to use timm/vit_pe_core_large_patch14_336.fb with timm:
import timm model = timm.create_model("hf_hub:timm/vit_pe_core_large_patch14_336.fb", pretrained=True) - Transformers
How to use timm/vit_pe_core_large_patch14_336.fb with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-feature-extraction", model="timm/vit_pe_core_large_patch14_336.fb")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("timm/vit_pe_core_large_patch14_336.fb", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Great improvement!
This is cool, I've got almost 50% increase of accuracy for SNS likes prediction using this model over vanilla Clip-L/14@336px. I'm gonna try the gigantic version and see how it fares too. The base one couldn't beat the old clip though (but then I was comparing it to Large Clip ver., so there's that).
@seedmanc yeah PE encoders are strong... for other good, recent vit image encoders, also look at https://huggingface.co/timm/vit_large_patch14_reg1_tipsv2.webli (and related that I just uploaded, need main branch timm), and dinov3 models (https://huggingface.co/collections/timm/timm-dinov3)
I tried the Gigantic ver, it gave only 10% of improvement over Large.
@rwightman I tried Dino before, it wasn't fit for my task, even v3 produced worse metrics than vanilla Clip. Now I tried the reg1_tipsv2 you suggested, but it gives me a major metrics drop compared to others. I dunno if it's a special kind of model that requires a particular config, I just swapped my embeddings to ones produced by it same way as I do with others. Well, for now I'll stick to PE Core Gigantic.
I don't suppose Timm supports Jina CLIP? Thought I'd try out one more model before settling.
@seedmanc Yeah, for different downstream tasks you can get quite a bit of variation based on the upstream datasets, training methodology. There's also EUPE that I just pushed up too https://huggingface.co/timm/vit_base_patch16_dinov3_qkvb.eupe_lvd1689m ... I assume you're aware but all of these models mentioned have different input normalizations, so make sure you're using the timm helpers that use that metadata or setting it appropriately per model family.