zer0int
/

LongCLIP-GmP-ViT-L-14

Zero-Shot Image Classification

Model card Files Files and versions

zer0int commited on Sep 23, 2024

Commit

5f19dde

·

verified ·

1 Parent(s): 63d6f69

Update README.md

Files changed (1) hide show

README.md +15 -7

README.md CHANGED Viewed

@@ -32,15 +32,23 @@ Truncate to 77 tokens
 tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') 📉
 ```
 # 👇
-# Option 2 (edit Transformers) 💖 RECOMMENDED 💖:
-- 👉 Find the line that says `max_position_embeddings=77,` in `[System Python]/site-packages/transformers/models/clip/configuration_clip.py`
-- 👉 Change to: `max_position_embeddings=248,`
-# Now, in your inference code, for text:
-- `text_input = processor([your-prompt-or-prompts-as-usual], padding="max_length", max_length=248)`
-- or:
-- `text_input = processor([your-prompt-or-prompts-as-usual], padding="True")`
 ```
 # Resulting Cosine Similarities for 248 tokens padded:

 tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') 📉
 ```
 # 👇
+# Option 2, proper integration: 💖 RECOMMENDED 💖
+- ### Solution for implementation of 248 tokens / thanks [@kk3dmax ](https://huggingface.co/zer0int/LongCLIP-GmP-ViT-L-14/discussions/3) 🤗
+- Obtain a full example script using this solution for Flux.1 inference on [my GitHub](https://github.com/zer0int/CLIP-txt2img-diffusers-scripts)
+```
+model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")
+config = CLIPConfig.from_pretrained(model_id)
+config.text_config.max_position_embeddings = 248
+clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config)
+clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)
+pipe.tokenizer = clip_processor.tokenizer  # Replace with the CLIP tokenizer
+pipe.text_encoder = clip_model.text_model  # Replace with the CLIP text encoder
+pipe.tokenizer_max_length = 248
+pipe.text_encoder.dtype = torch.bfloat16
+```
 ```
 # Resulting Cosine Similarities for 248 tokens padded: