izhx commited on
Commit
54e1aba
·
verified ·
1 Parent(s): 51f2ba5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -24
README.md CHANGED
@@ -3691,57 +3691,94 @@ The `GME` models support three types of input: **text**, **image**, and **image-
3691
  |[`gme-Qwen2-VL-2B`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct) | 2.21B | 32768 | 1536 | 65.27 | 68.41 | 64.45 |
3692
  |[`gme-Qwen2-VL-7B`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct) | 8.29B | 32768 | 3584 | 67.48 | 71.36 | 67.44 |
3693
 
 
 
3694
  ## Usage
3695
- **Use with custom code**
3696
 
3697
- ```python
3698
- # You can find the script gme_inference.py in https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct/blob/main/gme_inference.py
3699
- from gme_inference import GmeQwen2VL
3700
 
3701
- model = GmeQwen2VL('Alibaba-NLP/gme-Qwen2-VL-7B-Instruct')
3702
 
 
 
3703
  texts = [
3704
- "What kind of car is this?",
3705
- "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023."
3706
  ]
3707
  images = [
3708
- 'https://en.wikipedia.org/wiki/File:Tesla_Cybertruck_damaged_window.jpg',
3709
- 'https://en.wikipedia.org/wiki/File:2024_Tesla_Cybertruck_Foundation_Series,_front_left_(Greenwich).jpg',
3710
  ]
3711
 
 
 
 
 
 
 
 
3712
  # Single-modal embedding
3713
  e_text = gme.get_text_embeddings(texts=texts)
3714
  e_image = gme.get_image_embeddings(images=images)
3715
- print((e_text * e_image).sum(-1))
3716
- ## tensor([0.1702, 0.5278], dtype=torch.float16)
3717
 
3718
  # How to set embedding instruction
3719
- e_query = gme.get_text_embeddings(texts=texts, instruction='Find an image that matches the given text.')
3720
  # If is_query=False, we always use the default instruction.
3721
  e_corpus = gme.get_image_embeddings(images=images, is_query=False)
3722
- print((e_query * e_corpus).sum(-1))
3723
- ## tensor([0.2000, 0.5752], dtype=torch.float16)
3724
 
3725
  # Fused-modal embedding
3726
  e_fused = gme.get_fused_embeddings(texts=texts, images=images)
3727
- print((e_fused[0] * e_fused[1]).sum())
3728
- ## tensor(0.6826, dtype=torch.float16)
3729
-
3730
  ```
3731
 
3732
- <!-- <details>
3733
- <summary>With transformers</summary>
 
 
 
 
3734
 
3735
  ```python
3736
- # Requires transformers>=4.46.2
 
3737
 
3738
- TODO
 
 
 
 
 
 
 
 
 
 
 
3739
 
3740
- # [[0.3016996383666992, 0.7503870129585266, 0.3203084468841553]]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3741
  ```
3742
 
3743
- </details>
3744
- -->
3745
 
3746
  ## Evaluation
3747
 
 
3691
  |[`gme-Qwen2-VL-2B`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct) | 2.21B | 32768 | 1536 | 65.27 | 68.41 | 64.45 |
3692
  |[`gme-Qwen2-VL-7B`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct) | 8.29B | 32768 | 3584 | 67.48 | 71.36 | 67.44 |
3693
 
3694
+
3695
+
3696
  ## Usage
 
3697
 
 
 
 
3698
 
3699
+ **Transformers**
3700
 
3701
+ ```python
3702
+ t2i_prompt = 'Find an image that matches the given text.'
3703
  texts = [
3704
+ "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
3705
+ "Alibaba office.",
3706
  ]
3707
  images = [
3708
+ 'https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg',
3709
+ 'https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg',
3710
  ]
3711
 
3712
+
3713
+ gme = AutoModel.from_pretrained(
3714
+ "Alibaba-NLP/gme-Qwen2-VL-7B-Instruct",
3715
+ torch_dtype="float16", device_map='cuda', trust_remote_code=True
3716
+ )
3717
+
3718
+
3719
  # Single-modal embedding
3720
  e_text = gme.get_text_embeddings(texts=texts)
3721
  e_image = gme.get_image_embeddings(images=images)
3722
+ print('Single-modal', (e_text @ e_image.T).tolist())
3723
+ ## Single-modal [[0.279296875, 0.0002658367156982422], [0.06427001953125, 0.304443359375]]
3724
 
3725
  # How to set embedding instruction
3726
+ e_query = gme.get_text_embeddings(texts=texts, instruction=t2i_prompt)
3727
  # If is_query=False, we always use the default instruction.
3728
  e_corpus = gme.get_image_embeddings(images=images, is_query=False)
3729
+ print('Single-modal with instruction', (e_query @ e_corpus.T).tolist())
3730
+ ## Single-modal with instruction [[0.32861328125, 0.026336669921875], [0.09466552734375, 0.3134765625]]
3731
 
3732
  # Fused-modal embedding
3733
  e_fused = gme.get_fused_embeddings(texts=texts, images=images)
3734
+ print('Fused-modal', (e_fused @ e_fused.T).tolist())
3735
+ ## Fused-modal [[1.0, 0.0308685302734375], [0.0308685302734375, 1.0]]
 
3736
  ```
3737
 
3738
+
3739
+ **sentence_transformers**
3740
+
3741
+ The `encode` function accept `str` or `dict` with key(s) in `{'text', 'image', 'prompt'}`.
3742
+
3743
+ **Do not pass `prompt` as the argument to `encode`**, pass as the input as a `dict` with a `prompt` key.
3744
 
3745
  ```python
3746
+ from sentence_transformers import SentenceTransformer
3747
+
3748
 
3749
+ t2i_prompt = 'Find an image that matches the given text.'
3750
+ texts = [
3751
+ "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
3752
+ "Alibaba office.",
3753
+ ]
3754
+ images = [
3755
+ 'https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg',
3756
+ 'https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg',
3757
+ ]
3758
+
3759
+
3760
+ gme_st = SentenceTransformer("Alibaba-NLP/gme-Qwen2-VL-7B-Instruct")
3761
 
3762
+ # Single-modal embedding
3763
+ e_text = gme_st.encode(texts, convert_to_tensor=True)
3764
+ e_image = gme_st.encode([dict(image=i) for i in images], convert_to_tensor=True)
3765
+ print('Single-modal', (e_text @ e_image.T).tolist())
3766
+ ## Single-modal [[0.27880859375, 0.0005745887756347656], [0.06500244140625, 0.306640625]]
3767
+
3768
+ # How to set embedding instruction
3769
+ e_query = gme_st.encode([dict(text=t, prompt=t2i_prompt) for t in texts], convert_to_tensor=True)
3770
+ # If no prompt, we always use the default instruction.
3771
+ e_corpus = gme_st.encode([dict(image=i) for i in images], convert_to_tensor=True)
3772
+ print('Single-modal with instruction', (e_query @ e_corpus.T).tolist())
3773
+ ## Single-modal with instruction [[0.328369140625, 0.0269927978515625], [0.09521484375, 0.316162109375]]
3774
+
3775
+ # Fused-modal embedding
3776
+ e_fused = gme_st.encode([dict(text=t, image=i) for t, i in zip(texts, images)], convert_to_tensor=True)
3777
+ print('Fused-modal', (e_fused @ e_fused.T).tolist())
3778
+ ## Fused-modal [[0.99951171875, 0.0311737060546875], [0.0311737060546875, 1.0009765625]]
3779
  ```
3780
 
3781
+
 
3782
 
3783
  ## Evaluation
3784