Update README.md
Browse files
README.md
CHANGED
@@ -1,202 +1,163 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
library_name: peft
|
4 |
---
|
|
|
5 |
|
6 |
-
|
|
|
7 |
|
8 |
-
|
9 |
|
|
|
10 |
|
|
|
11 |
|
12 |
-
##
|
13 |
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
-
<!-- Provide a longer summary of what this model is. -->
|
17 |
|
|
|
18 |
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
###
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
[
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
[
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
## Model Examination [optional]
|
136 |
-
|
137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
138 |
-
|
139 |
-
[More Information Needed]
|
140 |
-
|
141 |
-
## Environmental Impact
|
142 |
-
|
143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
144 |
-
|
145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
146 |
-
|
147 |
-
- **Hardware Type:** [More Information Needed]
|
148 |
-
- **Hours used:** [More Information Needed]
|
149 |
-
- **Cloud Provider:** [More Information Needed]
|
150 |
-
- **Compute Region:** [More Information Needed]
|
151 |
-
- **Carbon Emitted:** [More Information Needed]
|
152 |
-
|
153 |
-
## Technical Specifications [optional]
|
154 |
-
|
155 |
-
### Model Architecture and Objective
|
156 |
-
|
157 |
-
[More Information Needed]
|
158 |
-
|
159 |
-
### Compute Infrastructure
|
160 |
-
|
161 |
-
[More Information Needed]
|
162 |
-
|
163 |
-
#### Hardware
|
164 |
-
|
165 |
-
[More Information Needed]
|
166 |
-
|
167 |
-
#### Software
|
168 |
-
|
169 |
-
[More Information Needed]
|
170 |
-
|
171 |
-
## Citation [optional]
|
172 |
-
|
173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
174 |
-
|
175 |
-
**BibTeX:**
|
176 |
-
|
177 |
-
[More Information Needed]
|
178 |
-
|
179 |
-
**APA:**
|
180 |
-
|
181 |
-
[More Information Needed]
|
182 |
-
|
183 |
-
## Glossary [optional]
|
184 |
-
|
185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
186 |
-
|
187 |
-
[More Information Needed]
|
188 |
-
|
189 |
-
## More Information [optional]
|
190 |
-
|
191 |
-
[More Information Needed]
|
192 |
-
|
193 |
-
## Model Card Authors [optional]
|
194 |
-
|
195 |
-
[More Information Needed]
|
196 |
-
|
197 |
-
## Model Card Contact
|
198 |
-
|
199 |
-
[More Information Needed]
|
200 |
-
### Framework versions
|
201 |
-
|
202 |
-
- PEFT 0.15.0
|
|
|
1 |
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- Tevatron/bge-ir
|
5 |
+
- Tevatron/wiki-ss-nq-new
|
6 |
+
- Tevatron/pixmo-docs
|
7 |
+
- Tevatron/colpali
|
8 |
+
- Tevatron/msrvtt
|
9 |
+
- Tevatron/audiocaps
|
10 |
+
- Tevatron/multivent
|
11 |
+
base_model:
|
12 |
+
- Tevatron/OmniEmbed-v0.1
|
13 |
+
pipeline_tag: visual-document-retrieval
|
14 |
library_name: peft
|
15 |
---
|
16 |
+
# Tevatron/OmniEmbed-v0.1
|
17 |
|
18 |
+
**OmniEmbed** is a powerful multi-modal embedding model built on [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) using our [Tevatron](https://github.com/texttron/tevatron/) toolkit—a unified toolkit across scale, language, and modality for document retrieval.
|
19 |
+
OmniEmbed generates unified embeddings across multilingual text, images, audio, and video, enabling effective cross-modal retrieval for diverse applications. [Paper](https://arxiv.org/pdf/2505.02466v1).
|
20 |
|
21 |
+
**OmniEmbed-multivent** is further finetuned on OmniEmbed for video retrieval with allowing joint enhancing joint input performance of video, audio and text.
|
22 |
|
23 |
+
OmniEmbed-multivent gets SoTA performance on MAGMaR 2025 shared task on MultiVENT-2.0 datasets, large-scale, multi-lingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news video.
|
24 |
|
25 |
+
📝 Text 🖼️ Image 🎧 Audio 🎥 Video 🌐 Multilingual
|
26 |
|
27 |
+
## Evaluation Results:
|
28 |
|
29 |
+
| | Modality | Model | nDCG@10 | AP | nDCG | RR | R@10 |
|
30 |
+
|-----|----------------------------------|------------------------|---------|-------|-------|-------|-------|
|
31 |
+
| | **Official Baselines** | | | | | | |
|
32 |
+
| | All | VAST | 0.116 | 0.08 | 0.115 | 0.198 | 0.118 |
|
33 |
+
| | OCR | ICDAR OCR → CLIP | 0.217 | 0.166 | 0.288 | 0.363 | 0.227 |
|
34 |
+
| | ASR | Whisper ASR | 0.267 | 0.212 | 0.336 | 0.417 | 0.29 |
|
35 |
+
| | Vision (key frame) | CLIP | 0.304 | 0.261 | 0.435 | 0.429 | 0.333 |
|
36 |
+
| | All | LanguageBind | 0.324 | 0.283 | 0.452 | 0.443 | 0.355 |
|
37 |
+
| | **Zero-Shot** | | | | | | |
|
38 |
+
| (a) | text, ASR | DRAMA | 0.629 | 0.576 | 0.693 | 0.749 | 0.649 |
|
39 |
+
| (b) | text, ASR | OmniEmbed | 0.377 | 0.329 | 0.453 | 0.493 | 0.403 |
|
40 |
+
| (c) | text, ASR, Vision (video), Audio| OmniEmbed | 0.595 | 0.537 | 0.673 | 0.732 | 0.616 |
|
41 |
+
| | **Trained on MultiVent 2.0 Training Set** | | | | | | |
|
42 |
+
| (d) | text, ASR | OmniEmbedMultivent | 0.710 | 0.673 | 0.772 | 0.808 | 0.734 |
|
43 |
+
| (f) | Vision (video), Audio | OmniEmbedMultivent | 0.709 | 0.665 | 0.776 | 0.822 | 0.724 |
|
44 |
+
| (h) | text, ASR, Vision (video), Audio| **OmniEmbedMultivent** | **0.753** | **0.769** | **0.807** | **0.848** | **0.715** |
|
45 |
|
|
|
46 |
|
47 |
+
---
|
48 |
|
49 |
+
### Usage
|
50 |
+
```python
|
51 |
+
# Import Library, Load Model and Processor
|
52 |
+
import torch
|
53 |
+
from transformers import AutoProcessor, Qwen2_5OmniThinkerForConditionalGeneration
|
54 |
+
from qwen_omni_utils import process_mm_info
|
55 |
+
|
56 |
+
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
57 |
+
|
58 |
+
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
|
59 |
+
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
|
60 |
+
'Tevatron/OmniEmbed-v0.1',
|
61 |
+
attn_implementation="flash_attention_2",
|
62 |
+
torch_dtype=torch.bfloat16
|
63 |
+
).to(device).eval()
|
64 |
+
|
65 |
+
processor.tokenizer.padding_side = "left"
|
66 |
+
model.padding_side = "left"
|
67 |
+
|
68 |
+
# Function to Encode Message
|
69 |
+
def encode_message(message):
|
70 |
+
texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)[0] + "<|endoftext|>"
|
71 |
+
audio_inputs, image_inputs, video_inputs = process_mm_info(message, use_audio_in_video=True)
|
72 |
+
|
73 |
+
inputs = processor(
|
74 |
+
text=texts,
|
75 |
+
audio=audio_inputs,
|
76 |
+
images=image_inputs,
|
77 |
+
videos=video_inputs,
|
78 |
+
return_tensors="pt",
|
79 |
+
padding="longest",
|
80 |
+
)
|
81 |
+
for k in inputs:
|
82 |
+
inputs[k] = inputs[k].to(device)
|
83 |
+
|
84 |
+
cache_position = torch.arange(0, inputs['input_ids'].shape[1], device=device)
|
85 |
+
inputs = model.prepare_inputs_for_generation(**inputs, use_cache=True, cache_position=cache_position)
|
86 |
+
model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)
|
87 |
+
|
88 |
+
last_hidden_state = model_outputs.hidden_states[-1]
|
89 |
+
reps = last_hidden_state[:, -1]
|
90 |
+
reps = torch.nn.functional.normalize(reps, p=2, dim=-1)
|
91 |
+
return reps
|
92 |
+
```
|
93 |
+
|
94 |
+
### 🎬 Video Retrieval
|
95 |
+
```python
|
96 |
+
example_query = 'Query: How to cook Mapo Tofu?'
|
97 |
+
example_video_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/mapo_tofu.mp4"
|
98 |
+
example_video_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/zhajiang_noodle.mp4"
|
99 |
+
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
|
100 |
+
video_1 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_1}]}]
|
101 |
+
video_2 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_2}]}]
|
102 |
+
|
103 |
+
sim1 = torch.cosine_similarity(encode_message(query), encode_message(video_1))
|
104 |
+
sim2 = torch.cosine_similarity(encode_message(query), encode_message(video_2))
|
105 |
+
|
106 |
+
print("Similarities:", sim1.item(), sim2.item())
|
107 |
+
```
|
108 |
+
|
109 |
+
### 🎵 Audio Retrieval
|
110 |
+
```python
|
111 |
+
example_query = 'Query: A light piano piece'
|
112 |
+
example_audio_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/joe_hisaishi_summer.mp3"
|
113 |
+
example_audio_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/jay_chou_superman_cant_fly.mp3"
|
114 |
+
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
|
115 |
+
audio_1 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_1}]}]
|
116 |
+
audio_2 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_2}]}]
|
117 |
+
|
118 |
+
sim1 = torch.cosine_similarity(encode_message(query), encode_message(audio_1))
|
119 |
+
sim2 = torch.cosine_similarity(encode_message(query), encode_message(audio_2))
|
120 |
+
|
121 |
+
print("Similarities:", sim1.item(), sim2.item())
|
122 |
+
```
|
123 |
+
|
124 |
+
### 📈 Image Document Retrieval (Image, Chart, PDF)
|
125 |
+
```python
|
126 |
+
example_query = 'Query: How many input modality does Qwen2.5-Omni support?'
|
127 |
+
example_image_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/qwen2.5omni_hgf.png"
|
128 |
+
example_image_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/llama4_hgf.png"
|
129 |
+
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
|
130 |
+
image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1}]}]
|
131 |
+
image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2}]}]
|
132 |
+
|
133 |
+
sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1))
|
134 |
+
sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2))
|
135 |
+
|
136 |
+
print("Similarities:", sim1.item(), sim2.item())
|
137 |
+
```
|
138 |
+
|
139 |
+
### 🌍 Multilingual Text Retrieval
|
140 |
+
```python
|
141 |
+
example_query = 'Query: 氧气在空气中占比多少?'
|
142 |
+
example_text_1 = "空气是指大气层中由不同气体和各类飘浮在其中的固体与液体颗粒(大气颗粒与气溶胶)所组成的气态混合物。地球大气层的空气主要由78.1%的氮气、20.9%氧气、0.9%的氩气和1~4%的水蒸气组成,其成分并不是固定的,随着高度、气压、温度的改变和对流情况不同,局部空气的组成比例也会改变。空气在大气层(特别是对流层)中的流动形成了风和曳流、气旋、龙卷等自然现象,而空气中飘浮的颗粒则形成了云、雾、霾和沙尘暴等短期天气情况。空气在海洋和陆地之间跨区域流动所承载的湿度和热能传导也是水循环和气候变率与变化的关键一环。"
|
143 |
+
example_text_2 = "水(化学式:H2O)是一种无机化合物,在常温且无杂质中是无色[1]无味不导电的透明液体,也会通过蒸发产生气态的水蒸气(这种蒸发可以发生在任何温度下,同时取决于与空气接触的表面积和湿度差)。在标准大气压下,水的凝固点是0 °C(32 °F;273 K),沸点是100 °C(212 °F;373 K)。"
|
144 |
+
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
|
145 |
+
text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}]
|
146 |
+
text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}]
|
147 |
+
|
148 |
+
sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1))
|
149 |
+
sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2))
|
150 |
+
|
151 |
+
print("Similarities:", sim1.item(), sim2.item())
|
152 |
+
```
|
153 |
+
|
154 |
+
## Data & Training
|
155 |
+
We fully open-soured the Training Data and Training Code in [Tevatron](https://github.com/texttron/tevatron/tree/qwenomni)
|
156 |
+
|
157 |
+
|
158 |
+
## Contact
|
159 |
+
This model is developed by:
|
160 |
+
|
161 |
+
Shengyao Zhuang, Xueguang Ma, Samantha Zhan, Crystina Zhang
|
162 |
+
|
163 |
+
Feel free to reach out to us with any questions or for further discussion.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|