Tevatron
/

OmniEmbed-v0.1-multivent

Visual Document Retrieval

PEFT

Safetensors

Model card Files Files and versions

xet

Community

MrLight commited on Jun 12

Commit

1283da3

verified ·

1 Parent(s): 7f96213

Update README.md

Browse files

Files changed (1) hide show

README.md +151 -190

README.md CHANGED Viewed

@@ -1,202 +1,163 @@
 ---
-base_model: Tevatron/Qwen2.5-Omni-7B-Thinker
 library_name: peft
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]
-### Framework versions
-- PEFT 0.15.0

 ---
+license: mit
+datasets:
+- Tevatron/bge-ir
+- Tevatron/wiki-ss-nq-new
+- Tevatron/pixmo-docs
+- Tevatron/colpali
+- Tevatron/msrvtt
+- Tevatron/audiocaps
+- Tevatron/multivent
+base_model:
+- Tevatron/OmniEmbed-v0.1
+pipeline_tag: visual-document-retrieval
 library_name: peft
 ---
+# Tevatron/OmniEmbed-v0.1
+**OmniEmbed** is a powerful multi-modal embedding model built on [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) using our [Tevatron](https://github.com/texttron/tevatron/) toolkit—a unified toolkit across scale, language, and modality for document retrieval.
+OmniEmbed generates unified embeddings across multilingual text, images, audio, and video, enabling effective cross-modal retrieval for diverse applications. [Paper](https://arxiv.org/pdf/2505.02466v1).
+**OmniEmbed-multivent** is further finetuned on OmniEmbed for video retrieval with allowing joint enhancing joint input performance of video, audio and text.
+OmniEmbed-multivent gets SoTA performance on MAGMaR 2025 shared task on MultiVENT-2.0 datasets, large-scale, multi-lingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news video.
+📝 Text   🖼️ Image   🎧 Audio   🎥 Video   🌐 Multilingual
+## Evaluation Results:
+|     | Modality                         | Model                 | nDCG@10 | AP    | nDCG  | RR    | R@10  |
+|-----|----------------------------------|------------------------|---------|-------|-------|-------|-------|
+|     | **Official Baselines**           |                        |         |       |       |       |       |
+|     | All                              | VAST                   | 0.116   | 0.08  | 0.115 | 0.198 | 0.118 |
+|     | OCR                              | ICDAR OCR → CLIP       | 0.217   | 0.166 | 0.288 | 0.363 | 0.227 |
+|     | ASR                              | Whisper ASR            | 0.267   | 0.212 | 0.336 | 0.417 | 0.29  |
+|     | Vision (key frame)               | CLIP                   | 0.304   | 0.261 | 0.435 | 0.429 | 0.333 |
+|     | All                              | LanguageBind           | 0.324   | 0.283 | 0.452 | 0.443 | 0.355 |
+|     | **Zero-Shot**                    |                        |         |       |       |       |       |
+| (a) | text, ASR                        | DRAMA                 | 0.629   | 0.576 | 0.693 | 0.749 | 0.649 |
+| (b) | text, ASR                        | OmniEmbed             | 0.377   | 0.329 | 0.453 | 0.493 | 0.403 |
+| (c) | text, ASR, Vision (video), Audio|  OmniEmbed             | 0.595   | 0.537 | 0.673 | 0.732 | 0.616 |
+|     | **Trained on MultiVent 2.0 Training Set** |             |         |       |       |       |       |
+| (d) | text, ASR                        | OmniEmbedMultivent    | 0.710   | 0.673 | 0.772 | 0.808 | 0.734 |
+| (f) | Vision (video), Audio           | OmniEmbedMultivent    | 0.709   | 0.665 | 0.776 | 0.822 | 0.724 |
+| (h) | text, ASR, Vision (video), Audio| **OmniEmbedMultivent**    | **0.753**   | **0.769** | **0.807** | **0.848** | **0.715** |
+---
+### Usage
+```python
+# Import Library, Load Model and Processor
+import torch
+from transformers import AutoProcessor, Qwen2_5OmniThinkerForConditionalGeneration
+from qwen_omni_utils import process_mm_info
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
+model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
+    'Tevatron/OmniEmbed-v0.1',
+    attn_implementation="flash_attention_2",
+    torch_dtype=torch.bfloat16
+).to(device).eval()
+processor.tokenizer.padding_side = "left"
+model.padding_side = "left"
+# Function to Encode Message
+def encode_message(message):
+    texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)[0] + "<|endoftext|>"
+    audio_inputs, image_inputs, video_inputs = process_mm_info(message, use_audio_in_video=True)
+    inputs = processor(
+        text=texts,
+        audio=audio_inputs,
+        images=image_inputs,
+        videos=video_inputs,
+        return_tensors="pt",
+        padding="longest",
+    )
+    for k in inputs:
+        inputs[k] = inputs[k].to(device)
+    cache_position = torch.arange(0, inputs['input_ids'].shape[1], device=device)
+    inputs = model.prepare_inputs_for_generation(**inputs, use_cache=True, cache_position=cache_position)
+    model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)
+    last_hidden_state = model_outputs.hidden_states[-1]
+    reps = last_hidden_state[:, -1]
+    reps = torch.nn.functional.normalize(reps, p=2, dim=-1)
+    return reps
+```
+### 🎬 Video Retrieval
+```python
+example_query = 'Query: How to cook Mapo Tofu?'
+example_video_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/mapo_tofu.mp4"
+example_video_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/zhajiang_noodle.mp4"
+query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
+video_1 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_1}]}]
+video_2 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_2}]}]
+sim1 = torch.cosine_similarity(encode_message(query), encode_message(video_1))
+sim2 = torch.cosine_similarity(encode_message(query), encode_message(video_2))
+print("Similarities:", sim1.item(), sim2.item())
+```
+### 🎵 Audio Retrieval
+```python
+example_query = 'Query: A light piano piece'
+example_audio_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/joe_hisaishi_summer.mp3"
+example_audio_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/jay_chou_superman_cant_fly.mp3"
+query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
+audio_1 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_1}]}]
+audio_2 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_2}]}]
+sim1 = torch.cosine_similarity(encode_message(query), encode_message(audio_1))
+sim2 = torch.cosine_similarity(encode_message(query), encode_message(audio_2))
+print("Similarities:", sim1.item(), sim2.item())
+```
+### 📈 Image Document Retrieval (Image, Chart, PDF)
+```python
+example_query = 'Query: How many input modality does Qwen2.5-Omni support?'
+example_image_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/qwen2.5omni_hgf.png"
+example_image_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/llama4_hgf.png"
+query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
+image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1}]}]
+image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2}]}]
+sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1))
+sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2))
+print("Similarities:", sim1.item(), sim2.item())
+```
+### 🌍 Multilingual Text Retrieval
+```python
+example_query = 'Query: 氧气在空气中占比多少？'
+example_text_1 = "空气是指大气层中由不同气体和各类飘浮在其中的固体与液体颗粒（大气颗粒与气溶胶）所组成的气态混合物。地球大气层的空气主要由78.1%的氮气、20.9%氧气、0.9%的氩气和1~4%的水蒸气组成，其成分并不是固定的，随着高度、气压、温度的改变和对流情况不同，局部空气的组成比例也会改变。空气在大气层（特别是对流层）中的流动形成了风和曳流、气旋、龙卷等自然现象，而空气中飘浮的颗粒则形成了云、雾、霾和沙尘暴等短期天气情况。空气在海洋和陆地之间跨区域流动所承载的湿度和热能传导也是水循环和气候变率与变化的关键一环。"
+example_text_2 = "水（化学式：H2O）是一种无机化合物，在常温且无杂质中是无色[1]无味不导电的透明液体，也会通过蒸发产生气态的水蒸气（这种蒸发可以发生在任何温度下，同时取决于与空气接触的表面积和湿度差）。在标准大气压下，水的凝固点是0 °C（32 °F；273 K），沸点是100 °C（212 °F；373 K）。"
+query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
+text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}]
+text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}]
+sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1))
+sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2))
+print("Similarities:", sim1.item(), sim2.item())
+```
+## Data & Training
+We fully open-soured the Training Data and Training Code in [Tevatron](https://github.com/texttron/tevatron/tree/qwenomni)
+## Contact
+This model is developed by:
+Shengyao Zhuang, Xueguang Ma, Samantha Zhan, Crystina Zhang
+Feel free to reach out to us with any questions or for further discussion.