File size: 2,998 Bytes
5e0c09e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
license: mit
library_name: transformers
base_model: Qwen/Qwen2.5-Omni-3B
language:
- en
tags:
- clamr
- multimodal
- video-retrieval
- late-interaction
pipeline_tag: feature-extraction
---

# CLaMR: Multimodal Late-Interaction Retrieval
by David Wan, Han Wang, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

CLaMR (Contextualized Late-Interaction for Multimodal Content Retrieval) is a novel retrieval system designed for tasks involving multiple modalities such as video frames, text (ASR, OCR), and descriptions. It adapts the ColBERT late-interaction strategy to a powerful multimodal foundation model, enabling fine-grained relevance scoring between a textual query and a rich set of multimodal document evidence.

It was introduced in the paper [CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval](https://arxiv.org/abs/2506.06144).

<p align="center"><img src="https://raw.githubusercontent.com/meetdavidwan/clamr/main/assets/teaser.png" width="800"/></p>

## Model Description

This model is built upon a **Qwen2.5-Omni-3B** backbone. CLaMR encodes a textual query and various multimodal document sources (like ASR, OCR, and video frames) into multi-vector representations. The core innovation is the **contextualized late-interaction mechanism**, which computes relevance by efficiently matching each query token embedding against all token embeddings from the various document modalities.

Unlike traditional methods that aggregate multimodal information into a single fixed-size vector, CLaMR preserves modality-specific details. This allows for a much more granular and interpretable similarity assessment, significantly improving retrieval performance on complex, multimodal documents. The model is trained to distinguish between relevant and irrelevant documents using a contrastive loss function.

## Model Training

### Dataset
The model was trained on MSRVTT.

### Training Parameters
The model was trained using the following configuration:
- **Framework:** PEFT with LoRA
- **LoRA `r`:** 128
- **LoRA `alpha`:** 128
- **LoRA Target Modules:** `down_proj`, `gate_proj`, `up_proj`, `k_proj`, `q_proj`, `v_proj`, `o_proj`, and a `custom_text_proj` layer.
- **Optimizer:** `paged_adamw_8bit`
- **Learning Rate:** 1e-5 with a linear decay and 0.1 warmup ratio.
- **Precision:** 4-bit quantization with `bfloat16`.
- **Hardware:** 8 x NVIDIA A100 80GB GPUs.
- **Batch Size:** 4 per device for training, 2 for evaluation.
- **Epochs:** 5

## Citation

If you use CLaMR in your research, please cite the following paper:

```bibtex
@misc{wan2025clamrcontextualizedlateinteractionmultimodal,
      title={CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval},
      author={David Wan and Han Wang and Elias Stengel-Eskin and Jaemin Cho and Mohit Bansal},
      year={2025},
      eprint={2506.06144},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={[https://arxiv.org/abs/2506.06144](https://arxiv.org/abs/2506.06144)},
}
```