File size: 1,877 Bytes

82c932f

---
license: apache-2.0
language:
- en
base_model:
- intfloat/e5-base-unsupervised
pipeline_tag: sentence-similarity
---


# cadet-embed-base-v1

**cadet-embed-base-v1** is a BERT-base embedding model fine-tuned **from `intfloat/e5-base-unsupervised`** with  

* **cross-encoder listwise distillation** (teachers: `RankT5-3B` and `BAAI/bge-reranker-v2.5-gemma2-lightweight`)  
* **purely synthetic queries** (Llama-3.1 8B generated: questions, claims, titles, keywords, zero-shot & few-shot web queries) over 400k passages total from MSMARCO, DBPedia and Wikipedia corpora.  

The result: highly effective BERT-base retrieval.

---

## Quick start
```python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("manveertamber/cadet-embed-base-v1")

query = "query: capital of France"

passages = [
    "passage: Paris is the capital and largest city of France.",
    "passage: Berlin is known for its vibrant art scene.",
    "passage: The Eiffel Tower is located in Paris, France."
]

# Encode (embeddings are already L2-normalised by default)
q_emb   = model.encode(query,    normalize_embeddings=True)
p_embs  = model.encode(passages, normalize_embeddings=True)     # shape (n_passages, dim)

# Cosine similarity = dot product of normalised vectors
scores = np.dot(p_embs, q_emb)                                  # shape (n_passages,)

# Rank passages by score
for passage, score in sorted(zip(passages, scores), key=lambda x: x[1], reverse=True):
    print(f"{score:.3f}\t{passage}")


```



If you use this model, please cite:

```
@article{tamber2025teaching,
  title={Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation},
  author={Tamber, Manveer Singh and Kazi, Suleman and Sourabh, Vivek and Lin, Jimmy},
  journal={arXiv preprint arXiv:2502.19712},
  year={2025}
}
```