Diver-Retriever-4B

HighLights

The Diver Retriever 4B model is a reasoning-intensive model designed to tackle the challenge of reasonIR and rader. We combined data from the fields of mathematics, coding, and healthcare. At the same time, we made precise matching in terms of the difficulty level of the samples, and uniquely constructed negative samples corresponding to each field. Therefore, this model performs very well on the Bright LeaderBoard as well as the Mteb-Medical Benchmark.

Model Description

Model type: Text Embedding
Language(s) (NLP): Bilingual (Chinese & English)
Context Length: 40k
Number of Paramaters: 4B

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our GitHub (https://github.com/AQ-MedAI/Diver).

Evaluation

Method	Avg.	Bio.	Earth.	Econ.	Psy.	Rob.	Stack.	Sus.	Leet.	Pony	AoPS	TheoQ.	TheoT.
Evaluate Retriever with Original Query
BM25	14.5	18.9	27.2	14.9	12.5	13.6	18.4	15.0	24.4	7.9	6.2	10.4	4.9
SBERT	14.9	15.1	20.4	16.6	22.7	8.2	11.0	15.3	26.4	7.0	5.3	20.0	10.8
gte-Qwen1.5-7B	22.5	30.6	36.4	17.8	24.6	13.2	22.2	14.8	25.5	9.9	14.4	27.8	32.9
Qwen3-4B	5.6	3.5	8.0	2.3	2.0	1.6	1.0	4.4	2.1	0.1	4.9	18.0	19.2
OpenAI	17.9	23.3	26.7	19.5	27.6	12.8	14.3	20.5	23.6	2.4	8.5	23.5	11.7
Google	20.0	22.7	34.8	19.6	27.8	15.7	20.1	17.1	29.6	3.6	9.3	23.8	15.9
ReasonIR-8B	24.4	26.2	31.4	23.3	30.0	18.0	23.9	20.5	35.0	10.5	14.7	31.9	27.2
RaDeR-7B	25.5	34.6	38.9	22.1	33.0	14.8	22.5	23.7	37.3	5.0	10.2	28.4	35.1
Seed1.5-Embedding	27.2	34.8	46.9	23.4	31.6	19.1	25.4	21.0	43.2	4.9	12.2	33.3	30.5
DIVER-Retriever	28.9	41.8	43.7	21.7	35.3	21.0	21.2	25.1	37.6	13.2	10.7	38.4	37.3
Evaluate Retriever with GPT-4 REASON-query
BM25	27.0	53.6	54.1	24.3	38.7	18.9	27.7	26.3	19.3	17.6	3.9	19.2	20.8
SBERT	17.8	18.5	26.3	17.5	27.2	8.8	11.8	17.5	24.3	10.3	5.0	22.3	23.5
gte-Qwen1.5-7B	24.8	35.5	43.1	24.3	34.3	15.4	22.9	23.9	25.4	5.2	4.6	28.7	34.6
Qwen3-4B	5.5	1.3	17.3	2.5	6.2	1.0	4.8	4.5	3.0	5.9	0.0	7.2	12.5
OpenAI	23.3	35.2	40.1	25.1	38.0	13.6	18.2	24.2	24.5	6.5	7.7	22.9	23.8
Google	26.2	36.4	45.6	25.6	38.2	18.7	29.5	17.9	31.1	3.7	10.0	27.8	30.4
ReasonIR-8B	29.9	43.6	42.9	32.7	38.8	20.9	25.8	27.5	31.5	19.6	7.4	33.1	35.7
RaDeR-7B	29.2	36.1	42.9	25.2	37.9	16.6	27.4	25.0	34.8	11.9	12.0	37.7	43.4
DIVER-Retriever	32.1	51.9	53.5	29.5	41.2	21.4	27.5	26.1	33.5	11.7	9.5	39.3	39.7
Evaluate retriever with DIVER-QExpand query
ReasonIR-8B	32.6	49.4	44.7	32.4	44.0	26.6	31.8	29.0	32.3	12.8	9.1	40.7	38.4
+BM25 (Hybrid)	35.7	56.8	53.5	33.0	48.5	29.4	34.2	32.0	35.2	16.8	12.9	39.3	36.8
DIVER-Retriever	33.9	54.5	52.7	28.8	44.9	25.1	27.4	29.5	34.5	10.0	14.5	40.7	44.7
+BM25 (Hybrid)	37.2	60.0	55.9	31.8	47.9	27.1	33.9	31.9	35.1	23.1	16.8	36.9	46.6

Usage

Inference

Sentence Transformers Usage

# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("AQ-MedAI/Diver-Retriever-4B")


# The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)

Transformers Usage

# Requires transformers>=4.51.0
import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('AQ-MedAI/Diver-Retriever-4B', padding_side='left')
model = AutoModel.from_pretrained('AQ-MedAI/Diver-Retriever-4B')


max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7534257769584656, 0.1146894246339798], [0.03198453038930893, 0.6258305311203003]]

Finetuning

We recommend you to use swift to finetune our DIVER-Retriever-4B with infonce.

Before starting training, please ensure your environment is properly configured.

pip install ms-swift -U
# Install from source
pip install git+https://github.com/modelscope/ms-swift.git

pip install transformers -U

# Optional packages
pip install deepspeed # multi-GPU training
pip install liger-kernel # save GPU memory resources
pip install flash-attn --no-build-isolation

Training Command

Using infonce loss as an example, the complete training command is as follows:

nproc_per_node=8
NPROC_PER_NODE=$nproc_per_node \
swift sft \
    --model DIVER/DIVER-Retriever-4B \
    --task_type embedding \
    --model_type qwen3_emb \
    --train_type full \
    --dataset your_dataset \
    --split_dataset_ratio 0.05 \
    --eval_strategy steps \
    --output_dir output \
    --eval_steps 20 \
    --num_train_epochs 5 \
    --save_steps 20 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 6e-6 \
    --loss_type infonce \
    --label_names labels \
    --dataloader_drop_last true \
    --deepspeed zero3

Citation

If you find our work helpful, feel free to give us a cite.

@misc{long2025divermultistageapproachreasoningintensive, title={DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval}, author={Meixiu Long and Duolin Sun and Dan Yang and Junjie Wang and Yue Shen and Jian Wang and Peng Wei and Jinjie Gu and Jiahai Wang}, year={2025}, eprint={2508.07995}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2508.07995}, }

AQ-MedAI
/

Diver-Retriever-4B

You need to agree to share your contact information to access this model