File size: 2,621 Bytes
f81bf56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
language:
- en
license: apache-2.0
tags:
- reranker
- cross-encoder
- sequence-classification
- vllm
base_model: Qwen/Qwen3-Reranker-4B
pipeline_tag: text-classification
---

# Qwen3-Reranker-4B-seq-cls-vllm-fixed

This is a fixed version of the Qwen3-Reranker-4B model converted to sequence classification format, optimized for use with vLLM.

## Model Description

This model is a pre-converted version of [Qwen/Qwen3-Reranker-4B](https://huggingface.co/Qwen/Qwen3-Reranker-4B) that:
- Has been converted from CausalLM to SequenceClassification architecture
- Includes proper configuration for vLLM compatibility
- Provides ~75,000x reduction in classification head size
- Offers ~150,000x fewer operations per token compared to using the full LM head

## Key Improvements

The original converted model ([tomaarsen/Qwen3-Reranker-4B-seq-cls](https://huggingface.co/tomaarsen/Qwen3-Reranker-4B-seq-cls)) was missing critical vLLM configuration attributes. This version adds:

```json
{
  "classifier_from_token": ["no", "yes"],
  "method": "from_2_way_softmax",
  "use_pad_token": false,
  "is_original_qwen3_reranker": false
}
```

These configurations are essential for vLLM to properly handle the pre-converted weights.

## Usage with vLLM

```bash
vllm serve danielchalef/Qwen3-Reranker-4B-seq-cls-vllm-fixed \
    --task score \
    --served-model-name qwen3-reranker-4b \
    --disable-log-requests
```

### Python Example

```python
from vllm import LLM

llm = LLM(
    model="danielchalef/Qwen3-Reranker-4B-seq-cls-vllm-fixed",
    task="score"
)

queries = ["What is the capital of France?"]
documents = ["Paris is the capital of France."]

outputs = llm.score(queries, documents)
scores = [output.outputs.score for output in outputs]
print(scores)
```

## Performance

This model performs identically to the original Qwen3-Reranker-4B when used with proper configuration, while providing significant efficiency improvements:

- **Memory**: ~600MB → ~8KB for classification head
- **Compute**: 151,936 logits → 1 logit per forward pass
- **Speed**: Faster inference due to reduced computation

## Technical Details

- **Architecture**: Qwen3ForSequenceClassification
- **Base Model**: Qwen/Qwen3-Reranker-4B
- **Conversion Method**: from_2_way_softmax (yes_logit - no_logit)
- **Model Size**: 4B parameters
- **Task**: Reranking/Scoring

## Citation

If you use this model, please cite the original Qwen3-Reranker:

```bibtex
@misc{qwen3reranker2024,
  title={Qwen3-Reranker},
  author={Qwen Team},
  year={2024},
  publisher={Hugging Face}
}
```

## License

Apache 2.0 (inherited from the base model)