File size: 9,308 Bytes
31eb4bc 395ee18 31eb4bc 395ee18 4dd42d8 31eb4bc 16c9646 31eb4bc 16c9646 395ee18 4dd42d8 31eb4bc 16c9646 31eb4bc 16c9646 4dd42d8 16c9646 31eb4bc 16c9646 4dd42d8 31eb4bc 395ee18 31eb4bc 450ce49 31eb4bc 16c9646 31eb4bc 16c9646 4dd42d8 2a84ea2 4dd42d8 2a84ea2 4dd42d8 2a84ea2 4dd42d8 1fd2285 2a84ea2 3500069 2a84ea2 4dd42d8 31eb4bc 4dd42d8 3ff37db 4dd42d8 31eb4bc 4dd42d8 31eb4bc 4dd42d8 31eb4bc 4dd42d8 31eb4bc 4dd42d8 31eb4bc 4dd42d8 31eb4bc 4dd42d8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
---
language:
- ar
- en
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Qwen/Qwen3-Embedding-0.6B
widget:
- source_sentence: >-
أقترح أن تجد بنكًا في بلدك المحلي، وأن تفكر في فتح حساب مصرفي مقوم باليورو
لديهم.
sentences:
- يمكنك مزج هذه الأمور، ولكن من تجربتي، سيكون الأمر صعبًا جدًا في البداية.
- المرأة تضع ظلال العيون بقلم.
- لست متأكدًا مما إذا كان بإمكانك فتح حساب مصرفي في فرنسا إذا لم تكن مقيمًا.
- source_sentence: صورة بالأبيض والأسود لموجة تتحطم في المحيط.
sentences:
- كلب صغير أسود في المحيط مع بعض الصخور في الخلفية.
- امرأة تركب فيلًا.
- طائر أصفر وبرتقالي متمسك بجانب قفص.
- source_sentence: >-
إذا تمكنت من تجاوز "عامل الاشمئزاز"، فسيكون لديك مصدر سهل الاستخدام من
السماد العضوي النيتروجيني.
sentences:
- أرقام NPK على السماد تمثل النسبة المئوية، بالوزن، للنيتروجين وP2O5 وK2O.
- تجميع ويكيبيديا لقواعد السفر عبر الزمن هو مصدر جيد لفهم هذا الموضوع.
- رجل يعزف على الناي.
- source_sentence: رجل يتحدث.
sentences:
- رجل يرقص.
- أسد الجبل يطارد دبًا.
- >-
لأغراض الشمول، يحتوي برنامج Pages من Apple على العديد من قوالب الملصقات
الجيدة.
- source_sentence: الجانب الأيسر من محرك قطار فضي.
sentences:
- قرد يركب حافلة.
- >-
إحدى الأفكار التي كانت تُطرح منذ الثمانينات هي أنه يمكنك التمييز بين
"الحركات" و"الثبات".
pipeline_tag: sentence-similarity
library_name: sentence-transformers
license: apache-2.0
---
# Semantic-Ar-Qwen-Embed-0.6B
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) on STS tasks for better semantic arabic understanding.
It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) <!-- at revision a579a21d7aff542145eebef8d60ed73ec281a0b4 -->
- **Maximum Sequence Length:** 32768 tokens
- **Output Dimensionality:** 1024 dimensions
- **Similarity Function:** Cosine Similarity
- **Language:** ar
### 📊 Performance Evaluation
This model has been evaluated on Arabic semantic similarity benchmarks using the [MTEB](https://github.com/embeddings-benchmark/mteb) framework. Below are **Spearman correlation scores** for two tasks: **STS17**, **STS22.v2**, and their average.
| **Model** | **STS17 (Spearman)** | **STS22.v2 (Spearman)** | **Average** |
|----------------------------------|----------------------|--------------------------|-------------|
| Qwen3 Embeddings 0.6B | 0.7505 | 0.6520 | 0.7013 |
| Qwen3 Embeddings 4B | 0.7912 | 0.6669 | 0.7291 |
| Qwen3 Embeddings 8B | 0.8220 | **0.6680** | 0.7450 |
| Semantic-Ar-Qwen-Embed-V0.1 | **0.8300** | 0.6130 | 0.7215 |
> ✅ **STS17**: Sentence similarity from classical Arabic benchmarks
> 🧪 **STS22.v2**: Diverse, multi-domain Arabic similarity pairs
#### Performance with other models:
| Model | Dim | # Params. | STS17 | STS22-v2 | Average |
|------------------------------------------|------|-----------|-------|----------|---------|
| Arabic-Triplet-Matryoshka-V2 | 768 | 135M | 85 | 64 | 75 |
| Arabert-all-nli-triplet-Matryoshka | 768 | 135M | 83 | 64 | 74 |
| GATE-AraBert-V1 | 767 | 135M | 83 | 63 | 73 |
| AraGemma-Embedding-300m | 768 | 303M | 84 | 62 | 73 |
| **Semantic-Ar-Qwen-Embed-0.6B** | 1024 | 596M | 83 | 61 | 72 |
| Marbert-all-nli-triplet-Matryoshka | 768 | 163M | 82 | 61 | 72 |
| Arabic-labse-Matryoshka | 768 | 471M | 82 | 61 | 72 |
| AraEuroBert-Small | 768 | 210M | 80 | 61 | 71 |
| E5-all-nli-triplet-Matryoshka | 384 | 278M | 80 | 60 | 70 |
| text-embedding-3-large | 3072 | - | 81 | 59 | 70 |
| Arabic-all-nli-triplet-Matryoshka | 768 | 135M | 82 | 54 | 68 |
| AraEuroBert-Mid | 1151 | 610M | 83 | 53 | 68 |
| paraphrase-multilingual-mpnet-base-v2 | 768 | 135M | 79 | 55 | 67 |
| AraEuroBert-Large | 2304 | 2.1B | 79 | 55 | 67 |
| text-embedding-ada-002 | 1536 | - | 71 | 62 | 66 |
| text-embedding-3-small | 1536 | - | 72 | 57 | 65 |
---
### 📌 Insights
- **Semantic-Ar-Qwen-Embed-V0.1** leads on **STS17**, indicating task specialization.
- **Qwen3 8B** achieves the **highest average** and **top STS22.v2** score, making it the best all-rounder.
- Model size clearly correlates with performance across Qwen variants.
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 32768, 'do_lower_case': False}) with Transformer model: Qwen3Model
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Load model from Hugging Face Hub
model = SentenceTransformer("Omartificial-Intelligence-Space/Semantic-Ar-Qwen-Embed-0.6B")
# Sentences for embedding (English + Arabic)
sentences = [
'Left side of a silver train engine.',
'A close-up of a black train engine.',
"One idea that's been going around at least since the 80s is that you can distinguish between Holds and Moves.",
"الجانب الأيسر من محرك قطار فضي.",
"صورة مقربة لمحرك قطار أسود.",
"إحدى الأفكار المتداولة منذ الثمانينات هي إمكانية التمييز بين الثبات والحركة.",
]
# Generate embeddings
embeddings = model.encode(sentences)
print("Embedding shape:", embeddings.shape)
# Output: (6, 1024)
# Compute similarity matrix
similarities = model.similarity(embeddings, embeddings)
print("Similarity shape:", similarities.shape)
# Output: (6, 6)
# Optionally print similarity scores
import numpy as np
import pandas as pd
df = pd.DataFrame(np.round(similarities, 3), index=sentences, columns=sentences)
print("\nSimilarity matrix:\n")
print(df)
```
## Citation
### BibTeX
#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
```
#### MatryoshkaLoss
```bibtex
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |