|
--- |
|
language: |
|
- ar |
|
- en |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- loss:MatryoshkaLoss |
|
- loss:MultipleNegativesRankingLoss |
|
base_model: Qwen/Qwen3-Embedding-0.6B |
|
widget: |
|
- source_sentence: >- |
|
أقترح أن تجد بنكًا في بلدك المحلي، وأن تفكر في فتح حساب مصرفي مقوم باليورو |
|
لديهم. |
|
sentences: |
|
- يمكنك مزج هذه الأمور، ولكن من تجربتي، سيكون الأمر صعبًا جدًا في البداية. |
|
- المرأة تضع ظلال العيون بقلم. |
|
- لست متأكدًا مما إذا كان بإمكانك فتح حساب مصرفي في فرنسا إذا لم تكن مقيمًا. |
|
- source_sentence: صورة بالأبيض والأسود لموجة تتحطم في المحيط. |
|
sentences: |
|
- كلب صغير أسود في المحيط مع بعض الصخور في الخلفية. |
|
- امرأة تركب فيلًا. |
|
- طائر أصفر وبرتقالي متمسك بجانب قفص. |
|
- source_sentence: >- |
|
إذا تمكنت من تجاوز "عامل الاشمئزاز"، فسيكون لديك مصدر سهل الاستخدام من |
|
السماد العضوي النيتروجيني. |
|
sentences: |
|
- أرقام NPK على السماد تمثل النسبة المئوية، بالوزن، للنيتروجين وP2O5 وK2O. |
|
- تجميع ويكيبيديا لقواعد السفر عبر الزمن هو مصدر جيد لفهم هذا الموضوع. |
|
- رجل يعزف على الناي. |
|
- source_sentence: رجل يتحدث. |
|
sentences: |
|
- رجل يرقص. |
|
- أسد الجبل يطارد دبًا. |
|
- >- |
|
لأغراض الشمول، يحتوي برنامج Pages من Apple على العديد من قوالب الملصقات |
|
الجيدة. |
|
- source_sentence: الجانب الأيسر من محرك قطار فضي. |
|
sentences: |
|
- قرد يركب حافلة. |
|
- >- |
|
إحدى الأفكار التي كانت تُطرح منذ الثمانينات هي أنه يمكنك التمييز بين |
|
"الحركات" و"الثبات". |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
license: apache-2.0 |
|
--- |
|
|
|
# Semantic-Ar-Qwen-Embed-0.6B |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) on STS tasks for better semantic arabic understanding. |
|
It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) <!-- at revision a579a21d7aff542145eebef8d60ed73ec281a0b4 --> |
|
- **Maximum Sequence Length:** 32768 tokens |
|
- **Output Dimensionality:** 1024 dimensions |
|
- **Similarity Function:** Cosine Similarity |
|
- **Language:** ar |
|
|
|
### 📊 Performance Evaluation |
|
|
|
|
|
This model has been evaluated on Arabic semantic similarity benchmarks using the [MTEB](https://github.com/embeddings-benchmark/mteb) framework. Below are **Spearman correlation scores** for two tasks: **STS17**, **STS22.v2**, and their average. |
|
|
|
| **Model** | **STS17 (Spearman)** | **STS22.v2 (Spearman)** | **Average** | |
|
|----------------------------------|----------------------|--------------------------|-------------| |
|
| Qwen3 Embeddings 0.6B | 0.7505 | 0.6520 | 0.7013 | |
|
| Qwen3 Embeddings 4B | 0.7912 | 0.6669 | 0.7291 | |
|
| Qwen3 Embeddings 8B | 0.8220 | **0.6680** | 0.7450 | |
|
| Semantic-Ar-Qwen-Embed-V0.1 | **0.8300** | 0.6130 | 0.7215 | |
|
|
|
> ✅ **STS17**: Sentence similarity from classical Arabic benchmarks |
|
> 🧪 **STS22.v2**: Diverse, multi-domain Arabic similarity pairs |
|
|
|
#### Performance with other models: |
|
|
|
| Model | Dim | # Params. | STS17 | STS22-v2 | Average | |
|
|------------------------------------------|------|-----------|-------|----------|---------| |
|
| Arabic-Triplet-Matryoshka-V2 | 768 | 135M | 85 | 64 | 75 | |
|
| Arabert-all-nli-triplet-Matryoshka | 768 | 135M | 83 | 64 | 74 | |
|
| GATE-AraBert-V1 | 767 | 135M | 83 | 63 | 73 | |
|
| AraGemma-Embedding-300m | 768 | 303M | 84 | 62 | 73 | |
|
| **Semantic-Ar-Qwen-Embed-0.6B** | 1024 | 596M | 83 | 61 | 72 | |
|
| Marbert-all-nli-triplet-Matryoshka | 768 | 163M | 82 | 61 | 72 | |
|
| Arabic-labse-Matryoshka | 768 | 471M | 82 | 61 | 72 | |
|
| AraEuroBert-Small | 768 | 210M | 80 | 61 | 71 | |
|
| E5-all-nli-triplet-Matryoshka | 384 | 278M | 80 | 60 | 70 | |
|
| text-embedding-3-large | 3072 | - | 81 | 59 | 70 | |
|
| Arabic-all-nli-triplet-Matryoshka | 768 | 135M | 82 | 54 | 68 | |
|
| AraEuroBert-Mid | 1151 | 610M | 83 | 53 | 68 | |
|
| paraphrase-multilingual-mpnet-base-v2 | 768 | 135M | 79 | 55 | 67 | |
|
| AraEuroBert-Large | 2304 | 2.1B | 79 | 55 | 67 | |
|
| text-embedding-ada-002 | 1536 | - | 71 | 62 | 66 | |
|
| text-embedding-3-small | 1536 | - | 72 | 57 | 65 | |
|
|
|
--- |
|
|
|
### 📌 Insights |
|
- **Semantic-Ar-Qwen-Embed-V0.1** leads on **STS17**, indicating task specialization. |
|
- **Qwen3 8B** achieves the **highest average** and **top STS22.v2** score, making it the best all-rounder. |
|
- Model size clearly correlates with performance across Qwen variants. |
|
|
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 32768, 'do_lower_case': False}) with Transformer model: Qwen3Model |
|
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
) |
|
``` |
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Load model from Hugging Face Hub |
|
model = SentenceTransformer("Omartificial-Intelligence-Space/Semantic-Ar-Qwen-Embed-0.6B") |
|
|
|
# Sentences for embedding (English + Arabic) |
|
sentences = [ |
|
'Left side of a silver train engine.', |
|
'A close-up of a black train engine.', |
|
"One idea that's been going around at least since the 80s is that you can distinguish between Holds and Moves.", |
|
|
|
"الجانب الأيسر من محرك قطار فضي.", |
|
"صورة مقربة لمحرك قطار أسود.", |
|
"إحدى الأفكار المتداولة منذ الثمانينات هي إمكانية التمييز بين الثبات والحركة.", |
|
] |
|
|
|
# Generate embeddings |
|
embeddings = model.encode(sentences) |
|
print("Embedding shape:", embeddings.shape) |
|
# Output: (6, 1024) |
|
|
|
# Compute similarity matrix |
|
similarities = model.similarity(embeddings, embeddings) |
|
print("Similarity shape:", similarities.shape) |
|
# Output: (6, 6) |
|
|
|
# Optionally print similarity scores |
|
import numpy as np |
|
import pandas as pd |
|
|
|
df = pd.DataFrame(np.round(similarities, 3), index=sentences, columns=sentences) |
|
print("\nSimilarity matrix:\n") |
|
print(df) |
|
``` |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
|
|
#### Sentence Transformers |
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/1908.10084", |
|
} |
|
``` |
|
|
|
#### MatryoshkaLoss |
|
```bibtex |
|
@misc{kusupati2024matryoshka, |
|
title={Matryoshka Representation Learning}, |
|
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi}, |
|
year={2024}, |
|
eprint={2205.13147}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG} |
|
} |
|
``` |
|
|
|
#### MultipleNegativesRankingLoss |
|
```bibtex |
|
@misc{henderson2017efficient, |
|
title={Efficient Natural Language Response Suggestion for Smart Reply}, |
|
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, |
|
year={2017}, |
|
eprint={1705.00652}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |