|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
language: |
|
|
- ar |
|
|
base_model: |
|
|
- UBC-NLP/MARBERTv2 |
|
|
pipeline_tag: fill-mask |
|
|
tags: |
|
|
- Saudi |
|
|
- Arabic |
|
|
- Embedding |
|
|
--- |
|
|
|
|
|
# SA-BERT-V1: Saudi-Dialect Embeddings |
|
|
|
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/69k3eyIMmiSUV4vcEx1YM.png" alt="MarBERTv2-SA Logo" width="400"/> |
|
|
</p> |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
* **Fine-Tuned Model ID:** Omartificial-Intelligence-Space/SA-BERT-V1 |
|
|
* **License:** Apache 2.0 |
|
|
* **Designed For:** Saudi Dialect |
|
|
* **Model Type:** Sentence-Embedding (BERT encoder with mean-pooling) |
|
|
* **Architecture:** 12-layer Transformer, 768-dim hidden states |
|
|
* **Embedding Size:** 768 |
|
|
* **Pretrained On:** UBC-NLP/MARBERTv2 |
|
|
* **Fine-Tuned On:** Over 500K Saudi-dialect sentences covering diverse topics and regional variations (Hijazi, Najdi, and more) |
|
|
* **Supported Language:** Arabic (Saudi dialect) |
|
|
* **Intended Tasks:** Semantic similarity, clustering, retrieval, downstream classification |
|
|
|
|
|
--- |
|
|
|
|
|
### SA-BERT-V1 delivers unparalleled Saudi-dialect understanding—achieving a +0.0022 in-vs-cross similarity gap and 0.98 mean cosine scores across 44 specialized categories, setting a new standard for Arabic dialect sentence embeddings.” |
|
|
|
|
|
<div align="center"> |
|
|
<div style="display: flex; justify-content: center;"> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/8Mm88nfscUbl-PvGS3rkv.png" alt="Similarity Comparison" width="48%" style="margin-right: 2%"/> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/alS2kB5djEtl3tVcB6TxC.png" alt="Gap Analysis" width="48%"/> |
|
|
</div> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/rfBiHWV23FKWTgVGcKVM2.png" alt="Model Comparison" width="70%" style="margin-top: 20px"/> |
|
|
</div> |
|
|
|
|
|
▪️**SA-BERT-V1** shows a positive in–cross gap and high absolute similarity, proving the effectiveness of targeted Saudi-dialect fine-tuning. |
|
|
|
|
|
▪️**In vs Cross:** Both near ~0.98, with a slight positive gap (+0.0023), meaning same-topic embeddings are closer. |
|
|
|
|
|
▪️**Performance:** Exceptional clustering for Saudi dialect; ideal for retrieval or grouping tasks. |
|
|
|
|
|
▪️The evaluations—both the similarity metrics and the “in- vs-cross” gap plots—were run on a held-out test set of **1280 Saudi-dialect sentences covering 44 diverse categories** (e.g. Greetings, Weather, Law & Justice, etc.). |
|
|
|
|
|
▪️**Dataset** is create by the space and released to evaluate embedding models by sampling intra-category and cross-category pairs from that set to compute: |
|
|
|
|
|
◽️Average in-category / cross-category cosine similarities ◽️Top-5 most/least similar pairs ◽️Per-category average similarities |
|
|
|
|
|
▪️ **Access Test Samples:** [saudi-dialect-test-samples](https://huggingface.co/datasets/Omartificial-Intelligence-Space/saudi-dialect-test-samples) |
|
|
|
|
|
--- |
|
|
|
|
|
## Implementation Example |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
# Configuration |
|
|
MODEL_ID = "Omartificial-Intelligence-Space/SA-BERT-V1" |
|
|
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
# Load tokenizer and model |
|
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID , token= "PASS_READ_TOKEN_HERE") |
|
|
model = AutoModel.from_pretrained(MODEL_ID , token = "PASS_READ_TOKEN_HERE").to(DEVICE).eval() |
|
|
|
|
|
def embed_sentence(text: str) -> torch.Tensor: |
|
|
""" |
|
|
Tokenizes `text`, feeds it through SA-BERT-V1, and returns |
|
|
a 768-dimensional mean-pooled sentence embedding. |
|
|
""" |
|
|
# Encode the text |
|
|
enc = tokenizer( |
|
|
text, |
|
|
truncation=True, |
|
|
padding="max_length", |
|
|
max_length=256, |
|
|
return_tensors="pt" |
|
|
).to(DEVICE) |
|
|
|
|
|
# Forward pass |
|
|
with torch.no_grad(): |
|
|
outputs = model(**enc).last_hidden_state # shape: (1, seq_len, 768) |
|
|
|
|
|
# Mean-pooling over valid tokens |
|
|
mask = enc["attention_mask"].unsqueeze(-1) # shape: (1, seq_len, 1) |
|
|
summed = (outputs * mask).sum(dim=1) # shape: (1, 768) |
|
|
counts = mask.sum(dim=1).clamp(min=1e-9) # shape: (1, 1) |
|
|
embedding = summed / counts # shape: (1, 768) |
|
|
|
|
|
return embedding.squeeze(0) # shape: (768,) |
|
|
|
|
|
# Example usage |
|
|
if __name__ == "__main__": |
|
|
sentences = [ |
|
|
"شتبي من البقالة؟", |
|
|
"كيف حالك؟", |
|
|
"وش رايك في الموضوع هذا؟" |
|
|
] |
|
|
for s in sentences: |
|
|
vec = embed_sentence(s) |
|
|
print(f"Sentence: {s}\nEmbedding shape: {vec.shape}\n") |
|
|
``` |
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use MarBERTv2-SA in your research or applications, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{nacar2025SABERTV1, |
|
|
title={SA-BERT-V1: Fine-Tuned Saudi-Dialect Embeddings}, |
|
|
author={Nacar, Omer & Sibaee, Serry}, |
|
|
year={2025}, |
|
|
publisher={Omartificial-Intelligence-Space}, |
|
|
howpublished={\url{https://huggingface.co/Omartificial-Intelligence-Space/SA-BERT-V1}}, |
|
|
} |
|
|
|
|
|
@inproceedings{abdul-mageed-etal-2021-arbert, |
|
|
title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic", |
|
|
author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah", |
|
|
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)", |
|
|
year = "2021", |
|
|
publisher = "Association for Computational Linguistics", |
|
|
pages = "7088--7105", |
|
|
} |
|
|
``` |