SA-BERT-V1 / README.md

Omartificial-Intelligence-Space's picture

Update README.md

266a9bf verified 7 months ago

5.51 kB

	---
	license: apache-2.0
	library_name: transformers
	language:
	- ar
	base_model:
	- UBC-NLP/MARBERTv2
	pipeline_tag: fill-mask
	tags:
	- Saudi
	- Arabic
	- Embedding
	---

	# SA-BERT-V1: Saudi-Dialect Embeddings


	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/69k3eyIMmiSUV4vcEx1YM.png" alt="MarBERTv2-SA Logo" width="400"/>
	</p>


	## Model Details

	* Fine-Tuned Model ID: Omartificial-Intelligence-Space/SA-BERT-V1
	* License: Apache 2.0
	* Designed For: Saudi Dialect
	* Model Type: Sentence-Embedding (BERT encoder with mean-pooling)
	* Architecture: 12-layer Transformer, 768-dim hidden states
	* Embedding Size: 768
	* Pretrained On: UBC-NLP/MARBERTv2
	* Fine-Tuned On: Over 500K Saudi-dialect sentences covering diverse topics and regional variations (Hijazi, Najdi, and more)
	* Supported Language: Arabic (Saudi dialect)
	* Intended Tasks: Semantic similarity, clustering, retrieval, downstream classification

	---

	### SA-BERT-V1 delivers unparalleled Saudi-dialect understanding—achieving a +0.0022 in-vs-cross similarity gap and 0.98 mean cosine scores across 44 specialized categories, setting a new standard for Arabic dialect sentence embeddings.”

	<div align="center">
	<div style="display: flex; justify-content: center;">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/8Mm88nfscUbl-PvGS3rkv.png" alt="Similarity Comparison" width="48%" style="margin-right: 2%"/>
	<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/alS2kB5djEtl3tVcB6TxC.png" alt="Gap Analysis" width="48%"/>
	</div>
	<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/rfBiHWV23FKWTgVGcKVM2.png" alt="Model Comparison" width="70%" style="margin-top: 20px"/>
	</div>

	▪️SA-BERT-V1 shows a positive in–cross gap and high absolute similarity, proving the effectiveness of targeted Saudi-dialect fine-tuning.

	▪️In vs Cross: Both near ~0.98, with a slight positive gap (+0.0023), meaning same-topic embeddings are closer.

	▪️Performance: Exceptional clustering for Saudi dialect; ideal for retrieval or grouping tasks.

	▪️The evaluations—both the similarity metrics and the “in- vs-cross” gap plots—were run on a held-out test set of 1280 Saudi-dialect sentences covering 44 diverse categories (e.g. Greetings, Weather, Law & Justice, etc.).

	▪️Dataset is create by the space and released to evaluate embedding models by sampling intra-category and cross-category pairs from that set to compute:

	◽️Average in-category / cross-category cosine similarities ◽️Top-5 most/least similar pairs ◽️Per-category average similarities

	▪️ Access Test Samples: [saudi-dialect-test-samples](https://huggingface.co/datasets/Omartificial-Intelligence-Space/saudi-dialect-test-samples)

	---

	## Implementation Example

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	# Configuration
	MODEL_ID = "Omartificial-Intelligence-Space/SA-BERT-V1"
	DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID , token= "PASS_READ_TOKEN_HERE")
	model = AutoModel.from_pretrained(MODEL_ID , token = "PASS_READ_TOKEN_HERE").to(DEVICE).eval()

	def embed_sentence(text: str) -> torch.Tensor:
	"""
	Tokenizes `text`, feeds it through SA-BERT-V1, and returns
	a 768-dimensional mean-pooled sentence embedding.
	"""
	# Encode the text
	enc = tokenizer(
	text,
	truncation=True,
	padding="max_length",
	max_length=256,
	return_tensors="pt"
	).to(DEVICE)

	# Forward pass
	with torch.no_grad():
	outputs = model(**enc).last_hidden_state # shape: (1, seq_len, 768)

	# Mean-pooling over valid tokens
	mask = enc["attention_mask"].unsqueeze(-1) # shape: (1, seq_len, 1)
	summed = (outputs * mask).sum(dim=1) # shape: (1, 768)
	counts = mask.sum(dim=1).clamp(min=1e-9) # shape: (1, 1)
	embedding = summed / counts # shape: (1, 768)

	return embedding.squeeze(0) # shape: (768,)

	# Example usage
	if __name__ == "__main__":
	sentences = [
	"شتبي من البقالة؟",
	"كيف حالك؟",
	"وش رايك في الموضوع هذا؟"
	]
	for s in sentences:
	vec = embed_sentence(s)
	print(f"Sentence: {s}\nEmbedding shape: {vec.shape}\n")
	```
	---

	## Citation

	If you use MarBERTv2-SA in your research or applications, please cite:

	```bibtex
	@misc{nacar2025SABERTV1,
	title={SA-BERT-V1: Fine-Tuned Saudi-Dialect Embeddings},
	author={Nacar, Omer & Sibaee, Serry},
	year={2025},
	publisher={Omartificial-Intelligence-Space},
	howpublished={\url{https://huggingface.co/Omartificial-Intelligence-Space/SA-BERT-V1}},
	}

	@inproceedings{abdul-mageed-etal-2021-arbert,
	title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
	author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
	booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
	year = "2021",
	publisher = "Association for Computational Linguistics",
	pages = "7088--7105",
	}
	```