SA-Retrieval-Embeddings-0.2B

Saudi Arabic Retrieval-Optimized Sentence Embeddings

This model is a retrieval-optimized SentenceTransformer, fine-tuned from Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B, and specifically designed for:

  • Semantic retrieval
  • RAG (Retrieval-Augmented Generation)
  • Paragraph-level semantic search
  • Chunk-based document retrieval
  • Saudi Arabic dialect understanding

Unlike general semantic similarity models, this model is explicitly trained to rank the correct semantic chunk at the top, even among closely related alternatives.


🔍 What makes this model different?

Most Arabic embedding models are trained on pairwise similarity only.
This model goes further by incorporating:

  • Summary → Chunk retrieval supervision
  • Hard negatives from semantic chunk boundaries
  • Triplet-based discrimination
  • In-batch negatives via MNLR

As a result, it excels in real-world retrieval scenarios, not just sentence similarity.


🧠 Training Overview

  • Base Model: SA-STS-Embeddings-0.2B
  • Training Objective:
    • MultipleNegativesRankingLoss (primary)
    • TripletLoss with hard negatives (boundary-based)
  • Embedding Dimension: 768
  • Pooling Strategy: Mean pooling
  • Max Sequence Length: 512 tokens
  • Training Samples: 4,038+ supervised retrieval examples
  • Precision: FP16

Training Data

The model was trained using Saudi Semantic Chunking data, where:

  • Each document is split into 3–5 semantic chunks
  • Each chunk has a human-written summary
  • Retrieval task:
    summary → correct chunk among other chunks from the same document

Dataset: 👉 Omartificial-Intelligence-Space/Saudi-Semantic-Chunks


📊 Evaluation Results

The model was evaluated on a hard retrieval benchmark consisting of
1,515 retrieval cases across 24 Saudi domains, using chunk-level negatives.

🏆 Leaderboard Comparison

leaderboard_comparison-20251223T111700

Key Takeaways

  • Best Top-1 Accuracy → correct chunk ranked first ~88% of the time
  • Best MRR → correct chunk appears very early in ranking
  • Excellent Recall@5 (99.2%) → ideal for RAG pipelines
  • Highest FinalScore → best overall balance of retrieval + discourse awareness

📐 Metric Definitions

  • Top-1: Correct chunk ranked first
  • MRR: Mean Reciprocal Rank
  • Recall@k: Correct chunk appears in top-k
  • nDCG: Ranking quality with position discount
  • Contrast: (Intra-chunk similarity − Inter-chunk similarity)
  • FinalScore: 0.4 × Top-1 + 0.3 × MRR + 0.2 × Contrast + 0.1 × nDCG

🧪 Usage

Install

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Omartificial-Intelligence-Space/SA-Retrieval-Embeddings-0.2B"
)

sentences = [
    "أفضل وقت لزيارة العلا في الشتاء",
    "العلا تكون أجمل في الشتاء والجو معتدل",
    "زحمة الرياض اليوم غير طبيعية"
]

embeddings = model.encode(sentences, normalize_embeddings=True)

from sklearn.metrics.pairwise import cosine_similarity

query = "أفضل وقت لزيارة أبها"
chunks = [
    "أبها تتميز بأجواء معتدلة في الصيف.",
    "الرياض مدينة مزدحمة.",
    "مطاعم جدة متنوعة."
]

q_emb = model.encode(query, normalize_embeddings=True)
c_embs = model.encode(chunks, normalize_embeddings=True)

scores = cosine_similarity([q_emb], c_embs)[0]
for s, c in sorted(zip(scores, chunks), reverse=True):
    print(round(s, 3), c)

🎯 Intended Use

  • RAG systems
  • Semantic search engines
  • Knowledge base retrieval
  • Document chunk retrieval
  • Saudi dialect applications
  • Government & enterprise search

⚠️ Limitations

  • Optimized for Saudi Arabic (dialect + MSA)
  • Not trained for cross-lingual retrieval
  • Not intended for generative tasks
  • Best performance when text is chunked semantically
@misc{sa_retrieval_embeddings_2025,
  title = {SA-Retrieval-Embeddings-0.2B: Retrieval-Optimized Saudi Arabic Sentence Embeddings},
  author = {Omer Nacar},
  year = {2025},
  publisher = {HuggingFace}
}
Downloads last month
18
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Omartificial-Intelligence-Space/SA-Retrieval-Embeddings-0.2B