Add README with usage documentation

416cf70 verified 7 months ago

3.67 kB

metadata

language: hi
license: mit
tags:
  - hindi
  - embeddings
  - sentence-embeddings
  - semantic-search
  - text-similarity
datasets:
  - custom
pipeline_tag: sentence-similarity
library_name: transformers

Hindi Sentence Embeddings Model

This is a custom state-of-the-art sentence embedding model trained specifically for Hindi text. It leverages an advanced transformer architecture with specialized pooling strategies to create high-quality semantic representations of Hindi sentences.

Features

Specialized for Hindi language text
Advanced transformer architecture with optimized attention mechanism
Multiple pooling strategies for enhanced semantic representations
Creates normalized vector representations for semantic similarity
Supports semantic search and text similarity applications

Usage

Installation

pip install torch sentencepiece scikit-learn matplotlib
git lfs install 
git clone https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model
cd hindi-embedding-foundational-model

Quick Start

from hindi_embeddings import HindiEmbedder

# Initialize the embedder
model = HindiEmbedder("path/to/hindi-embedding-foundational-model")

# Encode sentences to embeddings
sentences = [
    "मुझे हिंदी भाषा बहुत पसंद है।",
    "मैं हिंदी भाषा सीख रहा हूँ।"
]
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

# Compute similarity between sentences
similarity = model.compute_similarity(sentences[0], sentences[1])
print(f"Similarity: {similarity:.4f}")

# Perform semantic search
query = "भारत की राजधानी"
documents = [
    "दिल्ली भारत की राजधानी है।",
    "मुंबई भारत का सबसे बड़ा शहर है।",
    "हिमालय पर्वत भारत के उत्तर में स्थित है।"
]
results = model.search(query, documents)
for i, result in enumerate(results):
    print(f"{i+1}. Score: {result['score']:.4f}")
    print(f"   Document: {result['document']}")

# Visualize embeddings
example_sentences = [
    "मुझे हिंदी में पढ़ना बहुत पसंद है।",
    "आज मौसम बहुत अच्छा है।",
    "भारत एक विशाल देश है।"
]
model.visualize_embeddings(example_sentences)

Model Details

This model uses an advanced transformer-based architecture with the following enhancements:

Pre-layer normalization for stable training
Specialized attention mechanism with relative positional encoding
Multiple pooling strategies (weighted, mean, attention-based)
L2-normalized vectors for cosine similarity

Technical specifications:

Embedding dimension: 768
Hidden dimension: 768
Layers: 12
Attention heads: 12
Vocabulary size: 50,000
Context length: 128 tokens

Applications

Semantic search and information retrieval
Text clustering and categorization
Recommendation systems
Question answering
Document similarity comparison
Content-based filtering

License

This model is released under the MIT License.

Citation

If you use this model in your research or application, please cite us:

@misc{DeepMostInnovations2025hindi,
  author = {DeepMost Innovations},
  title = {Hindi Sentence Embeddings Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model}}
}