|
|
--- |
|
|
language: hi |
|
|
license: mit |
|
|
tags: |
|
|
- hindi |
|
|
- embeddings |
|
|
- sentence-embeddings |
|
|
- semantic-search |
|
|
- text-similarity |
|
|
datasets: |
|
|
- custom |
|
|
pipeline_tag: sentence-similarity |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Hindi Sentence Embeddings Model |
|
|
|
|
|
This is a custom state-of-the-art sentence embedding model trained specifically for Hindi text. It leverages an advanced transformer architecture with specialized pooling strategies to create high-quality semantic representations of Hindi sentences. |
|
|
|
|
|
## Features |
|
|
|
|
|
- Specialized for Hindi language text |
|
|
- Advanced transformer architecture with optimized attention mechanism |
|
|
- Multiple pooling strategies for enhanced semantic representations |
|
|
- Creates normalized vector representations for semantic similarity |
|
|
- Supports semantic search and text similarity applications |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch sentencepiece scikit-learn matplotlib |
|
|
git lfs install |
|
|
git clone https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model |
|
|
cd hindi-embedding-foundational-model |
|
|
``` |
|
|
|
|
|
### Enhanced RAG System |
|
|
|
|
|
This model now includes an enhanced RAG (Retrieval Augmented Generation) system that integrates Unsloth's optimized Llama-3.2-1B-Instruct model for question answering on top of Hindi document retrieval. |
|
|
|
|
|
#### Setup and Installation |
|
|
|
|
|
1. Install additional dependencies: |
|
|
```bash |
|
|
pip install unsloth transformers bitsandbytes accelerate langchain langchain-community faiss-cpu |
|
|
``` |
|
|
|
|
|
2. Index your documents: |
|
|
```bash |
|
|
python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --data_dir ./data --output_dir ./output --index |
|
|
``` |
|
|
|
|
|
3. Run in QA mode with LLM: |
|
|
```bash |
|
|
python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --output_dir ./output --interactive --qa |
|
|
``` |
|
|
|
|
|
### Basic Embedding Usage |
|
|
|
|
|
```python |
|
|
from hindi_embeddings import HindiEmbedder |
|
|
|
|
|
# Initialize the embedder |
|
|
model = HindiEmbedder("path/to/hindi-embedding-foundational-model") |
|
|
|
|
|
# Encode sentences to embeddings |
|
|
sentences = [ |
|
|
"मुझे हिंदी भाषा बहुत पसंद है।", |
|
|
"मैं हिंदी भाषा सीख रहा हूँ।" |
|
|
] |
|
|
embeddings = model.encode(sentences) |
|
|
print(f"Embedding shape: {embeddings.shape}") |
|
|
|
|
|
# Compute similarity between sentences |
|
|
similarity = model.compute_similarity(sentences[0], sentences[1]) |
|
|
print(f"Similarity: {similarity:.4f}") |
|
|
|
|
|
# Perform semantic search |
|
|
query = "भारत की राजधानी" |
|
|
documents = [ |
|
|
"दिल्ली भारत की राजधानी है।", |
|
|
"मुंबई भारत का सबसे बड़ा शहर है।", |
|
|
"हिमालय पर्वत भारत के उत्तर में स्थित है।" |
|
|
] |
|
|
results = model.search(query, documents) |
|
|
for i, result in enumerate(results): |
|
|
print(f"{i+1}. Score: {result['score']:.4f}") |
|
|
print(f" Document: {result['document']}") |
|
|
|
|
|
# Visualize embeddings |
|
|
example_sentences = [ |
|
|
"मुझे हिंदी में पढ़ना बहुत पसंद है।", |
|
|
"आज मौसम बहुत अच्छा है।", |
|
|
"भारत एक विशाल देश है।" |
|
|
] |
|
|
model.visualize_embeddings(example_sentences) |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
This model uses an advanced transformer-based architecture with the following enhancements: |
|
|
|
|
|
- Pre-layer normalization for stable training |
|
|
- Specialized attention mechanism with relative positional encoding |
|
|
- Multiple pooling strategies (weighted, mean, attention-based) |
|
|
- L2-normalized vectors for cosine similarity |
|
|
|
|
|
Technical specifications: |
|
|
- Embedding dimension: 768 |
|
|
- Hidden dimension: 768 |
|
|
- Layers: 12 |
|
|
- Attention heads: 12 |
|
|
- Vocabulary size: 50,000 |
|
|
- Context length: 128 tokens |
|
|
|
|
|
## Applications |
|
|
|
|
|
- Semantic search and information retrieval |
|
|
- Text clustering and categorization |
|
|
- Recommendation systems |
|
|
- Question answering |
|
|
- Document similarity comparison |
|
|
- Content-based filtering |
|
|
- RAG systems for Hindi language content |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the MIT License. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research or application, please cite us: |
|
|
|
|
|
``` |
|
|
@misc{DeepMostInnovations2025hindi, |
|
|
author = {DeepMost Innovations}, |
|
|
title = {Hindi Sentence Embeddings Model}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model}} |
|
|
} |
|
|
``` |
|
|
|