DeepakKumarMSL commited on
Commit
4ff2686
·
verified ·
1 Parent(s): 14695de

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧠 Text Similarity Model using Sentence-BERT
2
+
3
+ This project fine-tunes a Sentence-BERT model (`paraphrase-MiniLM-L6-v2`) on the **STS Benchmark** English dataset (`stsb_multi_mt`) to perform **semantic similarity scoring** between two text inputs.
4
+
5
+ ---
6
+
7
+ ## 🚀 Features
8
+
9
+ - 🔁 Fine-tunes `sentence-transformers/paraphrase-MiniLM-L6-v2`
10
+ - 🔧 Trained on the `stsb_multi_mt` dataset (English split)
11
+ - 🧪 Predicts cosine similarity between sentence pairs (0 to 1)
12
+ - ⚙️ Uses a custom PyTorch model and manual training loop
13
+ - 💾 Model is saved as `similarity_model.pt`
14
+ - 🧠 Supports inference on custom sentence pairs
15
+
16
+ ---
17
+
18
+ ## 📦 Dependencies
19
+
20
+ Install required libraries:
21
+
22
+ ```python
23
+ pip install -q transformers datasets sentence-transformers evaluate --upgrade
24
+ ```
25
+
26
+ # 📊 Dataset
27
+ - Dataset: stsb_multi_mt
28
+ - Split: "en"
29
+ - Purpose: Provides sentence pairs with similarity scores ranging from 0 to 5, which are normalized to 0–1 for training.
30
+
31
+ ```python
32
+
33
+ from datasets import load_dataset
34
+
35
+ dataset = load_dataset("stsb_multi_mt", name="en", split="train")
36
+ dataset = dataset.shuffle(seed=42).select(range(10000)) # Sample subset for faster training
37
+ ```
38
+
39
+ ## 🏗️ Model Architecture
40
+ # ✅ Base Model
41
+ - sentence-transformers/paraphrase-MiniLM-L6-v2 (from Hugging Face)
42
+
43
+ # ✅ Fine-Tuning
44
+ - Cosine similarity computed between the CLS token embeddings of two inputs
45
+
46
+ - Loss: Mean Squared Error (MSE) between predicted similarity and true score
47
+
48
+ # 🧠 Training
49
+
50
+ - Epochs: 3
51
+
52
+ - Optimizer: Adam
53
+
54
+ - Loss: MSELoss
55
+
56
+ - Manual training loop using PyTorch
57
+
58
+ # Files and Structure
59
+
60
+ 📦text-similarity-project
61
+ ┣ 📜similarity_model.pt # Trained PyTorch model
62
+ ┣ 📜training_script.py # Full training and inference script
63
+ ┣ 📜README.md # Documentation