foochun
/

bge-large-finetuned

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8eadfa9595c8f175d2a5113f17d40d956f408b29cd32aa5e6523dc473034ec2f
 size 1049760

 version https://git-lfs.github.com/spec/v1
+oid sha256:d0fae3f9a09ca238049f5af1b058df30df52a7d29669b1b46037926ecf90eb2c
 size 1049760

README.md CHANGED Viewed

@@ -4,35 +4,35 @@ tags:
 - sentence-similarity
 - feature-extraction
 - generated_from_trainer
-- dataset_size:69216
 - loss:MultipleNegativesRankingLoss
 base_model: BAAI/bge-large-en-v1.5
 widget:
-- source_sentence: ajith s/o sockalingam
   sentences:
-  - ajith a/l sockalingam
-  - marcus ping yi ng
-  - ajith a/p sockalingam
-- source_sentence: quinn kwan xin fang
   sentences:
-  - ambiga a/p jacob
-  - quinn fang kwan xin
-  - xin kwan fang
-- source_sentence: brandon teh min ling
   sentences:
-  - victor bing yong ng
-  - min ling teh brandon
-  - ling min teh brandon
-- source_sentence: carmen ho xin jun
   sentences:
-  - xin ho jun carmen
-  - pei ho yi grace
-  - xin jun ho carmen
-- source_sentence: alicia lim siu ling
   sentences:
-  - lim ling siu alicia
-  - alicia siu ling lim
-  - nadia soh meng jun
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
 ---
@@ -87,9 +87,9 @@ from sentence_transformers import SentenceTransformer
 model = SentenceTransformer("foochun/bge-large-finetuned")
 # Run inference
 sentences = [
-    'alicia lim siu ling',
-    'alicia siu ling lim',
-    'lim ling siu alicia',
 ]
 embeddings = model.encode(sentences)
 print(embeddings.shape)
@@ -143,19 +143,19 @@ You can finetune this model on your own dataset.
 #### Unnamed Dataset
-* Size: 69,216 training samples
 * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
 * Approximate statistics based on the first 1000 samples:
   |         | query                                                                            | pos                                                                              | neg                                                                              |
   |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
   | type    | string                                                                           | string                                                                           | string                                                                           |
-  | details | <ul><li>min: 4 tokens</li><li>mean: 8.96 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 8.22 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 8.47 tokens</li><li>max: 16 tokens</li></ul> |
 * Samples:
-  | query                              | pos                            | neg                           |
-  |:-----------------------------------|:-------------------------------|:------------------------------|
-  | <code>abdul karim bin bakar</code> | <code>abdul karim bakar</code> | <code>johan bin hamid</code>  |
-  | <code>rupai anak jamit</code>      | <code>rupai jamit</code>       | <code>rupai anak karim</code> |
-  | <code>sim kim ning</code>          | <code>ning sim kim</code>      | <code>kim sim ning</code>     |
 * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
   ```json
   {
@@ -168,19 +168,19 @@ You can finetune this model on your own dataset.
 #### Unnamed Dataset
-* Size: 9,887 evaluation samples
 * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
 * Approximate statistics based on the first 1000 samples:
   |         | query                                                                            | pos                                                                              | neg                                                                              |
   |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
   | type    | string                                                                           | string                                                                           | string                                                                           |
-  | details | <ul><li>min: 4 tokens</li><li>mean: 7.86 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.38 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.65 tokens</li><li>max: 16 tokens</li></ul> |
 * Samples:
-  | query                               | pos                                    | neg                                 |
-  |:------------------------------------|:---------------------------------------|:------------------------------------|
-  | <code>mohd ridzuan bin nasir</code> | <code>mohamad ridzuan bin nasir</code> | <code>mohd ridzuan bin naser</code> |
-  | <code>isabel koh jun liang</code>   | <code>isabel koh jun liang</code>      | <code>liang jun koh isabel</code>   |
-  | <code>neo mei chuan</code>          | <code>neo mei chuan</code>             | <code>mak mei chuan</code>          |
 * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
   ```json
   {
@@ -325,12 +325,12 @@ You can finetune this model on your own dataset.
 ### Training Logs
 | Epoch      | Step     | Training Loss | Validation Loss |
 |:----------:|:--------:|:-------------:|:---------------:|
-| 0.4621     | 500      | 0.1357        | 0.0127          |
-| 0.9242     | 1000     | 0.0149        | 0.0065          |
-| 1.3863     | 1500     | 0.0079        | 0.0065          |
-| 1.8484     | 2000     | 0.0069        | 0.0043          |
-| 2.3105     | 2500     | 0.0059        | 0.0040          |
-| **2.7726** | **3000** | **0.0052**    | **0.0039**      |
 * The bold row denotes the saved checkpoint.
@@ -340,7 +340,7 @@ You can finetune this model on your own dataset.
 - Transformers: 4.51.3
 - PyTorch: 2.6.0+cu124
 - Accelerate: 1.6.0
-- Datasets: 3.5.1
 - Tokenizers: 0.21.1
 ## Citation

 - sentence-similarity
 - feature-extraction
 - generated_from_trainer
+- dataset_size:72454
 - loss:MultipleNegativesRankingLoss
 base_model: BAAI/bge-large-en-v1.5
 widget:
+- source_sentence: mahathir bin mohamad
   sentences:
+  - mahathir mohamad
+  - mahathir bin ismail
+  - nazhan hafiz rahmat
+- source_sentence: siow yin heng
   sentences:
+  - siu xin loh daniel
+  - yin heng siow
+  - siow heng yin
+- source_sentence: fadzil bin othman
   sentences:
+  - izzah auni binti zulkifli
+  - fadzil othman
+  - mariam binti hassan
+- source_sentence: raja muhammad syamil bin raja ishak
   sentences:
+  - ridzuan bin hashim
+  - meng leong fang
+  - raja muhd syamil bin raja ishak
+- source_sentence: felix koh bing sheng
   sentences:
+  - olivia sinnathamby
+  - bing koh sheng
+  - koh bing sheng
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
 ---
 model = SentenceTransformer("foochun/bge-large-finetuned")
 # Run inference
 sentences = [
+    'felix koh bing sheng',
+    'koh bing sheng',
+    'bing koh sheng',
 ]
 embeddings = model.encode(sentences)
 print(embeddings.shape)
 #### Unnamed Dataset
+* Size: 72,454 training samples
 * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
 * Approximate statistics based on the first 1000 samples:
   |         | query                                                                            | pos                                                                              | neg                                                                              |
   |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
   | type    | string                                                                           | string                                                                           | string                                                                           |
+  | details | <ul><li>min: 4 tokens</li><li>mean: 8.06 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.59 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.76 tokens</li><li>max: 16 tokens</li></ul> |
 * Samples:
+  | query                                | pos                                  | neg                              |
+  |:-------------------------------------|:-------------------------------------|:---------------------------------|
+  | <code>ahmad faisal bin zainal</code> | <code>ahmad faisal bin zainal</code> | <code>sakinah binti jamil</code> |
+  | <code>daniel lim ling ee</code>      | <code>lim ling ee daniel</code>      | <code>ee ling lim</code>         |
+  | <code>lau sze sheng</code>           | <code>sheng lau sze</code>           | <code>lau sz sheng</code>        |
 * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
   ```json
   {
 #### Unnamed Dataset
+* Size: 10,350 evaluation samples
 * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
 * Approximate statistics based on the first 1000 samples:
   |         | query                                                                            | pos                                                                              | neg                                                                              |
   |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
   | type    | string                                                                           | string                                                                           | string                                                                           |
+  | details | <ul><li>min: 4 tokens</li><li>mean: 8.06 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.59 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.81 tokens</li><li>max: 16 tokens</li></ul> |
 * Samples:
+  | query                                            | pos                                              | neg                                       |
+  |:-------------------------------------------------|:-------------------------------------------------|:------------------------------------------|
+  | <code>xavier loh ling sheng</code>               | <code>loh ling sheng xavier</code>               | <code>loh sheng ling xavier</code>        |
+  | <code>chan poh king</code>                       | <code>chan poh king</code>                       | <code>chan king poh</code>                |
+  | <code>siti suzelita sazrikin binti hassan</code> | <code>siti suzelita sazrikin binti hassan</code> | <code>roslilawati binti hj mukhtar</code> |
 * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
   ```json
   {
 ### Training Logs
 | Epoch      | Step     | Training Loss | Validation Loss |
 |:----------:|:--------:|:-------------:|:---------------:|
+| 0.4413     | 500      | 0.1568        | 0.0153          |
+| 0.8826     | 1000     | 0.0155        | 0.0073          |
+| 1.3239     | 1500     | 0.0086        | 0.0064          |
+| 1.7652     | 2000     | 0.0067        | 0.0054          |
+| 2.2065     | 2500     | 0.0059        | 0.0050          |
+| **2.6478** | **3000** | **0.0052**    | **0.0049**      |
 * The bold row denotes the saved checkpoint.
 - Transformers: 4.51.3
 - PyTorch: 2.6.0+cu124
 - Accelerate: 1.6.0
+- Datasets: 3.6.0
 - Tokenizers: 0.21.1
 ## Citation

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:15b52f7abf658111d9430675ac14595f44e24a6d62b078f77ee10351c0ce222f
 size 1340612432

 version https://git-lfs.github.com/spec/v1
+oid sha256:8ed937208fea5f17d65ecb178dd0a6fd0db166daaeae588de942ebe415b59216
 size 1340612432