foochun commited on
Commit
9ef22a0
·
verified ·
1 Parent(s): 180df1d

256 Dimension updated

Browse files
Files changed (3) hide show
  1. 2_Dense/model.safetensors +1 -1
  2. README.md +45 -45
  3. model.safetensors +1 -1
2_Dense/model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8eadfa9595c8f175d2a5113f17d40d956f408b29cd32aa5e6523dc473034ec2f
3
  size 1049760
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d0fae3f9a09ca238049f5af1b058df30df52a7d29669b1b46037926ecf90eb2c
3
  size 1049760
README.md CHANGED
@@ -4,35 +4,35 @@ tags:
4
  - sentence-similarity
5
  - feature-extraction
6
  - generated_from_trainer
7
- - dataset_size:69216
8
  - loss:MultipleNegativesRankingLoss
9
  base_model: BAAI/bge-large-en-v1.5
10
  widget:
11
- - source_sentence: ajith s/o sockalingam
12
  sentences:
13
- - ajith a/l sockalingam
14
- - marcus ping yi ng
15
- - ajith a/p sockalingam
16
- - source_sentence: quinn kwan xin fang
17
  sentences:
18
- - ambiga a/p jacob
19
- - quinn fang kwan xin
20
- - xin kwan fang
21
- - source_sentence: brandon teh min ling
22
  sentences:
23
- - victor bing yong ng
24
- - min ling teh brandon
25
- - ling min teh brandon
26
- - source_sentence: carmen ho xin jun
27
  sentences:
28
- - xin ho jun carmen
29
- - pei ho yi grace
30
- - xin jun ho carmen
31
- - source_sentence: alicia lim siu ling
32
  sentences:
33
- - lim ling siu alicia
34
- - alicia siu ling lim
35
- - nadia soh meng jun
36
  pipeline_tag: sentence-similarity
37
  library_name: sentence-transformers
38
  ---
@@ -87,9 +87,9 @@ from sentence_transformers import SentenceTransformer
87
  model = SentenceTransformer("foochun/bge-large-finetuned")
88
  # Run inference
89
  sentences = [
90
- 'alicia lim siu ling',
91
- 'alicia siu ling lim',
92
- 'lim ling siu alicia',
93
  ]
94
  embeddings = model.encode(sentences)
95
  print(embeddings.shape)
@@ -143,19 +143,19 @@ You can finetune this model on your own dataset.
143
 
144
  #### Unnamed Dataset
145
 
146
- * Size: 69,216 training samples
147
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
148
  * Approximate statistics based on the first 1000 samples:
149
  | | query | pos | neg |
150
  |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
151
  | type | string | string | string |
152
- | details | <ul><li>min: 4 tokens</li><li>mean: 8.96 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 8.22 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 8.47 tokens</li><li>max: 16 tokens</li></ul> |
153
  * Samples:
154
- | query | pos | neg |
155
- |:-----------------------------------|:-------------------------------|:------------------------------|
156
- | <code>abdul karim bin bakar</code> | <code>abdul karim bakar</code> | <code>johan bin hamid</code> |
157
- | <code>rupai anak jamit</code> | <code>rupai jamit</code> | <code>rupai anak karim</code> |
158
- | <code>sim kim ning</code> | <code>ning sim kim</code> | <code>kim sim ning</code> |
159
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
160
  ```json
161
  {
@@ -168,19 +168,19 @@ You can finetune this model on your own dataset.
168
 
169
  #### Unnamed Dataset
170
 
171
- * Size: 9,887 evaluation samples
172
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
173
  * Approximate statistics based on the first 1000 samples:
174
  | | query | pos | neg |
175
  |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
176
  | type | string | string | string |
177
- | details | <ul><li>min: 4 tokens</li><li>mean: 7.86 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.38 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.65 tokens</li><li>max: 16 tokens</li></ul> |
178
  * Samples:
179
- | query | pos | neg |
180
- |:------------------------------------|:---------------------------------------|:------------------------------------|
181
- | <code>mohd ridzuan bin nasir</code> | <code>mohamad ridzuan bin nasir</code> | <code>mohd ridzuan bin naser</code> |
182
- | <code>isabel koh jun liang</code> | <code>isabel koh jun liang</code> | <code>liang jun koh isabel</code> |
183
- | <code>neo mei chuan</code> | <code>neo mei chuan</code> | <code>mak mei chuan</code> |
184
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
185
  ```json
186
  {
@@ -325,12 +325,12 @@ You can finetune this model on your own dataset.
325
  ### Training Logs
326
  | Epoch | Step | Training Loss | Validation Loss |
327
  |:----------:|:--------:|:-------------:|:---------------:|
328
- | 0.4621 | 500 | 0.1357 | 0.0127 |
329
- | 0.9242 | 1000 | 0.0149 | 0.0065 |
330
- | 1.3863 | 1500 | 0.0079 | 0.0065 |
331
- | 1.8484 | 2000 | 0.0069 | 0.0043 |
332
- | 2.3105 | 2500 | 0.0059 | 0.0040 |
333
- | **2.7726** | **3000** | **0.0052** | **0.0039** |
334
 
335
  * The bold row denotes the saved checkpoint.
336
 
@@ -340,7 +340,7 @@ You can finetune this model on your own dataset.
340
  - Transformers: 4.51.3
341
  - PyTorch: 2.6.0+cu124
342
  - Accelerate: 1.6.0
343
- - Datasets: 3.5.1
344
  - Tokenizers: 0.21.1
345
 
346
  ## Citation
 
4
  - sentence-similarity
5
  - feature-extraction
6
  - generated_from_trainer
7
+ - dataset_size:72454
8
  - loss:MultipleNegativesRankingLoss
9
  base_model: BAAI/bge-large-en-v1.5
10
  widget:
11
+ - source_sentence: mahathir bin mohamad
12
  sentences:
13
+ - mahathir mohamad
14
+ - mahathir bin ismail
15
+ - nazhan hafiz rahmat
16
+ - source_sentence: siow yin heng
17
  sentences:
18
+ - siu xin loh daniel
19
+ - yin heng siow
20
+ - siow heng yin
21
+ - source_sentence: fadzil bin othman
22
  sentences:
23
+ - izzah auni binti zulkifli
24
+ - fadzil othman
25
+ - mariam binti hassan
26
+ - source_sentence: raja muhammad syamil bin raja ishak
27
  sentences:
28
+ - ridzuan bin hashim
29
+ - meng leong fang
30
+ - raja muhd syamil bin raja ishak
31
+ - source_sentence: felix koh bing sheng
32
  sentences:
33
+ - olivia sinnathamby
34
+ - bing koh sheng
35
+ - koh bing sheng
36
  pipeline_tag: sentence-similarity
37
  library_name: sentence-transformers
38
  ---
 
87
  model = SentenceTransformer("foochun/bge-large-finetuned")
88
  # Run inference
89
  sentences = [
90
+ 'felix koh bing sheng',
91
+ 'koh bing sheng',
92
+ 'bing koh sheng',
93
  ]
94
  embeddings = model.encode(sentences)
95
  print(embeddings.shape)
 
143
 
144
  #### Unnamed Dataset
145
 
146
+ * Size: 72,454 training samples
147
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
148
  * Approximate statistics based on the first 1000 samples:
149
  | | query | pos | neg |
150
  |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
151
  | type | string | string | string |
152
+ | details | <ul><li>min: 4 tokens</li><li>mean: 8.06 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.59 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.76 tokens</li><li>max: 16 tokens</li></ul> |
153
  * Samples:
154
+ | query | pos | neg |
155
+ |:-------------------------------------|:-------------------------------------|:---------------------------------|
156
+ | <code>ahmad faisal bin zainal</code> | <code>ahmad faisal bin zainal</code> | <code>sakinah binti jamil</code> |
157
+ | <code>daniel lim ling ee</code> | <code>lim ling ee daniel</code> | <code>ee ling lim</code> |
158
+ | <code>lau sze sheng</code> | <code>sheng lau sze</code> | <code>lau sz sheng</code> |
159
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
160
  ```json
161
  {
 
168
 
169
  #### Unnamed Dataset
170
 
171
+ * Size: 10,350 evaluation samples
172
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
173
  * Approximate statistics based on the first 1000 samples:
174
  | | query | pos | neg |
175
  |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
176
  | type | string | string | string |
177
+ | details | <ul><li>min: 4 tokens</li><li>mean: 8.06 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.59 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.81 tokens</li><li>max: 16 tokens</li></ul> |
178
  * Samples:
179
+ | query | pos | neg |
180
+ |:-------------------------------------------------|:-------------------------------------------------|:------------------------------------------|
181
+ | <code>xavier loh ling sheng</code> | <code>loh ling sheng xavier</code> | <code>loh sheng ling xavier</code> |
182
+ | <code>chan poh king</code> | <code>chan poh king</code> | <code>chan king poh</code> |
183
+ | <code>siti suzelita sazrikin binti hassan</code> | <code>siti suzelita sazrikin binti hassan</code> | <code>roslilawati binti hj mukhtar</code> |
184
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
185
  ```json
186
  {
 
325
  ### Training Logs
326
  | Epoch | Step | Training Loss | Validation Loss |
327
  |:----------:|:--------:|:-------------:|:---------------:|
328
+ | 0.4413 | 500 | 0.1568 | 0.0153 |
329
+ | 0.8826 | 1000 | 0.0155 | 0.0073 |
330
+ | 1.3239 | 1500 | 0.0086 | 0.0064 |
331
+ | 1.7652 | 2000 | 0.0067 | 0.0054 |
332
+ | 2.2065 | 2500 | 0.0059 | 0.0050 |
333
+ | **2.6478** | **3000** | **0.0052** | **0.0049** |
334
 
335
  * The bold row denotes the saved checkpoint.
336
 
 
340
  - Transformers: 4.51.3
341
  - PyTorch: 2.6.0+cu124
342
  - Accelerate: 1.6.0
343
+ - Datasets: 3.6.0
344
  - Tokenizers: 0.21.1
345
 
346
  ## Citation
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:15b52f7abf658111d9430675ac14595f44e24a6d62b078f77ee10351c0ce222f
3
  size 1340612432
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ed937208fea5f17d65ecb178dd0a6fd0db166daaeae588de942ebe415b59216
3
  size 1340612432