foochun commited on
Commit
e1a1485
·
verified ·
1 Parent(s): 9ef22a0

256 Dimension updated

Browse files
2_Dense/model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d0fae3f9a09ca238049f5af1b058df30df52a7d29669b1b46037926ecf90eb2c
3
  size 1049760
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:617ed83c3023a45f38fc054dcdd27c1923b06e10bddbf65bc1fe638d4eb0761f
3
  size 1049760
README.md CHANGED
@@ -4,35 +4,35 @@ tags:
4
  - sentence-similarity
5
  - feature-extraction
6
  - generated_from_trainer
7
- - dataset_size:72454
8
  - loss:MultipleNegativesRankingLoss
9
  base_model: BAAI/bge-large-en-v1.5
10
  widget:
11
- - source_sentence: mahathir bin mohamad
12
  sentences:
13
- - mahathir mohamad
14
- - mahathir bin ismail
15
- - nazhan hafiz rahmat
16
- - source_sentence: siow yin heng
17
  sentences:
18
- - siu xin loh daniel
19
- - yin heng siow
20
- - siow heng yin
21
- - source_sentence: fadzil bin othman
22
  sentences:
23
- - izzah auni binti zulkifli
24
- - fadzil othman
25
- - mariam binti hassan
26
- - source_sentence: raja muhammad syamil bin raja ishak
27
  sentences:
28
- - ridzuan bin hashim
29
- - meng leong fang
30
- - raja muhd syamil bin raja ishak
31
- - source_sentence: felix koh bing sheng
32
  sentences:
33
- - olivia sinnathamby
34
- - bing koh sheng
35
- - koh bing sheng
36
  pipeline_tag: sentence-similarity
37
  library_name: sentence-transformers
38
  ---
@@ -87,9 +87,9 @@ from sentence_transformers import SentenceTransformer
87
  model = SentenceTransformer("foochun/bge-large-finetuned")
88
  # Run inference
89
  sentences = [
90
- 'felix koh bing sheng',
91
- 'koh bing sheng',
92
- 'bing koh sheng',
93
  ]
94
  embeddings = model.encode(sentences)
95
  print(embeddings.shape)
@@ -143,19 +143,19 @@ You can finetune this model on your own dataset.
143
 
144
  #### Unnamed Dataset
145
 
146
- * Size: 72,454 training samples
147
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
148
  * Approximate statistics based on the first 1000 samples:
149
  | | query | pos | neg |
150
  |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
151
  | type | string | string | string |
152
- | details | <ul><li>min: 4 tokens</li><li>mean: 8.06 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.59 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.76 tokens</li><li>max: 16 tokens</li></ul> |
153
  * Samples:
154
- | query | pos | neg |
155
- |:-------------------------------------|:-------------------------------------|:---------------------------------|
156
- | <code>ahmad faisal bin zainal</code> | <code>ahmad faisal bin zainal</code> | <code>sakinah binti jamil</code> |
157
- | <code>daniel lim ling ee</code> | <code>lim ling ee daniel</code> | <code>ee ling lim</code> |
158
- | <code>lau sze sheng</code> | <code>sheng lau sze</code> | <code>lau sz sheng</code> |
159
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
160
  ```json
161
  {
@@ -168,19 +168,19 @@ You can finetune this model on your own dataset.
168
 
169
  #### Unnamed Dataset
170
 
171
- * Size: 10,350 evaluation samples
172
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
173
  * Approximate statistics based on the first 1000 samples:
174
  | | query | pos | neg |
175
  |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
176
  | type | string | string | string |
177
- | details | <ul><li>min: 4 tokens</li><li>mean: 8.06 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.59 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.81 tokens</li><li>max: 16 tokens</li></ul> |
178
  * Samples:
179
- | query | pos | neg |
180
- |:-------------------------------------------------|:-------------------------------------------------|:------------------------------------------|
181
- | <code>xavier loh ling sheng</code> | <code>loh ling sheng xavier</code> | <code>loh sheng ling xavier</code> |
182
- | <code>chan poh king</code> | <code>chan poh king</code> | <code>chan king poh</code> |
183
- | <code>siti suzelita sazrikin binti hassan</code> | <code>siti suzelita sazrikin binti hassan</code> | <code>roslilawati binti hj mukhtar</code> |
184
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
185
  ```json
186
  {
@@ -267,7 +267,6 @@ You can finetune this model on your own dataset.
267
  - `fsdp`: []
268
  - `fsdp_min_num_params`: 0
269
  - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
270
- - `tp_size`: 0
271
  - `fsdp_transformer_layer_cls_to_wrap`: None
272
  - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
273
  - `deepspeed`: None
@@ -325,21 +324,21 @@ You can finetune this model on your own dataset.
325
  ### Training Logs
326
  | Epoch | Step | Training Loss | Validation Loss |
327
  |:----------:|:--------:|:-------------:|:---------------:|
328
- | 0.4413 | 500 | 0.1568 | 0.0153 |
329
- | 0.8826 | 1000 | 0.0155 | 0.0073 |
330
- | 1.3239 | 1500 | 0.0086 | 0.0064 |
331
- | 1.7652 | 2000 | 0.0067 | 0.0054 |
332
- | 2.2065 | 2500 | 0.0059 | 0.0050 |
333
- | **2.6478** | **3000** | **0.0052** | **0.0049** |
334
 
335
  * The bold row denotes the saved checkpoint.
336
 
337
  ### Framework Versions
338
  - Python: 3.11.9
339
  - Sentence Transformers: 4.1.0
340
- - Transformers: 4.51.3
341
  - PyTorch: 2.6.0+cu124
342
- - Accelerate: 1.6.0
343
  - Datasets: 3.6.0
344
  - Tokenizers: 0.21.1
345
 
 
4
  - sentence-similarity
5
  - feature-extraction
6
  - generated_from_trainer
7
+ - dataset_size:72897
8
  - loss:MultipleNegativesRankingLoss
9
  base_model: BAAI/bge-large-en-v1.5
10
  widget:
11
+ - source_sentence: penny sim chee jun
12
  sentences:
13
+ - jun sim chee penny
14
+ - chee sim jun penny
15
+ - azmi bin raja sharif
16
+ - source_sentence: yeoh lyn leong
17
  sentences:
18
+ - yeoh lyn leomng
19
+ - yeoh lyn leong
20
+ - pei lee shing
21
+ - source_sentence: felix tan zhen bing
22
  sentences:
23
+ - felix bing tan zhen
24
+ - jonathan ramanathan
25
+ - felix bing zhen tan
26
+ - source_sentence: syed ahmad fadhil bin syed idris
27
  sentences:
28
+ - nurul ain othman
29
+ - fadhil bin syed idris
30
+ - mohd ikram bin salleh
31
+ - source_sentence: wan faris bin wan syafiq
32
  sentences:
33
+ - wan amin bin wan syafiq
34
+ - ryan ping xin lau
35
+ - wan faris bin wan syafiq
36
  pipeline_tag: sentence-similarity
37
  library_name: sentence-transformers
38
  ---
 
87
  model = SentenceTransformer("foochun/bge-large-finetuned")
88
  # Run inference
89
  sentences = [
90
+ 'wan faris bin wan syafiq',
91
+ 'wan faris bin wan syafiq',
92
+ 'wan amin bin wan syafiq',
93
  ]
94
  embeddings = model.encode(sentences)
95
  print(embeddings.shape)
 
143
 
144
  #### Unnamed Dataset
145
 
146
+ * Size: 72,897 training samples
147
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
148
  * Approximate statistics based on the first 1000 samples:
149
  | | query | pos | neg |
150
  |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
151
  | type | string | string | string |
152
+ | details | <ul><li>min: 4 tokens</li><li>mean: 8.09 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.62 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.72 tokens</li><li>max: 17 tokens</li></ul> |
153
  * Samples:
154
+ | query | pos | neg |
155
+ |:---------------------------------------------|:-----------------------------------------------|:---------------------------------------------|
156
+ | <code>mohd khairul anwar bin kassim</code> | <code>muhammad khairul anwar bin kassim</code> | <code>syed hassan bin bakar</code> |
157
+ | <code>jason ong ling wei</code> | <code>jason wei ong ling</code> | <code>ling ong wei jason</code> |
158
+ | <code>muhammad azri syah bin abdullah</code> | <code>azri syah abdullah</code> | <code>abdullah bin muhammad azri syah</code> |
159
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
160
  ```json
161
  {
 
168
 
169
  #### Unnamed Dataset
170
 
171
+ * Size: 10,413 evaluation samples
172
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
173
  * Approximate statistics based on the first 1000 samples:
174
  | | query | pos | neg |
175
  |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
176
  | type | string | string | string |
177
+ | details | <ul><li>min: 4 tokens</li><li>mean: 8.05 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.57 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.84 tokens</li><li>max: 16 tokens</li></ul> |
178
  * Samples:
179
+ | query | pos | neg |
180
+ |:-------------------------------------------|:-------------------------------------------|:-----------------------------------------|
181
+ | <code>elaine soh yi ping</code> | <code>elaine soh yi ping</code> | <code>soh ping yi elaine</code> |
182
+ | <code>raja arshad bin raja tun uda</code> | <code>arshad bin raja tun uda</code> | <code>ismail bin sabri</code> |
183
+ | <code>syafiq kyle bin ahmad khariri</code> | <code>syafiq kyle bin ahmad khariri</code> | <code>adlin aman ramlie bin ramli</code> |
184
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
185
  ```json
186
  {
 
267
  - `fsdp`: []
268
  - `fsdp_min_num_params`: 0
269
  - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
 
270
  - `fsdp_transformer_layer_cls_to_wrap`: None
271
  - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
272
  - `deepspeed`: None
 
324
  ### Training Logs
325
  | Epoch | Step | Training Loss | Validation Loss |
326
  |:----------:|:--------:|:-------------:|:---------------:|
327
+ | 0.4386 | 500 | 0.1479 | 0.0113 |
328
+ | 0.8772 | 1000 | 0.0163 | 0.0063 |
329
+ | 1.3158 | 1500 | 0.0087 | 0.0058 |
330
+ | 1.7544 | 2000 | 0.0067 | 0.0040 |
331
+ | 2.1930 | 2500 | 0.0058 | 0.0037 |
332
+ | **2.6316** | **3000** | **0.0053** | **0.0037** |
333
 
334
  * The bold row denotes the saved checkpoint.
335
 
336
  ### Framework Versions
337
  - Python: 3.11.9
338
  - Sentence Transformers: 4.1.0
339
+ - Transformers: 4.52.4
340
  - PyTorch: 2.6.0+cu124
341
+ - Accelerate: 1.7.0
342
  - Datasets: 3.6.0
343
  - Tokenizers: 0.21.1
344
 
config.json CHANGED
@@ -24,7 +24,7 @@
24
  "pad_token_id": 0,
25
  "position_embedding_type": "absolute",
26
  "torch_dtype": "float32",
27
- "transformers_version": "4.51.3",
28
  "type_vocab_size": 2,
29
  "use_cache": true,
30
  "vocab_size": 30522
 
24
  "pad_token_id": 0,
25
  "position_embedding_type": "absolute",
26
  "torch_dtype": "float32",
27
+ "transformers_version": "4.52.4",
28
  "type_vocab_size": 2,
29
  "use_cache": true,
30
  "vocab_size": 30522
config_sentence_transformers.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
  "__version__": {
3
  "sentence_transformers": "4.1.0",
4
- "transformers": "4.51.3",
5
  "pytorch": "2.6.0+cu124"
6
  },
7
  "prompts": {},
 
1
  {
2
  "__version__": {
3
  "sentence_transformers": "4.1.0",
4
+ "transformers": "4.52.4",
5
  "pytorch": "2.6.0+cu124"
6
  },
7
  "prompts": {},
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8ed937208fea5f17d65ecb178dd0a6fd0db166daaeae588de942ebe415b59216
3
  size 1340612432
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ebc106023445e63ab1248d4bdfb424cef2f2e46498f03bf3ac5b5e9d4dc8c3d8
3
  size 1340612432