foochun
/

bge-large-finetuned

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d0fae3f9a09ca238049f5af1b058df30df52a7d29669b1b46037926ecf90eb2c
 size 1049760

 version https://git-lfs.github.com/spec/v1
+oid sha256:617ed83c3023a45f38fc054dcdd27c1923b06e10bddbf65bc1fe638d4eb0761f
 size 1049760

README.md CHANGED Viewed

@@ -4,35 +4,35 @@ tags:
 - sentence-similarity
 - feature-extraction
 - generated_from_trainer
-- dataset_size:72454
 - loss:MultipleNegativesRankingLoss
 base_model: BAAI/bge-large-en-v1.5
 widget:
-- source_sentence: mahathir bin mohamad
   sentences:
-  - mahathir mohamad
-  - mahathir bin ismail
-  - nazhan hafiz rahmat
-- source_sentence: siow yin heng
   sentences:
-  - siu xin loh daniel
-  - yin heng siow
-  - siow heng yin
-- source_sentence: fadzil bin othman
   sentences:
-  - izzah auni binti zulkifli
-  - fadzil othman
-  - mariam binti hassan
-- source_sentence: raja muhammad syamil bin raja ishak
   sentences:
-  - ridzuan bin hashim
-  - meng leong fang
-  - raja muhd syamil bin raja ishak
-- source_sentence: felix koh bing sheng
   sentences:
-  - olivia sinnathamby
-  - bing koh sheng
-  - koh bing sheng
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
 ---
@@ -87,9 +87,9 @@ from sentence_transformers import SentenceTransformer
 model = SentenceTransformer("foochun/bge-large-finetuned")
 # Run inference
 sentences = [
-    'felix koh bing sheng',
-    'koh bing sheng',
-    'bing koh sheng',
 ]
 embeddings = model.encode(sentences)
 print(embeddings.shape)
@@ -143,19 +143,19 @@ You can finetune this model on your own dataset.
 #### Unnamed Dataset
-* Size: 72,454 training samples
 * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
 * Approximate statistics based on the first 1000 samples:
   |         | query                                                                            | pos                                                                              | neg                                                                              |
   |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
   | type    | string                                                                           | string                                                                           | string                                                                           |
-  | details | <ul><li>min: 4 tokens</li><li>mean: 8.06 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.59 tokens</li><li>max: 16 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.76 tokens</li><li>max: 16 tokens</li></ul> |
 * Samples:
-  | query                                | pos                                  | neg                              |
-  |:-------------------------------------|:-------------------------------------|:---------------------------------|
-  | <code>ahmad faisal bin zainal</code> | <code>ahmad faisal bin zainal</code> | <code>sakinah binti jamil</code> |
-  | <code>daniel lim ling ee</code>      | <code>lim ling ee daniel</code>      | <code>ee ling lim</code>         |
-  | <code>lau sze sheng</code>           | <code>sheng lau sze</code>           | <code>lau sz sheng</code>        |
 * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
   ```json
   {
@@ -168,19 +168,19 @@ You can finetune this model on your own dataset.
 #### Unnamed Dataset
-* Size: 10,350 evaluation samples
 * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
 * Approximate statistics based on the first 1000 samples:
   |         | query                                                                            | pos                                                                              | neg                                                                              |
   |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
   | type    | string                                                                           | string                                                                           | string                                                                           |
-  | details | <ul><li>min: 4 tokens</li><li>mean: 8.06 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.59 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.81 tokens</li><li>max: 16 tokens</li></ul> |
 * Samples:
-  | query                                            | pos                                              | neg                                       |
-  |:-------------------------------------------------|:-------------------------------------------------|:------------------------------------------|
-  | <code>xavier loh ling sheng</code>               | <code>loh ling sheng xavier</code>               | <code>loh sheng ling xavier</code>        |
-  | <code>chan poh king</code>                       | <code>chan poh king</code>                       | <code>chan king poh</code>                |
-  | <code>siti suzelita sazrikin binti hassan</code> | <code>siti suzelita sazrikin binti hassan</code> | <code>roslilawati binti hj mukhtar</code> |
 * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
   ```json
   {
@@ -267,7 +267,6 @@ You can finetune this model on your own dataset.
 - `fsdp`: []
 - `fsdp_min_num_params`: 0
 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
-- `tp_size`: 0
 - `fsdp_transformer_layer_cls_to_wrap`: None
 - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
 - `deepspeed`: None
@@ -325,21 +324,21 @@ You can finetune this model on your own dataset.
 ### Training Logs
 | Epoch      | Step     | Training Loss | Validation Loss |
 |:----------:|:--------:|:-------------:|:---------------:|
-| 0.4413     | 500      | 0.1568        | 0.0153          |
-| 0.8826     | 1000     | 0.0155        | 0.0073          |
-| 1.3239     | 1500     | 0.0086        | 0.0064          |
-| 1.7652     | 2000     | 0.0067        | 0.0054          |
-| 2.2065     | 2500     | 0.0059        | 0.0050          |
-| **2.6478** | **3000** | **0.0052**    | **0.0049**      |
 * The bold row denotes the saved checkpoint.
 ### Framework Versions
 - Python: 3.11.9
 - Sentence Transformers: 4.1.0
-- Transformers: 4.51.3
 - PyTorch: 2.6.0+cu124
-- Accelerate: 1.6.0
 - Datasets: 3.6.0
 - Tokenizers: 0.21.1

 - sentence-similarity
 - feature-extraction
 - generated_from_trainer
+- dataset_size:72897
 - loss:MultipleNegativesRankingLoss
 base_model: BAAI/bge-large-en-v1.5
 widget:
+- source_sentence: penny sim chee jun
   sentences:
+  - jun sim chee penny
+  - chee sim jun penny
+  - azmi bin raja sharif
+- source_sentence: yeoh lyn leong
   sentences:
+  - yeoh lyn leomng
+  - yeoh lyn leong
+  - pei lee shing
+- source_sentence: felix tan zhen bing
   sentences:
+  - felix bing tan zhen
+  - jonathan ramanathan
+  - felix bing zhen tan
+- source_sentence: syed ahmad fadhil bin syed idris
   sentences:
+  - nurul ain othman
+  - fadhil bin syed idris
+  - mohd ikram bin salleh
+- source_sentence: wan faris bin wan syafiq
   sentences:
+  - wan amin bin wan syafiq
+  - ryan ping xin lau
+  - wan faris bin wan syafiq
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
 ---
 model = SentenceTransformer("foochun/bge-large-finetuned")
 # Run inference
 sentences = [
+    'wan faris bin wan syafiq',
+    'wan faris bin wan syafiq',
+    'wan amin bin wan syafiq',
 ]
 embeddings = model.encode(sentences)
 print(embeddings.shape)
 #### Unnamed Dataset
+* Size: 72,897 training samples
 * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
 * Approximate statistics based on the first 1000 samples:
   |         | query                                                                            | pos                                                                              | neg                                                                              |
   |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
   | type    | string                                                                           | string                                                                           | string                                                                           |
+  | details | <ul><li>min: 4 tokens</li><li>mean: 8.09 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.62 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.72 tokens</li><li>max: 17 tokens</li></ul> |
 * Samples:
+  | query                                        | pos                                            | neg                                          |
+  |:---------------------------------------------|:-----------------------------------------------|:---------------------------------------------|
+  | <code>mohd khairul anwar bin kassim</code>   | <code>muhammad khairul anwar bin kassim</code> | <code>syed hassan bin bakar</code>           |
+  | <code>jason ong ling wei</code>              | <code>jason wei ong ling</code>                | <code>ling ong wei jason</code>              |
+  | <code>muhammad azri syah bin abdullah</code> | <code>azri syah abdullah</code>                | <code>abdullah bin muhammad azri syah</code> |
 * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
   ```json
   {
 #### Unnamed Dataset
+* Size: 10,413 evaluation samples
 * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
 * Approximate statistics based on the first 1000 samples:
   |         | query                                                                            | pos                                                                              | neg                                                                              |
   |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
   | type    | string                                                                           | string                                                                           | string                                                                           |
+  | details | <ul><li>min: 4 tokens</li><li>mean: 8.05 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.57 tokens</li><li>max: 17 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.84 tokens</li><li>max: 16 tokens</li></ul> |
 * Samples:
+  | query                                      | pos                                        | neg                                      |
+  |:-------------------------------------------|:-------------------------------------------|:-----------------------------------------|
+  | <code>elaine soh yi ping</code>            | <code>elaine soh yi ping</code>            | <code>soh ping yi elaine</code>          |
+  | <code>raja arshad bin raja tun uda</code>  | <code>arshad bin raja tun uda</code>       | <code>ismail bin sabri</code>            |
+  | <code>syafiq kyle bin ahmad khariri</code> | <code>syafiq kyle bin ahmad khariri</code> | <code>adlin aman ramlie bin ramli</code> |
 * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
   ```json
   {
 - `fsdp`: []
 - `fsdp_min_num_params`: 0
 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
 - `fsdp_transformer_layer_cls_to_wrap`: None
 - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
 - `deepspeed`: None
 ### Training Logs
 | Epoch      | Step     | Training Loss | Validation Loss |
 |:----------:|:--------:|:-------------:|:---------------:|
+| 0.4386     | 500      | 0.1479        | 0.0113          |
+| 0.8772     | 1000     | 0.0163        | 0.0063          |
+| 1.3158     | 1500     | 0.0087        | 0.0058          |
+| 1.7544     | 2000     | 0.0067        | 0.0040          |
+| 2.1930     | 2500     | 0.0058        | 0.0037          |
+| **2.6316** | **3000** | **0.0053**    | **0.0037**      |
 * The bold row denotes the saved checkpoint.
 ### Framework Versions
 - Python: 3.11.9
 - Sentence Transformers: 4.1.0
+- Transformers: 4.52.4
 - PyTorch: 2.6.0+cu124
+- Accelerate: 1.7.0
 - Datasets: 3.6.0
 - Tokenizers: 0.21.1

config.json CHANGED Viewed

@@ -24,7 +24,7 @@
   "pad_token_id": 0,
   "position_embedding_type": "absolute",
   "torch_dtype": "float32",
-  "transformers_version": "4.51.3",
   "type_vocab_size": 2,
   "use_cache": true,
   "vocab_size": 30522

   "pad_token_id": 0,
   "position_embedding_type": "absolute",
   "torch_dtype": "float32",
+  "transformers_version": "4.52.4",
   "type_vocab_size": 2,
   "use_cache": true,
   "vocab_size": 30522

config_sentence_transformers.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "__version__": {
     "sentence_transformers": "4.1.0",
-    "transformers": "4.51.3",
     "pytorch": "2.6.0+cu124"
   },
   "prompts": {},

 {
   "__version__": {
     "sentence_transformers": "4.1.0",
+    "transformers": "4.52.4",
     "pytorch": "2.6.0+cu124"
   },
   "prompts": {},

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8ed937208fea5f17d65ecb178dd0a6fd0db166daaeae588de942ebe415b59216
 size 1340612432

 version https://git-lfs.github.com/spec/v1
+oid sha256:ebc106023445e63ab1248d4bdfb424cef2f2e46498f03bf3ac5b5e9d4dc8c3d8
 size 1340612432