radoslavralev commited on
Commit
bf14c2a
·
verified ·
1 Parent(s): 7fc57f9

Add new SentenceTransformer model

Browse files
README.md CHANGED
@@ -12,96 +12,132 @@ tags:
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
- - dataset_size:483820
16
- - loss:CachedMultipleNegativesSymmetricRankingLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
- - source_sentence: 'See Precambrian time scale # Proposed Geologic timeline for another
20
- set of periods 4600 -- 541 MYA .'
21
  sentences:
22
- - In 2014 election , Biju Janata Dal candidate Tathagat Satapathy Bharatiya Janata
23
- party candidate Rudra Narayan Pany defeated with a margin of 1.37,340 votes .
24
- - In Scotland , the Strathclyde Partnership for Transport , formerly known as Strathclyde
25
- Passenger Transport Executive , comprises the former Strathclyde region , which
26
- includes the urban area around Glasgow .
27
- - 'See Precambrian Time Scale # Proposed Geological Timeline for another set of
28
- periods of 4600 -- 541 MYA .'
29
- - source_sentence: It is also 5 kilometers northeast of Tamaqua , 27 miles south of
30
- Allentown and 9 miles northwest of Hazleton .
31
  sentences:
32
- - In 1948 he moved to Massachusetts , and eventually settled in Vermont .
33
- - Suddenly I remembered that I was a New Zealander , I caught the first plane home
34
- and came back .
35
- - It is also 5 miles northeast of Tamaqua , 27 miles south of Allentown , and 9
36
- miles northwest of Hazleton .
37
- - source_sentence: The party has a Member of Parliament , a member of the House of
38
- Lords , three members of the London Assembly and two Members of the European Parliament
39
- .
40
  sentences:
41
- - The party has one Member of Parliament , one member of the House of Lords , three
42
- Members of the London Assembly and two Members of the European Parliament .
43
- - Grapsid crabs dominate in Australia , Malaysia and Panama , while gastropods Cerithidea
44
- scalariformis and Melampus coeffeus are important seed predators in Florida mangroves
45
- .
46
- - Music Story is a music service website and international music data provider that
47
- curates , aggregates and analyses metadata for digital music services .
48
- - source_sentence: 'The play received two 1969 Tony Award nominations : Best Actress
49
- in a Play ( Michael Annals ) and Best Costume Design ( Charlotte Rae ) .'
50
  sentences:
51
- - Ravishanker is a fellow of the International Statistical Institute and an elected
52
- member of the American Statistical Association .
53
- - 'In 1969 , the play received two Tony - Award nominations : Best Actress in a
54
- Theatre Play ( Michael Annals ) and Best Costume Design ( Charlotte Rae ) .'
55
- - AMD and Nvidia both have proprietary methods of scaling , CrossFireX for AMD ,
56
- and SLI for Nvidia .
57
- - source_sentence: He was a close friend of Ángel Cabrera and is a cousin of golfer
58
- Tony Croatto .
59
  sentences:
60
- - He was a close friend of Ángel Cabrera , and is a cousin of golfer Tony Croatto
61
- .
62
- - Eugenijus Bartulis ( born December 7 , 1949 in Kaunas ) is a Lithuanian Roman
63
- Catholic priest , and Bishop of Šiauliai .
64
- - UWIRE also distributes its members content to professional media outlets , including
65
- Yahoo , CNN and CBS News .
66
  datasets:
67
  - redis/langcache-sentencepairs-v1
68
  pipeline_tag: sentence-similarity
69
  library_name: sentence-transformers
70
  metrics:
71
- - cosine_accuracy@1
72
- - cosine_precision@1
73
- - cosine_recall@1
74
- - cosine_ndcg@10
75
- - cosine_mrr@1
76
- - cosine_map@100
 
 
77
  model-index:
78
  - name: Redis fine-tuned BiEncoder model for semantic caching on LangCache
79
  results:
80
  - task:
81
- type: information-retrieval
82
- name: Information Retrieval
83
  dataset:
84
- name: train
85
- type: train
86
  metrics:
87
- - type: cosine_accuracy@1
88
- value: 0.5576531716821823
89
- name: Cosine Accuracy@1
90
- - type: cosine_precision@1
91
- value: 0.5576531716821823
92
- name: Cosine Precision@1
93
- - type: cosine_recall@1
94
- value: 0.5355707469384673
95
- name: Cosine Recall@1
96
- - type: cosine_ndcg@10
97
- value: 0.752122458721726
98
- name: Cosine Ndcg@10
99
- - type: cosine_mrr@1
100
- value: 0.5576531716821823
101
- name: Cosine Mrr@1
102
- - type: cosine_map@100
103
- value: 0.6973125692592467
104
- name: Cosine Map@100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  ---
106
 
107
  # Redis fine-tuned BiEncoder model for semantic caching on LangCache
@@ -113,7 +149,7 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [A
113
  ### Model Description
114
  - **Model Type:** Sentence Transformer
115
  - **Base model:** [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) <!-- at revision e7f32e3c00f91d699e8c43b53106206bcc72bb22 -->
116
- - **Maximum Sequence Length:** 100 tokens
117
  - **Output Dimensionality:** 768 dimensions
118
  - **Similarity Function:** Cosine Similarity
119
  - **Training Dataset:**
@@ -131,7 +167,7 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [A
131
 
132
  ```
133
  SentenceTransformer(
134
- (0): Transformer({'max_seq_length': 100, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
135
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
136
  )
137
  ```
@@ -154,9 +190,9 @@ from sentence_transformers import SentenceTransformer
154
  model = SentenceTransformer("redis/langcache-embed-v3")
155
  # Run inference
156
  sentences = [
157
- 'He was a close friend of Ángel Cabrera and is a cousin of golfer Tony Croatto .',
158
- 'He was a close friend of Ángel Cabrera , and is a cousin of golfer Tony Croatto .',
159
- 'UWIRE also distributes its members content to professional media outlets , including Yahoo , CNN and CBS News .',
160
  ]
161
  embeddings = model.encode(sentences)
162
  print(embeddings.shape)
@@ -165,9 +201,9 @@ print(embeddings.shape)
165
  # Get the similarity scores for the embeddings
166
  similarities = model.similarity(embeddings, embeddings)
167
  print(similarities)
168
- # tensor([[0.9922, 0.9922, 0.5352],
169
- # [0.9922, 0.9961, 0.5391],
170
- # [0.5352, 0.5391, 1.0000]], dtype=torch.bfloat16)
171
  ```
172
 
173
  <!--
@@ -198,19 +234,21 @@ You can finetune this model on your own dataset.
198
 
199
  ### Metrics
200
 
201
- #### Information Retrieval
202
 
203
- * Dataset: `train`
204
- * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
205
 
206
- | Metric | Value |
207
- |:-------------------|:-----------|
208
- | cosine_accuracy@1 | 0.5577 |
209
- | cosine_precision@1 | 0.5577 |
210
- | cosine_recall@1 | 0.5356 |
211
- | **cosine_ndcg@10** | **0.7521** |
212
- | cosine_mrr@1 | 0.5577 |
213
- | cosine_map@100 | 0.6973 |
 
 
214
 
215
  <!--
216
  ## Bias, Risks and Limitations
@@ -231,26 +269,24 @@ You can finetune this model on your own dataset.
231
  #### LangCache Sentence Pairs (all)
232
 
233
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
234
- * Size: 26,850 training samples
235
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
236
  * Approximate statistics based on the first 1000 samples:
237
- | | sentence1 | sentence2 | label |
238
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:-----------------------------|
239
- | type | string | string | int |
240
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.35 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>1: 100.00%</li></ul> |
241
  * Samples:
242
- | sentence1 | sentence2 | label |
243
- |:----------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------|:---------------|
244
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
245
- | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
246
- | <code>The 12F was officially homologated on August 21 , 1929 and exhibited at the Paris Salon in 1930 .</code> | <code>The 12F was officially homologated on 21 August 1929 and displayed at the 1930 Paris Salon .</code> | <code>1</code> |
247
- * Loss: [<code>CachedMultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativessymmetricrankingloss) with these parameters:
248
  ```json
249
  {
250
  "scale": 20.0,
251
- "similarity_fct": "cos_sim",
252
- "mini_batch_size": 96,
253
- "gather_across_devices": false
254
  }
255
  ```
256
 
@@ -259,33 +295,31 @@ You can finetune this model on your own dataset.
259
  #### LangCache Sentence Pairs (all)
260
 
261
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
262
- * Size: 26,850 evaluation samples
263
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
264
  * Approximate statistics based on the first 1000 samples:
265
- | | sentence1 | sentence2 | label |
266
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:-----------------------------|
267
- | type | string | string | int |
268
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.35 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>1: 100.00%</li></ul> |
269
  * Samples:
270
- | sentence1 | sentence2 | label |
271
- |:----------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------|:---------------|
272
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
273
- | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
274
- | <code>The 12F was officially homologated on August 21 , 1929 and exhibited at the Paris Salon in 1930 .</code> | <code>The 12F was officially homologated on 21 August 1929 and displayed at the 1930 Paris Salon .</code> | <code>1</code> |
275
- * Loss: [<code>CachedMultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativessymmetricrankingloss) with these parameters:
276
  ```json
277
  {
278
  "scale": 20.0,
279
- "similarity_fct": "cos_sim",
280
- "mini_batch_size": 96,
281
- "gather_across_devices": false
282
  }
283
  ```
284
 
285
  ### Training Logs
286
- | Epoch | Step | train_cosine_ndcg@10 |
287
- |:-----:|:----:|:--------------------:|
288
- | -1 | -1 | 0.7521 |
289
 
290
 
291
  ### Framework Versions
@@ -314,6 +348,17 @@ You can finetune this model on your own dataset.
314
  }
315
  ```
316
 
 
 
 
 
 
 
 
 
 
 
 
317
  <!--
318
  ## Glossary
319
 
 
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
+ - dataset_size:1056095
16
+ - loss:CoSENTLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
+ - source_sentence: In 2015 Adolf Hitler appeared in the kickstarter short movie ``
20
+ Kung Fury `` as Taccone ( A.K.A .
21
  sentences:
22
+ - In 2015 , Adolf Hitler appeared in the Kickstarter - short film `` Kung Fury ``
23
+ as Taccone ( A.K.A .
24
+ - In 1795 , the only white residents were Dr. John Laidley and two brothers with
25
+ the surname Ainslie .
26
+ - The 125th University Match was played in March 2014 at the Rye Golf Club , Oxford
27
+ , East Sussex won the game 8.5 - 6.5 .
28
+ - source_sentence: From 1973 to 1974 , Aubrey toured with the Cambridge Theatre Company
29
+ as Diggory in `` She Stoops to Conquer `` and again as Aguecheek .
 
30
  sentences:
31
+ - Oxide can be reduced to metallic samarium at higher temperatures by heating with
32
+ a reducing agent such as hydrogen or carbon monoxide .
33
+ - From 1973 to 1974 Aguecheek toured with the Cambridge Theatre Company as Diggory
34
+ in `` You Stoops to Conquer `` and again as Aubrey .
35
+ - The medals were presented by Barry Maister , IOC member , New Zealand and Sarah
36
+ Webb Gosling , Vice President of World Sailing .
37
+ - source_sentence: There is no official wall on the border , although there are sections
38
+ of fence near populated areas and continuous border crossings .
39
  sentences:
40
+ - The 2014 -- 15 Boston Bruins season was the 91st season for the National Hockey
41
+ League franchise that was established on November 1 , 1924 .
42
+ - He was trained by the Inghams and owned by John Hawkes .
43
+ - There is no continuous wall on the border , although there are fence sections
44
+ near populated areas and official border crossings .
45
+ - source_sentence: Capital . `` The French established similar hill stations in Indochina
46
+ , such as Dalat built in 1921 .
 
 
47
  sentences:
48
+ - Lubuk China is a small town in Alor Gajah District , Melaka , Malaysia . It is
49
+ situated near the border with Negeri Sembilan .
50
+ - The French established similar hill stations in Indochina , such as Dalat , built
51
+ in 1921 .
52
+ - John Potts ( or Pott ) was a doctor and colonial governor of Virginia in the Jamestown
53
+ settlement at Virginia Colony in the early 17th century .
54
+ - source_sentence: The band pursued `` signals `` in January 2012 in three weeks ,
55
+ and drums were recorded in a day and a half .
56
  sentences:
57
+ - It was repaired at the beginning of the 20th century and is listed as closed in
58
+ our records .
59
+ - The band tracked `` Signals `` in three weeks in January 2012 . Drums were recorded
60
+ in a day and a half .
61
+ - Contributors include actor Anton LaVey , Satanist Christopher Lee , serial killer
62
+ expert Clive Barker , author Karen Greenlee , and necrophile Robert Ressler .
63
  datasets:
64
  - redis/langcache-sentencepairs-v1
65
  pipeline_tag: sentence-similarity
66
  library_name: sentence-transformers
67
  metrics:
68
+ - cosine_accuracy
69
+ - cosine_accuracy_threshold
70
+ - cosine_f1
71
+ - cosine_f1_threshold
72
+ - cosine_precision
73
+ - cosine_recall
74
+ - cosine_ap
75
+ - cosine_mcc
76
  model-index:
77
  - name: Redis fine-tuned BiEncoder model for semantic caching on LangCache
78
  results:
79
  - task:
80
+ type: binary-classification
81
+ name: Binary Classification
82
  dataset:
83
+ name: val
84
+ type: val
85
  metrics:
86
+ - type: cosine_accuracy
87
+ value: 0.7629982153480072
88
+ name: Cosine Accuracy
89
+ - type: cosine_accuracy_threshold
90
+ value: 0.8639795780181885
91
+ name: Cosine Accuracy Threshold
92
+ - type: cosine_f1
93
+ value: 0.6907391673746814
94
+ name: Cosine F1
95
+ - type: cosine_f1_threshold
96
+ value: 0.8261561989784241
97
+ name: Cosine F1 Threshold
98
+ - type: cosine_precision
99
+ value: 0.6290946608202218
100
+ name: Cosine Precision
101
+ - type: cosine_recall
102
+ value: 0.7657770800627943
103
+ name: Cosine Recall
104
+ - type: cosine_ap
105
+ value: 0.7350867007914639
106
+ name: Cosine Ap
107
+ - type: cosine_mcc
108
+ value: 0.47714361581572273
109
+ name: Cosine Mcc
110
+ - task:
111
+ type: binary-classification
112
+ name: Binary Classification
113
+ dataset:
114
+ name: test
115
+ type: test
116
+ metrics:
117
+ - type: cosine_accuracy
118
+ value: 0.7034875284177939
119
+ name: Cosine Accuracy
120
+ - type: cosine_accuracy_threshold
121
+ value: 0.8523406982421875
122
+ name: Cosine Accuracy Threshold
123
+ - type: cosine_f1
124
+ value: 0.7118695167174169
125
+ name: Cosine F1
126
+ - type: cosine_f1_threshold
127
+ value: 0.8109798431396484
128
+ name: Cosine F1 Threshold
129
+ - type: cosine_precision
130
+ value: 0.597953808752026
131
+ name: Cosine Precision
132
+ - type: cosine_recall
133
+ value: 0.8794040968342645
134
+ name: Cosine Recall
135
+ - type: cosine_ap
136
+ value: 0.6473629818920917
137
+ name: Cosine Ap
138
+ - type: cosine_mcc
139
+ value: 0.4409362621742405
140
+ name: Cosine Mcc
141
  ---
142
 
143
  # Redis fine-tuned BiEncoder model for semantic caching on LangCache
 
149
  ### Model Description
150
  - **Model Type:** Sentence Transformer
151
  - **Base model:** [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) <!-- at revision e7f32e3c00f91d699e8c43b53106206bcc72bb22 -->
152
+ - **Maximum Sequence Length:** 8192 tokens
153
  - **Output Dimensionality:** 768 dimensions
154
  - **Similarity Function:** Cosine Similarity
155
  - **Training Dataset:**
 
167
 
168
  ```
169
  SentenceTransformer(
170
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
171
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
172
  )
173
  ```
 
190
  model = SentenceTransformer("redis/langcache-embed-v3")
191
  # Run inference
192
  sentences = [
193
+ 'The band pursued `` signals `` in January 2012 in three weeks , and drums were recorded in a day and a half .',
194
+ 'The band tracked `` Signals `` in three weeks in January 2012 . Drums were recorded in a day and a half .',
195
+ 'Contributors include actor Anton LaVey , Satanist Christopher Lee , serial killer expert Clive Barker , author Karen Greenlee , and necrophile Robert Ressler .',
196
  ]
197
  embeddings = model.encode(sentences)
198
  print(embeddings.shape)
 
201
  # Get the similarity scores for the embeddings
202
  similarities = model.similarity(embeddings, embeddings)
203
  print(similarities)
204
+ # tensor([[0.9999, 0.9598, 0.4944],
205
+ # [0.9598, 0.9999, 0.5096],
206
+ # [0.4944, 0.5096, 0.9999]])
207
  ```
208
 
209
  <!--
 
234
 
235
  ### Metrics
236
 
237
+ #### Binary Classification
238
 
239
+ * Datasets: `val` and `test`
240
+ * Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
241
 
242
+ | Metric | val | test |
243
+ |:--------------------------|:-----------|:-----------|
244
+ | cosine_accuracy | 0.763 | 0.7035 |
245
+ | cosine_accuracy_threshold | 0.864 | 0.8523 |
246
+ | cosine_f1 | 0.6907 | 0.7119 |
247
+ | cosine_f1_threshold | 0.8262 | 0.811 |
248
+ | cosine_precision | 0.6291 | 0.598 |
249
+ | cosine_recall | 0.7658 | 0.8794 |
250
+ | **cosine_ap** | **0.7351** | **0.6474** |
251
+ | cosine_mcc | 0.4771 | 0.4409 |
252
 
253
  <!--
254
  ## Bias, Risks and Limitations
 
269
  #### LangCache Sentence Pairs (all)
270
 
271
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
272
+ * Size: 62,021 training samples
273
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
274
  * Approximate statistics based on the first 1000 samples:
275
+ | | sentence1 | sentence2 | label |
276
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:------------------------------------------------|
277
+ | type | string | string | int |
278
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.46 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.36 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>0: ~50.30%</li><li>1: ~49.70%</li></ul> |
279
  * Samples:
280
+ | sentence1 | sentence2 | label |
281
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
282
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
283
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> | <code>0</code> |
284
+ | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
285
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
286
  ```json
287
  {
288
  "scale": 20.0,
289
+ "similarity_fct": "pairwise_cos_sim"
 
 
290
  }
291
  ```
292
 
 
295
  #### LangCache Sentence Pairs (all)
296
 
297
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
298
+ * Size: 62,021 evaluation samples
299
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
300
  * Approximate statistics based on the first 1000 samples:
301
+ | | sentence1 | sentence2 | label |
302
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:------------------------------------------------|
303
+ | type | string | string | int |
304
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.46 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.36 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>0: ~50.30%</li><li>1: ~49.70%</li></ul> |
305
  * Samples:
306
+ | sentence1 | sentence2 | label |
307
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
308
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
309
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> | <code>0</code> |
310
+ | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
311
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
312
  ```json
313
  {
314
  "scale": 20.0,
315
+ "similarity_fct": "pairwise_cos_sim"
 
 
316
  }
317
  ```
318
 
319
  ### Training Logs
320
+ | Epoch | Step | val_cosine_ap | test_cosine_ap |
321
+ |:-----:|:----:|:-------------:|:--------------:|
322
+ | -1 | -1 | 0.7351 | 0.6474 |
323
 
324
 
325
  ### Framework Versions
 
348
  }
349
  ```
350
 
351
+ #### CoSENTLoss
352
+ ```bibtex
353
+ @online{kexuefm-8847,
354
+ title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
355
+ author={Su Jianlin},
356
+ year={2022},
357
+ month={Jan},
358
+ url={https://kexue.fm/archives/8847},
359
+ }
360
+ ```
361
+
362
  <!--
363
  ## Glossary
364
 
config.json CHANGED
@@ -12,7 +12,7 @@
12
  "cls_token_id": 50281,
13
  "decoder_bias": true,
14
  "deterministic_flash_attn": false,
15
- "dtype": "bfloat16",
16
  "embedding_dropout": 0.0,
17
  "eos_token_id": 50282,
18
  "global_attn_every_n_layers": 3,
 
12
  "cls_token_id": 50281,
13
  "decoder_bias": true,
14
  "deterministic_flash_attn": false,
15
+ "dtype": "float32",
16
  "embedding_dropout": 0.0,
17
  "eos_token_id": 50282,
18
  "global_attn_every_n_layers": 3,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c00d45c397ee18dccac8a87b21c268ce14fed5bc561700ec08a2a0fc056f251b
3
- size 298041696
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0f9247027e7d57e8b36440b5b3d10a785ded92c7c9f4a313ff7f54a549967290
3
+ size 596070136
sentence_bert_config.json CHANGED
@@ -1,4 +1,4 @@
1
  {
2
- "max_seq_length": 100,
3
  "do_lower_case": false
4
  }
 
1
  {
2
+ "max_seq_length": 8192,
3
  "do_lower_case": false
4
  }
tokenizer.json CHANGED
@@ -2,7 +2,7 @@
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
- "max_length": 100,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
 
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
+ "max_length": 8192,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
tokenizer_config.json CHANGED
@@ -933,7 +933,6 @@
933
  "cls_token": "[CLS]",
934
  "extra_special_tokens": {},
935
  "mask_token": "[MASK]",
936
- "max_length": 100,
937
  "model_input_names": [
938
  "input_ids",
939
  "attention_mask"
 
933
  "cls_token": "[CLS]",
934
  "extra_special_tokens": {},
935
  "mask_token": "[MASK]",
 
936
  "model_input_names": [
937
  "input_ids",
938
  "attention_mask"