radoslavralev commited on
Commit
5742358
·
verified ·
1 Parent(s): b1cf62c

Add new SentenceTransformer model

Browse files
Files changed (2) hide show
  1. README.md +67 -94
  2. model.safetensors +1 -1
README.md CHANGED
@@ -12,54 +12,50 @@ tags:
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
- - dataset_size:2200421
16
- - loss:CoSENTLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
- - source_sentence: They are sometimes called Marg or also Path in Hindi .
20
  sentences:
21
- - Largs was born in Brisbane House in Noddsdale , near Brisbane in Ayrshire , Scotland
22
- , the son of Sir Thomas Brisbane and Dame Eleanora Brisbane .
23
- - Its smallest radius is 1.4 ( 131 thousand light years ) and largest 0.7 angle
24
- minutes ( 65 thousand light years ) .
25
- - They are also called Marg or sometimes the path in the Hindi .
26
- - source_sentence: The main mode of play in `` Crash Bash `` is the Adventure Mode
27
- , in which one or two players must win all 28 levels to complete .
28
  sentences:
29
- - Parkton is a city in Robeson County , North Carolina , in the Lumberton Metro
30
- area , in the United States .
31
- - The CANTAB tests were developed by Professor Barbara Sahakian and Professor Trevor
32
- Robbins .
33
- - The main mode in `` Crash Bash `` is the adventure mode in which one or two players
34
- must complete all 28 levels to win .
35
- - source_sentence: It was formed in December 2014 from elements of the disbanded 51st
36
- Mechanized Brigade and newly mobilized units .
37
  sentences:
38
- - It had branches in feature films , television , physical and digital publishing
39
- , merchandise , recorded music , digital and online media applications and mobile
40
- and social games .
41
- - Notts County and Arsenal were relegated to the Second Division ; Preston North
42
- End and Burnley were promoted to the First Division .
43
- - It was formed in December 2014 from elements of the dissolved 51st Mechanized
44
- Brigade and newly mobilized units .
45
- - source_sentence: The band pursued `` signals `` in January 2012 in three weeks ,
46
- and drums were recorded in a day and a half .
47
  sentences:
48
- - Kearsarge Lakes , Kearsarge Pass Trail , and Rae Lakes all have a maximum 2 nights
49
- stay , and Bullfrog Lake along the Charlotte Lake is closed to camping .
50
- - The band tracked `` Signals `` in three weeks in January 2012 . Drums were recorded
51
- in a day and a half .
52
- - From 1954 to 1961 , he was married to Stella Caralis and from 1978 until his death
53
- with Nina Bohlen .
54
- - source_sentence: A special case is of the Country B loyalist who controls agents
55
- or provides managerial supporting or other functions against Country A .
56
  sentences:
57
- - A special case is the loyalist of Country B , who controls agents or provides
58
- management support or other functions against Country A .
59
- - Music Story is a music service website and international music data provider that
60
- curates , aggregates and analyses metadata for digital music services .
61
- - These six cars were painted in the same lacquering as the buffet cars , silver
62
- with red lines and text .
63
  datasets:
64
  - redis/langcache-sentencepairs-v2
65
  pipeline_tag: sentence-similarity
@@ -151,9 +147,9 @@ from sentence_transformers import SentenceTransformer
151
  model = SentenceTransformer("redis/langcache-embed-v3")
152
  # Run inference
153
  sentences = [
154
- 'A special case is of the Country B loyalist who controls agents or provides managerial supporting or other functions against Country A .',
155
- 'A special case is the loyalist of Country B , who controls agents or provides management support or other functions against Country A .',
156
- 'Music Story is a music service website and international music data provider that curates , aggregates and analyses metadata for digital music services .',
157
  ]
158
  embeddings = model.encode(sentences)
159
  print(embeddings.shape)
@@ -162,9 +158,9 @@ print(embeddings.shape)
162
  # Get the similarity scores for the embeddings
163
  similarities = model.similarity(embeddings, embeddings)
164
  print(similarities)
165
- # tensor([[1.0000, 0.9844, 0.5195],
166
- # [0.9844, 0.9922, 0.5078],
167
- # [0.5195, 0.5078, 0.9922]], dtype=torch.bfloat16)
168
  ```
169
 
170
  <!--
@@ -228,52 +224,40 @@ You can finetune this model on your own dataset.
228
  #### LangCache Sentence Pairs (all)
229
 
230
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
231
- * Size: 72,021 training samples
232
- * Columns: <code>sentence_a</code>, <code>sentence_b</code>, and <code>label</code>
233
  * Approximate statistics based on the first 1000 samples:
234
- | | sentence_a | sentence_b | label |
235
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:------------------------------------------------|
236
- | type | string | string | int |
237
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.46 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.36 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>0: ~50.30%</li><li>1: ~49.70%</li></ul> |
238
  * Samples:
239
- | sentence_a | sentence_b | label |
240
- |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
241
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
242
- | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> | <code>0</code> |
243
- | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
244
- * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
245
- ```json
246
- {
247
- "scale": 20.0,
248
- "similarity_fct": "pairwise_cos_sim"
249
- }
250
- ```
251
 
252
  ### Evaluation Dataset
253
 
254
  #### LangCache Sentence Pairs (all)
255
 
256
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
257
- * Size: 72,021 evaluation samples
258
- * Columns: <code>sentence_a</code>, <code>sentence_b</code>, and <code>label</code>
259
  * Approximate statistics based on the first 1000 samples:
260
- | | sentence_a | sentence_b | label |
261
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:------------------------------------------------|
262
- | type | string | string | int |
263
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.46 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.36 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>0: ~50.30%</li><li>1: ~49.70%</li></ul> |
264
  * Samples:
265
- | sentence_a | sentence_b | label |
266
- |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
267
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
268
- | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> | <code>0</code> |
269
- | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
270
- * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
271
- ```json
272
- {
273
- "scale": 20.0,
274
- "similarity_fct": "pairwise_cos_sim"
275
- }
276
- ```
277
 
278
  ### Training Logs
279
  | Epoch | Step | test_cosine_ndcg@10 |
@@ -307,17 +291,6 @@ You can finetune this model on your own dataset.
307
  }
308
  ```
309
 
310
- #### CoSENTLoss
311
- ```bibtex
312
- @online{kexuefm-8847,
313
- title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
314
- author={Su Jianlin},
315
- year={2022},
316
- month={Jan},
317
- url={https://kexue.fm/archives/8847},
318
- }
319
- ```
320
-
321
  <!--
322
  ## Glossary
323
 
 
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
+ - dataset_size:3587
16
+ - loss:ArcTripletLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
+ - source_sentence: Hunter College was originally Lehman College 's uptown campus .
20
  sentences:
21
+ - Acquired programming includes the Irish soap `` Fair City `` and Finnish drama
22
+ `` Black Widows `` .
23
+ - According to the United States Census Bureau , the town has a total area of ;
24
+ of the area is land and 0.66 % is water .
25
+ - Hunter College originally was Lehman College Uptown Campus .
26
+ - source_sentence: He hoped to defeat them and then marry Ravonna .
 
27
  sentences:
28
+ - Stillwater Creek received its official name in 1884 when William L. Couch established
29
+ his `` boomer colony `` on its banks .
30
+ - Note that the invertible of a matrix is always an exponential matrix .
31
+ - He hoped to defeat them and marry Ravonna .
32
+ - source_sentence: Born on February 2 , 1984 , Abrar Khan is a professional Pakistani
33
+ international Kabaddi player .
 
 
34
  sentences:
35
+ - Born on February 2 , 1984 , Abrar Khan is a professional Pakistani international
36
+ Kabaddi player .
37
+ - Together , the paired mylohyoid muscles form a muscular floor for the oral cavity
38
+ of the mouth .
39
+ - Abrar Khan born 2 February 1984 is a Pakistani professional international Kabaddi
40
+ player .
41
+ - source_sentence: Certainly , `` Lucy was nothing like flat `` in physical form ,
42
+ social condition , and personality .
 
43
  sentences:
44
+ - The real number is called the `` imaginary part `` of the real number ; the real
45
+ number is called the `` complex part `` of .
46
+ - From the Celebes lake , the captain Bullock observed the appearance of the corona
47
+ , while Gustav Fritsch accompanied an expedition to Aden .
48
+ - Certainly `` Lucy was , in physical form , social condition and personality ,
49
+ nothing like Shallow `` .
50
+ - source_sentence: The trio has performed besides Gesaffelstein , Justice , Bob Moses
51
+ and Lee Foss .
52
  sentences:
53
+ - The trio has performed besides Gesaffelstein , Justice , Bob Moses and Lee Foss
54
+ .
55
+ - The suttas generally contain educational content , while other early Buddhist
56
+ texts deal with monastic discipline or vinaya .
57
+ - The trio has performed alongside Bob Moses , Justice , Gesaffelstein and Lee Foss
58
+ .
59
  datasets:
60
  - redis/langcache-sentencepairs-v2
61
  pipeline_tag: sentence-similarity
 
147
  model = SentenceTransformer("redis/langcache-embed-v3")
148
  # Run inference
149
  sentences = [
150
+ 'The trio has performed besides Gesaffelstein , Justice , Bob Moses and Lee Foss .',
151
+ 'The trio has performed besides Gesaffelstein , Justice , Bob Moses and Lee Foss .',
152
+ 'The trio has performed alongside Bob Moses , Justice , Gesaffelstein and Lee Foss .',
153
  ]
154
  embeddings = model.encode(sentences)
155
  print(embeddings.shape)
 
158
  # Get the similarity scores for the embeddings
159
  similarities = model.similarity(embeddings, embeddings)
160
  print(similarities)
161
+ # tensor([[0.9961, 0.9961, 0.9844],
162
+ # [0.9961, 0.9961, 0.9844],
163
+ # [0.9844, 0.9844, 0.9961]], dtype=torch.bfloat16)
164
  ```
165
 
166
  <!--
 
224
  #### LangCache Sentence Pairs (all)
225
 
226
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
227
+ * Size: 1,922 training samples
228
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
229
  * Approximate statistics based on the first 1000 samples:
230
+ | | anchor | positive | negative |
231
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
232
+ | type | string | string | string |
233
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.26 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.24 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.09 tokens</li><li>max: 49 tokens</li></ul> |
234
  * Samples:
235
+ | anchor | positive | negative |
236
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------|
237
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>At that time , on June 22 , 1754 , Edward Bentham married Bentham Elizabeth Bates ( d . 1790 ) from Hampshire in the nearby county of Alton .</code> |
238
+ | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>In 2012 , Cornell 5th and Lehigh 8th , Cornell was also 4th in 2013 and 7th in 2014 .</code> |
239
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
240
+ * Loss: <code>losses.ArcTripletLoss</code>
 
 
 
 
 
 
241
 
242
  ### Evaluation Dataset
243
 
244
  #### LangCache Sentence Pairs (all)
245
 
246
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
247
+ * Size: 1,922 evaluation samples
248
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
249
  * Approximate statistics based on the first 1000 samples:
250
+ | | anchor | positive | negative |
251
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
252
+ | type | string | string | string |
253
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.26 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.24 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.09 tokens</li><li>max: 49 tokens</li></ul> |
254
  * Samples:
255
+ | anchor | positive | negative |
256
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------|
257
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>At that time , on June 22 , 1754 , Edward Bentham married Bentham Elizabeth Bates ( d . 1790 ) from Hampshire in the nearby county of Alton .</code> |
258
+ | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>In 2012 , Cornell 5th and Lehigh 8th , Cornell was also 4th in 2013 and 7th in 2014 .</code> |
259
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
260
+ * Loss: <code>losses.ArcTripletLoss</code>
 
 
 
 
 
 
261
 
262
  ### Training Logs
263
  | Epoch | Step | test_cosine_ndcg@10 |
 
291
  }
292
  ```
293
 
 
 
 
 
 
 
 
 
 
 
 
294
  <!--
295
  ## Glossary
296
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2e01abd77d4c0c664940beac5b2b312b66055c81265195500fecb712466ddfc9
3
  size 298041696
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95d02211c4cca89113f9f3e93ed91f5176bf50170faa2cb835f7bfea15bb9dd2
3
  size 298041696