radoslavralev commited on
Commit
6e71c42
·
verified ·
1 Parent(s): 8f296ec

Add new SentenceTransformer model

Browse files
Files changed (2) hide show
  1. README.md +44 -86
  2. model.safetensors +1 -1
README.md CHANGED
@@ -12,54 +12,9 @@ tags:
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
- - dataset_size:1056095
16
- - loss:CoSENTLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
- widget:
19
- - source_sentence: In 2015 Adolf Hitler appeared in the kickstarter short movie ``
20
- Kung Fury `` as Taccone ( A.K.A .
21
- sentences:
22
- - In 2015 , Adolf Hitler appeared in the Kickstarter - short film `` Kung Fury ``
23
- as Taccone ( A.K.A .
24
- - In 1795 , the only white residents were Dr. John Laidley and two brothers with
25
- the surname Ainslie .
26
- - The 125th University Match was played in March 2014 at the Rye Golf Club , Oxford
27
- , East Sussex won the game 8.5 - 6.5 .
28
- - source_sentence: From 1973 to 1974 , Aubrey toured with the Cambridge Theatre Company
29
- as Diggory in `` She Stoops to Conquer `` and again as Aguecheek .
30
- sentences:
31
- - Oxide can be reduced to metallic samarium at higher temperatures by heating with
32
- a reducing agent such as hydrogen or carbon monoxide .
33
- - From 1973 to 1974 Aguecheek toured with the Cambridge Theatre Company as Diggory
34
- in `` You Stoops to Conquer `` and again as Aubrey .
35
- - The medals were presented by Barry Maister , IOC member , New Zealand and Sarah
36
- Webb Gosling , Vice President of World Sailing .
37
- - source_sentence: There is no official wall on the border , although there are sections
38
- of fence near populated areas and continuous border crossings .
39
- sentences:
40
- - The 2014 -- 15 Boston Bruins season was the 91st season for the National Hockey
41
- League franchise that was established on November 1 , 1924 .
42
- - He was trained by the Inghams and owned by John Hawkes .
43
- - There is no continuous wall on the border , although there are fence sections
44
- near populated areas and official border crossings .
45
- - source_sentence: Capital . `` The French established similar hill stations in Indochina
46
- , such as Dalat built in 1921 .
47
- sentences:
48
- - Lubuk China is a small town in Alor Gajah District , Melaka , Malaysia . It is
49
- situated near the border with Negeri Sembilan .
50
- - The French established similar hill stations in Indochina , such as Dalat , built
51
- in 1921 .
52
- - John Potts ( or Pott ) was a doctor and colonial governor of Virginia in the Jamestown
53
- settlement at Virginia Colony in the early 17th century .
54
- - source_sentence: The band pursued `` signals `` in January 2012 in three weeks ,
55
- and drums were recorded in a day and a half .
56
- sentences:
57
- - It was repaired at the beginning of the 20th century and is listed as closed in
58
- our records .
59
- - The band tracked `` Signals `` in three weeks in January 2012 . Drums were recorded
60
- in a day and a half .
61
- - Contributors include actor Anton LaVey , Satanist Christopher Lee , serial killer
62
- expert Clive Barker , author Karen Greenlee , and necrophile Robert Ressler .
63
  datasets:
64
  - redis/langcache-sentencepairs-v1
65
  pipeline_tag: sentence-similarity
@@ -151,9 +106,9 @@ from sentence_transformers import SentenceTransformer
151
  model = SentenceTransformer("redis/langcache-embed-v3")
152
  # Run inference
153
  sentences = [
154
- 'The band pursued `` signals `` in January 2012 in three weeks , and drums were recorded in a day and a half .',
155
- 'The band tracked `` Signals `` in three weeks in January 2012 . Drums were recorded in a day and a half .',
156
- 'Contributors include actor Anton LaVey , Satanist Christopher Lee , serial killer expert Clive Barker , author Karen Greenlee , and necrophile Robert Ressler .',
157
  ]
158
  embeddings = model.encode(sentences)
159
  print(embeddings.shape)
@@ -162,9 +117,9 @@ print(embeddings.shape)
162
  # Get the similarity scores for the embeddings
163
  similarities = model.similarity(embeddings, embeddings)
164
  print(similarities)
165
- # tensor([[0.9961, 0.9570, 0.4941],
166
- # [0.9570, 0.9961, 0.5078],
167
- # [0.4941, 0.5078, 1.0000]], dtype=torch.bfloat16)
168
  ```
169
 
170
  <!--
@@ -228,24 +183,25 @@ You can finetune this model on your own dataset.
228
  #### LangCache Sentence Pairs (all)
229
 
230
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
231
- * Size: 62,021 training samples
232
- * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
233
  * Approximate statistics based on the first 1000 samples:
234
- | | sentence1 | sentence2 | label |
235
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:------------------------------------------------|
236
- | type | string | string | int |
237
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.46 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.36 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>0: ~50.30%</li><li>1: ~49.70%</li></ul> |
238
  * Samples:
239
- | sentence1 | sentence2 | label |
240
- |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
241
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
242
- | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> | <code>0</code> |
243
- | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
244
- * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
245
  ```json
246
  {
247
  "scale": 20.0,
248
- "similarity_fct": "pairwise_cos_sim"
 
249
  }
250
  ```
251
 
@@ -254,24 +210,25 @@ You can finetune this model on your own dataset.
254
  #### LangCache Sentence Pairs (all)
255
 
256
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
257
- * Size: 62,021 evaluation samples
258
- * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
259
  * Approximate statistics based on the first 1000 samples:
260
- | | sentence1 | sentence2 | label |
261
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:------------------------------------------------|
262
- | type | string | string | int |
263
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.46 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.36 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>0: ~50.30%</li><li>1: ~49.70%</li></ul> |
264
  * Samples:
265
- | sentence1 | sentence2 | label |
266
- |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
267
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
268
- | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> | <code>0</code> |
269
- | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
270
- * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
271
  ```json
272
  {
273
  "scale": 20.0,
274
- "similarity_fct": "pairwise_cos_sim"
 
275
  }
276
  ```
277
 
@@ -307,14 +264,15 @@ You can finetune this model on your own dataset.
307
  }
308
  ```
309
 
310
- #### CoSENTLoss
311
  ```bibtex
312
- @online{kexuefm-8847,
313
- title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
314
- author={Su Jianlin},
315
- year={2022},
316
- month={Jan},
317
- url={https://kexue.fm/archives/8847},
 
318
  }
319
  ```
320
 
 
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
+ - dataset_size:1451941
16
+ - loss:MultipleNegativesRankingLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  datasets:
19
  - redis/langcache-sentencepairs-v1
20
  pipeline_tag: sentence-similarity
 
106
  model = SentenceTransformer("redis/langcache-embed-v3")
107
  # Run inference
108
  sentences = [
109
+ 'The weather is lovely today.',
110
+ "It's so sunny outside!",
111
+ 'He drove to the stadium.',
112
  ]
113
  embeddings = model.encode(sentences)
114
  print(embeddings.shape)
 
117
  # Get the similarity scores for the embeddings
118
  similarities = model.similarity(embeddings, embeddings)
119
  print(similarities)
120
+ # tensor([[0.9922, 0.7891, 0.4629],
121
+ # [0.7891, 1.0000, 0.5117],
122
+ # [0.4629, 0.5117, 1.0000]], dtype=torch.bfloat16)
123
  ```
124
 
125
  <!--
 
183
  #### LangCache Sentence Pairs (all)
184
 
185
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
186
+ * Size: 109,885 training samples
187
+ * Columns: <code>texts</code>
188
  * Approximate statistics based on the first 1000 samples:
189
+ | | texts |
190
+ |:--------|:--------------------------------------------------------------------------------------|
191
+ | type | list |
192
+ | details | <ul><li>min: 3 elements</li><li>mean: 3.50 elements</li><li>max: 4 elements</li></ul> |
193
  * Samples:
194
+ | texts |
195
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
196
+ | <code>['The newer Punts are still very much in existence today and race in the same fleets as the older boats .', 'The newer punts are still very much in existence today and run in the same fleets as the older boats .', 'how can I get financial freedom as soon as possible?']</code> |
197
+ | <code>['The newer punts are still very much in existence today and run in the same fleets as the older boats .', 'The newer Punts are still very much in existence today and race in the same fleets as the older boats .', 'The older Punts are still very much in existence today and race in the same fleets as the newer boats .']</code> |
198
+ | <code>['Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .', 'Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .', 'Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .']</code> |
199
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
200
  ```json
201
  {
202
  "scale": 20.0,
203
+ "similarity_fct": "cos_sim",
204
+ "gather_across_devices": false
205
  }
206
  ```
207
 
 
210
  #### LangCache Sentence Pairs (all)
211
 
212
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
213
+ * Size: 109,885 evaluation samples
214
+ * Columns: <code>texts</code>
215
  * Approximate statistics based on the first 1000 samples:
216
+ | | texts |
217
+ |:--------|:--------------------------------------------------------------------------------------|
218
+ | type | list |
219
+ | details | <ul><li>min: 3 elements</li><li>mean: 3.50 elements</li><li>max: 4 elements</li></ul> |
220
  * Samples:
221
+ | texts |
222
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
223
+ | <code>['The newer Punts are still very much in existence today and race in the same fleets as the older boats .', 'The newer punts are still very much in existence today and run in the same fleets as the older boats .', 'how can I get financial freedom as soon as possible?']</code> |
224
+ | <code>['The newer punts are still very much in existence today and run in the same fleets as the older boats .', 'The newer Punts are still very much in existence today and race in the same fleets as the older boats .', 'The older Punts are still very much in existence today and race in the same fleets as the newer boats .']</code> |
225
+ | <code>['Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .', 'Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .', 'Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .']</code> |
226
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
227
  ```json
228
  {
229
  "scale": 20.0,
230
+ "similarity_fct": "cos_sim",
231
+ "gather_across_devices": false
232
  }
233
  ```
234
 
 
264
  }
265
  ```
266
 
267
+ #### MultipleNegativesRankingLoss
268
  ```bibtex
269
+ @misc{henderson2017efficient,
270
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
271
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
272
+ year={2017},
273
+ eprint={1705.00652},
274
+ archivePrefix={arXiv},
275
+ primaryClass={cs.CL}
276
  }
277
  ```
278
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f385a0d550f86ca8a89fefed1927e1f9254bbdf1eb157b251edbe7a7d712304f
3
  size 298041696
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95d02211c4cca89113f9f3e93ed91f5176bf50170faa2cb835f7bfea15bb9dd2
3
  size 298041696