radoslavralev commited on
Commit
ee88725
·
verified ·
1 Parent(s): 1f3bf1b

Add new SentenceTransformer model

Browse files
Files changed (2) hide show
  1. README.md +94 -68
  2. model.safetensors +1 -1
README.md CHANGED
@@ -12,51 +12,54 @@ tags:
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
- - dataset_size:3119809
16
- - loss:CustomBCELoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
- - source_sentence: Hayley Vaughan portrayed Ripa on the ABC daytime soap opera , ``
20
- All My Children `` , between 1990 and 2002 .
21
  sentences:
22
- - Traxxpad is a music application for Sony 's PlayStation Portable published by
23
- Definitive Studios and developed by Eidos Interactive .
24
- - Between 1990 and 2002 , Hayley Vaughan Ripa portrayed in the ABC soap opera ``
25
- All My Children `` .
26
- - Between 1990 and 2002 , Ripa Hayley portrayed Vaughan in the ABC soap opera ``
27
- All My Children `` .
28
- - source_sentence: Olivella monilifera is a species of dwarf sea snail , small gastropod
29
- mollusk in the family Olivellidae , the marine olives .
30
  sentences:
31
- - Olivella monilifera is a species of the dwarf - sea snail , small gastropod mollusk
32
- in the Olivellidae family , the marine olives .
33
- - He was cut by the Browns after being signed by the Bills in 2013 . He was later
34
- released .
35
- - Olivella monilifera is a kind of sea snail , marine gastropod mollusk in the Olivellidae
36
- family , the dwarf olives .
37
- - source_sentence: Hayashi said that Mackey `` is a sort of `` of the original model
38
- for Tenchi .
39
  sentences:
40
- - In the summer of 2009 , Ellick shot a documentary about Malala Yousafzai .
41
- - Hayashi said that Mackey is `` sort of `` the original model for Tenchi .
42
- - Mackey said that Hayashi is `` sort of `` the original model for Tenchi .
43
- - source_sentence: Much of the film was shot on location in Los Angeles and in nearby
44
- Burbank and Glendale .
 
 
 
 
45
  sentences:
46
- - Much of the film was shot on location in Los Angeles and in nearby Burbank and
47
- Glendale .
48
- - Much of the film was shot on site in Burbank and Glendale and in the nearby Los
49
- Angeles .
50
- - Traxxpad is a music application for the Sony PlayStation Portable developed by
51
- the Definitive Studios and published by Eidos Interactive .
52
- - source_sentence: According to him , the earth is the carrier of his artistic work
53
- , which is only integrated into the creative process by minimal changes .
54
  sentences:
55
- - National players are Bold players .
56
- - According to him , earth is the carrier of his artistic work being integrated
57
- into the creative process only by minimal changes .
58
- - According to him , earth is the carrier of his creative work being integrated
59
- into the artistic process only by minimal changes .
 
60
  datasets:
61
  - redis/langcache-sentencepairs-v2
62
  pipeline_tag: sentence-similarity
@@ -148,9 +151,9 @@ from sentence_transformers import SentenceTransformer
148
  model = SentenceTransformer("redis/langcache-embed-v3")
149
  # Run inference
150
  sentences = [
151
- 'According to him , the earth is the carrier of his artistic work , which is only integrated into the creative process by minimal changes .',
152
- 'According to him , earth is the carrier of his artistic work being integrated into the creative process only by minimal changes .',
153
- 'According to him , earth is the carrier of his creative work being integrated into the artistic process only by minimal changes .',
154
  ]
155
  embeddings = model.encode(sentences)
156
  print(embeddings.shape)
@@ -159,9 +162,9 @@ print(embeddings.shape)
159
  # Get the similarity scores for the embeddings
160
  similarities = model.similarity(embeddings, embeddings)
161
  print(similarities)
162
- # tensor([[1.0000, 0.9961, 0.9922],
163
- # [0.9961, 1.0000, 0.9961],
164
- # [0.9922, 0.9961, 0.9961]], dtype=torch.bfloat16)
165
  ```
166
 
167
  <!--
@@ -225,40 +228,52 @@ You can finetune this model on your own dataset.
225
  #### LangCache Sentence Pairs (all)
226
 
227
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
228
- * Size: 126,938 training samples
229
- * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
230
  * Approximate statistics based on the first 1000 samples:
231
- | | anchor | positive | negative |
232
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
233
- | type | string | string | string |
234
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 48 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 26.54 tokens</li><li>max: 61 tokens</li></ul> |
235
  * Samples:
236
- | anchor | positive | negative |
237
- |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|
238
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>how can I get financial freedom as soon as possible?</code> |
239
- | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The older Punts are still very much in existence today and race in the same fleets as the newer boats .</code> |
240
- | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
241
- * Loss: <code>losses.CustomBCELoss</code>
 
 
 
 
 
 
242
 
243
  ### Evaluation Dataset
244
 
245
  #### LangCache Sentence Pairs (all)
246
 
247
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
248
- * Size: 126,938 evaluation samples
249
- * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
250
  * Approximate statistics based on the first 1000 samples:
251
- | | anchor | positive | negative |
252
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
253
- | type | string | string | string |
254
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 48 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 26.54 tokens</li><li>max: 61 tokens</li></ul> |
255
  * Samples:
256
- | anchor | positive | negative |
257
- |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|
258
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>how can I get financial freedom as soon as possible?</code> |
259
- | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The older Punts are still very much in existence today and race in the same fleets as the newer boats .</code> |
260
- | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
261
- * Loss: <code>losses.CustomBCELoss</code>
 
 
 
 
 
 
262
 
263
  ### Training Logs
264
  | Epoch | Step | test_cosine_ndcg@10 |
@@ -292,6 +307,17 @@ You can finetune this model on your own dataset.
292
  }
293
  ```
294
 
 
 
 
 
 
 
 
 
 
 
 
295
  <!--
296
  ## Glossary
297
 
 
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
+ - dataset_size:2200421
16
+ - loss:CoSENTLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
+ - source_sentence: They are sometimes called Marg or also Path in Hindi .
 
20
  sentences:
21
+ - Largs was born in Brisbane House in Noddsdale , near Brisbane in Ayrshire , Scotland
22
+ , the son of Sir Thomas Brisbane and Dame Eleanora Brisbane .
23
+ - Its smallest radius is 1.4 ( 131 thousand light years ) and largest 0.7 angle
24
+ minutes ( 65 thousand light years ) .
25
+ - They are also called Marg or sometimes the path in the Hindi .
26
+ - source_sentence: The main mode of play in `` Crash Bash `` is the Adventure Mode
27
+ , in which one or two players must win all 28 levels to complete .
 
28
  sentences:
29
+ - Parkton is a city in Robeson County , North Carolina , in the Lumberton Metro
30
+ area , in the United States .
31
+ - The CANTAB tests were developed by Professor Barbara Sahakian and Professor Trevor
32
+ Robbins .
33
+ - The main mode in `` Crash Bash `` is the adventure mode in which one or two players
34
+ must complete all 28 levels to win .
35
+ - source_sentence: It was formed in December 2014 from elements of the disbanded 51st
36
+ Mechanized Brigade and newly mobilized units .
37
  sentences:
38
+ - It had branches in feature films , television , physical and digital publishing
39
+ , merchandise , recorded music , digital and online media applications and mobile
40
+ and social games .
41
+ - Notts County and Arsenal were relegated to the Second Division ; Preston North
42
+ End and Burnley were promoted to the First Division .
43
+ - It was formed in December 2014 from elements of the dissolved 51st Mechanized
44
+ Brigade and newly mobilized units .
45
+ - source_sentence: The band pursued `` signals `` in January 2012 in three weeks ,
46
+ and drums were recorded in a day and a half .
47
  sentences:
48
+ - Kearsarge Lakes , Kearsarge Pass Trail , and Rae Lakes all have a maximum 2 nights
49
+ stay , and Bullfrog Lake along the Charlotte Lake is closed to camping .
50
+ - The band tracked `` Signals `` in three weeks in January 2012 . Drums were recorded
51
+ in a day and a half .
52
+ - From 1954 to 1961 , he was married to Stella Caralis and from 1978 until his death
53
+ with Nina Bohlen .
54
+ - source_sentence: A special case is of the Country B loyalist who controls agents
55
+ or provides managerial supporting or other functions against Country A .
56
  sentences:
57
+ - A special case is the loyalist of Country B , who controls agents or provides
58
+ management support or other functions against Country A .
59
+ - Music Story is a music service website and international music data provider that
60
+ curates , aggregates and analyses metadata for digital music services .
61
+ - These six cars were painted in the same lacquering as the buffet cars , silver
62
+ with red lines and text .
63
  datasets:
64
  - redis/langcache-sentencepairs-v2
65
  pipeline_tag: sentence-similarity
 
151
  model = SentenceTransformer("redis/langcache-embed-v3")
152
  # Run inference
153
  sentences = [
154
+ 'A special case is of the Country B loyalist who controls agents or provides managerial supporting or other functions against Country A .',
155
+ 'A special case is the loyalist of Country B , who controls agents or provides management support or other functions against Country A .',
156
+ 'Music Story is a music service website and international music data provider that curates , aggregates and analyses metadata for digital music services .',
157
  ]
158
  embeddings = model.encode(sentences)
159
  print(embeddings.shape)
 
162
  # Get the similarity scores for the embeddings
163
  similarities = model.similarity(embeddings, embeddings)
164
  print(similarities)
165
+ # tensor([[1.0000, 0.9844, 0.5195],
166
+ # [0.9844, 0.9922, 0.5078],
167
+ # [0.5195, 0.5078, 0.9922]], dtype=torch.bfloat16)
168
  ```
169
 
170
  <!--
 
228
  #### LangCache Sentence Pairs (all)
229
 
230
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
231
+ * Size: 72,021 training samples
232
+ * Columns: <code>sentence_a</code>, <code>sentence_b</code>, and <code>label</code>
233
  * Approximate statistics based on the first 1000 samples:
234
+ | | sentence_a | sentence_b | label |
235
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:------------------------------------------------|
236
+ | type | string | string | int |
237
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.46 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.36 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>0: ~50.30%</li><li>1: ~49.70%</li></ul> |
238
  * Samples:
239
+ | sentence_a | sentence_b | label |
240
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
241
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
242
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> | <code>0</code> |
243
+ | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
244
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
245
+ ```json
246
+ {
247
+ "scale": 20.0,
248
+ "similarity_fct": "pairwise_cos_sim"
249
+ }
250
+ ```
251
 
252
  ### Evaluation Dataset
253
 
254
  #### LangCache Sentence Pairs (all)
255
 
256
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
257
+ * Size: 72,021 evaluation samples
258
+ * Columns: <code>sentence_a</code>, <code>sentence_b</code>, and <code>label</code>
259
  * Approximate statistics based on the first 1000 samples:
260
+ | | sentence_a | sentence_b | label |
261
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:------------------------------------------------|
262
+ | type | string | string | int |
263
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.46 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.36 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>0: ~50.30%</li><li>1: ~49.70%</li></ul> |
264
  * Samples:
265
+ | sentence_a | sentence_b | label |
266
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
267
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
268
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> | <code>0</code> |
269
+ | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
270
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
271
+ ```json
272
+ {
273
+ "scale": 20.0,
274
+ "similarity_fct": "pairwise_cos_sim"
275
+ }
276
+ ```
277
 
278
  ### Training Logs
279
  | Epoch | Step | test_cosine_ndcg@10 |
 
307
  }
308
  ```
309
 
310
+ #### CoSENTLoss
311
+ ```bibtex
312
+ @online{kexuefm-8847,
313
+ title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
314
+ author={Su Jianlin},
315
+ year={2022},
316
+ month={Jan},
317
+ url={https://kexue.fm/archives/8847},
318
+ }
319
+ ```
320
+
321
  <!--
322
  ## Glossary
323
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:df591d1a7c1b7bcc9b50299969bf975583c36ab94d873953a511566669307515
3
  size 298041696
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95d02211c4cca89113f9f3e93ed91f5176bf50170faa2cb835f7bfea15bb9dd2
3
  size 298041696