langcache-embed-v3 / README.md
radoslavralev's picture
Add new SentenceTransformer model
ee88725 verified
|
raw
history blame
15.9 kB
metadata
language:
  - en
license: apache-2.0
tags:
  - biencoder
  - sentence-transformers
  - text-classification
  - sentence-pair-classification
  - semantic-similarity
  - semantic-search
  - retrieval
  - reranking
  - generated_from_trainer
  - dataset_size:2200421
  - loss:CoSENTLoss
base_model: Alibaba-NLP/gte-modernbert-base
widget:
  - source_sentence: They are sometimes called Marg or also Path in Hindi .
    sentences:
      - >-
        Largs was born in Brisbane House in Noddsdale , near Brisbane in
        Ayrshire , Scotland , the son of Sir Thomas Brisbane and Dame Eleanora
        Brisbane .
      - >-
        Its smallest radius is 1.4 ( 131 thousand light years ) and largest 0.7
        angle minutes ( 65 thousand light years ) .
      - They are also called Marg or sometimes the path in the Hindi .
  - source_sentence: >-
      The main mode of play in `` Crash Bash `` is the Adventure Mode , in which
      one or two players must win all 28 levels to complete .
    sentences:
      - >-
        Parkton is a city in Robeson County , North Carolina , in the Lumberton
        Metro area , in the United States .
      - >-
        The CANTAB tests were developed by Professor Barbara Sahakian and
        Professor Trevor Robbins .
      - >-
        The main mode in `` Crash Bash `` is the adventure mode in which one or
        two players must complete all 28 levels to win .
  - source_sentence: >-
      It was formed in December 2014 from elements of the disbanded 51st
      Mechanized Brigade and newly mobilized units .
    sentences:
      - >-
        It had branches in feature films , television , physical and digital
        publishing , merchandise , recorded music , digital and online media
        applications and mobile and social games .
      - >-
        Notts County and Arsenal were relegated to the Second Division ; Preston
        North End and Burnley were promoted to the First Division .
      - >-
        It was formed in December 2014 from elements of the dissolved 51st
        Mechanized Brigade and newly mobilized units .
  - source_sentence: >-
      The band pursued `` signals `` in January 2012 in three weeks , and drums
      were recorded in a day and a half .
    sentences:
      - >-
        Kearsarge Lakes , Kearsarge Pass Trail , and Rae Lakes all have a
        maximum 2 nights stay , and Bullfrog Lake along the Charlotte Lake is
        closed to camping .
      - >-
        The band tracked `` Signals `` in three weeks in January 2012 . Drums
        were recorded in a day and a half .
      - >-
        From 1954 to 1961 , he was married to Stella Caralis and from 1978 until
        his death with Nina Bohlen .
  - source_sentence: >-
      A special case is of the Country B loyalist who controls agents or
      provides managerial supporting or other functions against Country A .
    sentences:
      - >-
        A special case is the loyalist of Country B , who controls agents or
        provides management support or other functions against Country A .
      - >-
        Music Story is a music service website and international music data
        provider that curates , aggregates and analyses metadata for digital
        music services .
      - >-
        These six cars were painted in the same lacquering as the buffet cars ,
        silver with red lines and text .
datasets:
  - redis/langcache-sentencepairs-v2
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_precision@1
  - cosine_recall@1
  - cosine_ndcg@10
  - cosine_mrr@1
  - cosine_map@100
model-index:
  - name: Redis fine-tuned BiEncoder model for semantic caching on LangCache
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: test
          type: test
        metrics:
          - type: cosine_accuracy@1
            value: 0.5861241448475948
            name: Cosine Accuracy@1
          - type: cosine_precision@1
            value: 0.5861241448475948
            name: Cosine Precision@1
          - type: cosine_recall@1
            value: 0.5679885764966713
            name: Cosine Recall@1
          - type: cosine_ndcg@10
            value: 0.7729838064849864
            name: Cosine Ndcg@10
          - type: cosine_mrr@1
            value: 0.5861241448475948
            name: Cosine Mrr@1
          - type: cosine_map@100
            value: 0.7216697804426214
            name: Cosine Map@100

Redis fine-tuned BiEncoder model for semantic caching on LangCache

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-modernbert-base on the LangCache Sentence Pairs (all) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for sentence pair similarity.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 100, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("redis/langcache-embed-v3")
# Run inference
sentences = [
    'A special case is of the Country B loyalist who controls agents or provides managerial supporting or other functions against Country A .',
    'A special case is the loyalist of Country B , who controls agents or provides management support or other functions against Country A .',
    'Music Story is a music service website and international music data provider that curates , aggregates and analyses metadata for digital music services .',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.9844, 0.5195],
#         [0.9844, 0.9922, 0.5078],
#         [0.5195, 0.5078, 0.9922]], dtype=torch.bfloat16)

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.5861
cosine_precision@1 0.5861
cosine_recall@1 0.568
cosine_ndcg@10 0.773
cosine_mrr@1 0.5861
cosine_map@100 0.7217

Training Details

Training Dataset

LangCache Sentence Pairs (all)

  • Dataset: LangCache Sentence Pairs (all)
  • Size: 72,021 training samples
  • Columns: sentence_a, sentence_b, and label
  • Approximate statistics based on the first 1000 samples:
    sentence_a sentence_b label
    type string string int
    details
    • min: 8 tokens
    • mean: 27.46 tokens
    • max: 53 tokens
    • min: 9 tokens
    • mean: 27.36 tokens
    • max: 52 tokens
    • 0: ~50.30%
    • 1: ~49.70%
  • Samples:
    sentence_a sentence_b label
    The newer Punts are still very much in existence today and race in the same fleets as the older boats . The newer punts are still very much in existence today and run in the same fleets as the older boats . 1
    Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada . Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada . 0
    After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall . Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall . 1
  • Loss: CoSENTLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "pairwise_cos_sim"
    }
    

Evaluation Dataset

LangCache Sentence Pairs (all)

  • Dataset: LangCache Sentence Pairs (all)
  • Size: 72,021 evaluation samples
  • Columns: sentence_a, sentence_b, and label
  • Approximate statistics based on the first 1000 samples:
    sentence_a sentence_b label
    type string string int
    details
    • min: 8 tokens
    • mean: 27.46 tokens
    • max: 53 tokens
    • min: 9 tokens
    • mean: 27.36 tokens
    • max: 52 tokens
    • 0: ~50.30%
    • 1: ~49.70%
  • Samples:
    sentence_a sentence_b label
    The newer Punts are still very much in existence today and race in the same fleets as the older boats . The newer punts are still very much in existence today and run in the same fleets as the older boats . 1
    Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada . Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada . 0
    After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall . Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall . 1
  • Loss: CoSENTLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "pairwise_cos_sim"
    }
    

Training Logs

Epoch Step test_cosine_ndcg@10
-1 -1 0.7730

Framework Versions

  • Python: 3.12.3
  • Sentence Transformers: 5.1.0
  • Transformers: 4.56.0
  • PyTorch: 2.8.0+cu128
  • Accelerate: 1.10.1
  • Datasets: 4.0.0
  • Tokenizers: 0.22.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CoSENTLoss

@online{kexuefm-8847,
    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
    author={Su Jianlin},
    year={2022},
    month={Jan},
    url={https://kexue.fm/archives/8847},
}