SageLite
/

SageLite-s

+---
+license: apache-2.0
+datasets:
+- bigcode/the-stack-v2
+-tiiuae/falcon-refinedweb
+library_name: transformers
+language:
+- code
+---
+## SageLite-s
+### Model description
+SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training: (1) standard MLM pretraining on the mixed code and text data ([The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)), (2) contrastive pre-finetuning learning on large amount of positive pairs mined from webdata and github, and (3) finetuning on a small amount of synthetic data.
+### Code Retrieval Performance
+##### 1.Code2Code Search
+| Model Name          | # Params | Embd Dim | Python | Java  | JS    | TS     | C#     | C      | Ruby   | PhP    | GO     | AVG    |
+|---------------------|----------|----------|--------|-------|-------|--------|--------|--------|--------|--------|--------|--------|
+| OpenAI-Code-01      | NA       | 3072     | 21.92  | 8.90  | 4.90  | 5.70   | 3.15   | 11.58  | 26.25  | 16.60  | 9.40   | 12.04  |
+| OpenAI-Text-3-Small | NA       | 1536     | 25.18  | 12.61 | 8.00  | 9.44   | 5.46   | 15.86  | 30.70  | 23.33  | 11.20  | 15.57  |
+| OpenAI-Text-3-Large | NA       | 3072     | 40.57  | 25.33 | 20.09 | 22.00  | 11.84  | 31.90  | 42.54  | 41.84  | 21.75  | 28.65  |
+| CodeSage-v2-Small   | 130M     | 1024     | 45.60  | 33.65 | 39.96 | 47.78  | 19.19  | 30.55  | 40.12  | 55.39  | 30.96  | 38.13  |
+| CodeSage-v2-Base    | 356M     | 1024     | 55.86  | 42.89 | 45.29 | 54.58  | 23.90  | 38.52  | 56.02  | 64.56  | 42.88  | 47.17  |
+| CodeSage-v2-Large   | 1.3B     | 2048     | 61.11  | 47.09 | 51.18 | 60.67  | 28.04  | 43.40  | 60.74  | 67.87  | 43.86  | 51.55  |
+| SageLite-s          | 80M      | 768      | 47.93  | 30.83 | 35.15 | 37.64  | 18.14  | 30.53  | 42.89  | 50.70  | 21.69  | 35.06  |
+| SageLite-l          | 850M     | 1536     | 64.46  | 45.53 | 50.80 | 54.71  | 30.66  | 47.46  | 61.01  | 68.68  | 39.25  | 51.40  |
+##### 2. NL2Code Search
+| Model Name          | # Params | CoSQA | AdvTest | Python | Java  | JS    | PhP    | GO     | Ruby   | Avg    |
+|---------------------|----------|-------|---------|--------|-------|-------|--------|--------|--------|--------|
+| OpenAI-Code-01      | NA       | 52.20 | 36.03   | 63.13  | 67.85 | 62.30 | 57.47  | 85.22  | 69.28  | 61.69  |
+| OpenAI-Text-3-Small | NA       | 52.48 | 34.10   | 62.62  | 65.87 | 60.28 | 54.85  | 81.96  | 67.57  | 59.97  |
+| OpenAI-Text-3-Large | NA       | 55.21 | 46.83   | 70.81  | 72.89 | 68.12 | 59.58  | 87.60  | 75.22  | 67.03  |
+| CodeSage-v2-Small   | 130M     | 52.39 | 47.28   | 68.79  | 68.13 | 65.77 | 60.20  | 80.26  | 72.46  | 64.41  |
+| CodeSage-v2-Base    | 356M     | 50.74 | 52.00   | 70.46  | 70.89 | 69.61 | 62.81  | 82.37  | 73.71  | 66.57  |
+| CodeSage-v2-Large   | 1.3B     | 53.18 | 56.31   | 74.18  | 72.33 | 72.49 | 65.26  | 84.67  | 76.61  | 69.38  |
+| SageLite-s          | 80M      | 56.49 | 42.32   | 67.59	| 66.62	| 62.32	| 58.87	 | 79.36  | 70.75  | 63.04  |
+| SageLite-l          | 850M     | 59.76 | 55.55   | 74.25  | 71.76 | 69.35 | 61.62  | 84.09  | 77.14  | 69.19  |
+### Text Retrieval Performance ([MTEB Retrieval](https://huggingface.co/spaces/mteb/leaderboard))
+| Metric                        | SageLite-s | SageLite-l |
+|-------------------------------|------------|------------|
+| ArguAna                       | 57.75      | 60.706     |
+| CQADupstackWordpressRetrieval | 32.42      | 38.625     |
+| FiQA2018                      | 34.85      | 46.729     |
+| NFCorpus                      | 29.97      | 33.698     |
+| QuoraRetrieval                | 85.35      | 87.497     |
+| SCIDOCS                       | 18.99      | 21.379     |
+| SciFact                       | 68.43      | 69.050     |
+| Touche2020                    | 24.41      | 21.425     |
+| TRECCOVID                     | 70.88      | 76.078     |
+| FEVER                         | 71.72      | 73.644     |
+| HotpotQA                      | 58.81      | 62.955     |
+| NQ                            | 48.26      | 54.478     |
+| DBPedia                       | 34.83      | 40.689     |
+| ClimateFEVER                  | 25.69      | 26.198     |
+| MSMARCO                       | 35.01      | 36.546     |
+| average                       | 46.49      | 49.980     |
+### Training Data
+This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).
+Stack data (https://huggingface.co/datasets/bigcode/the-stack-dedup). Supported languages (15 in total) are as follows: english (for text-only task), c, c-sharp, go, java, javascript, typescript, php, python, ruby.
+### Training procedure
+This checkpoint is first trained on code data via masked language modeling (MLM), followed by two-stage contrastive learning -- constrastive pre-finetuning on large amount of positive pairs mined from the internet and constrastive finetuning on a small amount of synthetic data.
+### How to use
+This checkpoint consists of an encoder (80M model), which can be used to extract code embeddings of 768 dimension. It can be easily loaded using the AutoModel functionality and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
+```
+from transformers import AutoModel, AutoTokenizer
+checkpoint = "SageLite/SageLite-s"
+device = "cuda"  # for GPU usage or "cpu" for CPU usage
+# Note: SageLite requires adding eos token at the end of each tokenized sequence
+tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
+model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
+inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
+embedding = model(inputs)[0]
+```