Dejiao Z
commited on
Commit
·
2002840
1
Parent(s):
b7b4991
updated readme
Browse files
README.md
CHANGED
|
@@ -20,6 +20,31 @@ SageLite is a new family of open embedding models with an encoder architecture t
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
### **Code Retrieval Performance**
|
| 24 |
|
| 25 |
#### 1. Code2Code Search
|
|
@@ -54,53 +79,23 @@ SageLite is a new family of open embedding models with an encoder architecture t
|
|
| 54 |
|
| 55 |
| Metric | SageLite-s | SageLite-l |
|
| 56 |
|-------------------------------|------------|------------|
|
| 57 |
-
| ArguAna | 57.75 | 60.
|
| 58 |
-
| CQADupstackWordpressRetrieval | 32.42 | 38.
|
| 59 |
-
| FiQA2018 | 34.85 | 46.
|
| 60 |
-
| NFCorpus | 29.97 | 33.
|
| 61 |
-
| QuoraRetrieval | 85.35 | 87.
|
| 62 |
-
| SCIDOCS | 18.99 | 21.
|
| 63 |
-
| SciFact | 68.43 | 69.
|
| 64 |
-
| Touche2020 | 24.41 | 21.
|
| 65 |
-
| TRECCOVID | 70.88 | 76.
|
| 66 |
-
| FEVER | 71.72 | 73.
|
| 67 |
-
| HotpotQA | 58.81 | 62.
|
| 68 |
-
| NQ | 48.26 | 54.
|
| 69 |
-
| DBPedia | 34.83 | 40.
|
| 70 |
-
| ClimateFEVER | 25.69 | 26.
|
| 71 |
-
| MSMARCO | 35.01 | 36.
|
| 72 |
-
| average | 46.49 | 49.
|
| 73 |
|
| 74 |
---
|
| 75 |
|
| 76 |
-
### **Training Data**
|
| 77 |
-
This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
|
| 78 |
-
|
| 79 |
-
---
|
| 80 |
-
|
| 81 |
-
### **Training Procedure**
|
| 82 |
-
This checkpoint was trained using the following procedure:
|
| 83 |
-
1. **MLM Pretraining**: Masked language modeling on code data.
|
| 84 |
-
2. **Contrastive Pre-Finetuning**: Using large-scale positive pairs mined from web and GitHub data.
|
| 85 |
-
3. **Contrastive Fine-Tuning**: Using a small amount of synthetic data.
|
| 86 |
-
|
| 87 |
-
---
|
| 88 |
-
|
| 89 |
-
### **How to Use**
|
| 90 |
-
This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
|
| 91 |
-
|
| 92 |
-
```python
|
| 93 |
-
from transformers import AutoModel, AutoTokenizer
|
| 94 |
|
| 95 |
-
# Specify the checkpoint
|
| 96 |
-
checkpoint = "SageLite/SageLite-s"
|
| 97 |
-
device = "cuda" # Use "cpu" if GPU is unavailable
|
| 98 |
-
|
| 99 |
-
# Load tokenizer and model
|
| 100 |
-
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
|
| 101 |
-
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
|
| 102 |
-
|
| 103 |
-
# Example usage
|
| 104 |
-
code_snippet = "def print_hello_world():\tprint('Hello World!')"
|
| 105 |
-
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
|
| 106 |
-
embedding = model(inputs)[0] # Extract the embedding
|
|
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
+
### **Training Data**
|
| 24 |
+
This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
### **How to Use**
|
| 30 |
+
This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
from transformers import AutoModel, AutoTokenizer
|
| 34 |
+
|
| 35 |
+
# Specify the checkpoint
|
| 36 |
+
checkpoint = "SageLite/SageLite-s"
|
| 37 |
+
device = "cuda" # Use "cpu" if GPU is unavailable
|
| 38 |
+
|
| 39 |
+
# Load tokenizer and model
|
| 40 |
+
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
|
| 41 |
+
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
|
| 42 |
+
|
| 43 |
+
# Example usage
|
| 44 |
+
code_snippet = "def print_hello_world():\tprint('Hello World!')"
|
| 45 |
+
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
|
| 46 |
+
embedding = model(inputs)[0] # Extract the embedding
|
| 47 |
+
|
| 48 |
### **Code Retrieval Performance**
|
| 49 |
|
| 50 |
#### 1. Code2Code Search
|
|
|
|
| 79 |
|
| 80 |
| Metric | SageLite-s | SageLite-l |
|
| 81 |
|-------------------------------|------------|------------|
|
| 82 |
+
| ArguAna | 57.75 | 60.71 |
|
| 83 |
+
| CQADupstackWordpressRetrieval | 32.42 | 38.63 |
|
| 84 |
+
| FiQA2018 | 34.85 | 46.73 |
|
| 85 |
+
| NFCorpus | 29.97 | 33.70 |
|
| 86 |
+
| QuoraRetrieval | 85.35 | 87.50 |
|
| 87 |
+
| SCIDOCS | 18.99 | 21.38 |
|
| 88 |
+
| SciFact | 68.43 | 69.05 |
|
| 89 |
+
| Touche2020 | 24.41 | 21.43 |
|
| 90 |
+
| TRECCOVID | 70.88 | 76.08 |
|
| 91 |
+
| FEVER | 71.72 | 73.64 |
|
| 92 |
+
| HotpotQA | 58.81 | 62.96 |
|
| 93 |
+
| NQ | 48.26 | 54.48 |
|
| 94 |
+
| DBPedia | 34.83 | 40.69 |
|
| 95 |
+
| ClimateFEVER | 25.69 | 26.20 |
|
| 96 |
+
| MSMARCO | 35.01 | 36.55 |
|
| 97 |
+
| average | 46.49 | 49.98 |
|
| 98 |
|
| 99 |
---
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|