Transformers
PyTorch
code
English
custom_code
Dejiao Z commited on
Commit
fcf2699
·
1 Parent(s): f66a4c2

updated readme

Browse files
Files changed (1) hide show
  1. README.md +96 -3
README.md CHANGED
@@ -1,3 +1,96 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - bigcode/the-stack-v2
5
+ -tiiuae/falcon-refinedweb
6
+
7
+ library_name: transformers
8
+ language:
9
+ - code
10
+ ---
11
+
12
+ ## SageLite-s
13
+
14
+ ### Model description
15
+ SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training: (1) standard MLM pretraining on the mixed code and text data ([The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)), (2) contrastive pre-finetuning learning on large amount of positive pairs mined from webdata and github, and (3) finetuning on a small amount of synthetic data.
16
+
17
+
18
+ ### Code Retrieval Performance
19
+
20
+ ##### 1.Code2Code Search
21
+ | Model Name | # Params | Embd Dim | Python | Java | JS | TS | C# | C | Ruby | PhP | GO | AVG |
22
+ |---------------------|----------|----------|--------|-------|-------|--------|--------|--------|--------|--------|--------|--------|
23
+ | OpenAI-Code-01 | NA | 3072 | 21.92 | 8.90 | 4.90 | 5.70 | 3.15 | 11.58 | 26.25 | 16.60 | 9.40 | 12.04 |
24
+ | OpenAI-Text-3-Small | NA | 1536 | 25.18 | 12.61 | 8.00 | 9.44 | 5.46 | 15.86 | 30.70 | 23.33 | 11.20 | 15.57 |
25
+ | OpenAI-Text-3-Large | NA | 3072 | 40.57 | 25.33 | 20.09 | 22.00 | 11.84 | 31.90 | 42.54 | 41.84 | 21.75 | 28.65 |
26
+ | CodeSage-v2-Small | 130M | 1024 | 45.60 | 33.65 | 39.96 | 47.78 | 19.19 | 30.55 | 40.12 | 55.39 | 30.96 | 38.13 |
27
+ | CodeSage-v2-Base | 356M | 1024 | 55.86 | 42.89 | 45.29 | 54.58 | 23.90 | 38.52 | 56.02 | 64.56 | 42.88 | 47.17 |
28
+ | CodeSage-v2-Large | 1.3B | 2048 | 61.11 | 47.09 | 51.18 | 60.67 | 28.04 | 43.40 | 60.74 | 67.87 | 43.86 | 51.55 |
29
+ | SageLite-s | 80M | 768 | 47.93 | 30.83 | 35.15 | 37.64 | 18.14 | 30.53 | 42.89 | 50.70 | 21.69 | 35.06 |
30
+ | SageLite-l | 850M | 1536 | 64.46 | 45.53 | 50.80 | 54.71 | 30.66 | 47.46 | 61.01 | 68.68 | 39.25 | 51.40 |
31
+
32
+
33
+ ##### 2. NL2Code Search
34
+ | Model Name | # Params | CoSQA | AdvTest | Python | Java | JS | PhP | GO | Ruby | Avg |
35
+ |---------------------|----------|-------|---------|--------|-------|-------|--------|--------|--------|--------|
36
+ | OpenAI-Code-01 | NA | 52.20 | 36.03 | 63.13 | 67.85 | 62.30 | 57.47 | 85.22 | 69.28 | 61.69 |
37
+ | OpenAI-Text-3-Small | NA | 52.48 | 34.10 | 62.62 | 65.87 | 60.28 | 54.85 | 81.96 | 67.57 | 59.97 |
38
+ | OpenAI-Text-3-Large | NA | 55.21 | 46.83 | 70.81 | 72.89 | 68.12 | 59.58 | 87.60 | 75.22 | 67.03 |
39
+ | CodeSage-v2-Small | 130M | 52.39 | 47.28 | 68.79 | 68.13 | 65.77 | 60.20 | 80.26 | 72.46 | 64.41 |
40
+ | CodeSage-v2-Base | 356M | 50.74 | 52.00 | 70.46 | 70.89 | 69.61 | 62.81 | 82.37 | 73.71 | 66.57 |
41
+ | CodeSage-v2-Large | 1.3B | 53.18 | 56.31 | 74.18 | 72.33 | 72.49 | 65.26 | 84.67 | 76.61 | 69.38 |
42
+ | SageLite-s | 80M | 56.49 | 42.32 | 67.59 | 66.62 | 62.32 | 58.87 | 79.36 | 70.75 | 63.04 |
43
+ | SageLite-l | 850M | 59.76 | 55.55 | 74.25 | 71.76 | 69.35 | 61.62 | 84.09 | 77.14 | 69.19 |
44
+
45
+ ### Text Retrieval Performance ([MTEB Retrieval](https://huggingface.co/spaces/mteb/leaderboard))
46
+
47
+ | Metric | SageLite-s | SageLite-l |
48
+ |-------------------------------|------------|------------|
49
+ | ArguAna | 57.75 | 60.706 |
50
+ | CQADupstackWordpressRetrieval | 32.42 | 38.625 |
51
+ | FiQA2018 | 34.85 | 46.729 |
52
+ | NFCorpus | 29.97 | 33.698 |
53
+ | QuoraRetrieval | 85.35 | 87.497 |
54
+ | SCIDOCS | 18.99 | 21.379 |
55
+ | SciFact | 68.43 | 69.050 |
56
+ | Touche2020 | 24.41 | 21.425 |
57
+ | TRECCOVID | 70.88 | 76.078 |
58
+ | FEVER | 71.72 | 73.644 |
59
+ | HotpotQA | 58.81 | 62.955 |
60
+ | NQ | 48.26 | 54.478 |
61
+ | DBPedia | 34.83 | 40.689 |
62
+ | ClimateFEVER | 25.69 | 26.198 |
63
+ | MSMARCO | 35.01 | 36.546 |
64
+ | average | 46.49 | 49.980 |
65
+
66
+
67
+
68
+ ### Training Data
69
+ This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).
70
+
71
+ Stack data (https://huggingface.co/datasets/bigcode/the-stack-dedup). Supported languages (15 in total) are as follows: english (for text-only task), c, c-sharp, go, java, javascript, typescript, php, python, ruby.
72
+
73
+ ### Training procedure
74
+ This checkpoint is first trained on code data via masked language modeling (MLM), followed by two-stage contrastive learning -- constrastive pre-finetuning on large amount of positive pairs mined from the internet and constrastive finetuning on a small amount of synthetic data.
75
+
76
+
77
+
78
+
79
+ ### How to use
80
+ This checkpoint consists of an encoder (80M model), which can be used to extract code embeddings of 768 dimension. It can be easily loaded using the AutoModel functionality and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
81
+
82
+ ```
83
+ from transformers import AutoModel, AutoTokenizer
84
+
85
+ checkpoint = "SageLite/SageLite-s"
86
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
87
+
88
+ # Note: SageLite requires adding eos token at the end of each tokenized sequence
89
+
90
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
91
+
92
+ model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
93
+
94
+ inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
95
+ embedding = model(inputs)[0]
96
+ ```