Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,132 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
| 4 |
+
|
| 5 |
+
## CodeShell
|
| 6 |
+
|
| 7 |
+
CodeShell 是[北京大学知识计算实验室](http://se.pku.edu.cn/kcl/)与蚌壳智能科技联合研发的大规模预训练代码语言模型基座。
|
| 8 |
+
|
| 9 |
+
CodeShell的主要特点包括:
|
| 10 |
+
|
| 11 |
+
* 性能强大:7B规模代码基座大模型,超过同等规模的最强基座模型(如CodeLlama-7B)
|
| 12 |
+
* 训练高效:基于高效的数据治理体系,冷启动训练500B高质量数据
|
| 13 |
+
* 体系完整:模型与IDE插件全栈技术体系开源
|
| 14 |
+
* 轻量快速:支持本地C++部署,提供轻量的本地化解决方案
|
| 15 |
+
* 评测全面:提供支持完整项目上下文的代码多任务评测体系(即将开源)
|
| 16 |
+
|
| 17 |
+
本次开源的模型和工具列表如下:
|
| 18 |
+
|
| 19 |
+
- CodeShell Base
|
| 20 |
+
- CodeShell Chat
|
| 21 |
+
- CodeShell Chat 4bit
|
| 22 |
+
- C/C++本地化部署工具
|
| 23 |
+
- VS Code插件
|
| 24 |
+
- JetBrains插件
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
## Model Use
|
| 28 |
+
|
| 29 |
+
### Code Generation
|
| 30 |
+
|
| 31 |
+
Codeshell 提供了Hugging Face格式的模型,开发者可以通过下列代码快速载入并使用Codeshell。
|
| 32 |
+
|
| 33 |
+
```python
|
| 34 |
+
import torch
|
| 35 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 36 |
+
tokenizer = AutoTokenizer.from_pretrained("codeshell", trust_remote_code=True)
|
| 37 |
+
model = AutoModelForCausalLM.from_pretrained("codeshell", trust_remote_code=True).cuda()
|
| 38 |
+
inputs = tokenizer('def print_hello_world():', return_tensors='pt').cuda()
|
| 39 |
+
outputs = model.generate(inputs)
|
| 40 |
+
print(tokenizer.decode(outputs[0]))
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
### Fill in the Moddle
|
| 44 |
+
|
| 45 |
+
CodeShell 支持Fill-in-the-Middle模式,从而更好的支持软件开发过程。
|
| 46 |
+
|
| 47 |
+
```
|
| 48 |
+
input_text = "<fim_prefix>def print_hello_world():\n <fim_suffix>\n print('Hello world!')<fim_middle>"
|
| 49 |
+
inputs = tokenizer(input_text, return_tensors='pt').cuda()
|
| 50 |
+
outputs = model.generate(inputs)
|
| 51 |
+
print(tokenizer.decode(outputs[0]))
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Model Quantization
|
| 55 |
+
|
| 56 |
+
CodeShell 支持4 bit/8 bit量化,4 bit量化后,占用显存大小约6G。
|
| 57 |
+
|
| 58 |
+
```
|
| 59 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 60 |
+
tokenizer = AutoTokenizer.from_pretrained("codeshell", trust_remote_code=True)
|
| 61 |
+
model = AutoModelForCausalLM.from_pretrained("codeshell", trust_remote_code=True)
|
| 62 |
+
model = model.quantize(4).cuda()
|
| 63 |
+
|
| 64 |
+
inputs = tokenizer('def print_hello_world():', return_tensors='pt').cuda()
|
| 65 |
+
outputs = model.generate(inputs)
|
| 66 |
+
print(tokenizer.decode(outputs[0]))
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
## CodeShell IDE Plugin
|
| 70 |
+
|
| 71 |
+
### Web API
|
| 72 |
+
|
| 73 |
+
CodeShell提供了Web API部署工具,为IDE插件提供API支持。
|
| 74 |
+
|
| 75 |
+
```
|
| 76 |
+
git clone [email protected]:WisdomShell/codeshell.git
|
| 77 |
+
cd codeshell
|
| 78 |
+
python api.py
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
CodeShell提供了C/C++版本的推理支持,在没有GPU的个人PC上也能高效使用。开发者可以根据本地环境进行编译,详见[C/C++本地化部署工具]()。编译完成后,可以通过下列命令启动Web API服务。
|
| 82 |
+
|
| 83 |
+
```
|
| 84 |
+
./server -m codeshell.gguf
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
部署完成后,开发者可以通过Web API进行模型推理:
|
| 88 |
+
|
| 89 |
+
```
|
| 90 |
+
curl --location 'http://127.0.0.1:8080/completion' --header 'Content-Type: application/json' --data '{"messages": {"content": "用python写个hello world"}, "temperature": 0.2, "stream": true}'
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### VS Code Plugin
|
| 94 |
+
|
| 95 |
+
CodeShell提供 [VS Code插件](),开发者可以通过插件进行代码补全、代码问答等操作。VS Code 插件也已开源,插件相关问题欢迎在[VS Code插件仓库]()中讨论。
|
| 96 |
+
|
| 97 |
+
## Model Details
|
| 98 |
+
|
| 99 |
+
- 模型架构
|
| 100 |
+
- Architecture: GPT-2
|
| 101 |
+
- Attention: Grouped-Query Attention with Flash Attention 2
|
| 102 |
+
- Position embedding: [Rotary Position Embedding](RoFormer: Enhanced Transformer with Rotary Position Embedding)
|
| 103 |
+
- Precision: bfloat16
|
| 104 |
+
- 超参数
|
| 105 |
+
- n_layer: 42
|
| 106 |
+
- n_embd: 4096
|
| 107 |
+
- n_inner: 16384
|
| 108 |
+
- n_head: 32
|
| 109 |
+
- num_query_groups: 8
|
| 110 |
+
- seq-length: 8192
|
| 111 |
+
- vocab_size: 70144
|
| 112 |
+
|
| 113 |
+
Code Shell使用GPT-2作为基础架构,并使用Grouped-Query Attention、RoPE相对位置编码等技术。
|
| 114 |
+
|
| 115 |
+
## Evaluation
|
| 116 |
+
|
| 117 |
+
我们选取了目前最流行的两个代码评测数据集对模型进行评估,与目前最先进的两个7b代码大模型CodeLllama与Starcoder相比,Codeshell 取得了最优的成绩。具体评测结果如下。
|
| 118 |
+
|
| 119 |
+
### Pass@1
|
| 120 |
+
| 任务 | codeshell-7B | codellama-7B | starcoderbase-7B |
|
| 121 |
+
| ------- | --------- | --------- | --------- |
|
| 122 |
+
| humaneval | **33.48** | 29.44 | 27.80 |
|
| 123 |
+
| mbpp | **39.08** | 37.60 | 34.16 |
|
| 124 |
+
| multiple-java | **29.56** | 29.24 | 24.30 |
|
| 125 |
+
| multiple-js | **33.60** | 31.30 | 27.02 |
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
# License
|
| 129 |
+
|
| 130 |
+
本仓库开源的模型遵循[Apache 2.0 许可证](https://www.apache.org/licenses/LICENSE-2.0),对学术研究完全开放,若需要商用,开发者可发送邮件进行申请,得到书面授权后方可使用。联系邮箱:[[email protected]](mailto:[email protected])
|
| 131 |
+
|
| 132 |
+
|