upload readme
Browse files
README.md
CHANGED
|
@@ -32,15 +32,28 @@ license: other
|
|
| 32 |
|
| 33 |
## Introduction
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
## InternLM2.5-1.8B-Chat
|
| 38 |
|
| 39 |
### Performance Evaluation
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
- The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
|
| 45 |
|
| 46 |
|
|
@@ -169,7 +182,10 @@ The code is licensed under Apache-2.0, while model weights are fully open for ac
|
|
| 169 |
|
| 170 |
## 简介
|
| 171 |
|
| 172 |
-
|
|
|
|
|
|
|
|
|
|
| 173 |
|
| 174 |
## InternLM2.5-1.8B-Chat
|
| 175 |
|
|
@@ -177,9 +193,16 @@ TODO
|
|
| 177 |
|
| 178 |
我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测,部分评测结果如下表所示,欢迎访问[ OpenCompass 榜单 ](https://rank.opencompass.org.cn)获取更多的评测结果。
|
| 179 |
|
| 180 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
| 182 |
-
- 以上评测结果基于 [OpenCompass](https://github.com/internLM/OpenCompass/)
|
| 183 |
- 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异,请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
|
| 184 |
|
| 185 |
**局限性:** 尽管在训练过程中我们非常注重模型的安全性,尽力促使模型输出符合伦理和法律要求的文本,但受限于模型大小以及概率生成范式,模型可能会产生各种不符合预期的输出,例如回复内容包含偏见、歧视等有害内容,请勿传播这些内容。由于传播不良信息导致的任何后果,本项目不承担责任。
|
|
|
|
| 32 |
|
| 33 |
## Introduction
|
| 34 |
|
| 35 |
+
InternLM2.5 has open-sourced a 1.8 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
|
| 36 |
+
|
| 37 |
+
- **Outstanding reasoning capability**: State-of-the-art performance on Math reasoning, surpassing models like MiniCPM-2 and Qwen2-1.5B.
|
| 38 |
+
|
| 39 |
+
- **Stronger tool use**: InternLM2.5 supports gathering information from more than 100 web pages, corresponding implementation has be released in [MindSearch](https://github.com/InternLM/MindSearch). InternLM2.5 has better tool utilization-related capabilities in instruction following, tool selection and reflection. See [examples](https://github.com/InternLM/InternLM/blob/main/agent/lagent.md).
|
| 40 |
|
| 41 |
## InternLM2.5-1.8B-Chat
|
| 42 |
|
| 43 |
### Performance Evaluation
|
| 44 |
|
| 45 |
+
We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://rank.opencompass.org.cn) for more evaluation results.
|
| 46 |
|
| 47 |
+
| Benchmark | InternLM2.5-1.8B-Chat | MiniCPM-2 | Qwen2-1.5B-Instruct |
|
| 48 |
+
| ------------------ | --------------------- | --------- | ------------------- |
|
| 49 |
+
| MMLU (5-shot) | 50.7 | 54.2 | 55.7 |
|
| 50 |
+
| CMMLU (5-shot) | 62.2 | 50.6 | 65.2 |
|
| 51 |
+
| BBH (3-shot CoT) | **41.9** | 41.5 | 36.5 |
|
| 52 |
+
| MATH (0-shot CoT) | **40.2** | 15.5 | 21.4 |
|
| 53 |
+
| HumanEval | 43.3 | 50.0 | 47.6 |
|
| 54 |
+
| GPQA (0-shot) | **27.8** | 23.7 | 27.3 |
|
| 55 |
+
|
| 56 |
+
- The evaluation results were obtained from [OpenCompass](https://github.com/internLM/OpenCompass/), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
|
| 57 |
- The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
|
| 58 |
|
| 59 |
|
|
|
|
| 182 |
|
| 183 |
## 简介
|
| 184 |
|
| 185 |
+
InternLM2.5 ,即书生·浦语大模型第 2.5 代,开源了面向实用场景的 18 亿参数基础模型与对话模型 (InternLM2.5-20B-Chat)。模型具有以下特点:
|
| 186 |
+
|
| 187 |
+
- 卓越的推理性能:在数学推理方面取得了同量级模型最优精度,超越了 MiniCPM-2 和 Qwen2-1.5B。
|
| 188 |
+
- 工具调用能力整体升级:InternLM2.5 支持从上百个网页搜集有效信息进行分析推理,相关实现已开源到 [MindSearch](https://github.com/InternLM/MindSearch)。InternLM2.5 具有更强和更具有泛化性的指令理解、工具筛选与结果反思等能力,新版模型可以更可靠地支持复杂智能体的搭建,支持对工具进行有效的多轮调用,完成较复杂的任务。可以查看更多[样例](https://github.com/InternLM/InternLM/blob/main/agent/lagent.md)。
|
| 189 |
|
| 190 |
## InternLM2.5-1.8B-Chat
|
| 191 |
|
|
|
|
| 193 |
|
| 194 |
我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测,部分评测结果如下表所示,欢迎访问[ OpenCompass 榜单 ](https://rank.opencompass.org.cn)获取更多的评测结果。
|
| 195 |
|
| 196 |
+
| Benchmark | InternLM2.5-1.8B-Chat | MiniCPM-2 | Qwen2-1.5B-Instruct |
|
| 197 |
+
| ------------------ | --------------------- | --------- | ------------------- |
|
| 198 |
+
| MMLU (5-shot) | 50.7 | 54.2 | 55.7 |
|
| 199 |
+
| CMMLU (5-shot) | 62.2 | 50.6 | 65.2 |
|
| 200 |
+
| BBH (3-shot CoT) | **41.9** | 41.5 | 36.5 |
|
| 201 |
+
| MATH (0-shot CoT) | **40.2** | 15.5 | 21.4 |
|
| 202 |
+
| HumanEval | 43.3 | 50.0 | 47.6 |
|
| 203 |
+
| GPQA (0-shot) | **27.8** | 23.7 | 27.3 |
|
| 204 |
|
| 205 |
+
- 以上评测结果基于 [OpenCompass](https://github.com/internLM/OpenCompass/) 获得,具体测试细节可参见 [OpenCompass](https://github.com/internLM/OpenCompass/) 中提供的配置文件。
|
| 206 |
- 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异,请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
|
| 207 |
|
| 208 |
**局限性:** 尽管在训练过程中我们非常注重模型的安全性,尽力促使模型输出符合伦理和法律要求的文本,但受限于模型大小以及概率生成范式,模型可能会产生各种不符合预期的输出,例如回复内容包含偏见、歧视等有害内容,请勿传播这些内容。由于传播不良信息导致的任何后果,本项目不承担责任。
|