File size: 6,420 Bytes
1f28ab9
 
 
998f2fd
d02e2d1
1f28ab9
d02e2d1
 
 
 
6f5866f
 
 
d02e2d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
998f2fd
1f28ab9
b3a134a
 
998f2fd
 
84a7b49
998f2fd
84a7b49
 
 
274ce6d
 
 
 
 
 
 
 
 
 
 
 
 
998f2fd
 
 
 
 
 
85981b8
998f2fd
 
 
 
 
b3a134a
 
 
 
 
 
 
4f313e1
b3a134a
 
8ae4d54
b3a134a
 
 
998f2fd
5454f2d
 
 
 
 
 
 
 
998f2fd
5454f2d
 
 
 
 
 
998f2fd
f5dcc9c
998f2fd
 
 
 
 
 
 
 
6adcc00
 
 
 
 
 
 
3e3276a
 
6adcc00
 
a25352f
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
license: apache-2.0
---
# K2-Chat: a fully-reproducible large language model outperforming Llama 2 70B Chat using 35% less compute
K2 Chat is finetuned from [K2-65B](https://huggingface.co/LLM360/K2). The most recent model update 10/31/24.

In this release, we introduce function calling features and target improvements across math, coding, and safety. 

We utilized the following datasets:

* [Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
* [JiuZhang3.0-Corpus-SFT](https://huggingface.co/datasets/ToheartZhang/JiuZhang3.0-Corpus-SFT)
* [glaive-function-calling-v2-sharegpt](https://huggingface.co/datasets/hiyouga/glaive-function-calling-v2-sharegpt)

## Results

|                         | K2-Chat-060124 | K2-Chat |
|-------------------------|---------|----------|
| **Natural Language Benchmarks** |         |          |
| MMLU (0-shot)           | 63.5    | 69.14    |
| RACE (0-shot)           | 46.1    | 46.60    |
| HellaSwag (10-shot)     | 81.7    | 80.80    |
| PIQA (5-shot)           | 82.3    | 81.34    |
| ARC-easy (5-shot)       | 84.6    | 79.00    |
| ARC-challenge (25-shot) | 61.3    | 61.09    |
| OpenBookQA (5-shot)     | 48.0    | 47.00    |
| Winogrande (5-shot)     | 79.5    | 78.30    |
| TruthfulQA (0-shot)     | 44.7    | 57.32    |
| CrowS-Pairs (0-shot)    | 64.2    | 65.32    |
| GSM8K (5-shot)          | 60.7    | 77.10    |
| MathQA (5-shot)         | 44.8    | 43.12    |
| LogiQA2.0 (0-shot)      | 38.0    | 36.83    |
| BBH CoT (0-shot)        | 64.9    | 70.37    |
| **Code Benchmarks**     |         |          |
| HumanEval (pass@1)      | 47.9    | 71.20    |
| **Domain Specific (Medical)** |   |          |
| MedQA (0-shot)          | 53.6    | 52.87    |
| MedMCQA (5-shot)        | 51.3    | 50.71    |
| PubMedQA (0-shot)       | 75.0    | 71.20    |
| **Other**               |         |          |
| MT-Bench               | 6.87     | 7.55     |
| JSON-Mode-Eval          | 77.21   | 90.09    |
| **Overall Average Score**|         |          |
| Avg Score               | 58.88   | 61.30    |





## K2-Chat-060124 
K2 Chat is finetuned from [K2-65B](https://huggingface.co/LLM360/K2). K2 Chat outperforms Llama 2-70B-Chat on all evaluations conducted. The model also outperforms Llama 3-70B-Instruct on coding tasks.

<center><img src="k2_chat_eval_table.png" alt="k2 eval table" /></center>

## LLM360 Model Performance and Evaluation Collection
The LLM360 Performance and Evaluation Collection is a robust evaluations set consisting of general and domain specific evaluations to assess model knowledge and function.

Evaluations include standard best practice benchmarks, medical, math, and coding knowledge. More about the evaluations can be found here.

<center><img src="k2_chat_table_of_tables.png" alt="k2 big eval table"/></center>

## Open LLM Leaderboard
| Evaluation      | Score      | Raw Score      |
| ----------- | ----------- | ----------- | 
| IFEval   | 51.52        | 52       |
| BBH   | 33.79        | 54       |
| Math Lvl 5   | 1.59        | 2       |
| GPQA   | 7.49        | 31       |
| MUSR   | 16.82        | 46       |
| MMLU-PRO   | 26.34        | 34       |
| Average   | 22.93        | 36.5       |



## Datasets and Mix

| Subset      | #Tokens | Avg. #Q | Avg. Query Len | Avg. #R | Avg. Reply Len |
| ----------- | ----------- |----------- |----------- |----------- |----------- |
| [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct)      | 66,639,699       | 1.00 | 81.53 | 1.00 | 172.78 |
| [OpenHermes-2](https://huggingface.co/datasets/teknium/OpenHermes-2.5)   |404,820,694        | 1.01 | 152.38	| 1.01	| 249.12 |
| [FLAN_3M](https://arxiv.org/abs/2109.01652)   | 2,346,961,387        | 1.00 | 727.49	| 1.00	| 54.83 | 
| [Standford Encyclopedia Philosophy](https://huggingface.co/datasets/AiresPucrs/stanford-encyclopedia-philosophy)   | 786,928        | 1.00	| 219.09 |	1.00	| 166.28 | 
| [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories)   | 1,448,898        | 1.00	| 260.82	| 1.00	| 207.47 |
| Safety & Alignment Data   | 99,976,621        | 1.00	| 126.71	| 1.00	| 373.79 |
| Total | 2,920,634,227

## Loading K2-Chat
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("LLM360/K2-Chat")
model = AutoModelForCausalLM.from_pretrained("LLM360/K2-Chat")

prompt = '<|beginofuser|>what is the highest mountain on earth?<|beginofsystem|>'

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(input_ids, do_sample=True, max_new_tokens=128)

print("-"*20 + "Output for model"  + 20 * '-')
print(tokenizer.batch_decode(gen_tokens)[0])
```
Alternatively, you can construct the prompt by applying the chat template of tokenizer on input conversation:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("LLM360/K2-Chat")
model = AutoModelForCausalLM.from_pretrained("LLM360/K2-Chat")

messages = [{"role": "user", "content": "what is the highest mountain on earth?"}]

input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
gen_tokens = model.generate(input_ids, do_sample=True, max_new_tokens=128)

print("-"*20 + "Output for model"  + 20 * '-')
print(tokenizer.batch_decode(gen_tokens)[0])
```
## LLM360 Developer Suite
We provide step-by-step finetuning tutorials for tech enthusiasts, AI practitioners and academic or industry researchers [here](https://www.llm360.ai/developer.html).

## About LLM360
LLM360 is an open research lab enabling community-owned AGI through open-source large model research and development.

LLM360 enables community-owned AGI by creating standards and tools to advance the bleeding edge of LLM capability and empower knowledge transfer, research, and development.

We believe in a future where artificial general intelligence (AGI) is created by the community, for the community. Through an open ecosystem of equitable computational resources, high quality data, and flowing technical knowledge, we can ensure ethical AGI development and universal access for all innovators.

[Visit us](https://www.llm360.ai/)

## Citation

**BibTeX:**

```bibtex
@article{
      title={LLM360 K2-65B: Scaling Up Fully Transparent Open-Source LLMs}, 
      author={The LLM360 Team},
      year={2024},
}
```