alexmarques commited on
Commit
d3edd4c
·
verified ·
1 Parent(s): 6fdce0e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +172 -0
README.md ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_link: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE
4
+ language:
5
+ - en
6
+ pipeline_tag: text-generation
7
+ base_model: Qwen/Qwen2.5-7B-Instruct
8
+ tags:
9
+ - chat
10
+ - neuralmagic
11
+ - llmcompressor
12
+ - fp8
13
+ ---
14
+
15
+ # Qwen2.5-7B-Instruct-FP8-dynamic
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** Qwen2
19
+ - **Input:** Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Activation quantization:** FP8
23
+ - **Weight quantization:** FP8
24
+ - **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
25
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
26
+ - **Release Date:** 11/27/2024
27
+ - **Version:** 1.0
28
+ - **License(s):** [apache-2.0](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE)
29
+ - **Model Developers:** Neural Magic
30
+
31
+ ### Model Optimizations
32
+
33
+ This model was obtained by quantizing the weights of [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) to FP8 data type.
34
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
35
+ Weight quantization also reduces disk size requirements by approximately 50%.
36
+
37
+ Only weights and activations of the linear operators within transformers blocks are quantized.
38
+ Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
39
+ The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
40
+
41
+ ## Deployment
42
+
43
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
44
+
45
+ ```python
46
+ from vllm import LLM, SamplingParams
47
+ from transformers import AutoTokenizer
48
+
49
+ model_id = "RedHatAI/Qwen2.5-7B-Instruct-FP8-dynamic"
50
+ number_gpus = 1
51
+ max_model_len = 8192
52
+
53
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
54
+
55
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
56
+
57
+ messages = [
58
+ {"role": "user", "content": "Give me a short introduction to large language model."},
59
+ ]
60
+
61
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
62
+
63
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
64
+
65
+ outputs = llm.generate(prompts, sampling_params)
66
+
67
+ generated_text = outputs[0].outputs[0].text
68
+ print(generated_text)
69
+ ```
70
+
71
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
72
+
73
+
74
+ ## Evaluation
75
+
76
+ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/387Bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 387Bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
77
+ ```
78
+ lm_eval \
79
+ --model vllm \
80
+ --model_args pretrained="neuralmagic/Qwen2.5-7B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=4096,enable_chunk_prefill=True,tensor_parallel_size=1 \
81
+ --apply_chat_template \
82
+ --fewshot_as_multiturn \
83
+ --tasks openllm \
84
+ --batch_size auto
85
+ ```
86
+
87
+ ### Accuracy
88
+
89
+ #### Open LLM Leaderboard evaluation scores
90
+ <table>
91
+ <tr>
92
+ <th>Benchmark
93
+ </th>
94
+ <th>Qwen2.5-7B-Instruct
95
+ </th>
96
+ <th>Qwen2.5-7B-Instruct-FP8-dynamic<br>(this model)
97
+ </th>
98
+ <th>Recovery
99
+ </th>
100
+ </tr>
101
+ <tr>
102
+ <td>MMLU (5-shot)
103
+ </td>
104
+ <td>74.24
105
+ </td>
106
+ <td>74.04
107
+ </td>
108
+ <td>99.7%
109
+ </td>
110
+ </tr>
111
+ <tr>
112
+ <td>ARC Challenge (25-shot)
113
+ </td>
114
+ <td>63.40
115
+ </td>
116
+ <td>63.14
117
+ </td>
118
+ <td>99.6%
119
+ </td>
120
+ </tr>
121
+ <tr>
122
+ <td>GSM-8K (5-shot, strict-match)
123
+ </td>
124
+ <td>80.36
125
+ </td>
126
+ <td>80.06
127
+ </td>
128
+ <td>99.6%
129
+ </td>
130
+ </tr>
131
+ <tr>
132
+ <td>Hellaswag (10-shot)
133
+ </td>
134
+ <td>81.52
135
+ </td>
136
+ <td>81.11
137
+ </td>
138
+ <td>99.5%
139
+ </td>
140
+ </tr>
141
+ <tr>
142
+ <td>Winogrande (5-shot)
143
+ </td>
144
+ <td>74.66
145
+ </td>
146
+ <td>74.43
147
+ </td>
148
+ <td>99.7%
149
+ </td>
150
+ </tr>
151
+ <tr>
152
+ <td>TruthfulQA (0-shot, mc2)
153
+ </td>
154
+ <td>64.76
155
+ </td>
156
+ <td>64.87
157
+ </td>
158
+ <td>100.2%
159
+ </td>
160
+ </tr>
161
+ <tr>
162
+ <td><strong>Average</strong>
163
+ </td>
164
+ <td><strong>73.16</strong>
165
+ </td>
166
+ <td><strong>72.94</strong>
167
+ </td>
168
+ <td><strong>99.7%</strong>
169
+ </td>
170
+ </tr>
171
+ </table>
172
+