textgeflecht commited on
Commit
cf26e9a
·
verified ·
1 Parent(s): a735d39

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +196 -3
README.md CHANGED
@@ -1,3 +1,196 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - qwen
5
+ - qwen2
6
+ - fp8
7
+ - quantization
8
+ - llm-compressor
9
+ - vllm
10
+ - code-generation
11
+ pipeline_tag: text-generation
12
+ base_model:
13
+ - Qwen/Qwen2.5-Coder-32B-Instruct
14
+ ---
15
+
16
+ # Qwen2.5-Coder-32B-Instruct-FP8-dynamic
17
+
18
+ This is a version of [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) quantized to FP8 (weights and dynamic activations) using [llm-compressor](https://github.com/vllm-project/llm-compressor).
19
+
20
+ This model format is particularly useful for accelerated inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper, Blackwell or newer).
21
+
22
+ ## Model Description
23
+
24
+ Qwen2.5-Coder-32B-Instruct is a state-of-the-art, large language model from Alibaba Cloud, specialized for coding tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the `lm_head` layer kept in its original precision to maintain output quality.
25
+
26
+ ## Quantization with llm-compressor
27
+
28
+ The model was quantized using the `oneshot` method from `llm-compressor` with the `FP8_DYNAMIC` scheme.
29
+ No calibration dataset was required for this quantization scheme.
30
+
31
+ The following script was used for conversion:
32
+ ```python
33
+ import torch
34
+ from transformers import AutoModelForCausalLM, AutoTokenizer
35
+ from llmcompressor import oneshot
36
+ from llmcompressor.modifiers.quantization import QuantizationModifier
37
+ import os
38
+
39
+ # --- 1. Set the new Model ID ---
40
+ MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct"
41
+
42
+ # --- 2. Load model and tokenizer using Auto classes ---
43
+ print(f"Loading model: {MODEL_ID}...")
44
+ model = AutoModelForCausalLM.from_pretrained(
45
+ MODEL_ID,
46
+ device_map="auto",
47
+ torch_dtype="auto",
48
+ trust_remote_code=True,
49
+ )
50
+ print("Loading tokenizer...")
51
+ tokenizer = AutoTokenizer.from_pretrained(
52
+ MODEL_ID,
53
+ trust_remote_code=True,
54
+ )
55
+
56
+ # --- 3. The quantization recipe remains the same ---
57
+ print("Configuring FP8 quantization recipe...")
58
+ recipe = QuantizationModifier(
59
+ targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
60
+ )
61
+
62
+ # Apply quantization. This step can take some time.
63
+ print("Applying one-shot quantization...")
64
+ oneshot(model=model, recipe=recipe, tokenizer=tokenizer)
65
+ print("Quantization complete.")
66
+
67
+ # --- 4. Confirm generation with the Qwen chat template ---
68
+ print("\n========== SAMPLE GENERATION ==============")
69
+ prompt = "Write a Python function for a quicksort algorithm. Include comments to explain the logic."
70
+ messages = [
71
+ {"role": "system", "content": "You are a helpful assistant specialized in writing code."},
72
+ {"role": "user", "content": prompt}
73
+ ]
74
+
75
+ input_text = tokenizer.apply_chat_template(
76
+ messages,
77
+ tokenize=False,
78
+ add_generation_prompt=True
79
+ )
80
+ model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
81
+
82
+ output_ids = model.generate(
83
+ **model_inputs,
84
+ max_new_tokens=256,
85
+ )
86
+
87
+ input_token_len = model_inputs.input_ids.shape[1]
88
+ generated_tokens = output_ids[0, input_token_len:]
89
+ response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
90
+
91
+ print(f"Generated Response:\n{response}")
92
+ print("==========================================")
93
+
94
+
95
+ # --- 5. Save the quantized model and the tokenizer correctly ---
96
+ SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
97
+ print(f"\nSaving quantized model to {SAVE_DIR}...")
98
+ model.save_pretrained(SAVE_DIR)
99
+
100
+ print(f"Saving tokenizer to {SAVE_DIR}...")
101
+ tokenizer.save_pretrained(SAVE_DIR)
102
+
103
+ print(f"\nModel and tokenizer saved successfully to '{SAVE_DIR}'")
104
+ ```
105
+
106
+
107
+ ## Inference Example
108
+ This model can be loaded and run with `transformers`, or for optimized FP8 inference, with [vLLM](https://github.com/vllm-project/vllm/).
109
+
110
+ ### Using `transformers` (for functional checking, not FP8 optimized)
111
+
112
+ ```python
113
+ import torch
114
+ from transformers import AutoModelForCausalLM, AutoTokenizer
115
+
116
+ MODEL_REPO_ID = "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic"
117
+
118
+ # For Qwen models, it is recommended to use trust_remote_code=True
119
+ model = AutoModelForCausalLM.from_pretrained(
120
+ MODEL_REPO_ID,
121
+ device_map="auto",
122
+ torch_dtype="auto",
123
+ trust_remote_code=True
124
+ )
125
+ tokenizer = AutoTokenizer.from_pretrained(
126
+ MODEL_REPO_ID,
127
+ trust_remote_code=True
128
+ )
129
+
130
+ prompt = "Write a complete and efficient implementation of the merge sort algorithm in Rust."
131
+ messages = [
132
+ {"role": "system", "content": "You are a helpful assistant specialized in writing high-quality Rust code."},
133
+ {"role": "user", "content": prompt}
134
+ ]
135
+
136
+ # Apply the chat template to format the prompt correctly
137
+ input_text = tokenizer.apply_chat_template(
138
+ messages,
139
+ tokenize=False,
140
+ add_generation_prompt=True
141
+ )
142
+
143
+ # Tokenize the input and move to the device
144
+ model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
145
+
146
+ # Generate output
147
+ output_ids = model.generate(
148
+ **model_inputs,
149
+ max_new_tokens=1024,
150
+ do_sample=True,
151
+ temperature=0.6,
152
+ top_p=0.9
153
+ )
154
+
155
+ # Decode only the newly generated tokens
156
+ input_token_len = model_inputs.input_ids.shape[1]
157
+ generated_tokens = output_ids[0, input_token_len:]
158
+ response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
159
+
160
+ print("--- Prompt ---")
161
+ print(prompt)
162
+ print("\n--- Qwen Response ---")
163
+ print(response)
164
+ ```
165
+
166
+ ### Using vLLM (for optimized FP8 inference)
167
+ This model, quantized to FP8 with llm-compressor, is designed for efficient inference with vLLM on newer NVIDIA GPUs.
168
+ Prerequisites:
169
+ - A recent version of vLLM that supports compressed-tensors.
170
+ - A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer).
171
+ - Docker and NVIDIA Container Toolkit installed.
172
+
173
+ Running with Docker (Recommended):
174
+ The following command starts a vLLM OpenAI-compatible server with this quantized model:
175
+ ```bash
176
+ # 1. Set your Hugging Face Token (optional, but recommended)
177
+ # export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"
178
+
179
+ # 2. Run the vLLM Docker container.
180
+ # Replace 'vllm/vllm-openai:latest' with a recent official build.
181
+ sudo docker run --gpus all \
182
+ -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
183
+ -p 8000:8000 \
184
+ -e HF_TOKEN="$HF_TOKEN" \
185
+ vllm/vllm-openai:latest \
186
+ --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \
187
+ --tokenizer-mode auto \
188
+ --load-format auto \
189
+ --trust-remote-code \
190
+ --max-model-len 4096 # Optional: Adjust based on your VRAM
191
+ ```
192
+
193
+ Once running, the server exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use any OpenAI client library (e.g., openai for Python) or curl to send requests.
194
+
195
+ ## Original Model Card (Qwen/Qwen2.5-Coder-32B-Instruct)
196
+ For more details on the base model, its capabilities, and licensing, please refer to the original model card: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct