|
--- |
|
license: apache-2.0 |
|
tags: |
|
- qwen |
|
- qwen2 |
|
- fp8 |
|
- quantization |
|
- llm-compressor |
|
- vllm |
|
- code-generation |
|
pipeline_tag: text-generation |
|
base_model: |
|
- Qwen/Qwen2.5-Coder-32B-Instruct |
|
--- |
|
|
|
# Qwen2.5-Coder-32B-Instruct-FP8-dynamic |
|
|
|
This is a version of [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) quantized to FP8 (weights and dynamic activations) using [llm-compressor](https://github.com/vllm-project/llm-compressor). |
|
|
|
This model format is particularly useful for accelerated inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper, Blackwell or newer). |
|
|
|
## Model Description |
|
|
|
Qwen2.5-Coder-32B-Instruct is a state-of-the-art, large language model from Alibaba Cloud, specialized for coding tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the `lm_head` layer kept in its original precision to maintain output quality. |
|
|
|
## Quantization with llm-compressor |
|
|
|
The model was quantized using the `oneshot` method from `llm-compressor` with the `FP8_DYNAMIC` scheme. |
|
No calibration dataset was required for this quantization scheme. |
|
|
|
The following script was used for conversion: |
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from llmcompressor import oneshot |
|
from llmcompressor.modifiers.quantization import QuantizationModifier |
|
import os |
|
|
|
# --- 1. Set the new Model ID --- |
|
MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct" |
|
|
|
# --- 2. Load model and tokenizer using Auto classes --- |
|
print(f"Loading model: {MODEL_ID}...") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
MODEL_ID, |
|
device_map="auto", |
|
torch_dtype="auto", |
|
trust_remote_code=True, |
|
) |
|
print("Loading tokenizer...") |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
MODEL_ID, |
|
trust_remote_code=True, |
|
) |
|
|
|
# --- 3. The quantization recipe remains the same --- |
|
print("Configuring FP8 quantization recipe...") |
|
recipe = QuantizationModifier( |
|
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"] |
|
) |
|
|
|
# Apply quantization. This step can take some time. |
|
print("Applying one-shot quantization...") |
|
oneshot(model=model, recipe=recipe, tokenizer=tokenizer) |
|
print("Quantization complete.") |
|
|
|
# --- 4. Confirm generation with the Qwen chat template --- |
|
print("\n========== SAMPLE GENERATION ==============") |
|
prompt = "Write a Python function for a quicksort algorithm. Include comments to explain the logic." |
|
messages = [ |
|
{"role": "system", "content": "You are a helpful assistant specialized in writing code."}, |
|
{"role": "user", "content": prompt} |
|
] |
|
|
|
input_text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device) |
|
|
|
output_ids = model.generate( |
|
**model_inputs, |
|
max_new_tokens=256, |
|
) |
|
|
|
input_token_len = model_inputs.input_ids.shape[1] |
|
generated_tokens = output_ids[0, input_token_len:] |
|
response = tokenizer.decode(generated_tokens, skip_special_tokens=True) |
|
|
|
print(f"Generated Response:\n{response}") |
|
print("==========================================") |
|
|
|
|
|
# --- 5. Save the quantized model and the tokenizer correctly --- |
|
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" |
|
print(f"\nSaving quantized model to {SAVE_DIR}...") |
|
model.save_pretrained(SAVE_DIR) |
|
|
|
print(f"Saving tokenizer to {SAVE_DIR}...") |
|
tokenizer.save_pretrained(SAVE_DIR) |
|
|
|
print(f"\nModel and tokenizer saved successfully to '{SAVE_DIR}'") |
|
``` |
|
|
|
|
|
## Inference Example |
|
This model can be loaded and run with `transformers`, or for optimized FP8 inference, with [vLLM](https://github.com/vllm-project/vllm/). |
|
|
|
### Using `transformers` (for functional checking, not FP8 optimized) |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
MODEL_REPO_ID = "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic" |
|
|
|
# For Qwen models, it is recommended to use trust_remote_code=True |
|
model = AutoModelForCausalLM.from_pretrained( |
|
MODEL_REPO_ID, |
|
device_map="auto", |
|
torch_dtype="auto", |
|
trust_remote_code=True |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
MODEL_REPO_ID, |
|
trust_remote_code=True |
|
) |
|
|
|
prompt = "Write a complete and efficient implementation of the merge sort algorithm in Rust." |
|
messages = [ |
|
{"role": "system", "content": "You are a helpful assistant specialized in writing high-quality Rust code."}, |
|
{"role": "user", "content": prompt} |
|
] |
|
|
|
# Apply the chat template to format the prompt correctly |
|
input_text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
|
|
# Tokenize the input and move to the device |
|
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device) |
|
|
|
# Generate output |
|
output_ids = model.generate( |
|
**model_inputs, |
|
max_new_tokens=1024, |
|
do_sample=True, |
|
temperature=0.6, |
|
top_p=0.9 |
|
) |
|
|
|
# Decode only the newly generated tokens |
|
input_token_len = model_inputs.input_ids.shape[1] |
|
generated_tokens = output_ids[0, input_token_len:] |
|
response = tokenizer.decode(generated_tokens, skip_special_tokens=True) |
|
|
|
print("--- Prompt ---") |
|
print(prompt) |
|
print("\n--- Qwen Response ---") |
|
print(response) |
|
``` |
|
|
|
### Using vLLM (for optimized FP8 inference) |
|
This model, quantized to FP8 with llm-compressor, is designed for efficient inference with vLLM on newer NVIDIA GPUs. |
|
Prerequisites: |
|
- A recent version of vLLM that supports compressed-tensors. |
|
- A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer). |
|
- Docker and NVIDIA Container Toolkit installed. |
|
|
|
Running with Docker (Recommended): |
|
The following command starts a vLLM OpenAI-compatible server with this quantized model: |
|
```bash |
|
# 1. Set your Hugging Face Token (optional, but recommended) |
|
# export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE" |
|
|
|
# 2. Run the vLLM Docker container. |
|
# Replace 'vllm/vllm-openai:latest' with a recent official build. |
|
sudo docker run --gpus all \ |
|
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \ |
|
-p 8000:8000 \ |
|
-e HF_TOKEN="$HF_TOKEN" \ |
|
vllm/vllm-openai:latest \ |
|
--model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \ |
|
--tokenizer-mode auto \ |
|
--load-format auto \ |
|
--trust-remote-code \ |
|
--max-model-len 4096 # Optional: Adjust based on your VRAM |
|
``` |
|
|
|
Once running, the server exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use any OpenAI client library (e.g., openai for Python) or curl to send requests. |
|
|
|
## Original Model Card (Qwen/Qwen2.5-Coder-32B-Instruct) |
|
For more details on the base model, its capabilities, and licensing, please refer to the original model card: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct |