textgeflecht
/

Qwen2.5-Coder-32B-Instruct-FP8-dynamic

+---
+license: apache-2.0
+tags:
+- qwen
+- qwen2
+- fp8
+- quantization
+- llm-compressor
+- vllm
+- code-generation
+pipeline_tag: text-generation
+base_model:
+- Qwen/Qwen2.5-Coder-32B-Instruct
+---
+# Qwen2.5-Coder-32B-Instruct-FP8-dynamic
+This is a version of [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) quantized to FP8 (weights and dynamic activations) using [llm-compressor](https://github.com/vllm-project/llm-compressor).
+This model format is particularly useful for accelerated inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper, Blackwell or newer).
+## Model Description
+Qwen2.5-Coder-32B-Instruct is a state-of-the-art, large language model from Alibaba Cloud, specialized for coding tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the `lm_head` layer kept in its original precision to maintain output quality.
+## Quantization with llm-compressor
+The model was quantized using the `oneshot` method from `llm-compressor` with the `FP8_DYNAMIC` scheme.
+No calibration dataset was required for this quantization scheme.
+The following script was used for conversion:
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+import os
+# --- 1. Set the new Model ID ---
+MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct"
+# --- 2. Load model and tokenizer using Auto classes ---
+print(f"Loading model: {MODEL_ID}...")
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_ID,
+    device_map="auto",
+    torch_dtype="auto",
+    trust_remote_code=True,
+)
+print("Loading tokenizer...")
+tokenizer = AutoTokenizer.from_pretrained(
+    MODEL_ID,
+    trust_remote_code=True,
+)
+# --- 3. The quantization recipe remains the same ---
+print("Configuring FP8 quantization recipe...")
+recipe = QuantizationModifier(
+    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
+)
+# Apply quantization. This step can take some time.
+print("Applying one-shot quantization...")
+oneshot(model=model, recipe=recipe, tokenizer=tokenizer)
+print("Quantization complete.")
+# --- 4. Confirm generation with the Qwen chat template ---
+print("\n========== SAMPLE GENERATION ==============")
+prompt = "Write a Python function for a quicksort algorithm. Include comments to explain the logic."
+messages = [
+    {"role": "system", "content": "You are a helpful assistant specialized in writing code."},
+    {"role": "user", "content": prompt}
+]
+input_text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
+output_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=256,
+)
+input_token_len = model_inputs.input_ids.shape[1]
+generated_tokens = output_ids[0, input_token_len:]
+response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
+print(f"Generated Response:\n{response}")
+print("==========================================")
+# --- 5. Save the quantized model and the tokenizer correctly ---
+SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
+print(f"\nSaving quantized model to {SAVE_DIR}...")
+model.save_pretrained(SAVE_DIR)
+print(f"Saving tokenizer to {SAVE_DIR}...")
+tokenizer.save_pretrained(SAVE_DIR)
+print(f"\nModel and tokenizer saved successfully to '{SAVE_DIR}'")
+```
+## Inference Example
+This model can be loaded and run with `transformers`, or for optimized FP8 inference, with [vLLM](https://github.com/vllm-project/vllm/).
+### Using `transformers` (for functional checking, not FP8 optimized)
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+MODEL_REPO_ID = "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic"
+# For Qwen models, it is recommended to use trust_remote_code=True
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_REPO_ID,
+    device_map="auto",
+    torch_dtype="auto",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    MODEL_REPO_ID,
+    trust_remote_code=True
+)
+prompt = "Write a complete and efficient implementation of the merge sort algorithm in Rust."
+messages = [
+    {"role": "system", "content": "You are a helpful assistant specialized in writing high-quality Rust code."},
+    {"role": "user", "content": prompt}
+]
+# Apply the chat template to format the prompt correctly
+input_text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+# Tokenize the input and move to the device
+model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
+# Generate output
+output_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=1024,
+    do_sample=True,
+    temperature=0.6,
+    top_p=0.9
+)
+# Decode only the newly generated tokens
+input_token_len = model_inputs.input_ids.shape[1]
+generated_tokens = output_ids[0, input_token_len:]
+response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
+print("--- Prompt ---")
+print(prompt)
+print("\n--- Qwen Response ---")
+print(response)
+```
+### Using vLLM (for optimized FP8 inference)
+This model, quantized to FP8 with llm-compressor, is designed for efficient inference with vLLM on newer NVIDIA GPUs.
+Prerequisites:
+- A recent version of vLLM that supports compressed-tensors.
+- A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer).
+- Docker and NVIDIA Container Toolkit installed.
+Running with Docker (Recommended):
+The following command starts a vLLM OpenAI-compatible server with this quantized model:
+```bash
+# 1. Set your Hugging Face Token (optional, but recommended)
+# export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"
+# 2. Run the vLLM Docker container.
+# Replace 'vllm/vllm-openai:latest' with a recent official build.
+sudo docker run --gpus all \
+    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
+    -p 8000:8000 \
+    -e HF_TOKEN="$HF_TOKEN" \
+    vllm/vllm-openai:latest \
+    --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \
+    --tokenizer-mode auto \
+    --load-format auto \
+    --trust-remote-code \
+    --max-model-len 4096 # Optional: Adjust based on your VRAM
+```
+Once running, the server exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use any OpenAI client library (e.g., openai for Python) or curl to send requests.
+## Original Model Card (Qwen/Qwen2.5-Coder-32B-Instruct)
+For more details on the base model, its capabilities, and licensing, please refer to the original model card: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct