Update README.md

cf26e9a verified 3 months ago

6.74 kB

	---
	license: apache-2.0
	tags:
	- qwen
	- qwen2
	- fp8
	- quantization
	- llm-compressor
	- vllm
	- code-generation
	pipeline_tag: text-generation
	base_model:
	- Qwen/Qwen2.5-Coder-32B-Instruct
	---

	# Qwen2.5-Coder-32B-Instruct-FP8-dynamic

	This is a version of [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) quantized to FP8 (weights and dynamic activations) using [llm-compressor](https://github.com/vllm-project/llm-compressor).

	This model format is particularly useful for accelerated inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper, Blackwell or newer).

	## Model Description

	Qwen2.5-Coder-32B-Instruct is a state-of-the-art, large language model from Alibaba Cloud, specialized for coding tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the `lm_head` layer kept in its original precision to maintain output quality.

	## Quantization with llm-compressor

	The model was quantized using the `oneshot` method from `llm-compressor` with the `FP8_DYNAMIC` scheme.
	No calibration dataset was required for this quantization scheme.

	The following script was used for conversion:
	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from llmcompressor import oneshot
	from llmcompressor.modifiers.quantization import QuantizationModifier
	import os

	# --- 1. Set the new Model ID ---
	MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct"

	# --- 2. Load model and tokenizer using Auto classes ---
	print(f"Loading model: {MODEL_ID}...")
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_ID,
	device_map="auto",
	torch_dtype="auto",
	trust_remote_code=True,
	)
	print("Loading tokenizer...")
	tokenizer = AutoTokenizer.from_pretrained(
	MODEL_ID,
	trust_remote_code=True,
	)

	# --- 3. The quantization recipe remains the same ---
	print("Configuring FP8 quantization recipe...")
	recipe = QuantizationModifier(
	targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
	)

	# Apply quantization. This step can take some time.
	print("Applying one-shot quantization...")
	oneshot(model=model, recipe=recipe, tokenizer=tokenizer)
	print("Quantization complete.")

	# --- 4. Confirm generation with the Qwen chat template ---
	print("\n========== SAMPLE GENERATION ==============")
	prompt = "Write a Python function for a quicksort algorithm. Include comments to explain the logic."
	messages = [
	{"role": "system", "content": "You are a helpful assistant specialized in writing code."},
	{"role": "user", "content": prompt}
	]

	input_text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)

	output_ids = model.generate(
	**model_inputs,
	max_new_tokens=256,
	)

	input_token_len = model_inputs.input_ids.shape[1]
	generated_tokens = output_ids[0, input_token_len:]
	response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

	print(f"Generated Response:\n{response}")
	print("==========================================")


	# --- 5. Save the quantized model and the tokenizer correctly ---
	SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
	print(f"\nSaving quantized model to {SAVE_DIR}...")
	model.save_pretrained(SAVE_DIR)

	print(f"Saving tokenizer to {SAVE_DIR}...")
	tokenizer.save_pretrained(SAVE_DIR)

	print(f"\nModel and tokenizer saved successfully to '{SAVE_DIR}'")
	```


	## Inference Example
	This model can be loaded and run with `transformers`, or for optimized FP8 inference, with [vLLM](https://github.com/vllm-project/vllm/).

	### Using `transformers` (for functional checking, not FP8 optimized)

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	MODEL_REPO_ID = "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic"

	# For Qwen models, it is recommended to use trust_remote_code=True
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_REPO_ID,
	device_map="auto",
	torch_dtype="auto",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(
	MODEL_REPO_ID,
	trust_remote_code=True
	)

	prompt = "Write a complete and efficient implementation of the merge sort algorithm in Rust."
	messages = [
	{"role": "system", "content": "You are a helpful assistant specialized in writing high-quality Rust code."},
	{"role": "user", "content": prompt}
	]

	# Apply the chat template to format the prompt correctly
	input_text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	# Tokenize the input and move to the device
	model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)

	# Generate output
	output_ids = model.generate(
	**model_inputs,
	max_new_tokens=1024,
	do_sample=True,
	temperature=0.6,
	top_p=0.9
	)

	# Decode only the newly generated tokens
	input_token_len = model_inputs.input_ids.shape[1]
	generated_tokens = output_ids[0, input_token_len:]
	response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

	print("--- Prompt ---")
	print(prompt)
	print("\n--- Qwen Response ---")
	print(response)
	```

	### Using vLLM (for optimized FP8 inference)
	This model, quantized to FP8 with llm-compressor, is designed for efficient inference with vLLM on newer NVIDIA GPUs.
	Prerequisites:
	- A recent version of vLLM that supports compressed-tensors.
	- A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer).
	- Docker and NVIDIA Container Toolkit installed.

	Running with Docker (Recommended):
	The following command starts a vLLM OpenAI-compatible server with this quantized model:
	```bash
	# 1. Set your Hugging Face Token (optional, but recommended)
	# export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"

	# 2. Run the vLLM Docker container.
	# Replace 'vllm/vllm-openai:latest' with a recent official build.
	sudo docker run --gpus all \
	-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
	-p 8000:8000 \
	-e HF_TOKEN="$HF_TOKEN" \
	vllm/vllm-openai:latest \
	--model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \
	--tokenizer-mode auto \
	--load-format auto \
	--trust-remote-code \
	--max-model-len 4096 # Optional: Adjust based on your VRAM
	```

	Once running, the server exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use any OpenAI client library (e.g., openai for Python) or curl to send requests.

	## Original Model Card (Qwen/Qwen2.5-Coder-32B-Instruct)
	For more details on the base model, its capabilities, and licensing, please refer to the original model card: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct