Model Overview
- Model Architecture: DeepSeek-R1-0528
- Input: Text
- Output: Text
- Supported Hardware Microarchitecture: AMD MI350/MI355
- ROCm: 7.0
- Operating System(s): Linux
- Inference Engine: SGLang/vLLM
- Model Optimizer: AMD-Quark (V0.10)
- Weight quantization: Perchannel, FP8E4M3, Static
- Activation quantization: Pertoken, FP8E4M3, Dynamic
- Calibration Dataset: Pile
This model was built with deepseek-ai DeepSeek-R1-0528 model by applying AMD-Quark for FP8E4M3 PTPC quantization.
Model Quantization
The model was quantized from deepseek-ai/DeepSeek-R1-0528 using AMD-Quark. The weights are quantized to FP8 and activations are quantized to FP8.
Preprocessing requirement:
Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16. You can either perform the dequantization manually using this conversion script, or use the pre-converted BFloat16 model available at unsloth/DeepSeek-R1-0528-BF16.
Quantization scripts:
# pip install amd-quark
from transformers import AutoTokenizer, AutoModelForCausalLM
from quark.torch import ModelQuantizer, export_safetensors
from quark.torch.quantization import FP8E4M3PerChannelSpec
from quark.torch.quantization.config.config import Config, QuantizationConfig
ckpt_path = "unsloth/DeepSeek-R1-0528-BF16"
exclude_layers = ["lm_head","*mlp.gate"]
output_dir = ckpt_path.rstrip("/").split("/")[-1] + "-ptpc"
# Load the original floating-point model
model = AutoModelForCausalLM.from_pretrained(ckpt_path, device_map="auto", torch_dtype="auto", trust_remote_code=True)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(ckpt_path)
# Set the quantization configuration
FP8_PER_CHANNEL_SPEC = FP8E4M3PerChannelSpec(is_dynamic=False, ch_axis=0).to_quantization_spec()
FP8_PER_TOKEN_DYNAMIC_SPEC = FP8E4M3PerChannelSpec(is_dynamic=True, ch_axis=1).to_quantization_spec()
W_FP8_PER_CHANNEL_STATIC_A_FP8_PER_TOKEN_DYNAMIC_CONFIG = QuantizationConfig(input_tensors=FP8_PER_TOKEN_DYNAMIC_SPEC, weight=FP8_PER_CHANNEL_SPEC)
quant_config = Config(global_quant_config=W_FP8_PER_CHANNEL_STATIC_A_FP8_PER_TOKEN_DYNAMIC_CONFIG, exclude=exclude_layers)
# Apply quantization
quantizer = ModelQuantizer(quant_config)
model = quantizer.quantize_model(model)
# Export quantized model
model = quantizer.freeze(model)
export_safetensors(model, output_dir)
tokenizer.save_pretrained(output_dir)
Accuracy
| Benchmark | DeepSeek-R1-0528 | DeepSeek-R1-0528-ptpc(this model) |
| GPQA-diamond | 80.72 | 80.05 |
Reproduction
Docker: rocm/vllm-private:rocm7.1_ubuntu22.04_vllm0.11.2_ptpc_fp8
vllm version: 0.10.1.1+rocm710
aiter version: 0.1.6.post2.dev55+g59bd8ff2c
lighteval version: 0.12.2
MODEL_ARGS="model_name=/amd/DeepSeek-R1-0528-ptpc,dtype=auto,tensor_parallel_size=8,max_model_length=71536,max_num_batched_tokens=65536,gpu_memory_utilization=0.9,generation_parameters={max_new_tokens:65536,temperature:0.6,top_p:0.95,seed:100}"
lighteval vllm $MODEL_ARGS "lighteval|gpqa:diamond|0"
Deployment
This model can be deployed efficiently using the vLLM backends.
License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- 13
Model tree for amd/DeepSeek-R1-0528-ptpc
Base model
deepseek-ai/DeepSeek-R1-0528