DeepSeek-R1-0528-quantized.w4a16 / README.md

ekurtic

Update README.md

e9b0930 verified 5 months ago

preview code

raw

history blame

2.46 kB

metadata

license: mit
library_name: vllm
base_model:
  - deepseek-ai/DeepSeek-R1-0528
pipeline_tag: text-generation
tags:
  - deepseek
  - neuralmagic
  - redhat
  - llmcompressor
  - quantized
  - INT4
  - GPTQ

DeepSeek-R1-0528-quantized.w4a16

Model Overview

Model Architecture: DeepseekV3ForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Activation quantization: None
- Weight quantization: INT4
Release Date: 05/30/2025
Version: 1.0
Model Developers: Red Hat (Neural Magic)

Model Optimizations

This model was obtained by quantizing weights of DeepSeek-R1-0528 to INT4 data type. This optimization reduces the number of bits used to represent weights from 8 to 4, reducing GPU memory requirements (by approximately 50%). Weight quantization also reduces disk size requirements by approximately 50%.

Deployment

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "RedHatAI/DeepSeek-R1-0528-quantized.w4a16"
number_gpus = 8
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Give me a short introduction to large language model."
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompt, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Evaluation (More evals coming soon)

unquantized baseline on GSM8k

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9591|±  |0.0055|
|     |       |strict-match    |     5|exact_match|↑  |0.9568|±  |0.0056|

this INT4 quantized model on GSM8k

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9560|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.9553|±  |0.0057|