Eldar Kurtic commited on
Commit
9aa823f
·
2 Parent(s): 5dc641c e9b0930

Merge branch 'main' of https://huggingface.co/RedHatAI/DeepSeek-R1-0528-quantized.w4a16

Browse files
Files changed (1) hide show
  1. README.md +60 -2
README.md CHANGED
@@ -1,4 +1,62 @@
1
- # More evals coming soon
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  - unquantized baseline on GSM8k
4
  ```bash
@@ -14,4 +72,4 @@
14
  |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
15
  |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9560|± |0.0056|
16
  | | |strict-match | 5|exact_match|↑ |0.9553|± |0.0057|
17
- ```
 
1
+ ---
2
+ license: mit
3
+ library_name: vllm
4
+ base_model:
5
+ - deepseek-ai/DeepSeek-R1-0528
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - deepseek
9
+ - neuralmagic
10
+ - redhat
11
+ - llmcompressor
12
+ - quantized
13
+ - INT4
14
+ - GPTQ
15
+ ---
16
+
17
+ # DeepSeek-R1-0528-quantized.w4a16
18
+
19
+ ## Model Overview
20
+ - **Model Architecture:** DeepseekV3ForCausalLM
21
+ - **Input:** Text
22
+ - **Output:** Text
23
+ - **Model Optimizations:**
24
+ - **Activation quantization:** None
25
+ - **Weight quantization:** INT4
26
+ - **Release Date:** 05/30/2025
27
+ - **Version:** 1.0
28
+ - **Model Developers:** Red Hat (Neural Magic)
29
+
30
+
31
+ ### Model Optimizations
32
+
33
+ This model was obtained by quantizing weights of [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) to INT4 data type.
34
+ This optimization reduces the number of bits used to represent weights from 8 to 4, reducing GPU memory requirements (by approximately 50%).
35
+ Weight quantization also reduces disk size requirements by approximately 50%.
36
+
37
+
38
+ ## Deployment
39
+
40
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
41
+
42
+ ```python
43
+ from vllm import LLM, SamplingParams
44
+ from transformers import AutoTokenizer
45
+ model_id = "RedHatAI/DeepSeek-R1-0528-quantized.w4a16"
46
+ number_gpus = 8
47
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=256)
48
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
49
+ prompt = "Give me a short introduction to large language model."
50
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
51
+ outputs = llm.generate(prompt, sampling_params)
52
+ generated_text = outputs[0].outputs[0].text
53
+ print(generated_text)
54
+ ```
55
+
56
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
57
+
58
+
59
+ ## Evaluation (More evals coming soon)
60
 
61
  - unquantized baseline on GSM8k
62
  ```bash
 
72
  |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
73
  |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9560|± |0.0056|
74
  | | |strict-match | 5|exact_match|↑ |0.9553|± |0.0057|
75
+ ```