Intel/DeepSeek-V3.1-int4-AutoRound

Model Details

This model is a int4 model with group_size 128 and symmetric quantization of deepseek-ai/DeepSeek-V3.1 generated by intel/auto-round. Please follow the license of the original model.

How To Use

INT4 Inference

Potential overflow/underflow issues have been observed on CUDA, primarily due to kernel limitations. For better accuracy, we recommend deploying the model on CPU or using our INT4 mixed version

from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch
quantized_model_dir = "Intel/DeepSeek-V3.1-int4-AutoRound"

model = AutoModelForCausalLM.from_pretrained(
        quantized_model_dir,
        torch_dtype=torch.bfloat16,
        device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
        "9.11和9.8哪个数字大",
        "strawberry中有几个r?",
        "There is a girl who likes adventure,",
        "Please give a brief introduction of DeepSeek company.",
        ]

texts=[]
for prompt in prompts:
    messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
            )
    texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(
        input_ids=inputs["input_ids"].to(model.device),
        attention_mask=inputs["attention_mask"].to(model.device),
        max_length=200, ##change this to align with the official usage
        num_return_sequences=1,
        do_sample=False  ##change this to align with the official usage
        )
generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
        ]
decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
"""
GPU result:
Prompt: 9.11和9.8哪个数字大
Generated: 9.11 和 9.8 相比，**9.11 更大**。
 
- 9.11 可以理解为 9.11
# 1. 概述
 
## 1.1 什么是Spring
 
Spring是一个开源框架，它由Rod Johnson创建。它是为了解决企业应用开发的复杂性而创建的。Spring使用基本的JavaBean来完成以前只可能由EJB完成的事情。然而，Spring的用途不仅限于服务器端的开发。从简单性、可测试性和松耦合的角度而言，任何Java应用都可以从Spring中受益。
 
**目的：**解决企业应用开发的复杂性
 
**功能：**使用基本的JavaBean代替EJB，并提供了更多的企业应用功能
 
**范围：**任何Java应用
 
Spring是一个轻量级控制反转(IoC)和面向切面(AOP)的容器框架。
 
## 1.
--------------------------------------------------
CPU result:
Prompt: 9.11和9.8哪个数字大
Generated: 9.11 和 9.8 相比，**9.11 更大**。
- 9.11 可以理解为 9.11
- 9.8 可以理解为 9.80
比较小数点后第二位：1（来自9.11）大于 0（来自9.80），因此 9.11 > 9.8。
--------------------------------------------------
Prompt: strawberry中有几个r?
Generated: 在英文单词 "strawberry" 中，字母 "r" 出现了 **3 次**。
- 位置：第 3 个字母（s**t r**awberry）、第 6 个字母（stra**w b**erry 中的 "r" 实际是第 6 个字符，但注意 "w" 后是 "b"，这里需要仔细数）
实际上：
- 分解：s-t-r-a-w-b-e-r-r-y
- 字母 "r" 出现在第 3、第 8 和第 9 位（索引从 1 开始）。

所以，**"strawberry" 包含 3 个 "r"**。
--------------------------------------------------
Prompt: There is a girl who likes adventure,
Generated: Of course! Here are a few ways to imagine what that could look like, from a simple story to a character profile.

### A Short Story Snippet

The map was old, the edges frayed and the ink faded in places. Ella traced the route with her finger for the hundredth time, her heart beating a rhythm of pure excitement. It wasn't just a path to a hidden waterfall; it was a path to *discovery*.

She packed her bag not with fancy clothes, but with a well-worn compass, a rope, a water bottle, and her trusted journal. The forest welcomed her with the smell of damp earth and pine. Every rustle in the undergrowth was a mystery, every unfamiliar bird call a secret she was determined to learn.

As she reached the cliff face she needed to climb, a thrill, not fear, shot through her. She
--------------------------------------------------
Prompt: Please give a brief introduction of DeepSeek company.
Generated: Of course. Here is a brief introduction to DeepSeek.

**DeepSeek** is a leading Chinese AI research company focused on developing powerful artificial intelligence models, with a primary emphasis on large language models (LLMs) and multimodal systems.

Here are the key points about the company:

*   **Core Focus:** They are best known for their **DeepSeek-V2** and the more recent **DeepSeek-V3** models, which are highly capable LLMs that compete with other top-tier models like GPT-4. They specialize in both closed and open-source AI.
*   **Open-Source Contribution:** DeepSeak has made significant contributions to the open-source community. They have released powerful models like **DeepSeek-Coder** (focused on code generation and programming tasks) and the weights for earlier versions of their LLMs, allowing developers and researchers worldwide
--------------------------------------------------
"""

Generate the model

Mian branch is required if the model is fp8 and the device supports fp8 https://github.com/intel/auto-round

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers

model_name = "deepseek-ai/DeepSeek-V3.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=False, torch_dtype="auto")

block = model.model.layers
device_map = {}

for n, m in block.named_modules():
    if isinstance(m, (torch.nn.Linear, transformers.modeling_utils.Conv1D)):
        if "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) < 63:
            device = "cuda:1"
        elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 63 and int(
                n.split('.')[-2]) < 128:
            device = "cuda:2"
        elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 128 and int(
                n.split('.')[-2]) < 192:
            device = "cuda:3"
        elif "experts" in n and ("shared_experts" not in n) and int(
                n.split('.')[-2]) >= 192:
            device = "cuda:4"
        else:
            device = "cuda:0"
        n = n[2:]

        device_map.update({n: device})


from auto_round import AutoRound

autoround = AutoRound(model=model, tokenizer=tokenizer, device_map=device_map, nsamples=512,
                      batch_size=4, low_gpu_mem_usage=True, seqlen=2048,
                      )
autoround.quantize_and_save(format="auto_round", output_dir="tmp_autoround")

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

Intel
/

DeepSeek-V3.1-int4-AutoRound