wenhuach AngshucodeLEETCODE commited on
Commit
f7ebeb5
·
verified ·
1 Parent(s): 1285735

Create README.md (#1)

Browse files

- Create README.md (532dc08dfc36ad85f5a5748283cda88402609004)


Co-authored-by: Angshu karmakar <[email protected]>

Files changed (1) hide show
  1. README.md +124 -0
README.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card: Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
2
+
3
+ ## Model Details
4
+
5
+ - **Model Name**: Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
6
+ - **Developer**: Intel, based on OpenAI's gpt-oss-20b
7
+ - **Release Date**: Not explicitly stated in available information
8
+ - **Model Type**: Mixed INT4 language model with symmetric quantization
9
+ - **Base Model**: OpenAI/gpt-oss-20b
10
+ - **Quantization**: 4-bit integer (INT4) with group size 64, using Intel's AutoRound via Round-To-Nearest (RTN) without algorithm tuning
11
+ - **License**: Apache 2.0
12
+ - **Model Size**: Approximately 1.8 billion parameters (quantized)
13
+ - **Tensor Types**: I32, BF16, F16
14
+ - **Non-Expert Layers**: Fallback to 16-bit precision (BF16/F16)
15
+
16
+ This model is a quantized version of OpenAI's gpt-oss-20b, optimized for efficient inference on various hardware, including CPUs, Intel GPUs, and CUDA-enabled GPUs. It is designed for lower latency and specialized use cases, leveraging a Mixture-of-Experts (MoE) architecture with approximately 20 billion total parameters, of which about 3.6 billion are active per inference pass.[](https://huggingface.co/Intel/gpt-oss-20b-int4-g64-rtn-AutoRound)[](https://huggingface.co/Intel/gpt-oss-20b-int4-rtn-AutoRound/blob/main/README.md)
17
+
18
+ ## Intended Use
19
+
20
+ - **Primary Use Cases**:
21
+ - Local inference on consumer-grade hardware (e.g., desktops, laptops)
22
+ - Specialized tasks requiring low-latency text generation
23
+ - Research and experimentation in natural language processing
24
+ - Agentic workflows with strong instruction following, tool use (e.g., web search, Python code execution), and reasoning capabilities
25
+ - **Supported Tasks**:
26
+ - Text generation
27
+ - Instruction following
28
+ - Chain-of-thought reasoning
29
+ - Structured outputs
30
+ - **Intended Users**:
31
+ - Developers and researchers
32
+ - Enterprises building AI applications
33
+ - Hardware enthusiasts testing local inference performance
34
+
35
+ The model is suitable for scenarios requiring efficient deployment on resource-constrained devices, such as those with 16GB of memory. It supports a context window of up to 131,072 tokens, with a recommended minimum of 16,384 for reasoning tasks.[](https://huggingface.co/Intel/gpt-oss-20b-int4-g64-rtn-AutoRound)[](https://www.hardware-corner.net/guides/gpt-oss-20b-gpu-benchamrks/)
36
+
37
+ ## How to Use
38
+
39
+ ### Inference with Transformers
40
+ ```python
41
+ from transformers import pipeline
42
+
43
+ model_id = "Intel/gpt-oss-20b-int4-g64-rtn-AutoRound"
44
+ pipe = pipeline(
45
+ "text-generation",
46
+ model=model_id,
47
+ torch_dtype="auto",
48
+ device_map="auto",
49
+ trust_remote_code=True
50
+ )
51
+ messages = [{"role": "user", "content": "Explain quantum mechanics clearly and concisely."}]
52
+ outputs = pipe(messages, max_new_tokens=512)
53
+ print(outputs[0]["generated_text"][-1])
54
+ ```
55
+
56
+ ### Inference with vLLM
57
+ ```bash
58
+ uv pip install --pre vllm==0.10.1+gptoss \
59
+ --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
60
+ --extra-index-url https://download.pytorch.org/whl/nightly/cu128
61
+ vllm serve Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
62
+ ```
63
+
64
+ ### Inference with Ollama
65
+ ```bash
66
+ ollama pull Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
67
+ ollama run Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
68
+ ```
69
+
70
+ The model supports the harmony response format for consistent interaction. Ensure the appropriate format is applied when using direct model generation.[](https://huggingface.co/Intel/gpt-oss-20b-int4-AutoRound)[](https://github.com/openai/gpt-oss)
71
+
72
+ ## Hardware Requirements
73
+
74
+ - **Minimum**: 16GB VRAM for local inference (e.g., NVIDIA RTX 3090)
75
+ - **Recommended**: Single 80GB GPU (e.g., NVIDIA H100, AMD MI300X) for optimal performance
76
+ - **Tested Platforms**:
77
+ - Windows 11: Up to 36,000-token context with 24GB VRAM (RTX 3090)
78
+ - Linux: Up to 52,000-token context with 24GB VRAM (RTX 3090)
79
+ - **Performance** (on RTX 3090, MXFP4 format):
80
+ - Windows: ~24–36 tokens/second (t/s) generation at 2,000–36,000 token context
81
+ - Linux: ~55–114 t/s generation at 2,000–50,000 token context
82
+
83
+ Linux setups typically offer better performance due to lower VRAM overhead.[](https://www.hardware-corner.net/guides/gpt-oss-20b-gpu-benchamrks/)
84
+
85
+ ## Ethical Considerations and Limitations
86
+
87
+ - **Limitations**:
88
+ - The model may produce factually incorrect outputs and should not be relied upon for factual accuracy without verification.
89
+ - Potential for generating biased, lewd, or offensive content due to limitations in the pretrained model and fine-tuning datasets.
90
+ - Quantization may slightly degrade performance compared to the full-precision model.
91
+ - **Ethical Considerations**:
92
+ - Developers should perform safety testing before deployment to mitigate risks of harmful outputs.
93
+ - Users should be informed of the model’s limitations and potential biases.
94
+ - The model’s open-weight nature allows fine-tuning, which could be misused to bypass safety mechanisms.
95
+
96
+ Consult legal advice before using the model for commercial purposes.[](https://huggingface.co/Intel/gpt-oss-20b-int4-AutoRound)
97
+
98
+ ## Training and Quantization Details
99
+
100
+ - **Base Model**: OpenAI/gpt-oss-20b, a Mixture-of-Experts model with 20 billion total parameters (~3.6 billion active per inference).
101
+ - **Quantization Method**: Intel’s AutoRound with RTN (no algorithm tuning), using group size 64 and symmetric quantization for INT4 precision.
102
+ - **Weight Precision**:
103
+ - MoE projection weights: MXFP4 (4.25 bits per parameter)
104
+ - Non-expert layers: BF16/F16 (16-bit)
105
+ - **Training Data**: Not disclosed in available information.
106
+ - **Quantization Benefits**: Reduces memory footprint, enabling deployment on systems with as little as 16GB of memory.
107
+
108
+ The model leverages Intel’s Neural Compressor for optimization. For more details, see Intel’s documentation.[](https://huggingface.co/Intel/gpt-oss-20b-int4-AutoRound)[](https://huggingface.co/Intel/gpt-oss-20b-int4-rtn-AutoRound/blob/main/README.md)
109
+
110
+ ## Evaluation
111
+
112
+ - **Performance Metrics**: The model has been tested for inference speed on consumer hardware (e.g., RTX 3090), showing competitive token generation rates (see Hardware Requirements).
113
+ - **Safety Evaluations**: Based on OpenAI’s evaluations of gpt-oss-20b, the model does not reach high-risk capability thresholds in Biological, Chemical, Cyber, or AI Self-Improvement categories, even with adversarial fine-tuning.[](https://openai.com/index/gpt-oss-model-card/)[](https://www.hardware-corner.net/guides/gpt-oss-20b-gpu-benchamrks/)
114
+
115
+ ## Citation
116
+
117
+ ```bibtex
118
+ @article{cheng2023optimize,
119
+ title={Optimize weight rounding via signed gradient descent for the quantization of llms},
120
+ author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
121
+ journal={arXiv preprint arXiv:2309.05516},
122
+ year={2023}
123
+ }
124
+ ```