Create README.md (#1)
Browse files- Create README.md (532dc08dfc36ad85f5a5748283cda88402609004)
Co-authored-by: Angshu karmakar <[email protected]>
README.md
ADDED
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Card: Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
|
2 |
+
|
3 |
+
## Model Details
|
4 |
+
|
5 |
+
- **Model Name**: Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
|
6 |
+
- **Developer**: Intel, based on OpenAI's gpt-oss-20b
|
7 |
+
- **Release Date**: Not explicitly stated in available information
|
8 |
+
- **Model Type**: Mixed INT4 language model with symmetric quantization
|
9 |
+
- **Base Model**: OpenAI/gpt-oss-20b
|
10 |
+
- **Quantization**: 4-bit integer (INT4) with group size 64, using Intel's AutoRound via Round-To-Nearest (RTN) without algorithm tuning
|
11 |
+
- **License**: Apache 2.0
|
12 |
+
- **Model Size**: Approximately 1.8 billion parameters (quantized)
|
13 |
+
- **Tensor Types**: I32, BF16, F16
|
14 |
+
- **Non-Expert Layers**: Fallback to 16-bit precision (BF16/F16)
|
15 |
+
|
16 |
+
This model is a quantized version of OpenAI's gpt-oss-20b, optimized for efficient inference on various hardware, including CPUs, Intel GPUs, and CUDA-enabled GPUs. It is designed for lower latency and specialized use cases, leveraging a Mixture-of-Experts (MoE) architecture with approximately 20 billion total parameters, of which about 3.6 billion are active per inference pass.[](https://huggingface.co/Intel/gpt-oss-20b-int4-g64-rtn-AutoRound)[](https://huggingface.co/Intel/gpt-oss-20b-int4-rtn-AutoRound/blob/main/README.md)
|
17 |
+
|
18 |
+
## Intended Use
|
19 |
+
|
20 |
+
- **Primary Use Cases**:
|
21 |
+
- Local inference on consumer-grade hardware (e.g., desktops, laptops)
|
22 |
+
- Specialized tasks requiring low-latency text generation
|
23 |
+
- Research and experimentation in natural language processing
|
24 |
+
- Agentic workflows with strong instruction following, tool use (e.g., web search, Python code execution), and reasoning capabilities
|
25 |
+
- **Supported Tasks**:
|
26 |
+
- Text generation
|
27 |
+
- Instruction following
|
28 |
+
- Chain-of-thought reasoning
|
29 |
+
- Structured outputs
|
30 |
+
- **Intended Users**:
|
31 |
+
- Developers and researchers
|
32 |
+
- Enterprises building AI applications
|
33 |
+
- Hardware enthusiasts testing local inference performance
|
34 |
+
|
35 |
+
The model is suitable for scenarios requiring efficient deployment on resource-constrained devices, such as those with 16GB of memory. It supports a context window of up to 131,072 tokens, with a recommended minimum of 16,384 for reasoning tasks.[](https://huggingface.co/Intel/gpt-oss-20b-int4-g64-rtn-AutoRound)[](https://www.hardware-corner.net/guides/gpt-oss-20b-gpu-benchamrks/)
|
36 |
+
|
37 |
+
## How to Use
|
38 |
+
|
39 |
+
### Inference with Transformers
|
40 |
+
```python
|
41 |
+
from transformers import pipeline
|
42 |
+
|
43 |
+
model_id = "Intel/gpt-oss-20b-int4-g64-rtn-AutoRound"
|
44 |
+
pipe = pipeline(
|
45 |
+
"text-generation",
|
46 |
+
model=model_id,
|
47 |
+
torch_dtype="auto",
|
48 |
+
device_map="auto",
|
49 |
+
trust_remote_code=True
|
50 |
+
)
|
51 |
+
messages = [{"role": "user", "content": "Explain quantum mechanics clearly and concisely."}]
|
52 |
+
outputs = pipe(messages, max_new_tokens=512)
|
53 |
+
print(outputs[0]["generated_text"][-1])
|
54 |
+
```
|
55 |
+
|
56 |
+
### Inference with vLLM
|
57 |
+
```bash
|
58 |
+
uv pip install --pre vllm==0.10.1+gptoss \
|
59 |
+
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
|
60 |
+
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
|
61 |
+
vllm serve Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
|
62 |
+
```
|
63 |
+
|
64 |
+
### Inference with Ollama
|
65 |
+
```bash
|
66 |
+
ollama pull Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
|
67 |
+
ollama run Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
|
68 |
+
```
|
69 |
+
|
70 |
+
The model supports the harmony response format for consistent interaction. Ensure the appropriate format is applied when using direct model generation.[](https://huggingface.co/Intel/gpt-oss-20b-int4-AutoRound)[](https://github.com/openai/gpt-oss)
|
71 |
+
|
72 |
+
## Hardware Requirements
|
73 |
+
|
74 |
+
- **Minimum**: 16GB VRAM for local inference (e.g., NVIDIA RTX 3090)
|
75 |
+
- **Recommended**: Single 80GB GPU (e.g., NVIDIA H100, AMD MI300X) for optimal performance
|
76 |
+
- **Tested Platforms**:
|
77 |
+
- Windows 11: Up to 36,000-token context with 24GB VRAM (RTX 3090)
|
78 |
+
- Linux: Up to 52,000-token context with 24GB VRAM (RTX 3090)
|
79 |
+
- **Performance** (on RTX 3090, MXFP4 format):
|
80 |
+
- Windows: ~24–36 tokens/second (t/s) generation at 2,000–36,000 token context
|
81 |
+
- Linux: ~55–114 t/s generation at 2,000–50,000 token context
|
82 |
+
|
83 |
+
Linux setups typically offer better performance due to lower VRAM overhead.[](https://www.hardware-corner.net/guides/gpt-oss-20b-gpu-benchamrks/)
|
84 |
+
|
85 |
+
## Ethical Considerations and Limitations
|
86 |
+
|
87 |
+
- **Limitations**:
|
88 |
+
- The model may produce factually incorrect outputs and should not be relied upon for factual accuracy without verification.
|
89 |
+
- Potential for generating biased, lewd, or offensive content due to limitations in the pretrained model and fine-tuning datasets.
|
90 |
+
- Quantization may slightly degrade performance compared to the full-precision model.
|
91 |
+
- **Ethical Considerations**:
|
92 |
+
- Developers should perform safety testing before deployment to mitigate risks of harmful outputs.
|
93 |
+
- Users should be informed of the model’s limitations and potential biases.
|
94 |
+
- The model’s open-weight nature allows fine-tuning, which could be misused to bypass safety mechanisms.
|
95 |
+
|
96 |
+
Consult legal advice before using the model for commercial purposes.[](https://huggingface.co/Intel/gpt-oss-20b-int4-AutoRound)
|
97 |
+
|
98 |
+
## Training and Quantization Details
|
99 |
+
|
100 |
+
- **Base Model**: OpenAI/gpt-oss-20b, a Mixture-of-Experts model with 20 billion total parameters (~3.6 billion active per inference).
|
101 |
+
- **Quantization Method**: Intel’s AutoRound with RTN (no algorithm tuning), using group size 64 and symmetric quantization for INT4 precision.
|
102 |
+
- **Weight Precision**:
|
103 |
+
- MoE projection weights: MXFP4 (4.25 bits per parameter)
|
104 |
+
- Non-expert layers: BF16/F16 (16-bit)
|
105 |
+
- **Training Data**: Not disclosed in available information.
|
106 |
+
- **Quantization Benefits**: Reduces memory footprint, enabling deployment on systems with as little as 16GB of memory.
|
107 |
+
|
108 |
+
The model leverages Intel’s Neural Compressor for optimization. For more details, see Intel’s documentation.[](https://huggingface.co/Intel/gpt-oss-20b-int4-AutoRound)[](https://huggingface.co/Intel/gpt-oss-20b-int4-rtn-AutoRound/blob/main/README.md)
|
109 |
+
|
110 |
+
## Evaluation
|
111 |
+
|
112 |
+
- **Performance Metrics**: The model has been tested for inference speed on consumer hardware (e.g., RTX 3090), showing competitive token generation rates (see Hardware Requirements).
|
113 |
+
- **Safety Evaluations**: Based on OpenAI’s evaluations of gpt-oss-20b, the model does not reach high-risk capability thresholds in Biological, Chemical, Cyber, or AI Self-Improvement categories, even with adversarial fine-tuning.[](https://openai.com/index/gpt-oss-model-card/)[](https://www.hardware-corner.net/guides/gpt-oss-20b-gpu-benchamrks/)
|
114 |
+
|
115 |
+
## Citation
|
116 |
+
|
117 |
+
```bibtex
|
118 |
+
@article{cheng2023optimize,
|
119 |
+
title={Optimize weight rounding via signed gradient descent for the quantization of llms},
|
120 |
+
author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
|
121 |
+
journal={arXiv preprint arXiv:2309.05516},
|
122 |
+
year={2023}
|
123 |
+
}
|
124 |
+
```
|