# Model Card: Intel/gpt-oss-20b-int4-g64-rtn-AutoRound ## Model Details - **Model Name**: Intel/gpt-oss-20b-int4-g64-rtn-AutoRound - **Developer**: Intel, based on OpenAI's gpt-oss-20b - **Release Date**: Not explicitly stated in available information - **Model Type**: Mixed INT4 language model with symmetric quantization - **Base Model**: OpenAI/gpt-oss-20b - **Quantization**: 4-bit integer (INT4) with group size 64, using Intel's AutoRound via Round-To-Nearest (RTN) without algorithm tuning - **License**: Apache 2.0 - **Model Size**: Approximately 1.8 billion parameters (quantized) - **Tensor Types**: I32, BF16, F16 - **Non-Expert Layers**: Fallback to 16-bit precision (BF16/F16) This model is a quantized version of OpenAI's gpt-oss-20b, optimized for efficient inference on various hardware, including CPUs, Intel GPUs, and CUDA-enabled GPUs. It is designed for lower latency and specialized use cases, leveraging a Mixture-of-Experts (MoE) architecture with approximately 20 billion total parameters, of which about 3.6 billion are active per inference pass.[](https://huggingface.co/Intel/gpt-oss-20b-int4-g64-rtn-AutoRound)[](https://huggingface.co/Intel/gpt-oss-20b-int4-rtn-AutoRound/blob/main/README.md) ## Intended Use - **Primary Use Cases**: - Local inference on consumer-grade hardware (e.g., desktops, laptops) - Specialized tasks requiring low-latency text generation - Research and experimentation in natural language processing - Agentic workflows with strong instruction following, tool use (e.g., web search, Python code execution), and reasoning capabilities - **Supported Tasks**: - Text generation - Instruction following - Chain-of-thought reasoning - Structured outputs - **Intended Users**: - Developers and researchers - Enterprises building AI applications - Hardware enthusiasts testing local inference performance The model is suitable for scenarios requiring efficient deployment on resource-constrained devices, such as those with 16GB of memory. It supports a context window of up to 131,072 tokens, with a recommended minimum of 16,384 for reasoning tasks.[](https://huggingface.co/Intel/gpt-oss-20b-int4-g64-rtn-AutoRound)[](https://www.hardware-corner.net/guides/gpt-oss-20b-gpu-benchamrks/) ## How to Use ### Inference with Transformers ```python from transformers import pipeline model_id = "Intel/gpt-oss-20b-int4-g64-rtn-AutoRound" pipe = pipeline( "text-generation", model=model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True ) messages = [{"role": "user", "content": "Explain quantum mechanics clearly and concisely."}] outputs = pipe(messages, max_new_tokens=512) print(outputs[0]["generated_text"][-1]) ``` ## Hardware Requirements - **Minimum**: 16GB VRAM for local inference (e.g., NVIDIA RTX 3090) - **Recommended**: Single 80GB GPU (e.g., NVIDIA H100, AMD MI300X) for optimal performance - **Tested Platforms**: - Windows 11: Up to 36,000-token context with 24GB VRAM (RTX 3090) - Linux: Up to 52,000-token context with 24GB VRAM (RTX 3090) - **Performance** (on RTX 3090, MXFP4 format): - Windows: ~24–36 tokens/second (t/s) generation at 2,000–36,000 token context - Linux: ~55–114 t/s generation at 2,000–50,000 token context Linux setups typically offer better performance due to lower VRAM overhead.[](https://www.hardware-corner.net/guides/gpt-oss-20b-gpu-benchamrks/) ## Ethical Considerations and Limitations - **Limitations**: - The model may produce factually incorrect outputs and should not be relied upon for factual accuracy without verification. - Potential for generating biased, lewd, or offensive content due to limitations in the pretrained model and fine-tuning datasets. - Quantization may slightly degrade performance compared to the full-precision model. - **Ethical Considerations**: - Developers should perform safety testing before deployment to mitigate risks of harmful outputs. - Users should be informed of the model’s limitations and potential biases. - The model’s open-weight nature allows fine-tuning, which could be misused to bypass safety mechanisms. Consult legal advice before using the model for commercial purposes.[](https://huggingface.co/Intel/gpt-oss-20b-int4-AutoRound) ## Training and Quantization Details - **Base Model**: OpenAI/gpt-oss-20b, a Mixture-of-Experts model with 20 billion total parameters (~3.6 billion active per inference). - **Quantization Method**: Intel’s AutoRound with RTN (no algorithm tuning), using group size 64 and symmetric quantization for INT4 precision. - **Weight Precision**: - MoE projection weights: MXFP4 (4.25 bits per parameter) - Non-expert layers: BF16/F16 (16-bit) - **Training Data**: Not disclosed in available information. - **Quantization Benefits**: Reduces memory footprint, enabling deployment on systems with as little as 16GB of memory. The model leverages Intel’s Neural Compressor for optimization. For more details, see Intel’s documentation.[](https://huggingface.co/Intel/gpt-oss-20b-int4-AutoRound)[](https://huggingface.co/Intel/gpt-oss-20b-int4-rtn-AutoRound/blob/main/README.md) ## Evaluation - **Performance Metrics**: The model has been tested for inference speed on consumer hardware (e.g., RTX 3090), showing competitive token generation rates (see Hardware Requirements). - **Safety Evaluations**: Based on OpenAI’s evaluations of gpt-oss-20b, the model does not reach high-risk capability thresholds in Biological, Chemical, Cyber, or AI Self-Improvement categories, even with adversarial fine-tuning.[](https://openai.com/index/gpt-oss-model-card/)[](https://www.hardware-corner.net/guides/gpt-oss-20b-gpu-benchamrks/) ## Citation ```bibtex @article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} } ```