| # Llama-3.1-Nemotron-Ultra-253B-v1-MLX-Q5 | |
| This is a Q5 quantized version of NVIDIA's Llama-3.1-Nemotron-Ultra-253B-v1 model, converted for use with MLX on Apple Silicon. | |
| ## Model Details | |
| - **Original Model**: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 | |
| - **Quantization**: Q5 (5-bit) | |
| - **Size**: 163.2GB (Q5 quantized weights) | |
| - **Peak Memory Usage**: ~175GB when loaded | |
| - **Architecture**: DeciLM (NAS-optimized Llama variant) | |
| - **Framework**: MLX 0.26.2+ | |
| ## Key Features | |
| - **Neural Architecture Search (NAS)** optimized model | |
| - **Variable Grouped Query Attention (VGQA)** | |
| - **FFN Fusion** for improved efficiency | |
| - **Dummy layers** for reduced memory footprint | |
| - Optimized for Apple Silicon M-series chips | |
| ## Performance | |
| Tested on Mac Studio M3 Ultra (512GB RAM): | |
| - **Speed**: ~3.86 tokens/sec generation | |
| - **Prompt Processing**: ~14.3 tokens/sec | |
| - **Memory**: Peak usage ~175GB | |
| - Works with `mlx_lm` CLI tools (not LM Studio compatible yet) | |
| ## Usage | |
| ### With MLX-LM: | |
| ```python | |
| from mlx_lm import load, generate | |
| model, tokenizer = load("LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5") | |
| response = generate(model, tokenizer, prompt="Your prompt here", verbose=True) | |
| ``` | |
| ### Command Line: | |
| ```bash | |
| uv run mlx_lm.generate \ | |
| --model LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5 \ | |
| --prompt "Your prompt here" \ | |
| --max-tokens 1000 | |
| ``` | |
| ## Conversion Details | |
| - Converted using MLX-LM quantization tools | |
| - Q5 quantization with group size 64 | |
| - Preserved DeciLM architecture specifics | |
| ## License | |
| Same as the original model - check NVIDIA's license terms. | |
| ## Acknowledgments | |
| Thanks to NVIDIA for the original Nemotron model and the MLX team for the framework. |