div0-space's picture
Update README.md
6fe15eb verified
---
license: other
base_model:
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
tags:
- mlx
- mlx-community
- DeciLMForCausalLM
- NAS
- reasoning
---
# Llama-3.1-Nemotron-Ultra-253B-v1-MLX-Q5
This is a Q5 quantized version of NVIDIA's Llama-3.1-Nemotron-Ultra-253B-v1 model, converted for use with MLX on Apple Silicon.
## Model Details
- **Original Model**: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
- **Quantization**: Q5 (5-bit)
- **Size**: 163.2GB (Q5 quantized weights)
- **Peak Memory Usage**: ~175GB when loaded
- **Architecture**: DeciLM (NAS-optimized Llama variant)
- **Framework**: MLX 0.26.2+
## Key Features
- **Neural Architecture Search (NAS)** optimized model
- **Variable Grouped Query Attention (VGQA)**
- **FFN Fusion** for improved efficiency
- **Dummy layers** for reduced memory footprint
- Optimized for Apple Silicon M-series chips
## Performance
Tested on Mac Studio M3 Ultra (512GB RAM):
- **Speed**: ~3.86 tokens/sec generation
- **Prompt Processing**: ~14.3 tokens/sec
- **Memory**: Peak usage ~175GB
- Works with `mlx_lm` CLI tools (not LM Studio compatible yet)
## Usage
### With MLX-LM:
```python
from mlx_lm import load, generate
model, tokenizer = load("LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5")
response = generate(model, tokenizer, prompt="Your prompt here", verbose=True)
```
### Command Line:
```bash
uv run mlx_lm.generate \
--model LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5 \
--prompt "Your prompt here" \
--max-tokens 1000
```
## Conversion Details
- Converted using MLX-LM quantization tools
- Q5 quantization with group size 64
- Preserved DeciLM architecture specifics
## License
Same as the original model - check NVIDIA's license terms.
## Acknowledgments
Thanks to NVIDIA for the original Nemotron model and the MLX team for the framework.