|
--- |
|
license: other |
|
base_model: |
|
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 |
|
tags: |
|
- mlx |
|
- mlx-community |
|
- DeciLMForCausalLM |
|
- NAS |
|
- reasoning |
|
--- |
|
# Llama-3.1-Nemotron-Ultra-253B-v1-MLX-Q5 |
|
|
|
This is a Q5 quantized version of NVIDIA's Llama-3.1-Nemotron-Ultra-253B-v1 model, converted for use with MLX on Apple Silicon. |
|
|
|
## Model Details |
|
|
|
- **Original Model**: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 |
|
- **Quantization**: Q5 (5-bit) |
|
- **Size**: 163.2GB (Q5 quantized weights) |
|
- **Peak Memory Usage**: ~175GB when loaded |
|
- **Architecture**: DeciLM (NAS-optimized Llama variant) |
|
- **Framework**: MLX 0.26.2+ |
|
|
|
## Key Features |
|
|
|
- **Neural Architecture Search (NAS)** optimized model |
|
- **Variable Grouped Query Attention (VGQA)** |
|
- **FFN Fusion** for improved efficiency |
|
- **Dummy layers** for reduced memory footprint |
|
- Optimized for Apple Silicon M-series chips |
|
|
|
## Performance |
|
|
|
Tested on Mac Studio M3 Ultra (512GB RAM): |
|
- **Speed**: ~3.86 tokens/sec generation |
|
- **Prompt Processing**: ~14.3 tokens/sec |
|
- **Memory**: Peak usage ~175GB |
|
- Works with `mlx_lm` CLI tools (not LM Studio compatible yet) |
|
|
|
## Usage |
|
|
|
### With MLX-LM: |
|
```python |
|
from mlx_lm import load, generate |
|
|
|
model, tokenizer = load("LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5") |
|
response = generate(model, tokenizer, prompt="Your prompt here", verbose=True) |
|
``` |
|
|
|
### Command Line: |
|
```bash |
|
uv run mlx_lm.generate \ |
|
--model LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5 \ |
|
--prompt "Your prompt here" \ |
|
--max-tokens 1000 |
|
``` |
|
|
|
## Conversion Details |
|
|
|
- Converted using MLX-LM quantization tools |
|
- Q5 quantization with group size 64 |
|
- Preserved DeciLM architecture specifics |
|
|
|
## License |
|
|
|
Same as the original model - check NVIDIA's license terms. |
|
|
|
## Acknowledgments |
|
|
|
Thanks to NVIDIA for the original Nemotron model and the MLX team for the framework. |