LibraxisAI
/

Llama-3_1-Nemotron-Ultra-253B-v1-MLX-Q5

DeciLMForCausalLM

Model card Files Files and versions

Llama-3_1-Nemotron-Ultra-253B-v1-MLX-Q5 / README.md

div0-space's picture

Upload folder using huggingface_hub

416b98e verified 5 months ago

|

1.68 kB

	# Llama-3.1-Nemotron-Ultra-253B-v1-MLX-Q5

	This is a Q5 quantized version of NVIDIA's Llama-3.1-Nemotron-Ultra-253B-v1 model, converted for use with MLX on Apple Silicon.

	## Model Details

	- Original Model: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
	- Quantization: Q5 (5-bit)
	- Size: 163.2GB (Q5 quantized weights)
	- Peak Memory Usage: ~175GB when loaded
	- Architecture: DeciLM (NAS-optimized Llama variant)
	- Framework: MLX 0.26.2+

	## Key Features

	- Neural Architecture Search (NAS) optimized model
	- Variable Grouped Query Attention (VGQA)
	- FFN Fusion for improved efficiency
	- Dummy layers for reduced memory footprint
	- Optimized for Apple Silicon M-series chips

	## Performance

	Tested on Mac Studio M3 Ultra (512GB RAM):
	- Speed: ~3.86 tokens/sec generation
	- Prompt Processing: ~14.3 tokens/sec
	- Memory: Peak usage ~175GB
	- Works with `mlx_lm` CLI tools (not LM Studio compatible yet)

	## Usage

	### With MLX-LM:
	```python
	from mlx_lm import load, generate

	model, tokenizer = load("LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5")
	response = generate(model, tokenizer, prompt="Your prompt here", verbose=True)
	```

	### Command Line:
	```bash
	uv run mlx_lm.generate \
	--model LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5 \
	--prompt "Your prompt here" \
	--max-tokens 1000
	```

	## Conversion Details

	- Converted using MLX-LM quantization tools
	- Q5 quantization with group size 64
	- Preserved DeciLM architecture specifics

	## License

	Same as the original model - check NVIDIA's license terms.

	## Acknowledgments

	Thanks to NVIDIA for the original Nemotron model and the MLX team for the framework.