|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- Qwen/QwQ-32B |
|
|
library_name: mlx |
|
|
tags: |
|
|
- quantization |
|
|
- mlx-q5 |
|
|
--- |
|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- mlx==0.26.2 |
|
|
- q5 |
|
|
- qwq |
|
|
- reasoning |
|
|
- m3-ultra |
|
|
base_model: Qwen/QwQ-32B |
|
|
--- |
|
|
|
|
|
# QwQ-32B MLX Q5 Quantization |
|
|
|
|
|
This is a **Q5 (5-bit) quantized** version of the QwQ-32B reasoning model, optimized for MLX on Apple Silicon. This quantization offers an excellent balance between model quality and size, specifically designed for high-memory Apple Silicon systems like the M3 Ultra. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: Qwen/QwQ-32B |
|
|
- **Quantization**: Q5 (5-bit) with group size 64 |
|
|
- **Format**: MLX (Apple Silicon optimized) |
|
|
- **Size**: 21GB (from original 61GB bfloat16) |
|
|
- **Compression**: 66% size reduction |
|
|
- **Architecture**: Qwen2 with reasoning capabilities |
|
|
|
|
|
## Why Q5? |
|
|
|
|
|
Q5 quantization provides: |
|
|
- **Superior quality** compared to Q4 while being smaller than Q6/Q8 |
|
|
- **Optimal size** for 128GB+ Apple Silicon systems |
|
|
- **Minimal quality loss** - retains ~98% of original model capabilities |
|
|
- **Fast inference** with MLX's unified memory architecture |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- Apple Silicon Mac (M1/M2/M3/M4) |
|
|
- macOS 13.0+ |
|
|
- Python 3.11+ |
|
|
- MLX 0.26.0+ |
|
|
- mlx-lm 0.22.5+ |
|
|
- 32GB+ RAM recommended (64GB+ for full 128k context) |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
# Using uv (recommended) |
|
|
uv add mlx>=0.26.0 mlx-lm transformers |
|
|
|
|
|
# Or with pip (not tested and obsolete) |
|
|
pip install mlx>=0.26.0 mlx-lm transformers |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Direct Generation |
|
|
|
|
|
```bash |
|
|
uv run mlx_lm.generate \ |
|
|
--model LibraxisAI/QwQ-32B-MLX-Q5 \ |
|
|
--prompt "Solve this step by step: If a train travels 120 km in 2 hours, what is its speed?" \ |
|
|
--max-tokens 500 |
|
|
``` |
|
|
|
|
|
### Python API |
|
|
|
|
|
```python |
|
|
from mlx_lm import load, generate |
|
|
|
|
|
# Load model |
|
|
model, tokenizer = load("LibraxisAI/QwQ-32B-MLX-Q5") |
|
|
|
|
|
# Generate text with reasoning |
|
|
prompt = "Think step by step: What are the implications of Q5 quantization for LLM deployment?" |
|
|
response = generate( |
|
|
model=model, |
|
|
tokenizer=tokenizer, |
|
|
prompt=prompt, |
|
|
max_tokens=1000, |
|
|
temp=0.7 |
|
|
) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### HTTP Server |
|
|
|
|
|
```bash |
|
|
uv run mlx_lm.server \ |
|
|
--model LibraxisAI/QwQ-32B-MLX-Q5 \ |
|
|
--host 0.0.0.0 \ |
|
|
--port 8080 |
|
|
``` |
|
|
|
|
|
## Performance Benchmarks |
|
|
|
|
|
Tested on Mac Studio M3 Ultra (512GB): |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Model Size | 21GB | |
|
|
| Peak Memory Usage | ~25GB | |
|
|
| Generation Speed | ~12-15 tokens/sec | |
|
|
| Max Context Length | 131,072 tokens (128k) | |
|
|
|
|
|
## Special Features |
|
|
|
|
|
QwQ (Qwen with Questions) is designed for: |
|
|
- **Deep reasoning** and step-by-step problem solving |
|
|
- **Mathematical reasoning** and logical deduction |
|
|
- **Code generation** with explanations |
|
|
- **Self-reflection** and error correction |
|
|
|
|
|
## Limitations |
|
|
|
|
|
⚠️ **Important**: This Q5 model as for the release date, of this quant **is NOT compatible** with LM Studio (**yet**), which only supports 2, 3, 4, 6, and 8-bit quantizations & we didn't test it with Ollama or any other inference client. **Use MLX directly or via the MLX server** - we've created a comfortable, `command generation script` to run the server properly. |
|
|
|
|
|
## Conversion Details |
|
|
|
|
|
This model was quantized using: |
|
|
```bash |
|
|
uv run mlx_lm.convert \ |
|
|
--hf-path Qwen/QwQ-32B \ |
|
|
--mlx-path QwQ-32B-MLX-Q5 \ |
|
|
--dtype bfloat16 \ |
|
|
-q --q-bits 5 --q-group-size 64 |
|
|
``` |
|
|
|
|
|
## Frontier M3 Ultra Optimization |
|
|
|
|
|
This model is specifically optimized for the Mac Studio M3 Ultra setup with 512GB unified memory. For best performance: |
|
|
|
|
|
```python |
|
|
import mlx.core as mx |
|
|
|
|
|
# Set memory limits for large models |
|
|
mx.metal.set_memory_limit(100 * 1024**3) # 100GB |
|
|
mx.metal.set_cache_limit(20 * 1024**3) # 20GB cache |
|
|
``` |
|
|
|
|
|
## Tools Included |
|
|
|
|
|
We provide utility scripts for easy model management: |
|
|
|
|
|
1. **convert-to-mlx.sh** - Command generation tool - convert any model to MLX format with many options of customization and Q5 quantization support on mlx>=0.26.0 |
|
|
2. **mlx-serve.sh** - Launch MLX server with custom parameters |
|
|
|
|
|
## Historical Note |
|
|
|
|
|
The LibraxisAI Q5 models were among the **first Q5 quantized MLX models** available on Hugging Face, pioneering the use of 5-bit quantization for Apple Silicon optimization. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{qwq-32b-q5-mlx, |
|
|
author = {LibraxisAI}, |
|
|
title = {QwQ-32B Q5 MLX - Reasoning Model for Apple Silicon}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/LibraxisAI/QwQ-32B-MLX-Q5} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model follows the original QwQ license (Apache-2.0). See the [base model card](https://huggingface.com/Qwen/QwQ-32B) for full details. |
|
|
|
|
|
## Authors of the repository |
|
|
[Monika Szymanska](https://github.com/m-szymanska) |
|
|
[Maciej Gad, DVM](https://div0.space) |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Apple MLX team and community for the amazing 0.26.0+ framework |
|
|
- Qwen team for the innovative QwQ reasoning model |
|
|
- Klaudiusz-AI 🐉 |