Wanted to try a SOTA coder model on my NVIDIA Thor Dev Kit, full nvidia/MiniMax-M2.5-NVFP4 won't fit and saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 won't work on Thor for some reason. So I asked cloud MiniMax 2.7 to find out which experts from the original model are deleted based on router column deletion and also delete them from NVIDIA model. The resulting model here seems to work fine for chat, will check it out for coding shortly. Full run command with vllm 0.19.0 wrapper:

I have included extras directory in model files with the following tools:

  • Script used to delete experts from nvidia model based on structure of saricles model, should be reusable for other REAP transfer between quants
  • Modified chat template that allows turning off reasoning through the same mechanism as Qwen 3.5, enable_thinking: false in kwargs
  • Modified reasoning parser to support new template
  • cpp tool to aggressively clear memory and swap out inactive processes before running model in order to max out context length, I get ~120K tokens with FP8 kv cache
  • Example script to run vLLM with optimized parameters for speed and memory efficiency, for example limited CUDA graph captures, as well as custom chat template/reasoning parser.
Downloads last month
103
Safetensors
Model size
116B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for catplusplus/MiniMax-M2.5-REAP-172B-A10B-NVFP4

Quantized
(1)
this model
Free AI Image Generator No sign-up. Instant results. Open Now