Wanted to try a SOTA coder model on my NVIDIA Thor Dev Kit, full nvidia/MiniMax-M2.5-NVFP4 won't fit and saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 won't work on Thor for some reason. So I asked cloud MiniMax 2.7 to find out which experts from the original model are deleted based on router column deletion and also delete them from NVIDIA model. The resulting model here seems to work fine for chat, will check it out for coding shortly. Full run command with vllm 0.19.0 wrapper:

I have included extras directory in model files with the following tools:

Script used to delete experts from nvidia model based on structure of saricles model, should be reusable for other REAP transfer between quants
Modified chat template that allows turning off reasoning through the same mechanism as Qwen 3.5, enable_thinking: false in kwargs
Modified reasoning parser to support new template
cpp tool to aggressively clear memory and swap out inactive processes before running model in order to max out context length, I get ~120K tokens with FP8 kv cache
Example script to run vLLM with optimized parameters for speed and memory efficiency, for example limited CUDA graph captures, as well as custom chat template/reasoning parser.

Downloads last month: 103

Safetensors

Model size

116B params

Tensor type

BF16

F32

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for catplusplus/MiniMax-M2.5-REAP-172B-A10B-NVFP4

Base model

MiniMaxAI/MiniMax-M2.5

Quantized

nvidia/MiniMax-M2.5-NVFP4

Quantized

(1)

this model