Instructions to use unsloth/MiMo-V2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/MiMo-V2.5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/MiMo-V2.5-GGUF",
	filename="BF16/MiMo-V2.5-BF16-00001-of-00014.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use unsloth/MiMo-V2.5-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M

Use Docker

docker model run hf.co/unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M

LM Studio
Jan
Ollama
How to use unsloth/MiMo-V2.5-GGUF with Ollama:
```
ollama run hf.co/unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M
```

Unsloth Studio new

How to use unsloth/MiMo-V2.5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/MiMo-V2.5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/MiMo-V2.5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/MiMo-V2.5-GGUF to start chatting

Pi new

How to use unsloth/MiMo-V2.5-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use unsloth/MiMo-V2.5-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use unsloth/MiMo-V2.5-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M
```

Lemonade

How to use unsloth/MiMo-V2.5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/MiMo-V2.5-GGUF:UD-Q4_K_M

Run and chat with the model

lemonade run user.MiMo-V2.5-GGUF-UD-Q4_K_M

List all available models

lemonade list

Includes Unsloth chat template fixes!
For llama.cpp, use --jinja

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

Community
WeChat Group | Discord | Telegram | Reddit

⚠️ Important: Config Update Notice

The config.json and tokenizer_config.json files in this repository have been updated since the initial release. If you downloaded MiMo-V2.5 before this commit (4da2748), please re-pull or manually update these two files to ensure correct model behavior. Using the outdated config may lead to degraded model performance. We apologize for any inconvenience.

Quick fix:

hf download XiaomiMiMo/MiMo-V2.5 config.json tokenizer_config.json --local-dir ./MiMo-V2.5

MiMo-V2.5

1. Introduction

MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows. Key features include:

Hybrid Attention Architecture: Inherits the hybrid design from MiMo-V2-Flash, interleaving Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and 128 sliding window. This reduces KV-cache storage by nearly 6× while maintaining long-context performance via learnable attention sink bias.
Native Omnimodal Encoders: Equipped with a 729M-param Vision Transformer (ViT) featuring hybrid window attention and a dedicated audio encoder initialized from the weights of MiMo-Audio, enabling high-quality image, video, and audio understanding.
Multi-Token Prediction (MTP): Three lightweight MTP modules with dense FFNs accelerate inference via speculative decoding and improve RL training efficiency.
Efficient Pre-Training: Trained on a total of ~48T tokens using FP8 mixed precision. The context window supports up to 1M tokens.
Agentic Capabilities: Post-training incorporates SFT, large-scale agentic RL, and Multi-Teacher On-Policy Distillation (MOPD), achieving strong performance on agentic tasks and multimodal understanding benchmarks.

Model Summary

Architecture: Sparse MoE (Mixture of Experts), 310B total / 15B activated parameters
Context Length: Up to 1M tokens
Modalities: Text, Image, Video, Audio
Vision Encoder: 729M-param ViT (28 layers: 24 SWA + 4 Full)
Audio Encoder: 261M-param Audio Transformer (24 layers: 12 SWA + 12 Full)
Multi-Token Prediction (MTP): 329M parameters, 3 layers

2. Downloads

Model	Context Length	Download
MiMo-V2.5-Base	256K	🤗 HuggingFace 🤖 ModelScope
MiMo-V2.5	1M	🤗 HuggingFace 🤖 ModelScope

3. Evaluation Results

Multimodal Benchmarks

Coding & Agent Benchmarks

MiMo-V2.5 Coding and Agentic Benchmark Results

Long Context Benchmarks

4. Model Architecture

LLM Backbone

MiMo-V2.5's core language backbone inherits from the MiMo-V2-Flash architecture, a sparse MoE model with hybrid sliding window attention.

Component	MiMo-V2.5-Pro	MiMo-V2.5
Total Parameters	1.02T	310B
Activated Parameters	42B	15B
Hidden Size	6144	4096
Num Layers	70 (1 dense + 69 MoE)	48 (1 dense + 47 MoE)
Full Attention Layers	10	9
SWA Layers	60	39
Num Attention Heads	128	64
Num KV Heads	8 (GQA)	8 (GA) / 4 (SWA)
Head Dim (QK / V)	192 / 128	192 / 128
Routed Experts	384	256
Experts per Token	8	8
MoE Intermediate Size	2048	2048
Dense Intermediate Size	16384 (layer 0 only)	16384 (layer 0 only)
SWA Window Size	128	128
Max Context Length	1M	1M
MTP Layers	3	3

Vision Encoder

We train a dedicated MiMo ViT that adopts sliding-window attention to enable efficient visual encoding.

Configuration	Value
Total Layers	28
SWA Layers	24
Full Attention Layers	4
Window-Attention Pattern	[-1] + [0,0,0,0,1,1,1,1,-1] × 3
Attention Heads (Q / KV)	32 / 8
Head Dimensions (QK / V)	64 / 64
Sliding Window Size (L / R)	64 / 64

Window pattern notation: -1 = full attention, 0 = 1-D row window, 1 = 1-D column window.

Audio Encoder

Our audio encoder is initialized from the weights of MiMo-Audio-Tokenizer and further finetuned to support high-quality audio understanding.

Configuration	Value
Total Layers	24
SWA Layers	12
Full Attention Layers	12
Sliding Window Size	128
Attention Heads (Q / KV)	16 / 16
Head Dimensions (QK / V)	64 / 64

5. Training Process

MiMo-V2.5 is trained on a total of ~48T tokens.

Text Pre-training: We collect diverse text data for pre-training the LLM backbone.
Projector Warmup: Short-duration warmup of multimodal projectors (audio and visual MLP projectors).
Multimodal Pre-training: High-quality multimodal data collected for large-scale pretraining.
SFT & Agentic Post Training: Supervised fine-tuning with diverse agentic data. During this stage, the context window is progressively extended from 32K → 256K → 1M.
RL & MOPD Training: Reinforcement learning for improving perception, reasoning, and agentic capabilities.

6. Deployment

Since inference engines are continuously being updated and optimized, this guide only provides deployment examples for reference. For the best performance, we strongly recommend following our referenced approach to get the latest best practices and optimal performance.

SGLang Deployment

For the best performance, we strongly recommend deploying using this approach, which is officially supported by the SGLang community. Please refer to SGLang MiMo-V2.5 Cookbook for the latest deployment guide.

The following is an example of running the model with SGLang, referenced from sgl-project/sglang#23811:

python3 -m sglang.launch_server \
    --model-path XiaomiMiMo/MiMo-V2.5 \
    --served-model-name mimo-v2.5 \
    --log-level-http warning \
    --enable-cache-report \
    --pp-size 1 \
    --dp-size 2 \
    --tp-size 8 \
    --enable-dp-attention \
    --moe-a2a-backend deepep \
    --deepep-mode auto \
    --decode-log-interval 1 \
    --page-size 1 \
    --host 0.0.0.0 \
    --port 9001 \
    --trust-remote-code \
    --watchdog-timeout 1000000 \
    --mem-fraction-static 0.65 \
    --chunked-prefill-size 16384 \
    --reasoning-parser qwen3 \
    --tool-call-parser mimo \
    --context-length 262144 \
    --collect-tokens-histogram \
    --enable-metrics \
    --load-balance-method round_robin \
    --allow-auto-truncate \
    --enable-metrics-for-all-schedulers \
    --quantization fp8 \
    --skip-server-warmup \
    --moe-dense-tp-size 1 \
    --enable-dp-lm-head \
    --disable-tokenizer-batch-decode \
    --mm-enable-dp-encoder \
    --attention-backend fa3 \
    --mm-attention-backend fa3

vLLM Deployment

For the best performance, we strongly recommend deploying using this approach, which is officially supported by the vLLM community. Please refer to vLLM MiMo-V2-Flash Cookbook for the latest deployment guide.

For local deployment, we recommend setting the sampling parameters to temperature=1.0, top_p=0.95.

Citation

@misc{mimov25,
  title={MiMo-V2.5},
  year={2026},
  howpublished={\url{https://huggingface.co/collections/XiaomiMiMo/mimo-v25}},
}

Contact

For questions or feedback, reach us at mimo@xiaomi.com or join our community:

Downloads last month: 9,212

GGUF

Model size

310B params

Architecture

mimo2

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for unsloth/MiMo-V2.5-GGUF

Base model

XiaomiMiMo/MiMo-V2.5

Quantized

(13)

this model

Quantizations

1 model