AxionML Kimi-K2.5-MXFP8

Developed by AxionML for open-source serving and deployment use cases. Part of AxionML's effort to provide ready-to-serve quantized models for the community.

This is an MXFP8-quantized version of moonshotai/Kimi-K2.5 (1T total parameters, 32B activated), quantized using NVIDIA TensorRT Model Optimizer. Weights and activations of linear layers are quantized to FP8, reducing disk size and GPU memory by ~2x compared to BF16.

About MXFP8 quantization: MXFP8 (Microscaling FP8) uses the E4M3 format with per-block scaling factors to maintain accuracy while halving memory footprint. Unlike coarser per-tensor schemes, microscaling applies fine-grained scaling over small element groups, preserving dynamic range across layers with heterogeneous activation distributions. On NVIDIA Hopper and Blackwell GPUs, FP8 Tensor Cores deliver up to 2x the throughput of BF16 with negligible accuracy loss for well-calibrated models.

Ready for commercial and non-commercial use under Modified MIT.

Model Summary


Architecture	Mixture-of-Experts (MoE)
Total Parameters	1T
Activated Parameters	32B
Number of Layers	61 (including 1 dense layer)
Number of Experts	384 routed, 1 shared, 8 selected per token
Attention Mechanism	MLA
Activation Function	SwiGLU
Vision Encoder	MoonViT (400M parameters)
Context Length	256K
Vocabulary Size	160K

Evaluation Results (BF16 Baseline)

Benchmark	Kimi K2.5 (Thinking)
Reasoning & Knowledge
HLE-Full	30.1
HLE-Full (w/ tools)	50.2
AIME 2025	96.1
HMMT 2025 (Feb)	95.4
IMO-AnswerBench	81.8
GPQA-Diamond	87.6
MMLU-Pro	87.1
Image & Video
MMMU-Pro	78.5
CharXiv (RQ)	77.5
MathVision	84.2
MathVista (mini)	90.1
ZeroBench	9
Coding
SWE-bench Verified	65.4
LiveCodeBench	74.6
Codeforces	2131
Agentic
TAU-Bench (Airline)	72.6
TAU-Bench (Retail)	68.4
OSWorld (15 steps)	41.2
BrowserGym	57.3

Scores are from the Kimi-K2.5 model card. MXFP8 quantization is expected to produce negligible accuracy degradation (<0.5%) on these benchmarks.

Quantization Details

This model was quantized by applying MXFP8 to the weights and activations of linear operators within transformer blocks. Vision encoder weights are kept in their original precision.

Quantization format: MXFP8 (E4M3 with microscaling)
Calibration dataset: Nemotron-Post-Training-Dataset-v2
Tool: TensorRT Model Optimizer

Usage

Deploy with SGLang

python3 -m sglang.launch_server \
    --model-path AxionML/Kimi-K2.5-MXFP8 \
    --tp 8 \
    --trust-remote-code

Reproduce with ModelOpt

python3 examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path moonshotai/Kimi-K2.5 \
    --qformat mxfp8 \
    --export_path ./kimi-k2.5-mxfp8

Limitations

The base model was trained on data that may contain toxic language and societal biases. The quantized model inherits these limitations. It may generate inaccurate, biased, or offensive content. Please refer to the original model card for full details.

Downloads last month: -

Safetensors

Model size

1T params

Tensor type

BF16

F8_E4M3

Model tree for AxionML/Kimi-K2.5-MXFP8

Base model

moonshotai/Kimi-K2.5

Quantized

(25)

this model