AxionML Kimi-K2.5-MXFP8

Developed by AxionML for open-source serving and deployment use cases. Part of AxionML's effort to provide ready-to-serve quantized models for the community.

This is an MXFP8-quantized version of moonshotai/Kimi-K2.5 (1T total parameters, 32B activated), quantized using NVIDIA TensorRT Model Optimizer. Weights and activations of linear layers are quantized to FP8, reducing disk size and GPU memory by ~2x compared to BF16.

About MXFP8 quantization: MXFP8 (Microscaling FP8) uses the E4M3 format with per-block scaling factors to maintain accuracy while halving memory footprint. Unlike coarser per-tensor schemes, microscaling applies fine-grained scaling over small element groups, preserving dynamic range across layers with heterogeneous activation distributions. On NVIDIA Hopper and Blackwell GPUs, FP8 Tensor Cores deliver up to 2x the throughput of BF16 with negligible accuracy loss for well-calibrated models.

Ready for commercial and non-commercial use under Modified MIT.

Model Summary

Architecture Mixture-of-Experts (MoE)
Total Parameters 1T
Activated Parameters 32B
Number of Layers 61 (including 1 dense layer)
Number of Experts 384 routed, 1 shared, 8 selected per token
Attention Mechanism MLA
Activation Function SwiGLU
Vision Encoder MoonViT (400M parameters)
Context Length 256K
Vocabulary Size 160K

Evaluation Results (BF16 Baseline)

Benchmark Kimi K2.5 (Thinking)
Reasoning & Knowledge
HLE-Full 30.1
HLE-Full (w/ tools) 50.2
AIME 2025 96.1
HMMT 2025 (Feb) 95.4
IMO-AnswerBench 81.8
GPQA-Diamond 87.6
MMLU-Pro 87.1
Image & Video
MMMU-Pro 78.5
CharXiv (RQ) 77.5
MathVision 84.2
MathVista (mini) 90.1
ZeroBench 9
Coding
SWE-bench Verified 65.4
LiveCodeBench 74.6
Codeforces 2131
Agentic
TAU-Bench (Airline) 72.6
TAU-Bench (Retail) 68.4
OSWorld (15 steps) 41.2
BrowserGym 57.3

Scores are from the Kimi-K2.5 model card. MXFP8 quantization is expected to produce negligible accuracy degradation (<0.5%) on these benchmarks.

Quantization Details

This model was quantized by applying MXFP8 to the weights and activations of linear operators within transformer blocks. Vision encoder weights are kept in their original precision.

Usage

Deploy with SGLang

python3 -m sglang.launch_server \
    --model-path AxionML/Kimi-K2.5-MXFP8 \
    --tp 8 \
    --trust-remote-code

Reproduce with ModelOpt

python3 examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path moonshotai/Kimi-K2.5 \
    --qformat mxfp8 \
    --export_path ./kimi-k2.5-mxfp8

Limitations

The base model was trained on data that may contain toxic language and societal biases. The quantized model inherits these limitations. It may generate inaccurate, biased, or offensive content. Please refer to the original model card for full details.

Downloads last month
-
Safetensors
Model size
1T params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AxionML/Kimi-K2.5-MXFP8

Quantized
(25)
this model