AxionML Kimi-K2.5-MXFP8
Developed by AxionML for open-source serving and deployment use cases. Part of AxionML's effort to provide ready-to-serve quantized models for the community.
This is an MXFP8-quantized version of moonshotai/Kimi-K2.5 (1T total parameters, 32B activated), quantized using NVIDIA TensorRT Model Optimizer. Weights and activations of linear layers are quantized to FP8, reducing disk size and GPU memory by ~2x compared to BF16.
About MXFP8 quantization: MXFP8 (Microscaling FP8) uses the E4M3 format with per-block scaling factors to maintain accuracy while halving memory footprint. Unlike coarser per-tensor schemes, microscaling applies fine-grained scaling over small element groups, preserving dynamic range across layers with heterogeneous activation distributions. On NVIDIA Hopper and Blackwell GPUs, FP8 Tensor Cores deliver up to 2x the throughput of BF16 with negligible accuracy loss for well-calibrated models.
Ready for commercial and non-commercial use under Modified MIT.
Model Summary
| Architecture | Mixture-of-Experts (MoE) |
| Total Parameters | 1T |
| Activated Parameters | 32B |
| Number of Layers | 61 (including 1 dense layer) |
| Number of Experts | 384 routed, 1 shared, 8 selected per token |
| Attention Mechanism | MLA |
| Activation Function | SwiGLU |
| Vision Encoder | MoonViT (400M parameters) |
| Context Length | 256K |
| Vocabulary Size | 160K |
Evaluation Results (BF16 Baseline)
| Benchmark | Kimi K2.5 (Thinking) |
|---|---|
| Reasoning & Knowledge | |
| HLE-Full | 30.1 |
| HLE-Full (w/ tools) | 50.2 |
| AIME 2025 | 96.1 |
| HMMT 2025 (Feb) | 95.4 |
| IMO-AnswerBench | 81.8 |
| GPQA-Diamond | 87.6 |
| MMLU-Pro | 87.1 |
| Image & Video | |
| MMMU-Pro | 78.5 |
| CharXiv (RQ) | 77.5 |
| MathVision | 84.2 |
| MathVista (mini) | 90.1 |
| ZeroBench | 9 |
| Coding | |
| SWE-bench Verified | 65.4 |
| LiveCodeBench | 74.6 |
| Codeforces | 2131 |
| Agentic | |
| TAU-Bench (Airline) | 72.6 |
| TAU-Bench (Retail) | 68.4 |
| OSWorld (15 steps) | 41.2 |
| BrowserGym | 57.3 |
Scores are from the Kimi-K2.5 model card. MXFP8 quantization is expected to produce negligible accuracy degradation (<0.5%) on these benchmarks.
Quantization Details
This model was quantized by applying MXFP8 to the weights and activations of linear operators within transformer blocks. Vision encoder weights are kept in their original precision.
- Quantization format: MXFP8 (E4M3 with microscaling)
- Calibration dataset: Nemotron-Post-Training-Dataset-v2
- Tool: TensorRT Model Optimizer
Usage
Deploy with SGLang
python3 -m sglang.launch_server \
--model-path AxionML/Kimi-K2.5-MXFP8 \
--tp 8 \
--trust-remote-code
Reproduce with ModelOpt
python3 examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path moonshotai/Kimi-K2.5 \
--qformat mxfp8 \
--export_path ./kimi-k2.5-mxfp8
Limitations
The base model was trained on data that may contain toxic language and societal biases. The quantized model inherits these limitations. It may generate inaccurate, biased, or offensive content. Please refer to the original model card for full details.
- Downloads last month
- -
Model tree for AxionML/Kimi-K2.5-MXFP8
Base model
moonshotai/Kimi-K2.5