license: apache-2.0
library_name: transformers
Introduction
Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators.
Step3 model card:
| Config | Value |
|---|---|
| Number of Layers (Dense layer included) | 61 |
| Number of Dense Layers | 5 |
| Hidden Dimension | 7168 |
| Attention Mechanism | MFA |
| Low-rank Query Dimension | 2048 |
| Number of Query Heads | 64 |
| Head Dimension | 256 |
| Number of Experts | 48 |
| Selected Experts per Token | 3 |
| Number of Shared Experts | 1 |
| Max Context Length | 65536 |
| Tokenizer | Deepseek V3 |
| Total Parameters (LLM) | 316B |
| Activated Params per Token | 38B |
| Total Parameters (VLM) | 321B |
Evaluation Results
| Model | Total Params. | MMMU | MathVision | ZeroBench(sub) | DYNAMATH | SimpleVQA | HallusionBench | AIME25 | HMMT25 | CNMO24 | GPQA-Diamond | LiveCodeBench (24.8-25.5) |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Open-Source VLM | Step3 | 321B | 74.2 | 64.8 | 23.0 | 50.1 | 62.2 | 64.2 | 82.9 | 70.0 | 83.7 | 73.0 | 67.1 |
| ERINE4.5 - thinking | 300B/424B | 70.0 | 47.6 | 22.5 | 46.9 | 59.8 | 60.0 | 35.1 | 40.5* | 75.5 | 76.8 | 38.8 | |
| GLM-4.1V-thinking | 9B | 68.0 | 49.4 | 22.8 | 41.9 | 48.1 | 60.8 | 13.3 | 6.7 | 25.0 | 47.4 | 24.2 | |
| MiMo-VL | 7B | 66.7 | 60.4 | 18.6 | 45.9 | 48.5 | 59.6 | 60.0 | 34.6 | 69.9 | 55.5 | 50.1 | |
| QvQ-72B-Preview | 72B | 70.3 | 35.9 | 15.9 | 30.7 | 40.3 | 50.8 | 22.7 | 49.5 | 47.3 | 10.9 | 24.1 | |
| LLaMA-Maverick | 400B | 73.4 | 47.2 | 22.8 | 47.1 | 45.4 | 57.1 | 19.2 | 8.91 | 41.6 | 69.8 | 33.9 | |
| Open-Source LLM | MiniMax-M1-80k | 456B | - | - | - | - | - | - | 76.9 | - | - | 70.0 | 65.0 |
| Qwen3-235B-A22B-Thinking | 235B | - | - | - | - | - | - | 81.5 | 62.5 | - | 71.1 | 65.9 | |
| DeepSeek R1-0528 | 671B | - | - | - | - | - | - | 87.5 | 79.4 | 86.9 | 81.0 | 73.3 | |
| Qwen3-235B-A22B-Thinking-2507 | 235B | - | - | - | - | - | - | 92.3 | 83.9 | - | 81.1 | - | |
| Proprietary VLM | O3 | - | 82.9 | 72.8 | 25.2 | 58.1 | 59.8 | 60.1 | 88.9 | 70.1 | 86.7 | 83.3 | 75.8 |
| Claude4 Sonnet (thinking) | - | 76.9 | 64.6 | 26.1 | 48.1 | 43.7 | 57.0 | 70.5 | - | - | 75.4 | 55.9 | |
| Claude4 opus (thinking) | - | 79.8 | 66.1 | 25.2 | 49.3 | 47.2 | 59.9 | 75.5 | - | - | 79.6 | 56.6 | |
| Gemini 2.5 Flash (thinking) | - | 73.2 | 57.3 | 20.1 | 57.1 | 61.1 | 65.2 | 72.0 | - | - | 82.8 | 61.9 | |
| Gemini 2.5 Pro | - | 81.7 | 73.3 | 30.8 | 56.3 | 66.8 | 66.8 | 88.0 | - | - | 86.4 | 71.8 | |
| Grok 4 | - | 80.9 | 70.3 | 22.5 | 40.7 | 55.9 | 64.8 | 98.8 | 93.9 | 85.5 | 87.5 | 79.3 |
Note: Parts of the evaluation results are reproduced using the same settings.
†: Evaluation results of Gemini 2.5 Flash (thinking) may be lower than real model performance, especially on MathVision, due to insufficient instruction following ability.
Deployment
You can access Step3's API on https://platform.stepfun.com/ , we provide OpenAI/Anthropic-compatible API for you.
Inference with Hugging Face Transformers
We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch is supported by default. This behavior is aligned with vllm and sglang.
from transformers import AutoProcessor, AutoModelForCausalLM
key_mapping = {
"^vision_model": "model.vision_model",
r"^model(?!\.(language_model|vision_model))": "model.language_model",
"vit_downsampler": "model.vit_downsampler",
"vit_downsampler2": "model.vit_downsampler2",
"vit_large_projector": "model.vit_large_projector",
}
model_path = "stepfun-ai/step3"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path,
device_map="auto", torch_dtype="auto",trust_remote_code=True,
key_mapping=key_mapping)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "What's in this picture?"}
]
},
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=32768, do_sample=False)
decoded = processor.decode(generate_ids[0, inputs["input_ids"].shape[-1] :], skip_special_tokens=True)
print(decoded)
Inference with vLLM and SGLang
Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on Huggingface.
Currently, it is recommended to run Step3 on the following inference engines:
- vLLM
- SGLang
Deployment and Request examples for vLLM and SGLang can be found in the Model Deployment Guide.
Contact Us
If you have any questions, please reach out at [email protected] .
License
Both the code repository and the model weights are released under the Apache License (Version 2.0).
Citation
@misc{step3system,
title={Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding},
author={StepFun Team},
year={2025},
eprint={2507.19427},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.19427},
}
@misc{step3blog,
title={Step3: Cost-Effective Multimodal Intelligence},
author={StepFun Team},
url={https://stepfun.ai/research/step3},
}