metadata
license: apache-2.0
language:
- en
pipeline_tag: text-generation
Introduction
SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models, co-developed by the IPADS Lab at Shanghai Jiao Tong University and Zenergize. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud.
Performance
Model | MMLU | GPQA-diamond | GSM8K | MATH-500 | IFEVAL | LIVEBENCH | HUMANEVAL | Average |
---|---|---|---|---|---|---|---|---|
SmallThinker-4BA0.6B-Instruct |
66.11 | 31.31 | 80.02 | 60.60 | 69.69 | 42.20 | 82.32 | 61.75 |
qwen3-0.6b |
43.31 | 26.77 | 62.85 | 45.6 | 58.41 | 23.1 | 31.71 | 41.67 |
qwen3-1.7b |
64.19 | 27.78 | 81.88 | 63.6 | 69.50 | 35.60 | 61.59 | 57.73 |
gemma3nE2b |
63.04 | 20.2 | 82.34 | 58.6 | 73.2 | 27.90 | 64.63 | 55.70 |
Llama3.2-3B |
64.15 | 24.24 | 75.51 | 40 | 71.16 | 15.30 | 55.49 | 49.41 |
Llama-3.2-1B-Instruct |
45.66 | 22.73 | 1.67 | 14.4 | 48.06 | 13.50 | 37.20 | 26.17 |
Model Card
Architecture | Mixture-of-Experts (MoE) |
Total Parameters | 4B |
Activated Parameters | 0.6B |
Number of Layers | 32 |
Attention Hidden Dimension | 1536 |
MoE Hidden Dimension (per Expert) | 1408 |
Number of Attention Heads | 12 |
Number of Experts | 32 |
Selected Experts per Token | 4 |
Vocabulary Size | 151,936 |
Context Length | 32K |
Attention Mechanism | GQA |
Activation Function | ReGLU |
How to Run
Transformers
The latest version of transformers
is recommended or transformers>=4.52.4
is required.
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
path = "PowerInfer/SmallThinker-4BA0.6B-Instruct"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
messages = [
{"role": "user", "content": "Give me a short introduction to large language model."},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
model_outputs = model.generate(
model_inputs,
do_sample=True,
max_new_tokens=1024
)
output_token_ids = [
model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
]
responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)
ModelScope
ModelScope
adopts Python API similar to (though not entirely identical to) Transformers
. For basic usage, simply modify the first line of the above code as follows:
from modelscope import AutoModelForCausalLM, AutoTokenizer