SmallThinker-4BA0.6B-Instruct / README.md

yixinsong

Update README.md

c4c7dfc verified about 2 months ago

preview code

raw

history blame

3.06 kB

metadata

license: apache-2.0
language:
  - en
pipeline_tag: text-generation

Introduction

SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models, co-developed by the IPADS Lab at Shanghai Jiao Tong University and Zenergize. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud.

Performance

Model	MMLU	GPQA-diamond	GSM8K	MATH-500	IFEVAL	LIVEBENCH	HUMANEVAL	Average
`SmallThinker-4BA0.6B-Instruct`	66.11	31.31	80.02	60.60	69.69	42.20	82.32	61.75
`qwen3-0.6b`	43.31	26.77	62.85	45.6	58.41	23.1	31.71	41.67
`qwen3-1.7b`	64.19	27.78	81.88	63.6	69.50	35.60	61.59	57.73
`gemma3nE2b`	63.04	20.2	82.34	58.6	73.2	27.90	64.63	55.70
`Llama3.2-3B`	64.15	24.24	75.51	40	71.16	15.30	55.49	49.41
`Llama-3.2-1B-Instruct`	45.66	22.73	1.67	14.4	48.06	13.50	37.20	26.17

Model Card


Architecture	Mixture-of-Experts (MoE)
Total Parameters	4B
Activated Parameters	0.6B
Number of Layers	32
Attention Hidden Dimension	1536
MoE Hidden Dimension (per Expert)	1408
Number of Attention Heads	12
Number of Experts	32
Selected Experts per Token	4
Vocabulary Size	151,936
Context Length	32K
Attention Mechanism	GQA
Activation Function	ReGLU

How to Run

Transformers

The latest version of transformers is recommended or transformers>=4.52.4 is required. The following contains a code snippet illustrating how to use the model generate content based on given inputs.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

path = "PowerInfer/SmallThinker-4BA0.6B-Instruct"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

messages = [
    {"role": "user", "content": "Give me a short introduction to large language model."},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)

model_outputs = model.generate(
    model_inputs,
    do_sample=True,
    max_new_tokens=1024
)

output_token_ids = [
    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)

ModelScope

ModelScope adopts Python API similar to (though not entirely identical to) Transformers. For basic usage, simply modify the first line of the above code as follows:

from modelscope import AutoModelForCausalLM, AutoTokenizer