PowerInfer
/

SmallThinker-4BA0.6B-Instruct

Text Generation

feature-extraction

Model card Files Files and versions

SmallThinker-4BA0.6B-Instruct / README.md

yzmizeyu's picture

Update README.md

0c0193a verified about 2 months ago

|

3.12 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	---
	## Introduction

	SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment,
	co-developed by the IPADS and School of AI at Shanghai Jiao Tong University and Zenergize.
	Designed from the ground up for resource-constrained environments,
	SmallThinker brings powerful, private, and low-latency AI directly to your personal devices,
	without relying on the cloud.

	## Performance
	\| Model \| MMLU \| GPQA-diamond \| GSM8K \| MATH-500 \| IFEVAL \| LIVEBENCH \| HUMANEVAL \| Average \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| `SmallThinker-4BA0.6B-Instruct` \| 66.11 \| 31.31 \| 80.02 \| 60.60 \| 69.69 \| 42.20 \| 82.32 \| 61.75 \|
	\| `Qwen3-0.6B` \| 43.31 \| 26.77 \| 62.85 \| 45.6 \| 58.41 \| 23.1 \| 31.71 \| 41.67 \|
	\| `Qwen3-1.7B` \| 64.19 \| 27.78 \| 81.88 \| 63.6 \| 69.50 \| 35.60 \| 61.59 \| 57.73 \|
	\| `Gemma3nE2b-it` \| 63.04 \| 20.2 \| 82.34 \| 58.6 \| 73.2 \| 27.90 \| 64.63 \| 55.70 \|
	\| `Llama3.2-3B-Instruct` \| 64.15 \| 24.24 \| 75.51 \| 40 \| 71.16 \| 15.30 \| 55.49 \| 49.41 \|
	\| `Llama-3.2-1B-Instruct` \| 45.66 \| 22.73 \| 1.67 \| 14.4 \| 48.06 \| 13.50 \| 37.20 \| 26.17 \|
	## Model Card

	<div align="center">

	\| \| \|
	\|:---:\|:---:\|
	\| Architecture \| Mixture-of-Experts (MoE) \|
	\| Total Parameters \| 4B \|
	\| Activated Parameters \| 0.6B \|
	\| Number of Layers \| 32 \|
	\| Attention Hidden Dimension \| 1536 \|
	\| MoE Hidden Dimension (per Expert) \| 768 \|
	\| Number of Attention Heads \| 12 \|
	\| Number of Experts \| 32 \|
	\| Selected Experts per Token \| 4 \|
	\| Vocabulary Size \| 151,936 \|
	\| Context Length \| 32K \|
	\| Attention Mechanism \| GQA \|
	\| Activation Function \| ReGLU \|
	</div>

	## How to Run

	### Transformers

	The latest version of `transformers` is recommended or `transformers>=4.52.4` is required.
	The following contains a code snippet illustrating how to use the model generate content based on given inputs.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	path = "PowerInfer/SmallThinker-4BA0.6B-Instruct"
	device = "cuda"

	tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

	messages = [
	{"role": "user", "content": "Give me a short introduction to large language model."},
	]
	model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)

	model_outputs = model.generate(
	model_inputs,
	do_sample=True,
	max_new_tokens=1024
	)

	output_token_ids = [
	model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
	]

	responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
	print(responses)

	```

	### ModelScope

	`ModelScope` adopts Python API similar to (though not entirely identical to) `Transformers`. For basic usage, simply modify the first line of the above code as follows:

	```python
	from modelscope import AutoModelForCausalLM, AutoTokenizer
	```