PowerInfer
/

SmallThinker-21BA3B-Instruct

Text Generation

Mixture of Experts

Model card Files Files and versions

SmallThinker-21BA3B-Instruct / README.md

yixinsong's picture

Update README.md

003b904 verified 3 months ago

|

2.47 kB

	---
	license: apache-2.0
	---
	## Introduction

	SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment,
	co-developed by the IPADS and School of AI at Shanghai Jiao Tong University and Zenergize AI.
	Designed from the ground up for resource-constrained environments,
	SmallThinker brings powerful, private, and low-latency AI directly to your personal devices,
	without relying on the cloud.

	## Performance


	For the MMLU evaluation, we use a 0-shot CoT setting.

	## Model Card

	<div align="center">

	\| Architecture \| Mixture-of-Experts (MoE) \|
	\|:---:\|:---:\|
	\| Total Parameters \| 21B \|
	\| Activated Parameters \| 3B \|
	\| Number of Layers \| 52 \|
	\| Attention Hidden Dimension \| 2560 \|
	\| MoE Hidden Dimension (per Expert) \| 768 \|
	\| Number of Attention Heads \| 28 \|
	\| Number of KV Heads \| 4 \|
	\| Number of Experts \| 64 \|
	\| Selected Experts per Token \| 6 \|
	\| Vocabulary Size \| 151,936 \|
	\| Context Length \| 16K \|
	\| Attention Mechanism \| GQA \|
	\| Activation Function \| ReGLU \|
	</div>

	## How to Run

	### Transformers

	The latest version of `transformers` is recommended or `transformers>=4.53.3` is required.
	The following contains a code snippet illustrating how to use the model generate content based on given inputs.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	path = "PowerInfer/SmallThinker-21BA3B-Instruct"
	device = "cuda"

	tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

	messages = [
	{"role": "user", "content": "Give me a short introduction to large language model."},
	]
	model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)

	model_outputs = model.generate(
	model_inputs,
	do_sample=True,
	max_new_tokens=1024
	)

	output_token_ids = [
	model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
	]

	responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
	print(responses)

	```

	### ModelScope

	`ModelScope` adopts Python API similar to (though not entirely identical to) `Transformers`. For basic usage, simply modify the first line of the above code as follows:

	```python
	from modelscope import AutoModelForCausalLM, AutoTokenizer
	```