Instructions to use amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0") model = AutoModelForCausalLM.from_pretrained("amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0
- SGLang
How to use amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0 with Docker Model Runner:
docker model run hf.co/amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0
Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0
Model Overview
- Model Architecture: LlamaForCausalLM
- Input: Text
- Output: Text
- Source Model: Llama-3.1-8B-Instruct
- Supported Hardware: AMD EPYC (CPU inference)
- Preferred Operating System: Linux
- Inference Engine: vLLM v0.18.0
- Quantization Framework: TorchAO v0.16.0
- Quantization Method: 8-bit Dynamic Activation, 8-bit Weight Quantization, Symmetric
- Compatible Stack:
- ZenDNN v5.2.1
- zentorch v5.2.1
- PyTorch v2.10.0
- TorchAO v0.16.0
- vLLM v0.18.0
zentorch v5.2.1 for PyTorch v2.10.0 has to be built from source.
This model was Built with Llama. This is a quantized version of Llama-3.1-8B-Instruct created by AMD using TorchAO for ZenDNN-optimized CPU inference.
Quantization
The model was produced using torchao as shown in the example below. Both activations and weights are quantized to INT8 with symmetric mapping. Activation scales are computed dynamically at runtime per token.
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int8DynamicActivationInt8WeightConfig
from torchao.quantization.quant_primitives import MappingType
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
OUTPUT_DIR = "amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0"
modules_to_skip = ["lm_head"]
quantization_config = TorchAoConfig(
Int8DynamicActivationInt8WeightConfig(
version=2,
act_mapping_type=MappingType.SYMMETRIC,
),
modules_to_not_convert=modules_to_skip,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
dtype=torch.bfloat16,
device_map="cpu",
quantization_config=quantization_config,
trust_remote_code=True,
)
model.save_pretrained(OUTPUT_DIR, safe_serialization=False)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer.save_pretrained(OUTPUT_DIR)
# Smoke test
inputs = tokenizer("What are we having for dinner?", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=30, cache_implementation="static")
print(tokenizer.decode(out[0], skip_special_tokens=True))
safe_serialization=Falseis required because torchao's quantized tensor subclasses cannot currently be serialized in thesafetensorsformat.
Quick Start
Requirements
pip install --extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://wheels.vllm.ai/cpu/ \
torch==2.10.0+cpu \
vllm==0.18.0 \
torchao==0.16.0 \
transformers \
huggingface_hub
CPU runtime libraries (only needed if not already present):
conda install -c conda-forge gperftools=2.17.2 llvm-openmp=18.1.8 --no-deps -y
Recommended environment variables
# vLLM CPU runtime tuning
export VLLM_CPU_KVCACHE_SPACE=40 # GB of host memory for KV cache
export VLLM_CPU_OMP_THREADS_BIND="0-63" # NUMA-local cores
# TorchInductor
export TORCHINDUCTOR_FREEZING=1
export TORCHINDUCTOR_AUTOGRAD_CACHE=1
# Required CPU runtime libraries
export LD_PRELOAD="<path to lib>/libtcmalloc_minimal.so.4:<path to lib>/libiomp5.so${LD_PRELOAD:+:$LD_PRELOAD}"
Locate the libraries with find / -name 'libtcmalloc_minimal.so.4' and find / -name 'libiomp5.so', then substitute the resulting directory for <path to lib>.
Evaluation
The model was evaluated against the BF16 (unquantized) baseline using lm-evaluation-harness with the vLLM engine.
| Benchmark | BF16 Baseline | DA8W8 (this model) | Dynamic Quant Difference (baseline: BF16) |
|---|---|---|---|
| GSM8K (5-shot, exact-match strict) | 0.8453 | 0.8279 | -2.06% |
Evaluation Command
lm_eval \
--model vllm \
--model_args pretrained=amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0,tokenizer=meta-llama/Llama-3.1-8B-Instruct,dtype=bfloat16 \
--tasks gsm8k \
--batch_size auto \
--trust_remote_code \
--num_fewshot 5 \
--log_samples \
--gen_kwargs "max_gen_toks=2048" \
--apply_chat_template \
--output_path .
Limitations
- Version Lock: This model is quantized with TorchAO v0.16.0 and is compatible only with PyTorch v2.10.0 / ZenDNN v5.2.1. It will not load correctly on other PyTorch versions.
- CPU Only: This model is optimized for AMD EPYC CPU inference via ZenDNN. It is not intended for GPU inference.
License
This model is distributed under the same license as the source model. See the LICENSE file for details.
Modifications copyright (c) 2026 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- 1,503
Model tree for amd/Llama-3.1-8B-Instruct-da8w8-torchao-v0.16.0
Base model
meta-llama/Llama-3.1-8B