jerryzh168's picture
Update README.md
059d669 verified
|
raw
history blame
8.63 kB
metadata
library_name: transformers
tags:
  - torchao
  - phi
  - phi4
  - nlp
  - code
  - math
  - chat
  - conversational
license: mit
language:
  - multilingual
base_model:
  - microsoft/Phi-4-mini-instruct
pipeline_tag: text-generation

Phi4-mini model quantized with torchao int4 weight only quantization, by PyTorch team.

Quantization Recipe

First need to install the required packages:

pip install git+https://github.com/huggingface/transformers
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126

We used following code to get the quantized model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "microsoft/Phi-4-mini-instruct"

from torchao.quantization import Int4WeightOnlyConfig
quant_config = Int4WeightOnlyConfig(group_size=128, use_hqq=True)
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Push to hub
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-int4wo-hqq"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {
        "role": "system",
        "content": "",
    },
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
    templated_prompt,
    return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(prompt):])

# Local Benchmark
import torch.utils.benchmark as benchmark
from torchao.utils import benchmark_model
import torchao

def benchmark_fn(f, *args, **kwargs):
    # Manual warmup
    for _ in range(2):
        f(*args, **kwargs)

    t0 = benchmark.Timer(
        stmt="f(*args, **kwargs)",
        globals={"args": args, "kwargs": kwargs, "f": f},
        num_threads=torch.get_num_threads(),
    )
    return f"{(t0.blocked_autorange().mean):.3f}"

torchao.quantization.utils.recommended_inductor_config_setter()
quantized_model = torch.compile(quantized_model, mode="max-autotune")
print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))

Serving with vllm

We can use the same command we used in serving benchmarks to serve the model with vllm

vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3

Model Quality

We rely on lm-evaluation-harness to evaluate the quality of the quantized model.

Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install

baseline

lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8

int4wo-hqq

lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
Benchmark
Phi-4 mini-Ins phi4-mini-int4wo
Popular aggregated benchmark
mmlu (0-shot) 63.56
mmlu_pro (5-shot) 36.74
Reasoning
arc_challenge (0-shot) 56.91 54.86
gpqa_main_zeroshot 30.13 30.58
HellaSwag 54.57 53.54
openbookqa 33.00 34.40
piqa (0-shot) 77.64 76.33
social_iqa 49.59 47.90
truthfulqa_mc2 (0-shot) 48.39 46.44
winogrande (0-shot) 71.11 71.51
Multilingual
mgsm_en_cot_en 60.8 59.6
Math
gsm8k (5-shot) 81.88 74.37
mathqa (0-shot) 42.31 42.75
Overall TODO TODO

Model Performance

Our int4wo is only optimized for batch size 1, so we'll see slowdown in larger batch sizes, we expect this to be used in local server deployment for single or a few users and decode tokens per second will be more important than time to first token.

Need to install vllm nightly to get some recent changes

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Results (A100 machine)

Benchmark (Latency)
Phi-4 mini-Ins phi4-mini-int4wo-hqq
latency (batch_size=1) 2.46s 2.2s (12% speedup)
latency (batch_size=128) 6.55s 17s (60% slowdown)
serving (num_prompts=1) 0.87 req/s 1.05 req/s (20% speedup)
serving (num_prompts=1000) 24.15 req/s 5.64 req/s (77% slowdown)

Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second. Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.

Benchmark (Memory, TODO)
Phi-4 mini-Ins phi4-mini-int4wo-hqq
latency (batch_size=1) 2.46s 2.2s (12% speedup)
latency (batch_size=128) 6.55s 17s (60% slowdown)
serving (num_prompts=1) 0.87 req/s 1.05 req/s (20% speedup)
serving (num_prompts=1000) 24.15 req/s 5.64 req/s (77% slowdown)

Download dataset

Download sharegpt dataset: wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks

benchmark_latency

Run the following under vllm source code root folder:

baseline

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1

int4wo-hqq

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-int4wo-hqq --batch-size 1

benchmark_serving

We also benchmarked the throughput in a serving environment.

Run the following under vllm source code root folder:

baseline

Server:

vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3

Client:

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1

int4wo-hqq

Server:

vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3

Client:

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1