README.md · pytorch/Phi-4-mini-instruct-FP8 at 49ef0db50d65c17346edbf8092c3b989f49e2931

File size: 6,747 Bytes

11ca438
 
f38ad3d
 
939c0b0
 
 
 
 
 
 
f38ad3d
939c0b0
 
627b9e9
 
939c0b0
11ca438
 
a7bb628
 
49ef0db
 
 
 
 
fa9082a
384e9fa
fa9082a
 
800c265
a7bb628
 
 
 
 
 
 
 
 
 
 
960296e
39b90e8
a7bb628
 
 
265080b
 
a7bb628
39b90e8
a7bb628
 
 
a204b4f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a7bb628
 
 
 
a204b4f
49ef0db
a7bb628
49ef0db
 
 
 
a7bb628
49ef0db
a7bb628
 
 
be21cdb
a7bb628
 
 
 
be21cdb
a7bb628
bf1e484
a7bb628
 
8b3ab58
 
 
 
 
 
 
49ef0db
 
8b3ab58
49ef0db
 
 
 
 
8b3ab58
49ef0db
8b3ab58
49ef0db
 
8b3ab58
 
a7bb628
 
69fb0e9
 
 
 
ae4c6ae
 
69fb0e9
 
 
ae4c6ae
 
d682ce6
a7bb628
 
 
d682ce6
36880bf
 
 
d682ce6
a7bb628
 
 
 
d682ce6
a7bb628
bf1e484
a7bb628
 
d682ce6
a7bb628
 
 
36880bf
 
d682ce6
a7bb628
 
 
 
 
 
 
 
 
 
d682ce6
a7bb628
 
bf1e484
a7bb628
 
 
 
f38ad3d

---
library_name: transformers
tags:
- torchao
- phi
- phi4
- nlp
- code
- math
- chat
- conversational
license: mit
language:
- multilingual
base_model:
- microsoft/Phi-4-mini-instruct
pipeline_tag: text-generation
---

[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team.


# Quantization Recipe

First need to install the required packages:

```
pip install git+https://github.com/huggingface/transformers
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
```

We used following code to get the quantized model:

```
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "microsoft/Phi-4-mini-instruct"

from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Push to hub
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {
        "role": "system",
        "content": "",
    },
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
    templated_prompt,
    return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(prompt):])
```

# Serving with vllm
We can use the same command we used in serving benchmarks to serve the model with vllm
```
vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
```

# Model Quality
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.

## baseline
```
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
```

## float8dq
```
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
```

| Benchmark                        |                |                     |
|----------------------------------|----------------|---------------------|
|                                  | Phi-4 mini-Ins | phi4-mini-int4wo    | 
| **Popular aggregated benchmark** |                |                     |
| mmlu (0-shot)                    |                |  x              |
| mmlu_pro (5-shot)                |                |  x              |
| **Reasoning**                    |                |                     |
| arc_challenge (0-shot)           | 56.91          |  x              |
| gpqa_main_zeroshot               | 30.13          |  x              |
| HellaSwag                        | 54.57          |  54.55              |
| openbookqa                       | 33.00          |  x              |
| piqa (0-shot)	                   | 77.64          |  x              |
| social_iqa                       | 49.59          |  x              |
| truthfulqa_mc2 (0-shot)          | 48.39          |  x              |
| winogrande  (0-shot)             | 71.11          |  x              |
| **Multilingual**                 |                |                     |
| mgsm_en_cot_en                   | 60.8           |  60.0               |
| **Math**                         |                |                     |
| gsm8k (5-shot)                   | 81.88          |  80.89              |
| mathqa (0-shot)                  | 42.31          |  42.51              |
| **Overall**                      | **TODO**       | **TODO**            |

# Model Performance

## Results (H100 machine)
| Benchmark                        |                |                          |
|----------------------------------|----------------|--------------------------|
|                                  | Phi-4 mini-Ins | phi4-mini-float8dq       | 
| latency (batch_size=1)           | 1.64s         | 1.41s (16% speedup)      |
| latency (batch_size=128)         | 3.1s          | 2.72s (14% speedup)      |
| serving (num_prompts=1)          | 1.35 req/s     | 1.57 req/s (16% speedup) |
| serving (num_prompts=1000)       | 66.68 req/s    | 80.53 req/s (21% speedup)|

Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.

## Download dataset
Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`

Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
## benchmark_latency

Run the following under `vllm` source code root folder:

### baseline
```
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
```

### float8dq
```
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
```

## benchmark_serving

We also benchmarked the throughput in a serving environment.

Run the following under `vllm` source code root folder:

### baseline
Server:
```
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
```

Client:
```
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
```

### float8dq
Server:
```
vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
```

Client:
```
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1
```