Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,434 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
English | [简体中文](README.md)
|
| 2 |
+
|
| 3 |
+
<p align="center">
|
| 4 |
+
<picture>
|
| 5 |
+
<source media="(prefers-color-scheme: dark)" srcset="./docs/source/assets/logos/angelslim_logo_light.png">
|
| 6 |
+
<img alt="AngelSlim" src="./docs/source/assets/logos/angelslim_logo.png" width=55%>
|
| 7 |
+
</picture>
|
| 8 |
+
</p>
|
| 9 |
+
|
| 10 |
+
<h3 align="center">
|
| 11 |
+
Dedicated to building a more intuitive, comprehensive, and efficient LLMs compression toolkit.
|
| 12 |
+
</h3>
|
| 13 |
+
|
| 14 |
+
<p align="center">
|
| 15 |
+
📖 <a href="https://angelslim.readthedocs.io/">Documentation</a>   |   🤗 <a href="https://huggingface.co/AngelSlim">Hugging Face</a>   |   🤖 <a href="https://modelscope.cn/organization/AngelSlim">ModelScope</a>   |   💬 <a href="./docs/source/assets/angel_slim_wechat.png">WeChat</a> |   🫨 <a href="https://discord.com/invite/dHVNeuNdFt">Discord</a>
|
| 16 |
+
<br>
|
| 17 |
+
</p>
|
| 18 |
+
|
| 19 |
+
## Table of Contents
|
| 20 |
+
|
| 21 |
+
- [Latest Updates](#latest-updates)
|
| 22 |
+
- [Key Features](#key-features)
|
| 23 |
+
- [Supported Models](#supported-models)
|
| 24 |
+
- [How to Use](#how-to-use)
|
| 25 |
+
- [Install AngelSlim](#install-angelslim)
|
| 26 |
+
- [Quick Start](#quick-start)
|
| 27 |
+
- [deployment & Evaluation](#deployment)
|
| 28 |
+
- [Benchmark](#benchmark)
|
| 29 |
+
- [License](#license)
|
| 30 |
+
- [Citation](#citation)
|
| 31 |
+
- [Technical Discussion](#technical-discussion)
|
| 32 |
+
|
| 33 |
+
## 📣Latest Updates
|
| 34 |
+
|
| 35 |
+
- [25/08/04] We now support quantization for `Hunyuan 0.5B/1.8B/4B/7B` and multimodal model `Qwen2.5VL 3B/7B/32B/72B`, including `FP8/INT4` algorithms. We also opensource `Hunyuan 1.8B/4B/7B` series Eagle3 model weight.
|
| 36 |
+
- [25/07/04] We now support quantization for `Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen` and other models, including `INT8/FP8/INT4` algorithms. We also opensource `Qwen3` series Eagle3 model weight.
|
| 37 |
+
|
| 38 |
+
Coming soon:
|
| 39 |
+
|
| 40 |
+
- [ ] Support W4A8 quantization for DeepSeek-R1.
|
| 41 |
+
- [ ] Release of new algorithm for speculative sampling.
|
| 42 |
+
|
| 43 |
+
## 🌟Key Features
|
| 44 |
+
|
| 45 |
+
- **Highly Integrated**: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use.
|
| 46 |
+
- **Continuous Innovation**: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future.
|
| 47 |
+
- **Performance-Driven**: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU.
|
| 48 |
+
|
| 49 |
+
## 💼Supported Models
|
| 50 |
+
|
| 51 |
+
### Quantization
|
| 52 |
+
Currently supports the following LLMs, including Hunyuan-Dense, Hunyuan-MoE, Qwen3-Dense, Qwen3-MoE, Qwen2.5, DeepSeek-R1 distilled Qwen models, and QwQ::
|
| 53 |
+
|
| 54 |
+
| Model | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ |
|
| 55 |
+
| --------------------------------------------------------------------------------------------------------------------------- | ----------- | ---------- | ------------ | --------- | -------- |
|
| 56 |
+
| [Hunyuan-Dense](https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7) | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| 57 |
+
| [Hunyuan-MoE](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| 58 |
+
| [Qwen3-Dense](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| 59 |
+
| [Qwen3-MoE](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| 60 |
+
| [Qwen2.5](https://huggingface.co/collections/AngelSlim/qwen2-25-quant-68652d6cbdf5c0d4b1c4499a) | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| 61 |
+
| [DeepSeek-R1-Distill-Qwen](https://huggingface.co/collections/AngelSlim/deepseek-r1-distill-quant-68652f16a9c206b030b05f7f) | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| 62 |
+
| [QwQ](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| 63 |
+
|
| 64 |
+
### Speculative Decoding
|
| 65 |
+
|
| 66 |
+
#### Eagle3
|
| 67 |
+
The Eagle3 weights for the Qwen3 series model are now available.
|
| 68 |
+
|
| 69 |
+
| Qwen3 Models | Hunyuan Models |
|
| 70 |
+
| ----------|----------|
|
| 71 |
+
| ✅ [Qwen3-1.7B](https://huggingface.co/AngelSlim/Qwen3-1.7B_eagle3) |✅ [Hunyuan-1.8B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-1.8B-Instruct_eagle3) |
|
| 72 |
+
| ✅ [Qwen3-4B](https://huggingface.co/AngelSlim/Qwen3-4B_eagle3) |✅ [Hunyuan-4B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-4B-Instruct_eagle3) |
|
| 73 |
+
| ✅ [Qwen3-8B](https://huggingface.co/AngelSlim/Qwen3-8B_eagle3) |✅ [Hunyuan-7B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-7B-Instruct_eagle3) |
|
| 74 |
+
| ✅ [Qwen3-14B](https://huggingface.co/AngelSlim/Qwen3-14B_eagle3) |
|
| 75 |
+
| ✅ [Qwen3-32B](https://huggingface.co/AngelSlim/Qwen3-32B_eagle3) |
|
| 76 |
+
| ✅ [Qwen3-30B-A3B](https://huggingface.co/AngelSlim/Qwen3-a3B_eagle3) |
|
| 77 |
+
|
| 78 |
+
## 🛎️How to Use
|
| 79 |
+
|
| 80 |
+
### Install AngelSlim
|
| 81 |
+
|
| 82 |
+
We recommend using `pip` to install the latest stable version of `AngelSlim`:
|
| 83 |
+
|
| 84 |
+
```shell
|
| 85 |
+
pip install angelslim
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
Alternatively, you can clone the repository and install from source in editable mode:
|
| 89 |
+
|
| 90 |
+
```shell
|
| 91 |
+
cd AngelSlim && python setup.py install
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html).
|
| 95 |
+
|
| 96 |
+
### Quick Start
|
| 97 |
+
|
| 98 |
+
After installing `AngelSlim`, you can quickly start by running the following script to perform static `FP8` quantization on the `Qwen3-1.7B` model:
|
| 99 |
+
|
| 100 |
+
* One-click Start
|
| 101 |
+
|
| 102 |
+
```shell
|
| 103 |
+
python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
This example will load the HuggingFace model and perform activation value calibration using the `dataset` specified in the config file, saving the quantized model weights.
|
| 107 |
+
|
| 108 |
+
* Code-based Start
|
| 109 |
+
|
| 110 |
+
To perform dynamic `FP8` quantization on `Qwen3-1.7B`:
|
| 111 |
+
|
| 112 |
+
```python
|
| 113 |
+
from angelslim.engine import Engine
|
| 114 |
+
|
| 115 |
+
slim_engine = Engine()
|
| 116 |
+
# Prepare model
|
| 117 |
+
slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",)
|
| 118 |
+
# Initialize compressor
|
| 119 |
+
slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
|
| 120 |
+
# Compress model
|
| 121 |
+
slim_engine.run()
|
| 122 |
+
# Save compressed model
|
| 123 |
+
slim_engine.save("./output")
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
|
| 127 |
+
|
| 128 |
+
### Deployment and Testing
|
| 129 |
+
|
| 130 |
+
### 1. Offline Inference
|
| 131 |
+
|
| 132 |
+
If you need to load a quantized model via `transformers`, please set the `deploy_backend: huggingface` in the `global` configuration before quantizing the model, or manually modify the `ignored_layers` field in the `config.json` file located in the quantized model output directory to `ignore`.
|
| 133 |
+
|
| 134 |
+
To test offline inference with a quantized model loaded via `transformers`, run the following command:
|
| 135 |
+
|
| 136 |
+
```shell
|
| 137 |
+
python deploy/offline.py $MODEL_PATH
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
Where `MODEL_PATH` is the path to the quantized model output.
|
| 141 |
+
|
| 142 |
+
#### 2. API Service Deployment
|
| 143 |
+
|
| 144 |
+
After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:
|
| 145 |
+
|
| 146 |
+
**vLLM**
|
| 147 |
+
|
| 148 |
+
Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm) server, recommended version `vllm>=0.8.5.post1`. For MOE INT8 quantized models, vllm>=0.9.0 is required.
|
| 149 |
+
|
| 150 |
+
|
| 151 |
+
```shell
|
| 152 |
+
bash deploy/run_vllm.sh $MODEL_PATH
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
**SGLang**
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`.
|
| 159 |
+
|
| 160 |
+
```shell
|
| 161 |
+
bash deploy/run_sglang.sh $MODEL_PATH
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
#### 3. Service Invocation
|
| 165 |
+
|
| 166 |
+
Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
|
| 167 |
+
|
| 168 |
+
```shell
|
| 169 |
+
bash deploy/openai.sh $MODEL_PATH
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
#### 4. Performance Evaluation
|
| 173 |
+
|
| 174 |
+
Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`:
|
| 175 |
+
|
| 176 |
+
```shell
|
| 177 |
+
bash deploy/lm_eval.sh $MODEL_PATH
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
## 📈 Benchmark
|
| 184 |
+
|
| 185 |
+
### (1) Quantization
|
| 186 |
+
|
| 187 |
+
The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
|
| 188 |
+
|
| 189 |
+
#### Hunyuan Series Models
|
| 190 |
+
|
| 191 |
+
Benchmark results for the `Hunyuan-Instruct` model with `FP8`, `INT4-AWQ` and `INT4-GPTQ` quantization algorithms on datasets including`OlympiadBench`, `AIME 2024` and `DROP`:
|
| 192 |
+
|
| 193 |
+
<table>
|
| 194 |
+
<thead>
|
| 195 |
+
<tr><th>Model</th><th>Quantization</th><th>OlympiadBench</th><th>AIME 2024</th><th>DROP</th><th>GPQA-Diamond</th></tr>
|
| 196 |
+
</thead>
|
| 197 |
+
<tbody>
|
| 198 |
+
<tr><td rowspan="4">Hunyuan-A13B-Instruct</td>
|
| 199 |
+
<td>BF16</td><td>82.7</td><td>87.30</td><td>91.1</td><td>71.2</td></tr>
|
| 200 |
+
<tr><td>FP8-Static</td><td>83.0</td><td>86.7</td><td>91.1</td><td>-</td></tr>
|
| 201 |
+
<tr><td>Int4-GPTQ</td><td>82.7</td><td>86.7</td><td>91.1</td><td>-</td></tr>
|
| 202 |
+
<tr><td>Int4-AWQ</td><td>82.6</td><td>85.6</td><td>91.0</td><td>-</td></tr>
|
| 203 |
+
</tbody>
|
| 204 |
+
<tbody>
|
| 205 |
+
<tr><td rowspan="4">Hunyuan-7B-Instruct</td>
|
| 206 |
+
<td>BF16</td> <td>76.5</td><td>81.1</td><td>85.9</td><td>60.1</td></tr>
|
| 207 |
+
<tr><td>FP8-Static</td><td>76.6</td><td>80.9</td><td>86.0</td><td>60.1</td></tr>
|
| 208 |
+
<tr><td>Int4-GPTQ</td><td>76.2</td><td>81.0</td><td>85.7</td><td>60.0</td></tr>
|
| 209 |
+
<tr><td>Int4-AWQ</td><td>76.4</td><td>80.9</td><td>85.9</td><td>60.1</td></tr>
|
| 210 |
+
</tbody>
|
| 211 |
+
<tbody>
|
| 212 |
+
<tr><td rowspan="4">Hunyuan-4B-Instruct</td>
|
| 213 |
+
<td>BF16</td> <td>73.1</td><td>78.3</td><td>78.2</td><td>61.1</td></tr>
|
| 214 |
+
<tr><td>FP8-Static</td><td>73.1</td><td>76.6</td><td>78.3</td><td>60.2</td></tr>
|
| 215 |
+
<tr><td>Int4-GPTQ</td><td>72.9</td><td>-</td><td>78.1</td><td>58.1</td></tr>
|
| 216 |
+
<tr><td>Int4-AWQ</td><td>72.8</td><td>-</td><td>78.2</td><td>-</td></tr>
|
| 217 |
+
</tbody>
|
| 218 |
+
<tbody>
|
| 219 |
+
<tr><td rowspan="4">Hunyuan-1.8B-Instruct</td>
|
| 220 |
+
<td>BF16</td> <td>63.4</td><td>56.7</td><td>76.7</td><td>47.2</td></tr>
|
| 221 |
+
<tr><td>FP8-Static</td><td>62.5</td><td>55.2</td><td>75.1</td><td>47.7</td></tr>
|
| 222 |
+
<tr><td>Int4-GPTQ</td><td>60.9</td><td>-</td><td>73.0</td><td>44.4</td></tr>
|
| 223 |
+
<tr><td>Int4-AWQ</td><td>61.7</td><td>-</td><td>71.7</td><td>43.6</td></tr>
|
| 224 |
+
</tbody>
|
| 225 |
+
<tbody>
|
| 226 |
+
<tr><td rowspan="4">Hunyuan-0.5B-Instruct</td>
|
| 227 |
+
<td>BF16</td> <td>29.6</td><td>17.2</td><td>52.8</td><td>23.3</td></tr>
|
| 228 |
+
<tr><td>FP8-Static</td><td>29.6</td><td>17.2</td><td>51.6</td><td>22.5</td></tr>
|
| 229 |
+
<tr><td>Int4-GPTQ</td><td>26.8</td><td>-</td><td>50.9</td><td>23.3</td></tr>
|
| 230 |
+
<tr><td>Int4-AWQ</td><td>26.3</td><td>-</td><td>48.9</td><td>23.3</td></tr>
|
| 231 |
+
</tbody>
|
| 232 |
+
</table>
|
| 233 |
+
|
| 234 |
+
#### Qwen3 Series Models
|
| 235 |
+
|
| 236 |
+
Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
|
| 237 |
+
|
| 238 |
+
<table>
|
| 239 |
+
<thead>
|
| 240 |
+
<tr><th>Model</th><th>Quantization</th><th>CEVAL</th><th>MMLU</th><th>GSM8K</th><th>HUMANEVAL</th></tr>
|
| 241 |
+
</thead>
|
| 242 |
+
<tbody>
|
| 243 |
+
<tr><td rowspan="4">Qwen3-0.6B</td><td>BF16</td><td>45.84</td><td>47.21</td><td>42.99</td><td>19.51</td></tr>
|
| 244 |
+
<tr><td>FP8-Static</td><td>45.99</td><td>46.87</td><td>38.06</td><td>18.90</td></tr>
|
| 245 |
+
<tr><td>FP8-Dynamic</td><td>45.99</td><td>46.93</td><td>38.29</td><td>20.73</td></tr>
|
| 246 |
+
<tr><td>INT8-Dynamic</td><td>45.17</td><td>46.95</td><td>41.17</td><td>21.34</td></tr>
|
| 247 |
+
<tr><td rowspan="6">Qwen3-8B</td><td>BF16</td><td>79.27</td><td>74.78</td><td>87.79</td><td>63.41</td></tr>
|
| 248 |
+
<tr><td>FP8-Static</td><td>78.23</td><td>74.79</td><td>86.96</td><td>62.20</td></tr>
|
| 249 |
+
<tr><td>FP8-Dynamic</td><td>78.45</td><td>74.75</td><td>87.64</td><td>62.80</td></tr>
|
| 250 |
+
<tr><td>INT8-Dynamic</td><td>78.01</td><td>74.84</td><td>86.96</td><td>67.07</td></tr>
|
| 251 |
+
<tr><td>INT4-GPTQ</td><td>77.19</td><td>73.26</td><td>86.43</td><td>62.20</td></tr>
|
| 252 |
+
<tr><td>INT4-AWQ</td><td>76.15</td><td>73.59</td><td>86.96</td><td>63.41</td></tr>
|
| 253 |
+
<tr><td rowspan="6">Qwen3-14B</td><td>BF16</td><td>83.06</td><td>78.90</td><td>88.40</td><td>55.49</td></tr>
|
| 254 |
+
<tr><td>FP8-Static</td><td>82.62</td><td>78.57</td><td>89.46</td><td>57.32</td></tr>
|
| 255 |
+
<tr><td>FP8-Dynamic</td><td>82.24</td><td>78.92</td><td>88.32</td><td>52.44</td></tr>
|
| 256 |
+
<tr><td>INT8-Dynamic</td><td>81.87</td><td>78.13</td><td>86.28</td><td>56.10</td></tr>
|
| 257 |
+
<tr><td>INT4-GPTQ</td><td>81.05</td><td>78.02</td><td>87.34</td><td>57.93</td></tr>
|
| 258 |
+
<tr><td>INT4-AWQ</td><td>82.02</td><td>77.68</td><td>84.23</td><td>61.59</td></tr>
|
| 259 |
+
<tr><td rowspan="5">Qwen3-32B</td><td>BF16</td><td>86.55</td><td>82.00</td><td>74.53</td><td>37.80</td></tr>
|
| 260 |
+
<tr><td>FP8-Static</td><td>86.92</td><td>81.78</td><td>70.20</td><td>39.63</td></tr>
|
| 261 |
+
<tr><td>FP8-Dynamic</td><td>86.55</td><td>81.89</td><td>70.43</td><td>38.41</td></tr>
|
| 262 |
+
<tr><td>INT4-GPTQ</td><td>86.18</td><td>81.01</td><td>-</td><td>43.29</td></tr>
|
| 263 |
+
<tr><td>INT4-AWQ</td><td>86.18</td><td>81.54</td><td>-</td><td>36.59</td></tr>
|
| 264 |
+
<tr><td rowspan="4">Qwen3-30B-A3B</td><td>BF16</td><td>83.66</td><td>79.36</td><td>89.99</td><td>31.71</td></tr>
|
| 265 |
+
<tr><td>FP8-Static</td><td>83.95</td><td>79.47</td><td>89.01</td><td>31.10</td></tr>
|
| 266 |
+
<tr><td>FP8-Dynamic</td><td>84.10</td><td>79.40</td><td>89.16</td><td>32.93</td></tr>
|
| 267 |
+
<tr><td>INT8-Dynamic</td><td>83.36</td><td>79.48</td><td>89.16</td><td>34.15</td></tr>
|
| 268 |
+
<tr><td rowspan="4">Qwen3-235B-A22B</td><td>BF16</td><td>89.60</td><td>86.28</td><td>85.29</td><td>27.44</td></tr>
|
| 269 |
+
<tr><td>FP8-Static</td><td>89.67</td><td>86.19</td><td>86.96</td><td>27.44</td></tr>
|
| 270 |
+
<tr><td>FP8-Dynamic</td><td>89.67</td><td>86.18</td><td>85.22</td><td>28.05</td></tr>
|
| 271 |
+
<tr><td>INT8-Dynamic</td><td>88.93</td><td>86.20</td><td>86.20</td><td>23.78</td></tr>
|
| 272 |
+
<tr><td rowspan="5">QwQ-32B</td><td>BF16</td><td>85.74</td><td>82.03</td><td>73.31</td><td>42.68</td></tr>
|
| 273 |
+
<tr><td>FP8-Static</td><td>85.44</td><td>81.91</td><td>75.36</td><td>42.68</td></tr>
|
| 274 |
+
<tr><td>FP8-Dynamic</td><td>85.07</td><td>81.93</td><td>75.66</td><td>42.07</td></tr>
|
| 275 |
+
<tr><td>INT4-GPTQ</td><td>84.03</td><td>81.26</td><td>68.23</td><td>45.73</td></tr>
|
| 276 |
+
<tr><td>INT4-AWQ</td><td>83.58</td><td>81.01</td><td>68.69</td><td>43.29</td></tr>
|
| 277 |
+
</tbody>
|
| 278 |
+
</table>
|
| 279 |
+
|
| 280 |
+
#### Qwen2.5VL Series Models
|
| 281 |
+
|
| 282 |
+
Benchmark results for Qwen2.5VL series models with `BF16`、`FP8-Static`、`FP8-Dynamic`、`INT4-GPTQ`、`INT4-AWQ` quantization algorithms on datasets including `MMMU_VAL`、`DocVQA_VAL` and `ChartQA_TEST`:
|
| 283 |
+
|
| 284 |
+
<table>
|
| 285 |
+
<thead>
|
| 286 |
+
<tr><th>Model</th><th>Quantization</th><th>MMMU_VAL</th><th>MMLDocVQA_VALU</th><th>ChartQA_TEST</th></tr>
|
| 287 |
+
</thead>
|
| 288 |
+
<tbody>
|
| 289 |
+
<tr><td rowspan="5">Qwen2.5VL-3B</td><td>BF16</td><td>47.11</td><td>78.57</td><td>80.32</td></tr>
|
| 290 |
+
<tr><td>FP8-Static</td><td>47.33</td><td>79.34</td><td>79.68</td></tr>
|
| 291 |
+
<tr><td>FP8-Dynamic</td><td>45.99</td><td>46.93</td><td>38.29</td></tr>
|
| 292 |
+
<tr><td>INT4-GPTQ</td><td>46.56</td><td>77.20</td><td>78.96</td></tr>
|
| 293 |
+
<tr><td>INT4-AWQ</td><td>45.78</td><td>-</td><td>79.60</td></tr>
|
| 294 |
+
<tr><td rowspan="5">Qwen2.5VL-7B</td><td>BF16</td><td>45.44</td><td>89.71</td><td>84.64</td></tr>
|
| 295 |
+
<tr><td>FP8-Static</td><td>47.00</td><td>89.83</td><td>85.92</td></tr>
|
| 296 |
+
<tr><td>FP8-Dynamic</td><td>47.22</td><td>89.80</td><td>88.64</td></tr>
|
| 297 |
+
<tr><td>INT4-GPTQ</td><td>46.67</td><td>90.45</td><td>-</td></tr>
|
| 298 |
+
<tr><td>INT4-AWQ</td><td>45.67</td><td>89.28</td><td>-</td></tr>
|
| 299 |
+
<tr><td rowspan="5">Qwen2.5VL-32B</td><td>BF16</td><td>57.00</td><td>90.03</td><td>-</td></tr>
|
| 300 |
+
<tr><td>FP8-Static</td><td>57.00</td><td>89.88</td><td>-</td></tr>
|
| 301 |
+
<tr><td>FP8-Dynamic</td><td>56.44</td><td>89.88</td><td>-</td></tr>
|
| 302 |
+
<tr><td>INT4-GPTQ</td><td>55.22</td><td>89.80 </td><td>-</td></tr>
|
| 303 |
+
<tr><td>INT4-AWQ</td><td>55.22</td><td>90.30</td><td>-</td></tr>
|
| 304 |
+
<tr><td rowspan="5">Qwen2.5VL-72B</td><td>BF16</td><td>58.78</td><td>94.39</td><td>85.60</td></tr>
|
| 305 |
+
<tr><td>FP8-Static</td><td>57.89</td><td>94.41</td><td>85.84</td></tr>
|
| 306 |
+
<tr><td>FP8-Dynamic</td><td>58.67</td><td>94.38</td><td>85.60</td></tr>
|
| 307 |
+
<tr><td>INT4-GPTQ</td><td>57.56</td><td>94.46</td><td>86.48</td></tr>
|
| 308 |
+
<tr><td>INT4-AWQ</td><td>58.78</td><td>94.19</td><td>87.28</td></tr>
|
| 309 |
+
</tbody>
|
| 310 |
+
</table>
|
| 311 |
+
|
| 312 |
+
#### Other Models
|
| 313 |
+
|
| 314 |
+
Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU` and `GSM8K`:
|
| 315 |
+
|
| 316 |
+
<table>
|
| 317 |
+
<thead>
|
| 318 |
+
<tr><th>Model</th><th>Quantization</th><th>CEVAL</th><th>MMLU</th><th>GSM8K</th></tr>
|
| 319 |
+
</thead>
|
| 320 |
+
<tbody>
|
| 321 |
+
<tr><td rowspan="3">Qwen2.5-1.5B-Instruct</td><td>BF16</td><td>67.01</td><td>60.05</td><td>54.28</td></tr>
|
| 322 |
+
<tr><td>FP8-Static</td><td>66.27</td><td>60.23</td><td>-</td></tr>
|
| 323 |
+
<tr><td>FP8-Dynamic</td><td>66.79</td><td>60.08</td><td>51.71</td></tr>
|
| 324 |
+
<tr><td rowspan="5">Qwen2.5-7B-Instruct</td><td>BF16</td><td>81.20</td><td>74.55</td><td>79.98</td></tr>
|
| 325 |
+
<tr><td>FP8-Static</td><td>81.13</td><td>74.03</td><td>79.30</td></tr>
|
| 326 |
+
<tr><td>FP8-Dynamic</td><td>80.31</td><td>74.07</td><td>79.00</td></tr>
|
| 327 |
+
<tr><td>INT4-GPTQ</td><td>79.05</td><td>73.05</td><td>74.75</td></tr>
|
| 328 |
+
<tr><td>INT4-AWQ</td><td>79.35</td><td>73.22</td><td>79.38</td></tr>
|
| 329 |
+
<tr><td rowspan="5">Qwen2.5-32B-Instruct</td><td>BF16</td><td>87.30</td><td>83.21</td><td>81.73</td></tr>
|
| 330 |
+
<tr><td>FP8-Static</td><td>87.59</td><td>83.08</td><td>81.58</td></tr>
|
| 331 |
+
<tr><td>FP8-Dynamic</td><td>87.30</td><td>83.04</td><td>81.58</td></tr>
|
| 332 |
+
<tr><td>INT4-GPTQ</td><td>86.70</td><td>82.45</td><td>82.03</td></tr>
|
| 333 |
+
<tr><td>INT4-AWQ</td><td>87.00</td><td>82.64</td><td>-</td></tr>
|
| 334 |
+
<tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-7B</td><td>BF16</td><td>53.49</td><td>53.80</td><td>75.74</td></tr>
|
| 335 |
+
<tr><td>FP8-Static</td><td>53.57</td><td>54.17</td><td>76.19</td></tr>
|
| 336 |
+
<tr><td>FP8-Dynamic</td><td>52.97</td><td>54.13</td><td>74.15</td></tr>
|
| 337 |
+
<tr><td>INT4-GPTQ</td><td>51.86</td><td>52.44</td><td>75.89</td></tr>
|
| 338 |
+
<tr><td>INT4-AWQ</td><td>53.49</td><td>53.70</td><td>-</td></tr>
|
| 339 |
+
<tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-14B</td><td>BF16</td><td>77.71</td><td>74.28</td><td>85.67</td></tr>
|
| 340 |
+
<tr><td>FP8-Static</td><td>77.56</td><td>74.66</td><td>86.73</td></tr>
|
| 341 |
+
<tr><td>FP8-Dynamic</td><td>76.82</td><td>74.63</td><td>87.11</td></tr>
|
| 342 |
+
<tr><td>INT4-GPTQ</td><td>74.29</td><td>72.37</td><td>84.61</td></tr>
|
| 343 |
+
<tr><td>INT4-AWQ</td><td>74.81</td><td>73.00</td><td>86.05</td></tr>
|
| 344 |
+
<tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-32B</td><td>BF16</td><td>84.18</td><td>80.89</td><td>87.41</td></tr>
|
| 345 |
+
<tr><td>FP8-Static</td><td>83.43</td><td>80.90</td><td>87.57</td></tr>
|
| 346 |
+
<tr><td>FP8-Dynamic</td><td>83.73</td><td>81.10</td><td>86.43</td></tr>
|
| 347 |
+
<tr><td>INT4-GPTQ</td><td>84.10</td><td>79.80</td><td>86.73</td></tr>
|
| 348 |
+
<tr><td>INT4-AWQ</td><td>82.84</td><td>80.15</td><td>87.19</td></tr>
|
| 349 |
+
</tbody>
|
| 350 |
+
</table>
|
| 351 |
+
|
| 352 |
+
### (2) Speculative Decoding
|
| 353 |
+
|
| 354 |
+
#### Qwen3 Series Models
|
| 355 |
+
Benchmark results for Qwen3 series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
|
| 356 |
+
|
| 357 |
+
<table>
|
| 358 |
+
<thead>
|
| 359 |
+
<tr>
|
| 360 |
+
<th> </th><th> </th>
|
| 361 |
+
<th colspan="2" style="text-align: center; vertical-align: middle;">MT-bench</th>
|
| 362 |
+
<th colspan="2" style="text-align: center; vertical-align: middle;">HumanEval</th>
|
| 363 |
+
<th colspan="2" style="text-align: center; vertical-align: middle;">GSM8K</th>
|
| 364 |
+
<th colspan="2" style="text-align: center; vertical-align: middle;">Alpaca</th>
|
| 365 |
+
<th colspan="2" style="text-align: center; vertical-align: middle;">Mean</th></tr>
|
| 366 |
+
<tr><th>Temperature</th><th>Model</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th></tr>
|
| 367 |
+
</thead>
|
| 368 |
+
<tbody>
|
| 369 |
+
<!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=0</strong></td></tr> -->
|
| 370 |
+
<tr><td rowspan="6"><strong>T=0</strong></td>
|
| 371 |
+
<td>Qwen3-1.7B</td><td>2.05x</td><td>2.81</td><td>2.07x</td><td>2.93</td><td>2.11x</td><td>2.98</td><td>1.93x</td><td>2.69</td><td>2.04x</td><td>2.85</td></tr>
|
| 372 |
+
<tr> <td>Qwen3-4B</td><td>2.21x</td><td>3.01</td><td>2.36x</td><td>3.24</td><td>2.42x</td><td>3.13</td><td>2.32x</td><td>2.75</td><td>2.33x</td><td>3.03</td></tr>
|
| 373 |
+
<tr><td>Qwen3-8B</td><td>2.63x</td><td>3.65</td><td>2.76x</td><td>3.85</td><td>2.82x</td><td>3.90</td><td>2.62x</td><td>3.48</td><td>2.70x</td><td>3.72</td></tr>
|
| 374 |
+
<tr><td>Qwen3-14B</td><td>2.23x</td><td>3.30</td><td>2.53x</td><td>3.74</td><td>2.56x</td><td>3.79</td><td>2.16x</td><td>3.13</td><td>2.37x</td><td>3.49</td></tr>
|
| 375 |
+
<tr><td>Qwen3-32B</td><td>2.39x</td><td>2.78</td><td>2.37x</td><td>2.81</td><td>2.47x</td><td>2.92</td><td>2.42x</td><td>2.53</td><td>2.41x</td><td>2.76</td></tr>
|
| 376 |
+
<tr><td>Qwen3-30B-A3B</td><td>2.84x</td><td>3.63</td><td>2.27x</td><td>3.09</td><td>2.64x</td><td>3.42</td><td>2.83x</td><td>3.56</td><td>2.64x</td><td>3.42</td></tr>
|
| 377 |
+
<!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=1</strong></td></tr> -->
|
| 378 |
+
<tr><td rowspan="6"><strong>T=1</strong></td>
|
| 379 |
+
<td>Qwen3-1.7B</td><td>1.74x</td><td>2.53</td><td>1.86x</td><td>2.70</td><td>1.82x</td><td>2.69</td><td>1.72x</td><td>2.46</td><td>1.93x</td><td>2.60</td></tr>
|
| 380 |
+
<tr><td>Qwen3-4B</td><td>1.93x</td><td>2.60</td><td>2.00x</td><td>2.84</td><td>2.11x</td><td>2.82</td><td>2.34x</td><td>2.50</td><td>1.75x</td><td>2.69</td></tr>
|
| 381 |
+
<tr><td>Qwen3-8B</td><td>1.98x</td><td>2.75</td><td>2.25x</td><td>3.11</td><td>2.31x</td><td>3.15</td><td>2.10x</td><td>2.76</td><td>2.90x</td><td>2.94</td></tr>
|
| 382 |
+
<tr><td>Qwen3-14B</td><td>1.71x</td><td>2.61</td><td>1.95x</td><td>2.87</td><td>2.04x</td><td>3.08</td><td>1.68x</td><td>2.55</td><td>2.90x</td><td>2.78</td></tr>
|
| 383 |
+
<tr><td>Qwen3-32B</td><td>1.62x</td><td>1.91</td><td>1.71x</td><td>2.05</td><td>1.78x</td><td>2.10</td><td>1.80x</td><td>1.95</td><td>1.62x</td><td>2.00</td></tr>
|
| 384 |
+
<tr><td>Qwen3-30B-A3B</td><td>1.91x</td><td>2.46</td><td>2.00x</td><td>2.64</td><td>1.90x</td><td>2.53</td><td>1.80x</td><td>2.32</td><td>1.90x</td><td>2.48</td></tr>
|
| 385 |
+
</tbody>
|
| 386 |
+
</table>
|
| 387 |
+
|
| 388 |
+
#### Hunyuan Series Models
|
| 389 |
+
Benchmark results for Hunyuan series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
|
| 390 |
+
|
| 391 |
+
<table>
|
| 392 |
+
<thead>
|
| 393 |
+
<tr>
|
| 394 |
+
<th> </th><th> </th>
|
| 395 |
+
<th colspan="2" style="text-align: center; vertical-align: middle;">MT-bench</th>
|
| 396 |
+
<th colspan="2" style="text-align: center; vertical-align: middle;">HumanEval</th>
|
| 397 |
+
<th colspan="2" style="text-align: center; vertical-align: middle;">GSM8K</th>
|
| 398 |
+
<th colspan="2" style="text-align: center; vertical-align: middle;">Alpaca</th>
|
| 399 |
+
<th colspan="2" style="text-align: center; vertical-align: middle;">Mean</th></tr>
|
| 400 |
+
<tr><th>Temperature</th><th>Model</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th><th>Speedup</th><th>τ</th></tr>
|
| 401 |
+
</thead>
|
| 402 |
+
<tbody>
|
| 403 |
+
<!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=0</strong></td></tr> -->
|
| 404 |
+
<tr><td rowspan="3"><strong>T=0</strong></td>
|
| 405 |
+
<td>Hunyuan-1.8B-Instruct</td><td>1.97x</td><td>2.90</td><td>2.58x</td><td>3.73</td><td>2.61x</td><td>3.71</td><td>1.71x</td><td>2.43</td><td>2.22x</td><td>3.19</td></tr>
|
| 406 |
+
<tr> <td>Hunyuan-4B-Instruct</td><td>1.77x</td><td>2.60</td><td>2.64x</td><td>3.35</td><td>2.14x</td><td>3.17</td><td>1.72x</td><td>2.57</td><td>2.07x</td><td>2.92</td></tr>
|
| 407 |
+
<tr><td>Hunyuan-7B-Instruct</td><td>2.22x</td><td>3.58</td><td>3.59x</td><td>5.47</td><td>2.96x</td><td>4.68</td><td>1.64x</td><td>2.56</td><td>2.60x</td><td>4.07</td></tr>
|
| 408 |
+
<!-- <tr><td colspan="12" style="text-align: center; vertical-align: middle;"><strong>Temperature=1</strong></td></tr> -->
|
| 409 |
+
<tr><td rowspan="3"><strong>T=1</strong></td>
|
| 410 |
+
<td>Hunyuan-1.8B-Instruct</td><td>1.58x</td><td>2.36</td><td>2.35x</td><td>3.56</td><td>2.23x</td><td>3.38</td><td>1.26x</td><td>1.87</td><td>1.86x</td><td>2.79</td></tr>
|
| 411 |
+
<tr><td>Hunyuan-4B-Instruct</td><td>1.36x</td><td>2.05</td><td>1.97x</td><td>2.86</td><td>1.72x</td><td>2.68</td><td>1.14x</td><td>1.76</td><td>1.55x</td><td>2.34</td></tr>
|
| 412 |
+
<tr><td>Hunyuan-7B-Instruct</td><td>1.90x</td><td>3.11</td><td>3.12x</td><td>5.09</td><td>2.74x</td><td>4.34</td><td>1.47x</td><td>2.39</td><td>2.31x</td><td>3.73</td></tr>
|
| 413 |
+
</tbody>
|
| 414 |
+
</table>
|
| 415 |
+
|
| 416 |
+
## 📝 License
|
| 417 |
+
|
| 418 |
+
The code for this project is open-sourced under the [License for AngelSlim](LICENSE).
|
| 419 |
+
|
| 420 |
+
## 🔗 Citation
|
| 421 |
+
|
| 422 |
+
```
|
| 423 |
+
@software{AngelSlim2025,
|
| 424 |
+
title={{AngelSlim}},
|
| 425 |
+
author={Tencent AngelSlim Project Contributors},
|
| 426 |
+
year={2025},
|
| 427 |
+
month={6},
|
| 428 |
+
url={https://github.com/Tencent/AngelSlim},
|
| 429 |
+
}
|
| 430 |
+
```
|
| 431 |
+
|
| 432 |
+
## 💬 Technical Discussion
|
| 433 |
+
|
| 434 |
+
* AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on [GitHub Issues](https://github.com/Tencent/AngelSlim/issues) or join our [WeChat technical discussion group](./docs/source/assets/angel_slim_wechat.png).
|