Update README.md
Browse files
README.md
CHANGED
|
@@ -13,6 +13,22 @@ For more details about SwiftKV and how to use it:
|
|
| 13 |
* 📝 [SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation (arXiv)](https://arxiv.org/abs/2410.03960)
|
| 14 |
* 🚀 [Getting started guide](https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv)
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
## Eval Metrics
|
| 17 |
|
| 18 |
For a full breakdown on evaluation metrics and performance impact please refer to our [blog](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/) and [arXiv paper]((https://arxiv.org/abs/2410.03960)) but below we've outlined some relevant evaluation metrics.
|
|
@@ -27,7 +43,7 @@ For a full breakdown on evaluation metrics and performance impact please refer t
|
|
| 27 |
| Baseline | 82.00 | 77.90 | 80.40 | 54.56 | 67.90 | 70.63 | 82.56 | **73.71** |
|
| 28 |
| 50% SingleInputKV | 80.38 | 78.22 | 79.30 | 54.54 | 67.30 | 69.73 | 79.45 | **72.70** |
|
| 29 |
|
| 30 |
-
##
|
| 31 |
|
| 32 |
Instructions on how to use vLLM for both evaluation and performance benchmarks:
|
| 33 |
-
https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv
|
|
|
|
| 13 |
* 📝 [SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation (arXiv)](https://arxiv.org/abs/2410.03960)
|
| 14 |
* 🚀 [Getting started guide](https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv)
|
| 15 |
|
| 16 |
+
## Performance Metrics
|
| 17 |
+
|
| 18 |
+
To evaluate SwiftKV’s performance, we focus on the following key metrics (see more details in our [blog](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/)):
|
| 19 |
+
* Combined throughput: The total number of input and output tokens processed per second. This determines:
|
| 20 |
+
* For batch processing, the time required to complete jobs.
|
| 21 |
+
* For interactive use, the volume of concurrent requests a system can handle.
|
| 22 |
+
* TTFT: The latency between a user request and receiving the first token in the response.
|
| 23 |
+
* TPOT: The latency between subsequent tokens after the first token.
|
| 24 |
+
|
| 25 |
+
Combined input and output throughput for Llama 3.1 70B (left) and Llama 3.1 405B (right) across a range of input lengths (bottom).
|
| 26 |
+
<img src="figure-4-full.png" alt="performance plot of llama-405B w. swiftkv" width="800">
|
| 27 |
+
|
| 28 |
+
TTFT (top) and TPOT (bottom) for input lengths 2000 (left), 8000 (middle), and 32000 (right) for Llama 3.1 405B fp8 model. For each experiment, a range of different request arrival rates is simulated. Each request generates 256 output tokens.
|
| 29 |
+
<img src="figure-6.png" alt="performance plot of llama-405B w. swiftkv" width="700">
|
| 30 |
+
|
| 31 |
+
|
| 32 |
## Eval Metrics
|
| 33 |
|
| 34 |
For a full breakdown on evaluation metrics and performance impact please refer to our [blog](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/) and [arXiv paper]((https://arxiv.org/abs/2410.03960)) but below we've outlined some relevant evaluation metrics.
|
|
|
|
| 43 |
| Baseline | 82.00 | 77.90 | 80.40 | 54.56 | 67.90 | 70.63 | 82.56 | **73.71** |
|
| 44 |
| 50% SingleInputKV | 80.38 | 78.22 | 79.30 | 54.54 | 67.30 | 69.73 | 79.45 | **72.70** |
|
| 45 |
|
| 46 |
+
## Get started by serving SwiftKV on vLLM
|
| 47 |
|
| 48 |
Instructions on how to use vLLM for both evaluation and performance benchmarks:
|
| 49 |
+
https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv
|