Any vllm core optimize plan?

#1
by pty819 - opened

Thank you very much for your hard work. Based on my personal experience, the autoround mix method seems to be slightly more effective than awq. However, I've observed some performance issues with autoround, likely because its kernel is not yet well-optimized.

Here are the performance results on my machine:
Hardware: AMD EPYC 7763 64-Core Processor + RTX A6000 * 1

Autoround Performance:

Maximum throughput with 8 requests: 115 tokens/s. Increasing the batch size beyond this does not improve throughput.

Single-request throughput: 33 tokens/s

AWQ Performance:

Maximum throughput with 20 requests: 400+ tokens/s

Single-request throughput: 55+ tokens/s

Given these findings, I wanted to ask if there are any plans to submit a more optimized autoround kernel to vLLM.

Sign up or log in to comment