add vllm deployment instructions
Browse files
README.md
CHANGED
@@ -98,3 +98,37 @@ The accuracy (MMLU, 5-shot) and throughputs (tokens per second, TPS) benchmark r
|
|
98 |
|
99 |
|
100 |
We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.3x** speedup with FP8.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
|
99 |
|
100 |
We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.3x** speedup with FP8.
|
101 |
+
|
102 |
+
### Deploy with vLLM
|
103 |
+
To deploy the quantized checkpoint with [vLLM](https://github.com/vllm-project/vllm.git), follow the instructions below:
|
104 |
+
|
105 |
+
1. Install vLLM from directions [here](https://github.com/vllm-project/vllm?tab=readme-ov-file#getting-started).
|
106 |
+
2. To use a Model Optimizer PTQ checkpoint with vLLM, `quantization=modelopt` flag must be passed into the config while initializing the `LLM` Engine.
|
107 |
+
|
108 |
+
Example deployment on H100:
|
109 |
+
|
110 |
+
```
|
111 |
+
from vllm import LLM, SamplingParams
|
112 |
+
|
113 |
+
model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
|
114 |
+
sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
|
115 |
+
|
116 |
+
prompts = [
|
117 |
+
"Hello, my name is",
|
118 |
+
"The president of the United States is",
|
119 |
+
"The capital of France is",
|
120 |
+
"The future of AI is",
|
121 |
+
]
|
122 |
+
|
123 |
+
llm = LLM(model=model_id, quantization="modelopt")
|
124 |
+
outputs = llm.generate(prompts, sampling_params)
|
125 |
+
|
126 |
+
# Print the outputs.
|
127 |
+
for output in outputs:
|
128 |
+
prompt = output.prompt
|
129 |
+
generated_text = output.outputs[0].text
|
130 |
+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
131 |
+
|
132 |
+
```
|
133 |
+
|
134 |
+
This model can be deployed with an OpenAI Compatible Server via the vLLM backend. Instructions [here](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server).
|