zhiyucheng commited on
Commit
b335a0b
·
verified ·
1 Parent(s): decd0b2

add vllm deployment instructions

Browse files
Files changed (1) hide show
  1. README.md +34 -0
README.md CHANGED
@@ -98,3 +98,37 @@ The accuracy (MMLU, 5-shot) and throughputs (tokens per second, TPS) benchmark r
98
 
99
 
100
  We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.3x** speedup with FP8.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
 
100
  We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.3x** speedup with FP8.
101
+
102
+ ### Deploy with vLLM
103
+ To deploy the quantized checkpoint with [vLLM](https://github.com/vllm-project/vllm.git), follow the instructions below:
104
+
105
+ 1. Install vLLM from directions [here](https://github.com/vllm-project/vllm?tab=readme-ov-file#getting-started).
106
+ 2. To use a Model Optimizer PTQ checkpoint with vLLM, `quantization=modelopt` flag must be passed into the config while initializing the `LLM` Engine.
107
+
108
+ Example deployment on H100:
109
+
110
+ ```
111
+ from vllm import LLM, SamplingParams
112
+
113
+ model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
114
+ sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
115
+
116
+ prompts = [
117
+ "Hello, my name is",
118
+ "The president of the United States is",
119
+ "The capital of France is",
120
+ "The future of AI is",
121
+ ]
122
+
123
+ llm = LLM(model=model_id, quantization="modelopt")
124
+ outputs = llm.generate(prompts, sampling_params)
125
+
126
+ # Print the outputs.
127
+ for output in outputs:
128
+ prompt = output.prompt
129
+ generated_text = output.outputs[0].text
130
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
131
+
132
+ ```
133
+
134
+ This model can be deployed with an OpenAI Compatible Server via the vLLM backend. Instructions [here](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server).