add vllm deployment instructions

Files changed (1) hide show

README.md CHANGED Viewed

	@@ -98,3 +98,37 @@ The accuracy (MMLU, 5-shot) and throughputs (tokens per second, TPS) benchmark r
98
99
100	We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved ~1.3x speedup with FP8.

 We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.3x** speedup with FP8.
+### Deploy with vLLM
+To deploy the quantized checkpoint with [vLLM](https://github.com/vllm-project/vllm.git), follow the instructions below:
+1. Install vLLM from directions [here](https://github.com/vllm-project/vllm?tab=readme-ov-file#getting-started).
+2. To use a Model Optimizer PTQ checkpoint with vLLM, `quantization=modelopt` flag must be passed into the config while initializing the `LLM` Engine.
+Example deployment on H100:
+```
+from vllm import LLM, SamplingParams
+model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
+sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+llm = LLM(model=model_id, quantization="modelopt")
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+This model can be deployed with an OpenAI Compatible Server via the vLLM backend. Instructions [here](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server).