pytorch
/

Qwen3-8B-INT4

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on May 17

Commit

d6c04b6

·

verified ·

1 Parent(s): 7e49f57

Update README.md

Files changed (1) hide show

README.md +24 -9

README.md CHANGED Viewed

@@ -25,6 +25,29 @@ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 pip install torchao
 ```
 ## Code Example
 ```Py
 from vllm import LLM, SamplingParams
@@ -60,14 +83,6 @@ if __name__ == '__main__':
 Note: please use `VLLM_DISABLE_COMPILE_CACHE=1` to disable compile cache when running this code, e.g. `VLLM_DISABLE_COMPILE_CACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao,
 this is expected be resolved in pytorch 2.8.
-## Serving
-Then we can serve with the following command:
-```Shell
-export MODEL=pytorch/Qwen3-8B-int4wo-hqq
-vllm serve $MODEL --tokenizer $MODEL -O3
-```
 # Inference with Transformers
 Install the required packages:
@@ -94,7 +109,7 @@ model = AutoModelForCausalLM.from_pretrained(
     trust_remote_code=True,
 )
 tokenizer = AutoTokenizer.from_pretrained(model_path)
 messages = [
     {"role": "system", "content": "You are a helpful AI assistant."},
     {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},

 pip install torchao
 ```
+## Serving
+Then we can serve with the following command:
+```Shell
+# Server
+export MODEL=pytorch/Qwen3-8B-int4wo-hqq
+vllm serve $MODEL --tokenizer $MODEL -O3
+```
+```Shell
+# Client
+curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "pytorch/Qwen3-8B-int4wo-hqq",
+  "messages": [
+    {"role": "user", "content": "Give me a short introduction to large language models."}
+  ],
+  "temperature": 0.6,
+  "top_p": 0.95,
+  "top_k": 20,
+  "max_tokens": 32768
+}'
+```
 ## Code Example
 ```Py
 from vllm import LLM, SamplingParams
 Note: please use `VLLM_DISABLE_COMPILE_CACHE=1` to disable compile cache when running this code, e.g. `VLLM_DISABLE_COMPILE_CACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao,
 this is expected be resolved in pytorch 2.8.
 # Inference with Transformers
 Install the required packages:
     trust_remote_code=True,
 )
 tokenizer = AutoTokenizer.from_pretrained(model_path)
 messages = [
     {"role": "system", "content": "You are a helpful AI assistant."},
     {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},