jerryzh168 commited on
Commit
d6c04b6
·
verified ·
1 Parent(s): 7e49f57

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -9
README.md CHANGED
@@ -25,6 +25,29 @@ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
25
  pip install torchao
26
  ```
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  ## Code Example
29
  ```Py
30
  from vllm import LLM, SamplingParams
@@ -60,14 +83,6 @@ if __name__ == '__main__':
60
  Note: please use `VLLM_DISABLE_COMPILE_CACHE=1` to disable compile cache when running this code, e.g. `VLLM_DISABLE_COMPILE_CACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao,
61
  this is expected be resolved in pytorch 2.8.
62
 
63
- ## Serving
64
- Then we can serve with the following command:
65
- ```Shell
66
- export MODEL=pytorch/Qwen3-8B-int4wo-hqq
67
- vllm serve $MODEL --tokenizer $MODEL -O3
68
- ```
69
-
70
-
71
  # Inference with Transformers
72
 
73
  Install the required packages:
@@ -94,7 +109,7 @@ model = AutoModelForCausalLM.from_pretrained(
94
  trust_remote_code=True,
95
  )
96
  tokenizer = AutoTokenizer.from_pretrained(model_path)
97
-
98
  messages = [
99
  {"role": "system", "content": "You are a helpful AI assistant."},
100
  {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
 
25
  pip install torchao
26
  ```
27
 
28
+ ## Serving
29
+ Then we can serve with the following command:
30
+ ```Shell
31
+ # Server
32
+ export MODEL=pytorch/Qwen3-8B-int4wo-hqq
33
+ vllm serve $MODEL --tokenizer $MODEL -O3
34
+ ```
35
+
36
+ ```Shell
37
+ # Client
38
+ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
39
+ "model": "pytorch/Qwen3-8B-int4wo-hqq",
40
+ "messages": [
41
+ {"role": "user", "content": "Give me a short introduction to large language models."}
42
+ ],
43
+ "temperature": 0.6,
44
+ "top_p": 0.95,
45
+ "top_k": 20,
46
+ "max_tokens": 32768
47
+ }'
48
+ ```
49
+
50
+
51
  ## Code Example
52
  ```Py
53
  from vllm import LLM, SamplingParams
 
83
  Note: please use `VLLM_DISABLE_COMPILE_CACHE=1` to disable compile cache when running this code, e.g. `VLLM_DISABLE_COMPILE_CACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao,
84
  this is expected be resolved in pytorch 2.8.
85
 
 
 
 
 
 
 
 
 
86
  # Inference with Transformers
87
 
88
  Install the required packages:
 
109
  trust_remote_code=True,
110
  )
111
  tokenizer = AutoTokenizer.from_pretrained(model_path)
112
+
113
  messages = [
114
  {"role": "system", "content": "You are a helpful AI assistant."},
115
  {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},