pytorch
/

Qwen3-8B-INT4

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on May 17

Commit

c4cdc50

·

verified ·

1 Parent(s): 553f5a3

Update README.md

Files changed (1) hide show

README.md +1 -34

README.md CHANGED Viewed

@@ -30,7 +30,7 @@ Then we can serve with the following command:
 ```Shell
 # Server
 export MODEL=pytorch/Qwen3-8B-int4wo-hqq
-vllm serve $MODEL --tokenizer $MODEL -O3
 ```
 ```Shell
@@ -47,39 +47,6 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
 }'
 ```
-## Code Example
-```Py
-from vllm import LLM, SamplingParams
-# Sample prompts.
-prompts = [
-    "Hello, my name is",
-    "The president of the United States is",
-    "The capital of France is",
-    "The future of AI is",
-]
-# Create a sampling params object.
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-if __name__ == '__main__':
-    # Create an LLM.
-    llm = LLM(model="pytorch/Qwen3-8B-int4wo-hqq")
-    # Generate texts from the prompts.
-    # The output is a list of RequestOutput objects
-    # that contain the prompt, generated text, and other information.
-    outputs = llm.generate(prompts, sampling_params)
-    # Print the outputs.
-    print("\nGenerated Outputs:\n" + "-" * 60)
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt:    {prompt!r}")
-        print(f"Output:    {generated_text!r}")
-        print("-" * 60)
-```
 Note: please use `VLLM_DISABLE_COMPILE_CACHE=1` to disable compile cache when running this code, e.g. `VLLM_DISABLE_COMPILE_CACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao,
 this is expected be resolved in pytorch 2.8.

 ```Shell
 # Server
 export MODEL=pytorch/Qwen3-8B-int4wo-hqq
+VLLM_DISABLE_COMPILE_CACHE=1 vllm serve $MODEL --tokenizer $MODEL -O3
 ```
 ```Shell
 }'
 ```
 Note: please use `VLLM_DISABLE_COMPILE_CACHE=1` to disable compile cache when running this code, e.g. `VLLM_DISABLE_COMPILE_CACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao,
 this is expected be resolved in pytorch 2.8.