Update README.md
Browse files
README.md
CHANGED
@@ -30,7 +30,24 @@ tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
|
|
30 |
model.save_pretrained(output_dir, save_safetensors=True, save_compressed=False)
|
31 |
tokenizer.save_pretrained(output_dir)
|
32 |
```
|
|
|
33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
# gpt-oss-120b
|
36 |
<p align="center">
|
|
|
30 |
model.save_pretrained(output_dir, save_safetensors=True, save_compressed=False)
|
31 |
tokenizer.save_pretrained(output_dir)
|
32 |
```
|
33 |
+
## Inference
|
34 |
|
35 |
+
### Prerequisite
|
36 |
+
Install the latest vllm version:
|
37 |
+
```
|
38 |
+
pip install -U vllm \
|
39 |
+
--pre \
|
40 |
+
--extra-index-url https://wheels.vllm.ai/nightly
|
41 |
+
```
|
42 |
+
|
43 |
+
### vllm
|
44 |
+
|
45 |
+
For Ampere devices, please use TRITON_ATTN_VLLM_V1 attention backend i.e.,
|
46 |
+
```
|
47 |
+
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve cpatonn/gpt-oss-120b-BF16 --async-scheduling
|
48 |
+
```
|
49 |
+
|
50 |
+
For further information, please visit this [guide](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html).
|
51 |
|
52 |
# gpt-oss-120b
|
53 |
<p align="center">
|