Rico commited on
Commit
c533b15
·
1 Parent(s): 7bf5511

[UPDATE] update files

Browse files
Files changed (2) hide show
  1. deploy_guidance.md +0 -210
  2. stepfun-logo.png +0 -0
deploy_guidance.md DELETED
@@ -1,210 +0,0 @@
1
- # Step3 Model Deployment Guide
2
-
3
- This document provides deployment guidance for Step3 model.
4
-
5
- Currently, our open-source deployment guide only includes TP and DP+TP deployment methods. The AFD (Attn-FFN Disaggregated) approach mentioned in our [paper](https://arxiv.org/abs/2507.19427) is still under joint development with the open-source community to achieve optimal performance. Please stay tuned for updates on our open-source progress.
6
-
7
- ## Overview
8
-
9
- Step3 is a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs.
10
-
11
- For out fp8 version, about 326G memory is required.
12
- The smallest deployment unit for this version is 8xH20 with either Tensor Parallel (TP) or Data Parallel + Tensor Parallel (DP+TP).
13
-
14
- For out bf16 version, about 642G memory is required.
15
- The smallest deployment unit for this version is 16xH20 with either Tensor Parallel (TP) or Data Parallel + Tensor Parallel (DP+TP).
16
-
17
- ## Deployment Options
18
-
19
- ### vLLM Deployment
20
-
21
- Please make sure to use nightly version of vllm. For details, please refer to [vllm nightly installation doc](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#pre-built-wheels).
22
- ```bash
23
- uv pip install -U vllm \
24
- --torch-backend=auto \
25
- --extra-index-url https://wheels.vllm.ai/nightly
26
- ```
27
-
28
- We recommend to use the following command to deploy the model:
29
-
30
- **`max_num_batched_tokens` should be larger than 4096. If not set, the default value is 8192.**
31
-
32
- #### BF16 Model
33
- ##### Tensor Parallelism(Serving on 16xH20):
34
-
35
- ```bash
36
- # start ray on node 0 and node 1
37
-
38
- # node 0:
39
- vllm serve /path/to/step3 \
40
- --tensor-parallel-size 16 \
41
- --reasoning-parser step3 \
42
- --enable-auto-tool-choice \
43
- --tool-call-parser step3 \
44
- --trust-remote-code \
45
- --port $PORT_SERVING
46
- ```
47
-
48
- ###### Data Parallelism + Tensor Parallelism(Serving on 16xH20):
49
- Step3 only has single kv head, so attention data parallelism can be adopted to reduce the kv cache memory usage.
50
-
51
- ```bash
52
- # start ray on node 0 and node 1
53
-
54
- # node 0:
55
- vllm serve /path/to/step3 \
56
- --data-parallel-size 16 \
57
- --tensor-parallel-size 1 \
58
- --reasoning-parser step3 \
59
- --enable-auto-tool-choice \
60
- --tool-call-parser step3 \
61
- --trust-remote-code \
62
- ```
63
-
64
- #### FP8 Model
65
- ##### Tensor Parallelism(Serving on 8xH20):
66
-
67
- ```bash
68
- vllm serve /path/to/step3-fp8 \
69
- --tensor-parallel-size 8 \
70
- --reasoning-parser step3 \
71
- --enable-auto-tool-choice \
72
- --tool-call-parser step3 \
73
- --gpu-memory-utilization 0.85 \
74
- --trust-remote-code \
75
- ```
76
-
77
- ###### Data Parallelism + Tensor Parallelism(Serving on 8xH20):
78
-
79
- ```bash
80
- vllm serve /path/to/step3-fp8 \
81
- --data-parallel-size 8 \
82
- --tensor-parallel-size 1 \
83
- --reasoning-parser step3 \
84
- --enable-auto-tool-choice \
85
- --tool-call-parser step3 \
86
- --trust-remote-code \
87
- ```
88
-
89
-
90
- ##### Key parameter notes:
91
-
92
- * `reasoning-parser`: If enabled, reasoning content in the response will be parsed into a structured format.
93
- * `tool-call-parser`: If enabled, tool call content in the response will be parsed into a structured format.
94
-
95
- ### SGLang Deployment
96
-
97
- 0.4.10 or later is needed for SGLang.
98
-
99
- ```
100
- pip3 install "sglang[all]>=0.4.10"
101
- ```
102
-
103
- #### BF16 Model
104
- ##### Tensor Parallelism(Serving on 16xH20):
105
-
106
- ```bash
107
- # start ray on node 0 and node 1
108
-
109
- # node 0:
110
- python -m sglang.launch_server \
111
- --model-path /path/to/step3 \
112
- --trust-remote-code \
113
- --tool-call-parser step3 \
114
- --reasoning-parser step3 \
115
- --tp 16
116
- ```
117
-
118
- #### FP8 Model
119
- ##### Tensor Parallelism(Serving on 8xH20):
120
-
121
- ```bash
122
- python -m sglang.launch_server \
123
- --model-path /path/to/step3-fp8 \
124
- --trust-remote-code \
125
- --tool-call-parser step3 \
126
- --reasoning-parser step3-fp8 \
127
- --tp 8
128
- ```
129
-
130
-
131
- ### TensorRT-LLM Deployment
132
-
133
- [Coming soon...]
134
-
135
-
136
- ## Client Request Examples
137
-
138
- Then you can use the chat API as below:
139
- ```python
140
- from openai import OpenAI
141
-
142
- # Set OpenAI's API key and API base to use vLLM's API server.
143
- openai_api_key = "EMPTY"
144
- openai_api_base = "http://localhost:8000/v1"
145
-
146
- client = OpenAI(
147
- api_key=openai_api_key,
148
- base_url=openai_api_base,
149
- )
150
-
151
- chat_response = client.chat.completions.create(
152
- model="step3",
153
- messages=[
154
- {"role": "system", "content": "You are a helpful assistant."},
155
- {
156
- "role": "user",
157
- "content": [
158
- {
159
- "type": "image_url",
160
- "image_url": {
161
- "url": "https://xxxxx.png"
162
- },
163
- },
164
- {"type": "text", "text": "Please describe the image."},
165
- ],
166
- },
167
- ],
168
- )
169
- print("Chat response:", chat_response)
170
- ```
171
- You can also upload base64-encoded local images:
172
-
173
- ```python
174
- import base64
175
- from openai import OpenAI
176
- # Set OpenAI's API key and API base to use vLLM's API server.
177
- openai_api_key = "EMPTY"
178
- openai_api_base = "http://localhost:8000/v1"
179
- client = OpenAI(
180
- api_key=openai_api_key,
181
- base_url=openai_api_base,
182
- )
183
- image_path = "/path/to/local/image.png"
184
- with open(image_path, "rb") as f:
185
- encoded_image = base64.b64encode(f.read())
186
- encoded_image_text = encoded_image.decode("utf-8")
187
- base64_step = f"data:image;base64,{encoded_image_text}"
188
- chat_response = client.chat.completions.create(
189
- model="step3",
190
- messages=[
191
- {"role": "system", "content": "You are a helpful assistant."},
192
- {
193
- "role": "user",
194
- "content": [
195
- {
196
- "type": "image_url",
197
- "image_url": {
198
- "url": base64_step
199
- },
200
- },
201
- {"type": "text", "text": "Please describe the image."},
202
- ],
203
- },
204
- ],
205
- )
206
- print("Chat response:", chat_response)
207
-
208
- ```
209
-
210
- Note: In our image preprocessing pipeline, we implement a multi-patch mechanism to handle large images. If the input image exceeds 728x728 pixels, the system will automatically apply image cropping logic to get patches of the image.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
stepfun-logo.png DELETED
Binary file (7.29 kB)