Commit
·
96a5535
1
Parent(s):
70a8c87
doc: update doc for vLLM 256k support, align chinese doc with en doc.
Browse files- README.md +33 -2
- README_CN.md +99 -225
README.md
CHANGED
|
@@ -281,6 +281,38 @@ Support for this model has been added via this [PR 20114](https://github.com/vl
|
|
| 281 |
You can build and run vLLM from source after merging this pull request into your local repository.
|
| 282 |
|
| 283 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 284 |
|
| 285 |
#### Tool Calling with vLLM
|
| 286 |
|
|
@@ -331,7 +363,6 @@ docker run --gpus all \
|
|
| 331 |
-m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
|
| 332 |
```
|
| 333 |
|
| 334 |
-
|
| 335 |
## Contact Us
|
| 336 |
|
| 337 |
-
If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email ([email protected]).
|
|
|
|
| 281 |
You can build and run vLLM from source after merging this pull request into your local repository.
|
| 282 |
|
| 283 |
|
| 284 |
+
### Model Context Length Support
|
| 285 |
+
|
| 286 |
+
The Hunyuan A13B model supports a maximum context length of **256K tokens (262,144 token positions)**. However, due to GPU memory constraints on most hardware setups, the default configuration in `config.json` limits the context length to **32K tokens** to prevent out-of-memory (OOM) errors.
|
| 287 |
+
|
| 288 |
+
#### Extending Context Length to 256K
|
| 289 |
+
|
| 290 |
+
To enable full 256K context support, you can manually modify the `max_position_embeddings` field in the model's `config.json` file as follows:
|
| 291 |
+
|
| 292 |
+
```json
|
| 293 |
+
{
|
| 294 |
+
...
|
| 295 |
+
"max_position_embeddings": 262144,
|
| 296 |
+
...
|
| 297 |
+
}
|
| 298 |
+
```
|
| 299 |
+
|
| 300 |
+
When serving the model using **vLLM**, you can also explicitly set the maximum model length by adding the following flag to your server launch command:
|
| 301 |
+
|
| 302 |
+
```bash
|
| 303 |
+
--max-model-len 262144
|
| 304 |
+
```
|
| 305 |
+
|
| 306 |
+
#### Recommended Configuration for 256K Context Length
|
| 307 |
+
|
| 308 |
+
The following configuration is recommended for deploying the model with 256K context length support on systems equipped with **NVIDIA H20 GPUs (96GB VRAM)**:
|
| 309 |
+
|
| 310 |
+
| Model DType | KV-Cache Dtype | Number of Devices | Model Length |
|
| 311 |
+
|----------------|----------------|--------------------|--------------|
|
| 312 |
+
| `bfloat16` | `bfloat16` | 4 | 262,144 |
|
| 313 |
+
|
| 314 |
+
> ⚠️ **Note:** Using FP8 quantization for KV-cache may impact generation quality. The above settings are suggested configurations for stable 256K-length service deployment.
|
| 315 |
+
|
| 316 |
|
| 317 |
#### Tool Calling with vLLM
|
| 318 |
|
|
|
|
| 363 |
-m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
|
| 364 |
```
|
| 365 |
|
|
|
|
| 366 |
## Contact Us
|
| 367 |
|
| 368 |
+
If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email ([email protected]).
|
README_CN.md
CHANGED
|
@@ -176,281 +176,155 @@ print(response)
|
|
| 176 |
目前 TensorRT-LLM 的 fp8 和 int4 量化模型正在支持中,敬请期待。
|
| 177 |
|
| 178 |
|
| 179 |
-
##
|
| 180 |
-
### Docker:
|
| 181 |
|
| 182 |
-
|
| 183 |
|
| 184 |
-
|
| 185 |
-
```shell
|
| 186 |
-
# 拉取
|
| 187 |
-
docker pull hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm
|
| 188 |
-
# 起镜像
|
| 189 |
-
docker run --name hunyuanLLM_infer -itd --privileged --user root --net=host --ipc=host --gpus=8 hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm
|
| 190 |
-
```
|
| 191 |
-
|
| 192 |
-
注: Docker容器权限管理。以上代码采用特权模式(--privileged)启动Docker容器会赋予容器较高的权限,增加数据泄露和集群安全风险。建议在非必要情况下避免使用特权模式,以降低安全威胁。对于必须使用特权模式的场景,应进行严格的安全评估,并实施相应的安全监控、加固措施。
|
| 193 |
|
|
|
|
| 194 |
|
| 195 |
-
### BF16部署
|
| 196 |
-
|
| 197 |
-
BF16可以在2张显存超过80G的GPU卡上部署,如果长文推荐TP4。按如下步骤执行:
|
| 198 |
-
|
| 199 |
-
运行命令前请先设置如下环境变量:
|
| 200 |
-
|
| 201 |
-
```shell
|
| 202 |
-
export MODEL_PATH=PATH_TO_MODEL
|
| 203 |
```
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
#### 方式1:命令行推理
|
| 208 |
-
|
| 209 |
-
下面我们展示一个代码片段,采用`vLLM`快速请求chat model:
|
| 210 |
-
|
| 211 |
-
注: vLLM组件远程代码执行防护。下列代码中vLLM组件的trust-remote-code配置项若被启用,将允许加载并执行来自远程模型仓库的代码,这可能导致恶意代码的执行。除非业务需求明确要求,否则建议该配置项处于禁用状态,以降低潜在的安全威胁。
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
```python
|
| 215 |
-
import os
|
| 216 |
-
from typing import List, Optional
|
| 217 |
-
from vllm import LLM, SamplingParams
|
| 218 |
-
from vllm.inputs import PromptType
|
| 219 |
-
from transformers import AutoTokenizer
|
| 220 |
-
|
| 221 |
-
model_path=os.environ.get('MODEL_PATH')
|
| 222 |
-
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
| 223 |
-
|
| 224 |
-
llm = LLM(model=model_path,
|
| 225 |
-
tokenizer=model_path,
|
| 226 |
-
trust_remote_code=True,
|
| 227 |
-
dtype='bfloat16',
|
| 228 |
-
tensor_parallel_size=4,
|
| 229 |
-
gpu_memory_utilization=0.9)
|
| 230 |
-
|
| 231 |
-
sampling_params = SamplingParams(
|
| 232 |
-
temperature=0.7, top_p=0.8, max_tokens=4096, top_k=20, repetition_penalty=1.05)
|
| 233 |
-
|
| 234 |
-
messages = [
|
| 235 |
-
{
|
| 236 |
-
"role": "system",
|
| 237 |
-
"content": "You are a helpful assistant.",
|
| 238 |
-
},
|
| 239 |
-
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
| 240 |
-
]
|
| 241 |
-
|
| 242 |
-
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
|
| 243 |
-
|
| 244 |
-
dummy_inputs: List[PromptType] = [{
|
| 245 |
-
"prompt_token_ids": batch
|
| 246 |
-
} for batch in tokenized_chat.numpy().tolist()]
|
| 247 |
-
|
| 248 |
-
outputs = llm.generate(dummy_inputs, sampling_params)
|
| 249 |
-
|
| 250 |
-
# Print the outputs.
|
| 251 |
-
for output in outputs:
|
| 252 |
-
prompt = output.prompt
|
| 253 |
-
generated_text = output.outputs[0].text
|
| 254 |
-
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
| 255 |
```
|
| 256 |
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
```
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 270 |
```
|
| 271 |
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 275 |
```
|
| 276 |
|
| 277 |
-
注意修改`openapi.sh`中的`${LOCAL_IP}`和`${MODEL_PATH}`为服务对应值。
|
| 278 |
|
|
|
|
| 279 |
|
| 280 |
-
|
| 281 |
|
| 282 |
-
|
| 283 |
|
| 284 |
-
镜像:部署镜像同BF16。
|
| 285 |
|
|
|
|
| 286 |
|
| 287 |
-
|
| 288 |
-
部署Int8-weight-only版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量:
|
| 289 |
-
```SHELL
|
| 290 |
-
export MODEL_PATH=PATH_TO_BF16_MODEL
|
| 291 |
-
```
|
| 292 |
|
| 293 |
-
|
| 294 |
-
```shell
|
| 295 |
-
sh run_server_int8.sh
|
| 296 |
-
```
|
| 297 |
|
| 298 |
-
|
| 299 |
-
```shell
|
| 300 |
-
sh openapi.sh
|
| 301 |
-
```
|
| 302 |
|
| 303 |
-
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
|
|
|
|
|
|
| 307 |
```
|
| 308 |
|
| 309 |
-
|
| 310 |
-
```shell
|
| 311 |
-
sh run_server_int4.sh
|
| 312 |
-
```
|
| 313 |
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
sh openapi.sh
|
| 317 |
```
|
| 318 |
|
| 319 |
-
####
|
| 320 |
-
部署W8A8C8版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量:
|
| 321 |
-
```shell
|
| 322 |
-
export MODEL_PATH=PATH_TO_FP8_MODEL
|
| 323 |
-
```
|
| 324 |
|
| 325 |
-
|
| 326 |
-
```shell
|
| 327 |
-
sh run_server_fp8.sh
|
| 328 |
-
```
|
| 329 |
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
|
| 333 |
-
```
|
| 334 |
|
| 335 |
-
|
| 336 |
|
| 337 |
-
本部分介绍采用vLLM部署各个模型(原始模型和量化模型)的效率测试结果,包括不同Batchsize下的推理速度(tokens/s), 测试环境(腾讯云,H80(96G)GPU x 卡数):
|
| 338 |
|
| 339 |
-
|
| 340 |
-
```python
|
| 341 |
-
python3 benchmark_throughput.py --backend vllm \
|
| 342 |
-
--input-len 2048 \
|
| 343 |
-
--output-len 14336 \
|
| 344 |
-
--model $MODEL_PATH \
|
| 345 |
-
--tensor-parallel-size $TP \
|
| 346 |
-
--use-v2-block-manager \
|
| 347 |
-
--async-engine \
|
| 348 |
-
--trust-remote-code \
|
| 349 |
-
--num_prompts $BATCH_SIZE \
|
| 350 |
-
--max-num-seqs $BATCH_SIZE
|
| 351 |
-
```
|
| 352 |
|
| 353 |
-
|
| 354 |
-
|------|-----------------------------|-----------|-------------------------|---------------------|----------------------|----------------------|
|
| 355 |
-
| vLLM | Hunyuan-A13B-Instruct | 8 | 2048 | 190.84 | 1246.54 | 1981.99 |
|
| 356 |
-
| vLLM | Hunyuan-A13B-Instruct | 4 | 2048 | 158.90 | 779.10 | 1301.75 |
|
| 357 |
-
| vLLM | Hunyuan-A13B-Instruct | 2 | 2048 | 111.72 | 327.31 | 346.54 |
|
| 358 |
-
| vLLM | Hunyuan-A13B-Instruct(int8 weight only) | 2 | 2048 | 109.10 | 444.17 | 721.93 |
|
| 359 |
-
| vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 2 | 2048 | 91.83 | 372.01 | 617.70 |
|
| 360 |
-
| vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 1 | 2048 | 60.07 | 148.80 | 160.41 |
|
| 361 |
|
|
|
|
|
|
|
| 362 |
|
| 363 |
-
|
| 364 |
|
| 365 |
-
|
|
|
|
|
|
|
|
|
|
| 366 |
|
| 367 |
-
|
| 368 |
|
| 369 |
-
#### 方式1:命令行推理
|
| 370 |
|
| 371 |
-
|
| 372 |
|
|
|
|
| 373 |
|
| 374 |
-
```python
|
| 375 |
-
import sglang as sgl
|
| 376 |
-
from transformers import AutoTokenizer
|
| 377 |
-
|
| 378 |
-
model_path=os.environ.get('MODEL_PATH')
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
| 382 |
-
|
| 383 |
-
messages = [
|
| 384 |
-
{
|
| 385 |
-
"role": "system",
|
| 386 |
-
"content": "You are a helpful assistant.",
|
| 387 |
-
},
|
| 388 |
-
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
| 389 |
-
]
|
| 390 |
-
prompts = []
|
| 391 |
-
prompts.append(tokenizer.apply_chat_template(
|
| 392 |
-
messages,
|
| 393 |
-
tokenize=False,
|
| 394 |
-
add_generation_prompt=True
|
| 395 |
-
))
|
| 396 |
-
print(prompts)
|
| 397 |
-
|
| 398 |
-
llm = sgl.Engine(
|
| 399 |
-
model_path=model_path,
|
| 400 |
-
tp_size=4,
|
| 401 |
-
trust_remote_code=True,
|
| 402 |
-
mem_fraction_static=0.7,
|
| 403 |
-
)
|
| 404 |
|
| 405 |
-
|
| 406 |
-
outputs = llm.generate(prompts, sampling_params)
|
| 407 |
-
for prompt, output in zip(prompts, outputs):
|
| 408 |
-
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
|
| 409 |
-
```
|
| 410 |
|
| 411 |
-
|
| 412 |
|
| 413 |
-
|
| 414 |
|
| 415 |
-
|
| 416 |
-
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
|
| 420 |
-
|
|
|
|
|
|
|
| 421 |
```
|
| 422 |
|
| 423 |
-
|
| 424 |
-
```python
|
| 425 |
-
import openai
|
| 426 |
-
client = openai.Client(
|
| 427 |
-
base_url="http://localhost:30000/v1", api_key="EMPTY")
|
| 428 |
|
| 429 |
-
|
| 430 |
-
|
| 431 |
-
|
| 432 |
-
|
| 433 |
-
|
| 434 |
-
|
| 435 |
-
|
| 436 |
-
extra_body={"top_p": 0.8, "top_k": 20}
|
| 437 |
-
)
|
| 438 |
-
print(response)
|
| 439 |
```
|
| 440 |
|
| 441 |
-
#### FP8/Int4量化模型部署:
|
| 442 |
-
目前 sglang 的 fp8 和 int4 量化模型正在支持中,敬请期待。
|
| 443 |
|
| 444 |
## 交互式Demo Web
|
| 445 |
hunyuan-A13B 现已开放网页demo。访问 https://hunyuan.tencent.com/?model=hunyuan-a13b 即可简单体验我们的模型。
|
| 446 |
|
| 447 |
-
<br>
|
| 448 |
-
|
| 449 |
-
## 引用
|
| 450 |
-
如果你觉得我们的工作对你有帮助,欢迎引用我们的<a href="report/Hunyuan_A13B_Technical_Report.pdf">技术报告</a>!
|
| 451 |
-
|
| 452 |
-
<br>
|
| 453 |
-
|
| 454 |
-
|
| 455 |
## 联系我们
|
| 456 |
-
如果你想给我们的研发和产品团队留言,欢迎联系我们腾讯混元LLM团队。你可以通过邮件([email protected])联系我们。
|
|
|
|
| 176 |
目前 TensorRT-LLM 的 fp8 和 int4 量化模型正在支持中,敬请期待。
|
| 177 |
|
| 178 |
|
| 179 |
+
## vLLM 部署
|
|
|
|
| 180 |
|
| 181 |
+
### Docker 镜像
|
| 182 |
|
| 183 |
+
我们提供了一个预构建的 Docker 镜像,其中包含了支持本模型的 vLLM 0.8.5。当前官方 vLLM 正在持续开发中。**注意:该镜像要求使用 CUDA 12.8 版本。**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
+
- 快速开始方式如下:
|
| 186 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
```
|
| 188 |
+
docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm
|
| 189 |
+
或
|
| 190 |
+
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
```
|
| 192 |
|
| 193 |
+
- 下载模型文件:
|
| 194 |
+
- Huggingface:vLLM 会自动下载。
|
| 195 |
+
- ModelScope:`modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
|
| 196 |
+
|
| 197 |
+
- 启动 API 服务(从 Huggingface 下载模型):
|
| 198 |
+
|
| 199 |
+
```bash
|
| 200 |
+
docker run --rm --ipc=host \
|
| 201 |
+
-v ~/.cache:/root/.cache/ \
|
| 202 |
+
--security-opt seccomp=unconfined \
|
| 203 |
+
--net=host \
|
| 204 |
+
--gpus=all \
|
| 205 |
+
-it \
|
| 206 |
+
-e VLLM_USE_V1=0 \
|
| 207 |
+
--entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
|
| 208 |
+
-m vllm.entrypoints.openai.api_server \
|
| 209 |
+
--host 0.0.0.0 \
|
| 210 |
+
--tensor-parallel-size 4 \
|
| 211 |
+
--port 8000 \
|
| 212 |
+
--model tencent/Hunyuan-A13B-Instruct \
|
| 213 |
+
--trust_remote_code
|
| 214 |
```
|
| 215 |
|
| 216 |
+
- 启动 API 服务(从 ModelScope 下载模型):
|
| 217 |
+
|
| 218 |
+
```bash
|
| 219 |
+
docker run --rm --ipc=host \
|
| 220 |
+
-v ~/.cache/modelscope:/root/.cache/modelscope \
|
| 221 |
+
--security-opt seccomp=unconfined \
|
| 222 |
+
--net=host \
|
| 223 |
+
--gpus=all \
|
| 224 |
+
-it \
|
| 225 |
+
-e VLLM_USE_V1=0 \
|
| 226 |
+
--entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
|
| 227 |
+
-m vllm.entrypoints.openai.api_server \
|
| 228 |
+
--host 0.0.0.0 \
|
| 229 |
+
--tensor-parallel-size 4 \
|
| 230 |
+
--port 8000 \
|
| 231 |
+
--model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ \
|
| 232 |
+
--trust_remote_code
|
| 233 |
```
|
| 234 |
|
|
|
|
| 235 |
|
| 236 |
+
### 源码部署
|
| 237 |
|
| 238 |
+
对本模型的支持已通过 [PR 20114](https://github.com/vllm-project/vllm/pull/20114) 提交至 vLLM 项目。
|
| 239 |
|
| 240 |
+
你可以在本地仓库中合并此 PR 后,从源码构建并运行 vLLM。
|
| 241 |
|
|
|
|
| 242 |
|
| 243 |
+
### 模型上下文长度支持
|
| 244 |
|
| 245 |
+
Hunyuan A13B 模型支持最大 **256K token(即 262,144 个位置)** 的上下文长度。但由于大多数 GPU 硬件配置的显存限制,默认 `config.json` 中将上下文长度限制为 **32K token**,以避免出现显存溢出(OOM)问题。
|
|
|
|
|
|
|
|
|
|
|
|
|
| 246 |
|
| 247 |
+
#### 将上下文长度扩展至 256K
|
|
|
|
|
|
|
|
|
|
| 248 |
|
| 249 |
+
如需启用完整的 256K 上下文支持,请手动修改模型 `config.json` 文件中的 `max_position_embeddings` 字段如下:
|
|
|
|
|
|
|
|
|
|
| 250 |
|
| 251 |
+
```json
|
| 252 |
+
{
|
| 253 |
+
...
|
| 254 |
+
"max_position_embeddings": 262144,
|
| 255 |
+
...
|
| 256 |
+
}
|
| 257 |
```
|
| 258 |
|
| 259 |
+
当使用 **vLLM** 进行服务部署时,也可以通过添加以下参数来明确设置最大模型长度:
|
|
|
|
|
|
|
|
|
|
| 260 |
|
| 261 |
+
```bash
|
| 262 |
+
--max-model-len 262144
|
|
|
|
| 263 |
```
|
| 264 |
|
| 265 |
+
#### 推荐的 256K 上下文长度配置
|
|
|
|
|
|
|
|
|
|
|
|
|
| 266 |
|
| 267 |
+
以下是在配备 **NVIDIA H20 显卡(96GB 显存)** 的系统上部署 256K 上下文长度服务的推荐配置:
|
|
|
|
|
|
|
|
|
|
| 268 |
|
| 269 |
+
| 模型数据类型 | KV-Cache 数据类型 | 设备数量 | 模型长度 |
|
| 270 |
+
|----------------|-------------------|------------|--------------|
|
| 271 |
+
| `bfloat16` | `bfloat16` | 4 | 262,144 |
|
|
|
|
| 272 |
|
| 273 |
+
> ⚠️ **注意:** 使用 FP8 对 KV-cache 进行量化可能会影响生成质量。上述配置是用于稳定部署 256K 长度服务的建议设置。
|
| 274 |
|
|
|
|
| 275 |
|
| 276 |
+
### 使用 vLLM 调用工具
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 277 |
|
| 278 |
+
为了支持基于 Agent 的工作流和函数调用能力,本模型包含专门的解析机制,用于处理工具调用及内部推理步骤。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 279 |
|
| 280 |
+
关于如何在 Agent 场景中实现和使用这些功能的完整示例,请参见我们的 GitHub 示例代码:
|
| 281 |
+
🔗 [Hunyuan A13B Agent 示例](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
|
| 282 |
|
| 283 |
+
在使用 **vLLM** 部署模型时,可以使用以下参数配置工具解析行为:
|
| 284 |
|
| 285 |
+
| 参数名 | 值 |
|
| 286 |
+
|-------------------------|--------------------------------------------------------------------|
|
| 287 |
+
| `--tool-parser-plugin` | [本地 Hunyuan A13B 工具解析器文件](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
|
| 288 |
+
| `--tool-call-parser` | `hunyuan` |
|
| 289 |
|
| 290 |
+
这些设置可使 vLLM 根据预期格式正确解析和路由模型生成的工具调用。
|
| 291 |
|
|
|
|
| 292 |
|
| 293 |
+
### Reasoning Parser(推理解析器)
|
| 294 |
|
| 295 |
+
目前,Hunyuan A13B 模型在 vLLM 中的推理解析器支持仍在开发中。
|
| 296 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 297 |
|
| 298 |
+
## SGLang
|
|
|
|
|
|
|
|
|
|
|
|
|
| 299 |
|
| 300 |
+
### Docker 镜像
|
| 301 |
|
| 302 |
+
我们还提供基于 SGLang 最新版本构建的 Docker 镜像。
|
| 303 |
|
| 304 |
+
快速开始方式如下:
|
| 305 |
+
|
| 306 |
+
- 拉取 Docker 镜像:
|
| 307 |
+
|
| 308 |
+
```
|
| 309 |
+
docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang
|
| 310 |
+
或
|
| 311 |
+
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang
|
| 312 |
```
|
| 313 |
|
| 314 |
+
- 启动 API 服务:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 315 |
|
| 316 |
+
```bash
|
| 317 |
+
docker run --gpus all \
|
| 318 |
+
--shm-size 32g \
|
| 319 |
+
-p 30000:30000 \
|
| 320 |
+
--ipc=host \
|
| 321 |
+
docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang \
|
| 322 |
+
-m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
|
|
|
|
|
|
|
|
|
|
| 323 |
```
|
| 324 |
|
|
|
|
|
|
|
| 325 |
|
| 326 |
## 交互式Demo Web
|
| 327 |
hunyuan-A13B 现已开放网页demo。访问 https://hunyuan.tencent.com/?model=hunyuan-a13b 即可简单体验我们的模型。
|
| 328 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 329 |
## 联系我们
|
| 330 |
+
如果你想给我们的研发和产品团队留言,欢迎联系我们腾讯混元LLM团队。你可以通过邮件([email protected])联系我们。
|