Asher
commited on
Commit
·
c864579
1
Parent(s):
e568489
doc update: add china docker mirror for vllm.
Browse files- README.md +18 -3
- README_CN.md +18 -3
README.md
CHANGED
|
@@ -221,15 +221,30 @@ trtllm-serve \
|
|
| 221 |
|
| 222 |
### vLLM
|
| 223 |
|
| 224 |
-
#### Docker Image
|
| 225 |
-
We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.
|
| 226 |
|
| 227 |
-
- To get started:
|
| 228 |
|
|
|
|
|
|
|
|
|
|
| 229 |
```
|
| 230 |
docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
| 231 |
```
|
| 232 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 233 |
- Download Model file:
|
| 234 |
- Huggingface: will download automicly by vllm.
|
| 235 |
- ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
|
|
|
|
| 221 |
|
| 222 |
### vLLM
|
| 223 |
|
| 224 |
+
#### Inference from Docker Image
|
| 225 |
+
We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.4 is require for this docker**.
|
| 226 |
|
|
|
|
| 227 |
|
| 228 |
+
- To Get Started, Download the Docker Image:
|
| 229 |
+
|
| 230 |
+
**From Docker Hub:**
|
| 231 |
```
|
| 232 |
docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
| 233 |
```
|
| 234 |
|
| 235 |
+
**From China Mirror(Thanks to [CNB](https://cnb.cool/ "CNB.cool")):**
|
| 236 |
+
|
| 237 |
+
|
| 238 |
+
First, pull the image from CNB:
|
| 239 |
+
```
|
| 240 |
+
docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1
|
| 241 |
+
```
|
| 242 |
+
|
| 243 |
+
Then, rename the image to better align with the following scripts:
|
| 244 |
+
```
|
| 245 |
+
docker tag docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
- Download Model file:
|
| 249 |
- Huggingface: will download automicly by vllm.
|
| 250 |
- ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
|
README_CN.md
CHANGED
|
@@ -178,16 +178,31 @@ print(response)
|
|
| 178 |
|
| 179 |
## vLLM 部署
|
| 180 |
|
| 181 |
-
### Docker
|
| 182 |
|
| 183 |
我们提供了一个基于官方 vLLM 0.8.5 版本的 Docker 镜像方便快速部署和测试。**注意:该镜像要求使用 CUDA 12.4 版本。**
|
| 184 |
|
| 185 |
-
-
|
| 186 |
|
|
|
|
| 187 |
```
|
| 188 |
docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
| 189 |
```
|
| 190 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
- 下载模型文件:
|
| 192 |
- Huggingface:vLLM 会自动下载。
|
| 193 |
- ModelScope:`modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
|
|
@@ -238,7 +253,7 @@ docker run --rm --ipc=host \
|
|
| 238 |
|
| 239 |
### 模型上下文长度支持
|
| 240 |
|
| 241 |
-
Hunyuan A13B 模型支持最大 **256K token
|
| 242 |
|
| 243 |
#### 将上下文长度扩展至 256K
|
| 244 |
|
|
|
|
| 178 |
|
| 179 |
## vLLM 部署
|
| 180 |
|
| 181 |
+
### Docker 镜像推理
|
| 182 |
|
| 183 |
我们提供了一个基于官方 vLLM 0.8.5 版本的 Docker 镜像方便快速部署和测试。**注意:该镜像要求使用 CUDA 12.4 版本。**
|
| 184 |
|
| 185 |
+
- 首先,下载 Docker 镜像文件:
|
| 186 |
|
| 187 |
+
**从Docker Hub下载**:
|
| 188 |
```
|
| 189 |
docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
| 190 |
```
|
| 191 |
|
| 192 |
+
**中国国内镜像**:
|
| 193 |
+
|
| 194 |
+
考虑到下载速度, 也可以选择从 CNB 下载镜像,感谢[CNB云原生构建](https://cnb.cool/)提供支持:
|
| 195 |
+
|
| 196 |
+
1. 下载镜像
|
| 197 |
+
```
|
| 198 |
+
docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
2. 然后更名镜像(可选,更好的和下面脚本名字匹配)
|
| 202 |
+
```
|
| 203 |
+
docker tag docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
- 下载模型文件:
|
| 207 |
- Huggingface:vLLM 会自动下载。
|
| 208 |
- ModelScope:`modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
|
|
|
|
| 253 |
|
| 254 |
### 模型上下文长度支持
|
| 255 |
|
| 256 |
+
Hunyuan A13B 模型支持最大 **256K token(262,144 Token)** 的上下文长度。但由于大多数 GPU 硬件配置的显存限制,默认 `config.json` 中将上下文长度限制为 **32K token**,以避免出现显存溢出(OOM)问题。
|
| 257 |
|
| 258 |
#### 将上下文长度扩展至 256K
|
| 259 |
|