updated samples
Browse files
README.md
CHANGED
|
@@ -10,8 +10,13 @@ This model as intended to be used as an accelerator for llama 13B (chat).
|
|
| 10 |
Undlerlying implementation of Paged Attention KV-Cached and speculator can be found in https://github.com/foundation-model-stack/fms-extras
|
| 11 |
Production implementation using `fms-extras` implementation can be found in https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
|
| 12 |
|
|
|
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
```bash
|
| 17 |
docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
docker run -d --rm --gpus all \
|
|
@@ -35,17 +40,31 @@ git clone --branch speculative-decoding --single-branch https://github.com/tdoub
|
|
| 35 |
cd text-generation-inference/integration_tests
|
| 36 |
make gen-client
|
| 37 |
pip install . --no-cache-dir
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
python sample_client.py
|
| 39 |
```
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
####
|
| 44 |
|
| 45 |
```bash
|
| 46 |
git clone https://github.com/foundation-model-stack/fms-extras
|
| 47 |
(cd fms-extras && pip install -e .)
|
| 48 |
pip install transformers==4.35.0 sentencepiece numpy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
python fms-extras/scripts/paged_speculative_inference.py \
|
| 50 |
--variant=13b \
|
| 51 |
--model_path=/path/to/model_weights/llama/13B-F \
|
|
@@ -57,12 +76,9 @@ python fms-extras/scripts/paged_speculative_inference.py \
|
|
| 57 |
--compile_mode=reduce-overhead
|
| 58 |
```
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
```bash
|
| 63 |
-
git clone https://github.com/foundation-model-stack/fms-extras
|
| 64 |
-
(cd fms-extras && pip install -e .)
|
| 65 |
-
pip install transformers==4.35.0 sentencepiece numpy
|
| 66 |
python fms-extras/scripts/paged_speculative_inference.py \
|
| 67 |
--variant=13b \
|
| 68 |
--model_path=/path/to/model_weights/llama/13B-F \
|
|
@@ -73,12 +89,9 @@ python fms-extras/scripts/paged_speculative_inference.py \
|
|
| 73 |
--compile \
|
| 74 |
```
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
```bash
|
| 79 |
-
git clone https://github.com/foundation-model-stack/fms-extras
|
| 80 |
-
(cd fms-extras && pip install -e .)
|
| 81 |
-
pip install transformers==4.35.0 sentencepiece numpy
|
| 82 |
python fms-extras/scripts/paged_speculative_inference.py \
|
| 83 |
--variant=13b \
|
| 84 |
--model_path=/path/to/model_weights/llama/13B-F \
|
|
|
|
| 10 |
Undlerlying implementation of Paged Attention KV-Cached and speculator can be found in https://github.com/foundation-model-stack/fms-extras
|
| 11 |
Production implementation using `fms-extras` implementation can be found in https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
|
| 12 |
|
| 13 |
+
## Samples
|
| 14 |
|
| 15 |
+
### Production Server Sample
|
| 16 |
+
|
| 17 |
+
*To try this out running in a production-like environment, please use the pre-built docker image:*
|
| 18 |
+
|
| 19 |
+
#### Setup
|
| 20 |
|
| 21 |
```bash
|
| 22 |
docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
docker run -d --rm --gpus all \
|
|
|
|
| 40 |
cd text-generation-inference/integration_tests
|
| 41 |
make gen-client
|
| 42 |
pip install . --no-cache-dir
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
#### Run Sample
|
| 46 |
+
|
| 47 |
+
```bash
|
| 48 |
python sample_client.py
|
| 49 |
```
|
| 50 |
|
| 51 |
+
### Minimal Sample
|
| 52 |
+
|
| 53 |
+
*To try this out with the fms-native compiled model, please execute the following:*
|
| 54 |
|
| 55 |
+
#### Install
|
| 56 |
|
| 57 |
```bash
|
| 58 |
git clone https://github.com/foundation-model-stack/fms-extras
|
| 59 |
(cd fms-extras && pip install -e .)
|
| 60 |
pip install transformers==4.35.0 sentencepiece numpy
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
#### Run Sample
|
| 64 |
+
|
| 65 |
+
##### batch_size=1 (compile + cudagraphs)
|
| 66 |
+
|
| 67 |
+
```bash
|
| 68 |
python fms-extras/scripts/paged_speculative_inference.py \
|
| 69 |
--variant=13b \
|
| 70 |
--model_path=/path/to/model_weights/llama/13B-F \
|
|
|
|
| 76 |
--compile_mode=reduce-overhead
|
| 77 |
```
|
| 78 |
|
| 79 |
+
##### batch_size=1 (compile)
|
| 80 |
|
| 81 |
```bash
|
|
|
|
|
|
|
|
|
|
| 82 |
python fms-extras/scripts/paged_speculative_inference.py \
|
| 83 |
--variant=13b \
|
| 84 |
--model_path=/path/to/model_weights/llama/13B-F \
|
|
|
|
| 89 |
--compile \
|
| 90 |
```
|
| 91 |
|
| 92 |
+
##### batch_size=4 (compile)
|
| 93 |
|
| 94 |
```bash
|
|
|
|
|
|
|
|
|
|
| 95 |
python fms-extras/scripts/paged_speculative_inference.py \
|
| 96 |
--variant=13b \
|
| 97 |
--model_path=/path/to/model_weights/llama/13B-F \
|