Commit
·
646bb29
1
Parent(s):
f8182d3
Update README.md
Browse files
README.md
CHANGED
|
@@ -28,7 +28,7 @@ See the [usage instructions](#usage-example) for how to inference this model wit
|
|
| 28 |
|
| 29 |
## Performance Comparison
|
| 30 |
|
| 31 |
-
#### Latency for
|
| 32 |
|
| 33 |
Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:
|
| 34 |
|
|
@@ -67,13 +67,13 @@ from transformers import AutoConfig, AutoTokenizer
|
|
| 67 |
sess = InferenceSession("Mistral-7B-v0.1.onnx", providers = ["CUDAExecutionProvider"])
|
| 68 |
config = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
|
| 73 |
|
| 74 |
inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")
|
| 75 |
|
| 76 |
-
outputs =
|
| 77 |
|
| 78 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 79 |
```
|
|
|
|
| 28 |
|
| 29 |
## Performance Comparison
|
| 30 |
|
| 31 |
+
#### Latency for token generation
|
| 32 |
|
| 33 |
Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:
|
| 34 |
|
|
|
|
| 67 |
sess = InferenceSession("Mistral-7B-v0.1.onnx", providers = ["CUDAExecutionProvider"])
|
| 68 |
config = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
|
| 69 |
|
| 70 |
+
model = ORTModelForCausalLM(sess, config, use_cache = True, use_io_binding = True)
|
| 71 |
|
| 72 |
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
|
| 73 |
|
| 74 |
inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")
|
| 75 |
|
| 76 |
+
outputs = model.generate(**inputs)
|
| 77 |
|
| 78 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 79 |
```
|