TheBloke
/

Mistral-7B-v0.1-AWQ

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions

TheBloke commited on Sep 29, 2023

Commit

eb57837

·

1 Parent(s): 21da534

Update README.md

Files changed (1) hide show

README.md +40 -1

README.md CHANGED Viewed

@@ -49,7 +49,9 @@ AWQ is an efficient, accurate and blazing-fast low-bit weight quantization metho
 These are experimental first AWQs for the brand-new model format, Mistral.
-They will not work from vLLM or TGI. They can only be used from AutoAWQ, and they require installing both AutoAWQ and Transformers from Github. More details are below.
 <!-- description end -->
 <!-- repositories-available start -->
@@ -84,6 +86,43 @@ Models are released as sharded safetensors files.
 <!-- README_AWQ.md-provided-files end -->
 <!-- README_AWQ.md-use-from-python start -->
 ## How to use this AWQ model from Python code

 These are experimental first AWQs for the brand-new model format, Mistral.
+As of September 29th 2023, they are supported by AutoAWQ, and vLLM (version 0.2).
+To use from AutoAWQ requires installing both AutoAWQ and Transformers from Github. More details are below.
 <!-- description end -->
 <!-- repositories-available start -->
 <!-- README_AWQ.md-provided-files end -->
+<!-- README_AWQ.md-use-from-vllm start -->
+## Serving this model from vLLM
+Make sure you are using vLLM version 0.2.
+Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
+- When using vLLM as a server, pass the `--quantization awq` parameter, for example:
+```shell
+python3 python -m vllm.entrypoints.api_server --model TheBloke/Mistral-7B-v0.1-AWQ --quantization awq --dtype float16
+```
+When using vLLM from Python code, pass the `quantization=awq` parameter, for example:
+```python
+from vllm import LLM, SamplingParams
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM(model="TheBloke/Mistral-7B-v0.1-AWQ", quantization="awq", dtype="float16")
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+<!-- README_AWQ.md-use-from-vllm start -->
 <!-- README_AWQ.md-use-from-python start -->
 ## How to use this AWQ model from Python code