update notes on inference

Browse files

Files changed (1) hide show

README.md +35 -1

README.md CHANGED Viewed

@@ -10,14 +10,48 @@ tags:
 - llama
 - llama-2
 - hosted inference
 ---
 # Llama 2 - hosted inference
 This is simply an 8-bit version of the Llama-2-7B model.
 - 8-bits allows the model to be below 10 GB
 - This allows for hosted inference of the model on the model's home page
-~
 Below follows information on the original Llama 2 model...

 - llama
 - llama-2
 - hosted inference
+- 8 bit
+- 8bit
+- 8-bit
 ---
 # Llama 2 - hosted inference
 This is simply an 8-bit version of the Llama-2-7B model.
 - 8-bits allows the model to be below 10 GB
 - This allows for hosted inference of the model on the model's home page
+- Note that inference may be slow unless you have a HuggingFace Pro plan.
+If you want to run inference yourself (e.g. in a Colab notebook) you can try:
+```
+!pip install -q -U git+https://github.com/huggingface/accelerate.git
+!pip install -q -U bitsandbytes
+!pip install -q -U git+https://github.com/huggingface/transformers.git
+model_id = 'Trelis/Llama-2-7b-chat-hf-hosted-inference-8bit'
+import transformers
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TextStreamer
+model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto')
+#Llama 2 Inference
+def stream(user_prompt):
+    system_prompt = 'You are a helpful assistant that provides accurate and concise responses'
+    B_INST, E_INST = "[INST]", "[/INST]"
+    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
+    prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"
+    inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")
+    streamer = TextStreamer(tokenizer)
+    # Despite returning the usual output, the streamer will also print the generated text to stdout.
+    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)
+stream('Count to ten')
+```
 Below follows information on the original Llama 2 model...