ibm-granite
/

granite-3.3-8b-rag-agent-lib

Transformers

Safetensors

English

Model card Files Files and versions

xet

Community

kgreenewald commited on Jun 12

Commit

d224cea

verified ·

1 Parent(s): ab79fae

Update certainty_lora/README.md

Browse files

Files changed (1) hide show

certainty_lora/README.md +35 -61

certainty_lora/README.md CHANGED Viewed

@@ -6,7 +6,7 @@ pipeline_tag: text-generation
 library_name: transformers
 ---
-# Granite 3.3 8B Instruct - Uncertainty aLoRA
 Welcome to Granite Experiments!
@@ -15,29 +15,20 @@ Think of Experiments as a preview of what's to come. These projects are still un
 ## Model Summary
-**Granite 3.3 8b Instruct - Uncertainty** is an Activated LoRA (aLoRA) adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct),
-adding the capability to provide calibrated certainty scores when answering questions when prompted, in addition to retaining the full abilities of the [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.2-8b-instruct) model.
 - **Developer:** IBM Research
-- **Model type:** Activated LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-## Activated LoRA
-Activated LoRA (aLoRA) is a new low rank adapter architecture that allows for reusing existing base model KV cache for more efficient inference.
-[Paper](https://arxiv.org/abs/2504.12397)
-[IBM Research Blogpost](https://research.ibm.com/blog/inference-friendly-aloras)
-[Github - needed to run inference](https://github.com/IBM/activated-lora)
 ### Model Sources
 <!-- Provide the basic links for the model. -->
-- **UQ method** The **Granite Uncertainty 3.3 8b** model is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819)
 ## Usage
@@ -86,87 +77,70 @@ Scenario 2. Predicting the certainty score from the question only, *prior* to ge
 ### Quickstart Example
-The following code describes how to use the Granite Uncertainty model to answer questions and obtain intrinsic calibrated certainty scores. Note that no system prompt is used.
-The code required for Activated LoRA is on [Github](https://github.com/IBM/activated-lora)
-Prior to running the code below, either clone the repo or install as
-```
-pip install git+ssh://[email protected]:IBM/activated-lora.git
-```
-Note that two generation options are shown - one illustrating the KV cache reuse ability of aLoRA (faster), and another showing the simplest generation call (slower).
 ```python
 import torch,os
-from transformers import AutoTokenizer,  AutoModelForCausalLM, DynamicCache
-from alora.peft_model_alora import aLoRAPeftModelForCausalLM
-from alora.config import aLoraConfig
-from alora.tokenize_alora import tokenize_alora
-REUSE_CACHE = False
 token = os.getenv("HF_MISTRAL_TOKEN")
 BASE_NAME = "ibm-granite/granite-3.3-8b-instruct"
-LORA_NAME = "ibm-granite/granite-3.3-8b-alora-uncertainty"
 device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
 # Load model
 tokenizer = AutoTokenizer.from_pretrained(BASE_NAME,padding_side='left',trust_remote_code=True, token=token)
 model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
-model_UQ = aLoRAPeftModelForCausalLM.from_pretrained(model_base, LORA_NAME)
 question = "What is IBM Research?"
 print("Question:" + question)
 question_chat = [
-    {
-        "role": "user",
-        "content": question
-    },
 ]
 # Generate answer with base model
 input_text = tokenizer.apply_chat_template(question_chat,tokenize=False,add_generation_prompt=True)
-# Remove default system prompt
-len_sys = len(input_text.split("<|start_of_role|>user")[0])
-input_text = input_text[len_sys:]
 #tokenize
 inputs = tokenizer(input_text, return_tensors="pt")
-if REUSE_CACHE: #save KV cache for future aLoRA call
-    prompt_cache = DynamicCache()
-    with model_UQ.disable_adapter():
-        output_dict = model_base.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=600,past_key_values = prompt_cache, return_dict_in_generate=True)
-    answer_cache = output_dict.past_key_values
-    output = output_dict.sequences
-else: #simplest call
-    with model_UQ.disable_adapter():
-        output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=600)
 output_text = tokenizer.decode(output[0])
 answer = output_text.split("assistant<|end_of_role|>")[1]
 print("Answer: " + answer)
 # Generate certainty score
 uq_generation_prompt = "<|start_of_role|>certainty<|end_of_role|>"
-uq_chat = question_chat + [
     {
         "role": "assistant",
         "content": answer
     },
 ]
-uq_text = tokenizer.apply_chat_template(uq_chat,tokenize=False)
-uq_text = uq_text[len_sys:]
-# tokenize and generate
-inputs, alora_offsets = tokenize_alora(tokenizer,uq_text, uq_generation_prompt)
-if REUSE_CACHE: #reuse KV cache from earlier answer generation
-    output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=1,alora_offsets=alora_offsets,past_key_values=answer_cache)
-else: #simplest call
-    output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=1,alora_offsets=alora_offsets)
 output_text = tokenizer.decode(output[0])
-# Extract score
 uq_score = int(output_text[-1])
 print("Certainty: " + str(5 + uq_score * 10) + "%")
 ```
@@ -174,7 +148,7 @@ print("Certainty: " + str(5 + uq_score * 10) + "%")
 ## Evaluation
-The model was evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) datasets (not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (Granite-3.2-8b-instruct) and Granite-Uncertainty-3.2-8b.
 The average ECE across tasks for our method is 0.064 (out of 1) and is consistently low across tasks (maximum task ECE 0.10), compared to the base model average ECE of 0.20 and maximum task ECE of 0.60. Note that our ECE of 0.064 is smaller than the gap between the quantized certainty outputs (10% quantization steps). Additionally, the zero-shot performance on the MMLU tasks does not degrade, averaging at 89%.
 <!-- This section describes the evaluation protocols and provides the results. -->
@@ -185,7 +159,7 @@ The average ECE across tasks for our method is 0.064 (out of 1) and is consisten
 ## Training Details
-The **Granite Uncertainty 3.3 8b** model is an aLoRA adapter finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819).

 library_name: transformers
 ---
+# Granite 3.3 8B Instruct - Uncertainty LoRA
 Welcome to Granite Experiments!
 ## Model Summary
+**Granite 3.3 8b Instruct - Uncertainty** is a LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct),
+adding the capability to provide calibrated certainty scores when answering questions when prompted, in addition to retaining the full abilities of the [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) model.
 - **Developer:** IBM Research
+- **Model type:** LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ### Model Sources
 <!-- Provide the basic links for the model. -->
+- **Paper:** The **Granite Uncertainty 3.3 8b** model is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819)
 ## Usage
 ### Quickstart Example
+The following code describes how to use the Granite Uncertainty model to answer questions and obtain intrinsic calibrated certainty scores. Note that a generic system prompt is included, this is not necessary and can be modified as needed.
 ```python
 import torch,os
+from transformers import AutoTokenizer,  AutoModelForCausalLM
+from peft import PeftModel, PeftConfig
 token = os.getenv("HF_MISTRAL_TOKEN")
 BASE_NAME = "ibm-granite/granite-3.3-8b-instruct"
+LORA_NAME = "ibm-granite/granite-uncertainty-3.3-8b-lora"
 device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
 # Load model
 tokenizer = AutoTokenizer.from_pretrained(BASE_NAME,padding_side='left',trust_remote_code=True, token=token)
 model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
+model_UQ = PeftModel.from_pretrained(model_base, LORA_NAME)
 question = "What is IBM Research?"
 print("Question:" + question)
 question_chat = [
+	{
+		"role": "user",
+		"content": question
+	},
 ]
 # Generate answer with base model
 input_text = tokenizer.apply_chat_template(question_chat,tokenize=False,add_generation_prompt=True)
 #tokenize
 inputs = tokenizer(input_text, return_tensors="pt")
+output = model_base.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=600)
 output_text = tokenizer.decode(output[0])
 answer = output_text.split("assistant<|end_of_role|>")[1]
 print("Answer: " + answer)
 # Generate certainty score
 uq_generation_prompt = "<|start_of_role|>certainty<|end_of_role|>"
+uq_chat = [
+    {
+		"role": "system",
+		"content": ""
+	},
+	{
+		"role": "user",
+		"content": question
+	},
     {
         "role": "assistant",
         "content": answer
     },
 ]
+uq_text = tokenizer.apply_chat_template(uq_chat,tokenize=False) + uq_generation_prompt
+# remove automatic system prompt
+string_to_remove = tokenizer.apply_chat_template(uq_chat[0:1], tokenize=False,add_generation_prompt=False)
+input_text = input_text[len(string_to_remove):]
+uq_text = uq_text[len(string_to_remove):]
+# tokenize and generate
+inputs = tokenizer(uq_text, return_tensors="pt")
+output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=1)
 output_text = tokenizer.decode(output[0])
 uq_score = int(output_text[-1])
 print("Certainty: " + str(5 + uq_score * 10) + "%")
 ```
 ## Evaluation
+The model was evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) datasets (not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (Granite-3.3-8b-instruct) and Granite-Uncertainty-3.3-8b.
 The average ECE across tasks for our method is 0.064 (out of 1) and is consistently low across tasks (maximum task ECE 0.10), compared to the base model average ECE of 0.20 and maximum task ECE of 0.60. Note that our ECE of 0.064 is smaller than the gap between the quantized certainty outputs (10% quantization steps). Additionally, the zero-shot performance on the MMLU tasks does not degrade, averaging at 89%.
 <!-- This section describes the evaluation protocols and provides the results. -->
 ## Training Details
+The **Granite Uncertainty 3.3 8b** model is a LoRA adapter finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819).