Update certainty_lora/README.md
Browse files- certainty_lora/README.md +35 -61
certainty_lora/README.md
CHANGED
@@ -6,7 +6,7 @@ pipeline_tag: text-generation
|
|
6 |
library_name: transformers
|
7 |
---
|
8 |
|
9 |
-
# Granite 3.3 8B Instruct - Uncertainty
|
10 |
|
11 |
Welcome to Granite Experiments!
|
12 |
|
@@ -15,29 +15,20 @@ Think of Experiments as a preview of what's to come. These projects are still un
|
|
15 |
|
16 |
## Model Summary
|
17 |
|
18 |
-
**Granite 3.3 8b Instruct - Uncertainty** is
|
19 |
-
adding the capability to provide calibrated certainty scores when answering questions when prompted, in addition to retaining the full abilities of the [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.
|
20 |
|
21 |
- **Developer:** IBM Research
|
22 |
-
- **Model type:**
|
23 |
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
24 |
|
25 |
-
## Activated LoRA
|
26 |
-
Activated LoRA (aLoRA) is a new low rank adapter architecture that allows for reusing existing base model KV cache for more efficient inference.
|
27 |
-
|
28 |
-
[Paper](https://arxiv.org/abs/2504.12397)
|
29 |
-
|
30 |
-
[IBM Research Blogpost](https://research.ibm.com/blog/inference-friendly-aloras)
|
31 |
-
|
32 |
-
[Github - needed to run inference](https://github.com/IBM/activated-lora)
|
33 |
-
|
34 |
|
35 |
### Model Sources
|
36 |
|
37 |
<!-- Provide the basic links for the model. -->
|
38 |
|
39 |
|
40 |
-
- **
|
41 |
|
42 |
|
43 |
## Usage
|
@@ -86,87 +77,70 @@ Scenario 2. Predicting the certainty score from the question only, *prior* to ge
|
|
86 |
|
87 |
### Quickstart Example
|
88 |
|
89 |
-
The following code describes how to use the Granite Uncertainty model to answer questions and obtain intrinsic calibrated certainty scores. Note that
|
90 |
-
|
91 |
-
The code required for Activated LoRA is on [Github](https://github.com/IBM/activated-lora)
|
92 |
-
|
93 |
-
Prior to running the code below, either clone the repo or install as
|
94 |
-
|
95 |
-
```
|
96 |
-
pip install git+ssh://[email protected]:IBM/activated-lora.git
|
97 |
-
```
|
98 |
|
99 |
-
Note that two generation options are shown - one illustrating the KV cache reuse ability of aLoRA (faster), and another showing the simplest generation call (slower).
|
100 |
```python
|
101 |
import torch,os
|
102 |
-
from transformers import AutoTokenizer, AutoModelForCausalLM
|
103 |
-
from
|
104 |
-
from alora.config import aLoraConfig
|
105 |
-
from alora.tokenize_alora import tokenize_alora
|
106 |
-
|
107 |
-
REUSE_CACHE = False
|
108 |
|
109 |
token = os.getenv("HF_MISTRAL_TOKEN")
|
110 |
BASE_NAME = "ibm-granite/granite-3.3-8b-instruct"
|
111 |
-
LORA_NAME = "ibm-granite/granite-3.3-8b-
|
112 |
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
113 |
|
114 |
# Load model
|
115 |
tokenizer = AutoTokenizer.from_pretrained(BASE_NAME,padding_side='left',trust_remote_code=True, token=token)
|
116 |
model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
|
117 |
-
model_UQ =
|
118 |
|
119 |
question = "What is IBM Research?"
|
120 |
print("Question:" + question)
|
121 |
question_chat = [
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
]
|
127 |
|
128 |
# Generate answer with base model
|
129 |
input_text = tokenizer.apply_chat_template(question_chat,tokenize=False,add_generation_prompt=True)
|
130 |
-
|
131 |
-
len_sys = len(input_text.split("<|start_of_role|>user")[0])
|
132 |
-
input_text = input_text[len_sys:]
|
133 |
|
134 |
#tokenize
|
135 |
inputs = tokenizer(input_text, return_tensors="pt")
|
136 |
-
|
137 |
-
prompt_cache = DynamicCache()
|
138 |
-
with model_UQ.disable_adapter():
|
139 |
-
output_dict = model_base.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=600,past_key_values = prompt_cache, return_dict_in_generate=True)
|
140 |
-
answer_cache = output_dict.past_key_values
|
141 |
-
output = output_dict.sequences
|
142 |
-
else: #simplest call
|
143 |
-
with model_UQ.disable_adapter():
|
144 |
-
output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=600)
|
145 |
output_text = tokenizer.decode(output[0])
|
146 |
answer = output_text.split("assistant<|end_of_role|>")[1]
|
147 |
print("Answer: " + answer)
|
148 |
|
149 |
# Generate certainty score
|
150 |
uq_generation_prompt = "<|start_of_role|>certainty<|end_of_role|>"
|
151 |
-
uq_chat =
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
152 |
{
|
153 |
"role": "assistant",
|
154 |
"content": answer
|
155 |
},
|
156 |
]
|
157 |
|
158 |
-
uq_text = tokenizer.apply_chat_template(uq_chat,tokenize=False)
|
159 |
-
|
160 |
-
|
161 |
-
|
|
|
162 |
|
163 |
-
|
164 |
-
|
165 |
-
|
166 |
-
output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=1,alora_offsets=alora_offsets)
|
167 |
output_text = tokenizer.decode(output[0])
|
168 |
-
|
169 |
-
# Extract score
|
170 |
uq_score = int(output_text[-1])
|
171 |
print("Certainty: " + str(5 + uq_score * 10) + "%")
|
172 |
```
|
@@ -174,7 +148,7 @@ print("Certainty: " + str(5 + uq_score * 10) + "%")
|
|
174 |
|
175 |
## Evaluation
|
176 |
|
177 |
-
The model was evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) datasets (not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (Granite-3.
|
178 |
The average ECE across tasks for our method is 0.064 (out of 1) and is consistently low across tasks (maximum task ECE 0.10), compared to the base model average ECE of 0.20 and maximum task ECE of 0.60. Note that our ECE of 0.064 is smaller than the gap between the quantized certainty outputs (10% quantization steps). Additionally, the zero-shot performance on the MMLU tasks does not degrade, averaging at 89%.
|
179 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
180 |
|
@@ -185,7 +159,7 @@ The average ECE across tasks for our method is 0.064 (out of 1) and is consisten
|
|
185 |
|
186 |
|
187 |
## Training Details
|
188 |
-
The **Granite Uncertainty 3.3 8b** model is
|
189 |
|
190 |
|
191 |
|
|
|
6 |
library_name: transformers
|
7 |
---
|
8 |
|
9 |
+
# Granite 3.3 8B Instruct - Uncertainty LoRA
|
10 |
|
11 |
Welcome to Granite Experiments!
|
12 |
|
|
|
15 |
|
16 |
## Model Summary
|
17 |
|
18 |
+
**Granite 3.3 8b Instruct - Uncertainty** is a LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct),
|
19 |
+
adding the capability to provide calibrated certainty scores when answering questions when prompted, in addition to retaining the full abilities of the [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) model.
|
20 |
|
21 |
- **Developer:** IBM Research
|
22 |
+
- **Model type:** LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
|
23 |
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
### Model Sources
|
27 |
|
28 |
<!-- Provide the basic links for the model. -->
|
29 |
|
30 |
|
31 |
+
- **Paper:** The **Granite Uncertainty 3.3 8b** model is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819)
|
32 |
|
33 |
|
34 |
## Usage
|
|
|
77 |
|
78 |
### Quickstart Example
|
79 |
|
80 |
+
The following code describes how to use the Granite Uncertainty model to answer questions and obtain intrinsic calibrated certainty scores. Note that a generic system prompt is included, this is not necessary and can be modified as needed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
|
|
82 |
```python
|
83 |
import torch,os
|
84 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
85 |
+
from peft import PeftModel, PeftConfig
|
|
|
|
|
|
|
|
|
86 |
|
87 |
token = os.getenv("HF_MISTRAL_TOKEN")
|
88 |
BASE_NAME = "ibm-granite/granite-3.3-8b-instruct"
|
89 |
+
LORA_NAME = "ibm-granite/granite-uncertainty-3.3-8b-lora"
|
90 |
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
91 |
|
92 |
# Load model
|
93 |
tokenizer = AutoTokenizer.from_pretrained(BASE_NAME,padding_side='left',trust_remote_code=True, token=token)
|
94 |
model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
|
95 |
+
model_UQ = PeftModel.from_pretrained(model_base, LORA_NAME)
|
96 |
|
97 |
question = "What is IBM Research?"
|
98 |
print("Question:" + question)
|
99 |
question_chat = [
|
100 |
+
{
|
101 |
+
"role": "user",
|
102 |
+
"content": question
|
103 |
+
},
|
104 |
]
|
105 |
|
106 |
# Generate answer with base model
|
107 |
input_text = tokenizer.apply_chat_template(question_chat,tokenize=False,add_generation_prompt=True)
|
108 |
+
|
|
|
|
|
109 |
|
110 |
#tokenize
|
111 |
inputs = tokenizer(input_text, return_tensors="pt")
|
112 |
+
output = model_base.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=600)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
output_text = tokenizer.decode(output[0])
|
114 |
answer = output_text.split("assistant<|end_of_role|>")[1]
|
115 |
print("Answer: " + answer)
|
116 |
|
117 |
# Generate certainty score
|
118 |
uq_generation_prompt = "<|start_of_role|>certainty<|end_of_role|>"
|
119 |
+
uq_chat = [
|
120 |
+
{
|
121 |
+
"role": "system",
|
122 |
+
"content": ""
|
123 |
+
},
|
124 |
+
{
|
125 |
+
"role": "user",
|
126 |
+
"content": question
|
127 |
+
},
|
128 |
{
|
129 |
"role": "assistant",
|
130 |
"content": answer
|
131 |
},
|
132 |
]
|
133 |
|
134 |
+
uq_text = tokenizer.apply_chat_template(uq_chat,tokenize=False) + uq_generation_prompt
|
135 |
+
# remove automatic system prompt
|
136 |
+
string_to_remove = tokenizer.apply_chat_template(uq_chat[0:1], tokenize=False,add_generation_prompt=False)
|
137 |
+
input_text = input_text[len(string_to_remove):]
|
138 |
+
uq_text = uq_text[len(string_to_remove):]
|
139 |
|
140 |
+
# tokenize and generate
|
141 |
+
inputs = tokenizer(uq_text, return_tensors="pt")
|
142 |
+
output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=1)
|
|
|
143 |
output_text = tokenizer.decode(output[0])
|
|
|
|
|
144 |
uq_score = int(output_text[-1])
|
145 |
print("Certainty: " + str(5 + uq_score * 10) + "%")
|
146 |
```
|
|
|
148 |
|
149 |
## Evaluation
|
150 |
|
151 |
+
The model was evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) datasets (not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (Granite-3.3-8b-instruct) and Granite-Uncertainty-3.3-8b.
|
152 |
The average ECE across tasks for our method is 0.064 (out of 1) and is consistently low across tasks (maximum task ECE 0.10), compared to the base model average ECE of 0.20 and maximum task ECE of 0.60. Note that our ECE of 0.064 is smaller than the gap between the quantized certainty outputs (10% quantization steps). Additionally, the zero-shot performance on the MMLU tasks does not degrade, averaging at 89%.
|
153 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
154 |
|
|
|
159 |
|
160 |
|
161 |
## Training Details
|
162 |
+
The **Granite Uncertainty 3.3 8b** model is a LoRA adapter finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819).
|
163 |
|
164 |
|
165 |
|