kgreenewald commited on
Commit
d224cea
·
verified ·
1 Parent(s): ab79fae

Update certainty_lora/README.md

Browse files
Files changed (1) hide show
  1. certainty_lora/README.md +35 -61
certainty_lora/README.md CHANGED
@@ -6,7 +6,7 @@ pipeline_tag: text-generation
6
  library_name: transformers
7
  ---
8
 
9
- # Granite 3.3 8B Instruct - Uncertainty aLoRA
10
 
11
  Welcome to Granite Experiments!
12
 
@@ -15,29 +15,20 @@ Think of Experiments as a preview of what's to come. These projects are still un
15
 
16
  ## Model Summary
17
 
18
- **Granite 3.3 8b Instruct - Uncertainty** is an Activated LoRA (aLoRA) adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct),
19
- adding the capability to provide calibrated certainty scores when answering questions when prompted, in addition to retaining the full abilities of the [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.2-8b-instruct) model.
20
 
21
  - **Developer:** IBM Research
22
- - **Model type:** Activated LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
23
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
24
 
25
- ## Activated LoRA
26
- Activated LoRA (aLoRA) is a new low rank adapter architecture that allows for reusing existing base model KV cache for more efficient inference.
27
-
28
- [Paper](https://arxiv.org/abs/2504.12397)
29
-
30
- [IBM Research Blogpost](https://research.ibm.com/blog/inference-friendly-aloras)
31
-
32
- [Github - needed to run inference](https://github.com/IBM/activated-lora)
33
-
34
 
35
  ### Model Sources
36
 
37
  <!-- Provide the basic links for the model. -->
38
 
39
 
40
- - **UQ method** The **Granite Uncertainty 3.3 8b** model is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819)
41
 
42
 
43
  ## Usage
@@ -86,87 +77,70 @@ Scenario 2. Predicting the certainty score from the question only, *prior* to ge
86
 
87
  ### Quickstart Example
88
 
89
- The following code describes how to use the Granite Uncertainty model to answer questions and obtain intrinsic calibrated certainty scores. Note that no system prompt is used.
90
-
91
- The code required for Activated LoRA is on [Github](https://github.com/IBM/activated-lora)
92
-
93
- Prior to running the code below, either clone the repo or install as
94
-
95
- ```
96
- pip install git+ssh://[email protected]:IBM/activated-lora.git
97
- ```
98
 
99
- Note that two generation options are shown - one illustrating the KV cache reuse ability of aLoRA (faster), and another showing the simplest generation call (slower).
100
  ```python
101
  import torch,os
102
- from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
103
- from alora.peft_model_alora import aLoRAPeftModelForCausalLM
104
- from alora.config import aLoraConfig
105
- from alora.tokenize_alora import tokenize_alora
106
-
107
- REUSE_CACHE = False
108
 
109
  token = os.getenv("HF_MISTRAL_TOKEN")
110
  BASE_NAME = "ibm-granite/granite-3.3-8b-instruct"
111
- LORA_NAME = "ibm-granite/granite-3.3-8b-alora-uncertainty"
112
  device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
113
 
114
  # Load model
115
  tokenizer = AutoTokenizer.from_pretrained(BASE_NAME,padding_side='left',trust_remote_code=True, token=token)
116
  model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
117
- model_UQ = aLoRAPeftModelForCausalLM.from_pretrained(model_base, LORA_NAME)
118
 
119
  question = "What is IBM Research?"
120
  print("Question:" + question)
121
  question_chat = [
122
- {
123
- "role": "user",
124
- "content": question
125
- },
126
  ]
127
 
128
  # Generate answer with base model
129
  input_text = tokenizer.apply_chat_template(question_chat,tokenize=False,add_generation_prompt=True)
130
- # Remove default system prompt
131
- len_sys = len(input_text.split("<|start_of_role|>user")[0])
132
- input_text = input_text[len_sys:]
133
 
134
  #tokenize
135
  inputs = tokenizer(input_text, return_tensors="pt")
136
- if REUSE_CACHE: #save KV cache for future aLoRA call
137
- prompt_cache = DynamicCache()
138
- with model_UQ.disable_adapter():
139
- output_dict = model_base.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=600,past_key_values = prompt_cache, return_dict_in_generate=True)
140
- answer_cache = output_dict.past_key_values
141
- output = output_dict.sequences
142
- else: #simplest call
143
- with model_UQ.disable_adapter():
144
- output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=600)
145
  output_text = tokenizer.decode(output[0])
146
  answer = output_text.split("assistant<|end_of_role|>")[1]
147
  print("Answer: " + answer)
148
 
149
  # Generate certainty score
150
  uq_generation_prompt = "<|start_of_role|>certainty<|end_of_role|>"
151
- uq_chat = question_chat + [
 
 
 
 
 
 
 
 
152
  {
153
  "role": "assistant",
154
  "content": answer
155
  },
156
  ]
157
 
158
- uq_text = tokenizer.apply_chat_template(uq_chat,tokenize=False)
159
- uq_text = uq_text[len_sys:]
160
- # tokenize and generate
161
- inputs, alora_offsets = tokenize_alora(tokenizer,uq_text, uq_generation_prompt)
 
162
 
163
- if REUSE_CACHE: #reuse KV cache from earlier answer generation
164
- output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=1,alora_offsets=alora_offsets,past_key_values=answer_cache)
165
- else: #simplest call
166
- output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=1,alora_offsets=alora_offsets)
167
  output_text = tokenizer.decode(output[0])
168
-
169
- # Extract score
170
  uq_score = int(output_text[-1])
171
  print("Certainty: " + str(5 + uq_score * 10) + "%")
172
  ```
@@ -174,7 +148,7 @@ print("Certainty: " + str(5 + uq_score * 10) + "%")
174
 
175
  ## Evaluation
176
 
177
- The model was evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) datasets (not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (Granite-3.2-8b-instruct) and Granite-Uncertainty-3.2-8b.
178
  The average ECE across tasks for our method is 0.064 (out of 1) and is consistently low across tasks (maximum task ECE 0.10), compared to the base model average ECE of 0.20 and maximum task ECE of 0.60. Note that our ECE of 0.064 is smaller than the gap between the quantized certainty outputs (10% quantization steps). Additionally, the zero-shot performance on the MMLU tasks does not degrade, averaging at 89%.
179
  <!-- This section describes the evaluation protocols and provides the results. -->
180
 
@@ -185,7 +159,7 @@ The average ECE across tasks for our method is 0.064 (out of 1) and is consisten
185
 
186
 
187
  ## Training Details
188
- The **Granite Uncertainty 3.3 8b** model is an aLoRA adapter finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819).
189
 
190
 
191
 
 
6
  library_name: transformers
7
  ---
8
 
9
+ # Granite 3.3 8B Instruct - Uncertainty LoRA
10
 
11
  Welcome to Granite Experiments!
12
 
 
15
 
16
  ## Model Summary
17
 
18
+ **Granite 3.3 8b Instruct - Uncertainty** is a LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct),
19
+ adding the capability to provide calibrated certainty scores when answering questions when prompted, in addition to retaining the full abilities of the [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) model.
20
 
21
  - **Developer:** IBM Research
22
+ - **Model type:** LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
23
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
24
 
 
 
 
 
 
 
 
 
 
25
 
26
  ### Model Sources
27
 
28
  <!-- Provide the basic links for the model. -->
29
 
30
 
31
+ - **Paper:** The **Granite Uncertainty 3.3 8b** model is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819)
32
 
33
 
34
  ## Usage
 
77
 
78
  ### Quickstart Example
79
 
80
+ The following code describes how to use the Granite Uncertainty model to answer questions and obtain intrinsic calibrated certainty scores. Note that a generic system prompt is included, this is not necessary and can be modified as needed.
 
 
 
 
 
 
 
 
81
 
 
82
  ```python
83
  import torch,os
84
+ from transformers import AutoTokenizer, AutoModelForCausalLM
85
+ from peft import PeftModel, PeftConfig
 
 
 
 
86
 
87
  token = os.getenv("HF_MISTRAL_TOKEN")
88
  BASE_NAME = "ibm-granite/granite-3.3-8b-instruct"
89
+ LORA_NAME = "ibm-granite/granite-uncertainty-3.3-8b-lora"
90
  device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
91
 
92
  # Load model
93
  tokenizer = AutoTokenizer.from_pretrained(BASE_NAME,padding_side='left',trust_remote_code=True, token=token)
94
  model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
95
+ model_UQ = PeftModel.from_pretrained(model_base, LORA_NAME)
96
 
97
  question = "What is IBM Research?"
98
  print("Question:" + question)
99
  question_chat = [
100
+ {
101
+ "role": "user",
102
+ "content": question
103
+ },
104
  ]
105
 
106
  # Generate answer with base model
107
  input_text = tokenizer.apply_chat_template(question_chat,tokenize=False,add_generation_prompt=True)
108
+
 
 
109
 
110
  #tokenize
111
  inputs = tokenizer(input_text, return_tensors="pt")
112
+ output = model_base.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=600)
 
 
 
 
 
 
 
 
113
  output_text = tokenizer.decode(output[0])
114
  answer = output_text.split("assistant<|end_of_role|>")[1]
115
  print("Answer: " + answer)
116
 
117
  # Generate certainty score
118
  uq_generation_prompt = "<|start_of_role|>certainty<|end_of_role|>"
119
+ uq_chat = [
120
+ {
121
+ "role": "system",
122
+ "content": ""
123
+ },
124
+ {
125
+ "role": "user",
126
+ "content": question
127
+ },
128
  {
129
  "role": "assistant",
130
  "content": answer
131
  },
132
  ]
133
 
134
+ uq_text = tokenizer.apply_chat_template(uq_chat,tokenize=False) + uq_generation_prompt
135
+ # remove automatic system prompt
136
+ string_to_remove = tokenizer.apply_chat_template(uq_chat[0:1], tokenize=False,add_generation_prompt=False)
137
+ input_text = input_text[len(string_to_remove):]
138
+ uq_text = uq_text[len(string_to_remove):]
139
 
140
+ # tokenize and generate
141
+ inputs = tokenizer(uq_text, return_tensors="pt")
142
+ output = model_UQ.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=1)
 
143
  output_text = tokenizer.decode(output[0])
 
 
144
  uq_score = int(output_text[-1])
145
  print("Certainty: " + str(5 + uq_score * 10) + "%")
146
  ```
 
148
 
149
  ## Evaluation
150
 
151
+ The model was evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) datasets (not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (Granite-3.3-8b-instruct) and Granite-Uncertainty-3.3-8b.
152
  The average ECE across tasks for our method is 0.064 (out of 1) and is consistently low across tasks (maximum task ECE 0.10), compared to the base model average ECE of 0.20 and maximum task ECE of 0.60. Note that our ECE of 0.064 is smaller than the gap between the quantized certainty outputs (10% quantization steps). Additionally, the zero-shot performance on the MMLU tasks does not degrade, averaging at 89%.
153
  <!-- This section describes the evaluation protocols and provides the results. -->
154
 
 
159
 
160
 
161
  ## Training Details
162
+ The **Granite Uncertainty 3.3 8b** model is a LoRA adapter finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819).
163
 
164
 
165