File size: 20,460 Bytes
8023ff6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
---
license: apache-2.0
---

# LoRA Adapter for Query Rewrite

Welcome to Granite Experiments!

Think of Experiments as a preview of what's to come. These projects are still under development, but we wanted to let the open-source community take them for spin! Use them, break them, and help us build what's next for Granite – we'll keep an eye out for feedback and questions. Happy exploring!

Just a heads-up: Experiments are forever evolving, so we can't commit to ongoing support or guarantee performance.

# Model Summary 

<!-- Provide a quick summary of what the model is/does. -->

This is a LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) that is fine-tuned for the query rewrite task: 

    Given a multi-turn conversation between a user and an AI assistant, decontextualize the last 
    user utterance (query) by rewriting it (whenever necessary) into an equivalent version that 
    is standalone and can be understood by itself.

While this adapter is general purpose, it is especially effective in RAG settings where its ability to rewrite a user query into a standalone version directly improves the retriever performance, which in turn improves the answer generation performance. 

- **Developer:** IBM Research
- **Model type:** LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)


## Intended use

This is a LoRA adaptor that gives the ability to rewrite the last user query in a multi-turn conversation. Typically, the rewrite is a form of expansion that inlines into the query any implicit references that are made to entities, concepts, or even parts of the conversation that occur in the previous turns (either by the user or the AI assistant). Such expansion can include coreference resolution (i.e., replacement of pronouns with the actual entities), handling of ellipsis, which is the common linguistic phenomenon where parts of a sentence or phrase are omitted by the user, but can be understood from the context (i.e., for whom, of what, with respect to something discussed above, etc.).

As a result of the expansion, the query becomes a standalone query, still equivalent in meaning with what the user asked in the last turn. The rewritten query can be sent to downstream tasks (e.g., to a retriever in a RAG setting) as a better replacement for the original user query, and without the need for (a potentially very long) context. 

**Note**: Even though one main application for query rewrite is in RAG settings, this LoRA adapter can be used to rewrite user questions for other conversational use cases (e.g., to access a database, or other APIs, or tools). As such, the adapter does not need any RAG documents (that may be present in the context, in a RAG setting) and uses only the dialog turns with what is being said between the user and assistant. 

**Model input**: The input to the model consists of:
1. A list of conversational turns that can alternate between the `user` and `assistant` roles
2. The final user query is extracted and placed in a special `query_to_rewrite` role
3. A rewrite instruction that includes JSON formatting requirements

We provide the query to rewrite in a separate role for clearer delineation.

The simplest way to invoke the LoRA adapter for query rewrite is through the granite.io framework (https://github.com/ibm-granite/granite-io), where the LoRA adapter is wrapped through a QueryRewriteIOProcessor, which runs on top of VLLM and also abstracts away the lower-level details of calling the adapter. See the following quickstart example code. Before running the script, set the LORA_NAME parameter to the path of the directory that you downloaded the LoRA adapter. The download process is explained [here](https://huggingface.co/ibm-granite/granite-3.3-8b-rag-agent-lib#quickstart-example).

## Quickstart Example Using [Granite IO](https://github.com/ibm-granite/granite-io)
```python
# Imports go here
from granite_io.io.query_rewrite import QueryRewriteIOProcessor
from granite_io.io.granite_3_3.input_processors.granite_3_3_input_processor import (
    Granite3Point3Inputs,
)
from granite_io.backend.vllm_server import LocalVLLMServer
from granite_io import make_backend

# Constants go here
base_model_name = "ibm-granite/granite-3.3-8b-instruct"
lora_model_name = "PATH_TO_DOWNLOADED_DIRECTORY"
run_server = True

if run_server:
    # Start by firing up a local vLLM server and connecting a backend instance to it.
    server = LocalVLLMServer(
        base_model_name, lora_adapters=[(lora_model_name, lora_model_name)]
    )
    server.wait_for_startup(200)
    lora_backend = server.make_lora_backend(lora_model_name)
    backend = server.make_backend()
else:  # if not run_server
    # Use an existing server.
    # Modify the constants here as needed.
    openai_base_url = "http://localhost:55555/v1"
    openai_api_key = "granite_intrinsics_1234"
    openai_base_model_name = base_model_name
    openai_lora_model_name = lora_model_name
    backend = make_backend(
        "openai",
        {
            "model_name": openai_base_model_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )
    lora_backend = make_backend(
        "openai",
        {
            "model_name": openai_lora_model_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )

# Create an example chat completion with a short conversation.
chat_input = Granite3Point3Inputs.model_validate(
    {
        "messages": [
            {"role": "assistant", "content": "Welcome to pet questions!"},
            {
                "role": "user",
                "content": "I have two pets, a dog named Rex and a cat named Lucy.",
            },
            {
                "role": "assistant",
                "content": "Great, what would you like to share about them?",
            },
            {
                "role": "user",
                "content": "Rex spends a lot of time in the backyard and outdoors, "
                "and Luna is always inside.",
            },
            {
                "role": "assistant",
                "content": "Sounds good! Rex must love exploring outside, while Lucy "
                "probably enjoys her cozy indoor life.",
            },
            {
                "role": "user",
                "content": "But is he more likely to get fleas because of that?",
            },
        ],
        "generate_inputs": {"temperature": 0.0},
    }
)

# Instantiate the I/O processor for the LoRA adapter
io_proc = QueryRewriteIOProcessor(backend)

# Pass our example input through the I/O processor and retrieve the result
chat_result = await io_proc.acreate_chat_completion(chat_input)
print(chat_result.results[0].next_message.model_dump_json(indent=2))

# Free up GPU resources
if "server" in locals():
    server.shutdown()
```



## Quickstart Example Using HuggingFace

A more involved alternative is to use the LoRA adapter directly instead of invoking it through granite.io. The invocation sequence is slightly more complex (and abstracted away in the granite.io framework). This model uses a special format where:
1. The conversation history is formatted normally
2. The final user query is extracted and placed in a `query_to_rewrite` role
3. A special rewrite role with JSON instructions is appended

The exact format is:
```
<conversation history>
<|start_of_role|>query_to_rewrite<|end_of_role|>FINAL_USER_QUERY_HERE<|end_of_text|>
<|start_of_role|>rewrite: Given the conversation history above and the specific query provided in the 'query_to_rewrite' role, rewrite that query into a standalone question that captures the user's intent without requiring the conversation context. If the query is already clear and standalone, output it as is. Your output should be a JSON structure with the rewritten question:

```json
{
    "rewritten_question": "YOUR_REWRITTEN_QUESTION_HERE"
}
```<|end_of_role|>
```

**Model output**: When prompted with the above format, the model generates a json object, which contains a field with the actual rewritten question. 

Use the code below to get started with the model.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import json, re

INSTRUCTION_TEXT = "Given the conversation history above and the specific query provided in the 'query_to_rewrite' role, rewrite that query into a standalone question that captures the user's intent without requiring the conversation context. If the query is already clear and standalone, output it as is. "

JSON = """Your output should be a JSON structure with the rewritten question:

```json
{
    "rewritten_question": "YOUR_REWRITTEN_QUESTION_HERE"
}
```"""

REWRITE_PROMPT = "<|start_of_role|>rewrite: " + INSTRUCTION_TEXT + JSON + "<|end_of_role|>"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BASE_NAME = "ibm-granite/granite-3.3-8b-instruct"
LORA_NAME = "PATH_TO_DOWNLOADED_DIRECTORY"

tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side='left', trust_remote_code=True) 
model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME, device_map='auto') 
model_rewrite = PeftModel.from_pretrained(model_base, LORA_NAME)

# Input conversation
conv = [
    {
        "role": "user",
        "content": "Tim Cook is the CEO of Apple Inc."
    },
    {
        "role": "assistant",
        "content": "Yes, Tim Cook is the Chief Executive Officer of Apple Inc."
    },
    {
        "role": "user",
        "content": "and for Microsoft?"
    }
]

# Extract the final user query
final_user_query = conv[-1]["content"]

# Generate the query rewrite for the last turn in the above conversation
conv = [{"role": "system", "content": ""}] + conv
conversation_text = tokenizer.apply_chat_template(conv, tokenize=False)

# Add the query_to_rewrite role with the final user query
query_to_rewrite_role = f"<|start_of_role|>query_to_rewrite<|end_of_role|>{final_user_query}<|end_of_text|>\n"
input_text = conversation_text + query_to_rewrite_role + REWRITE_PROMPT

inputs = tokenizer(input_text, return_tensors="pt")

output = model_rewrite.generate(inputs["input_ids"].to(device), 
                               attention_mask=inputs["attention_mask"].to(device), 
                               max_new_tokens=80)
output_text = tokenizer.decode(output[0])

# Regex pattern to extract the JSON with the rewrite from the output of the model
pattern = r'\{\s*"[^"]+"\s*:\s*"[^"]*"\s*\}'
match_js = re.findall(pattern, output_text)[0]

try:
    # Parse the JSON and extract the rewrite    
    rewrite = json.loads(match_js)['rewritten_question']
except Exception as e: 
    rewrite = match_js.split("\"rewritten_question\": ", 1)[1]

print(f"Rewrite: {rewrite}\n")
# Rewrite: Who is the CEO of Microsoft?
```

## Training Details

The training data contains both: 1) standalone examples, which teach the adapter to refrain from rewriting user questions that are already standalone, and 2) non-standalone examples containing a diversity of patterns that are used to teach the adapter to expand the user turn so that it becomes standalone. 

### Training Data

The training data uses the publicly available Cloud corpus of technical documentation pages from [MT-RAG](https://arxiv.org/abs/2501.03468). Based on this corpus of documents, we constructed a dataset consisting of high-quality, human-created conversations, where the last turn of the conversation comes into versions: non-standalone version, and corresponding standalone version. 

The training dataset is proprietary and was obtained in combination with a third-party company who contracted the human annotators. 

### Robustness to System Prompts

In a typical Retrieval-Augmented Generation (RAG) setup, different researchers or practitioners may use various system prompts tailored to their specific use cases. To enhance the LoRA adapter's robustness against these variations, we generate three distinct versions of each training sample, each paired with a different system prompt. This expanded and diversified training dataset is then used to train the LoRA adapters, improving their ability to handle diverse prompt styles effectively.

System Prompts Used:

Version 1: <|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024. Today's Date: May 20, 2025. You are Granite, developed by IBM. You are a helpful AI assistant. <|end_of_text|>

Version 2: <|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024. Today's Date: May 20, 2025. You are Granite, developed by IBM. Write the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.<|end_of_text|>

Version 3: An empty system prompt (no instructions provided).

This approach ensures that our LoRA adapters remain effective and reliable across varying system prompt formats commonly encountered in real-world RAG applications.

#### Training Hyperparameters

The LoRA adapter was fine-tuned using PEFT under the following regime: rank = 32, learning rate = 3e-6, number of epochs = 25, and linear learning rate scheduler.

## Evaluation

### Evaluation of retriever 

We evaluate Recall@k on the [MT-RAG](https://arxiv.org/abs/2501.03468) benchmark, under various query rewrite strategies for the retriever. All retrieved passages are obtained using the Elser retriever with the same settings as in the above paper. In addition to the LoRA adapter, we include several other baselines, including no-rewrite (where we send the last user turn to the retriever as-is), Mixtral rewrites, as well as gold rewrites (human-created). 

We evaluate on three different testsets: a) full MT-RAG dataset (842 data points with last user turns); b) the non-standalone subset of MT-RAG dataset, which is a subset of 260 (out of 842) last user turns that were annotated by humans as non-standalone (i.e., they are dependent on the prior context); c) the standalone subset of MT-RAG dataset, which is the complementary subset, with all the last user turns that were annotated by humans as standalone.

a. Evaluation of Recall@k on full MT-RAG dataset.

| Strategy                    | Recall@5        | Recall@10         |  Recall@20   |
| --------------------------- | --------------- | ----------------- | ------------ |
| No rewrite                  |  0.486          | 0.587             |  0.665       |
| Mixtral 8x7b rewrite        |  0.522          | 0.642             |  0.72        |
| Gold rewrite                |  0.563          | 0.674             |  0.747       |
| Granite 3.3-8b LoRA rewrite |  0.563          | 0.682             |  0.762       |

b.  Evaluation of Recall@k on the non-standalone subset of MT-RAG.

| Strategy                    | Recall@5        | Recall@10         |  Recall@20   |
| --------------------------- | --------------- | ----------------- | ------------ |
| No rewrite                  |  0.263          | 0.338             | 0.435      | 
| Mixtral 8x7b rewrite        |  0.362         | 0.488             | 0.574      |
| Gold rewrite                |  0.479         | 0.582             | 0.662      |
| Granite 3.3-8b LoRA rewrite |    0.445      |    0.556        |     0.648   |

c.  Evaluation of Recall@k on the standalone subset of MT-RAG.

| Strategy                    | Recall@5        | Recall@10         |  Recall@20   |
| --------------------------- | --------------- | ----------------- | ------------ |
| No rewrite                  |  0.609         | 0.723            | 0.792        | 
| Mixtral 8x7b rewrite        |  0.613         | 0.733             | 0.809        |
| Gold rewrite                |  0.609         | 0.723            | 0.792        |
| Granite 3.3-8b LoRA rewrite |   0.628     |    0.751       |   0.824    |

If we focus on Recall@20 numbers, as one instance of the metric, there is an overall 9.7 percentage points jump when using query rewrite with the Granite 3.3-8b LoRA adapter versus when using the no rewrite strategy. This jump is more pronounced on the non-standalone fragment, where query rewrite with the Granite 3.3-8b LoRA adapter leads to almost 21 percentage points improvement over the no-rewrite strategy. Also, we can observe that the numbers with the LoRA rewrites are very close to what can be obtained with the gold rewrites on non-standalones (and slightly better on standalones for LoRA -- human annotators were instructed to leave the query unchanged when classifying it as standalone, however, the LoRA adapter may still perform some rewriting which turns out to further improve the recall).

### Evaluation of answer generation 

We evaluate answer generation quality, with top-k passages retrieved under the various query rewrite strategies for the retriever. We choose here k = 20, but similar trends take place for other values of k. We used Granite-3.3-8b instruct as the answer generator, and [RAGAS](https://arxiv.org/abs/2309.15217) Faithfulness on the answerable subset of MT RAG data, [JAFS](https://arxiv.org/abs/2504.11704) that rewards the model for correctly abstaining on unanswerable queries (full credit) and for
providing faithful answers on answerable queries (partial credit based on RAGAS Faithfulness), and [RAD-Bench](https://arxiv.org/abs/2409.12558) score as metrics for answer quality. We use the same three testsets as above. 

a. Evaluation of answer quality on full MT-RAG dataset.

| Strategy                    | RAGAS-F   (Answerable Subset)       | RAD-Bench        |  JAFS            |
| --------------------------- | ---------------- | ---------------- | ---------------- |
| No rewrite                  |  0.793            |    0.678          |  0.664          |
| Mixtral 8x7b rewrite        |  0.78            |    0.679         |   0.682          |
| Gold rewrite                |  0.81            |   0.686         |  0.67          |
| Granite 3.3-8b LoRA rewrite |  0.874           |     0.698         | 0.722        |

b. Evaluation of answer quality on non-standalone MT-RAG subset.

| Strategy                    | RAGAS-F   (Answerable Subset)        | RAD-Bench        |   JAFS            |
| --------------------------- | ---------------- | ---------------- | ---------------- |
| No rewrite                  |  0.695          |   0.618           |  0.581        |
| Mixtral 8x7b rewrite        |  0.776          |   0.644           |  0.627      |
| Gold rewrite                |  0.786          |   0.661           |  0.634   |
| Granite 3.3-8b LoRA rewrite |  0.865          |     0.669         | 0.70     |    

c. Evaluation of answer quality on standalone subset of MT-RAG. 

| Strategy                    | RAGAS-F   (Answerable Subset)        | RAD-Bench        |   JAFS            |
| --------------------------- | ---------------- | ---------------- | ---------------- |
| No rewrite                  |  0.845          |   0.71           |  0.708        |
| Mixtral 8x7b rewrite        |  0.854            |   0.697           |  0.71     |
| Gold rewrite                  |  0.845          |   0.71           |  0.708        |
| Granite 3.3-8b LoRA rewrite |   0.88               |   0.713               |     0.734    |




As with Recall, similar observations can be made here as well. Specifically, on the full dataset, we see an 8.1 percentage points jump in RAGAS Faithfulness (from 0.793 to 0.874), a 2 percentage points jump in RAD-Bench score (from 0.678 to 0.698), and a 5.8 percentage points jump in JAFS (from 0.664 to 0.722) when using query rewrite with the Granite 3.3-8b LoRA adapter versus when using the no rewrite strategy. This improvement is more pronounced on the non-standalone subset, where query rewrite with the Granite 3.3-8b LoRA adapter leads to a 17 percentage points jump in RAGAS Faithfulness (from 0.695 to 0.865), a 5.1 percentage points jump in RAD-Bench score (from 0.618 to 0.669), and an 11.9 percentage points jump in JAFS (from 0.581 to 0.70). 

## Contact 
[Lucian Popa](mailto:[email protected])
[Krishnateja Killamsetty](mailto:[email protected])

### Framework versions

- PEFT 0.14.0