# Fine-tune IDEFICS3 on Visual Question Answering

In this notebook we will fine-tune IDEFICS3 on VQAv2 dataset.

The transformers PR isn't merged yet so we will install the branch that contains the transformers implementation

In [None]:
!git clone https://github.com/andimarafioti/transformers.git

In [2]:
%cd transformers

/home/merve/transformers


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [3]:
!git checkout idefics3

Previous HEAD position was a72b30fe0 hot fix for merve
Switched to branch 'idefics3'
Your branch is up to date with 'origin/idefics3'.


In [10]:
!git checkout a72b30fe06bba77d9df4c72fcea48bbdc0d812a5

Note: switching to 'a72b30fe06bba77d9df4c72fcea48bbdc0d812a5'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at a72b30fe0 hot fix for merve


In [None]:
!pip install -q "."

In [12]:
!pip install -q accelerate datasets peft bitsandbytes

In [13]:
!pip install -q flash-attn --no-build-isolation

We will push out model to Hub so we need to authenticate ourselves.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In this notebook we will not do full fine-tuning but use QLoRA method, which loads an adapter to the quantized version of the model, saving space. If you want to do full fine-tuning, set `USE_LORA` and `USE_QLORA` to False. If you want to do LoRA, set `USE_QLORA` to False and `USE_LORA` to True.

In [1]:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "4" # you don't need this unless you work on a multigpu setup and need to use a specific index
# if you want to use multiple GPUs, use e.g. "2,4"

In [8]:
for param in model.model.vision_model.parameters():
    param.requires_grad = False 

We will load VQAv2 dataset. For educational purposes we will load the validation split and split it twice.

In [9]:
from datasets import load_dataset
ds = load_dataset('merve/vqav2-small', trust_remote_code=True)

In [10]:
split_ds = ds["validation"].train_test_split(test_size=0.8)
train_ds = split_ds["train"]

In [11]:
train_ds

Dataset({
    features: ['multiple_choice_answer', 'question', 'image'],
    num_rows: 4287
})

In [None]:
import torch
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration

USE_LORA = False
USE_QLORA = False
model_id = "HuggingFaceM4/Idefics3-8B-Llama3"

processor = AutoProcessor.from_pretrained(
    model_id
)

if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    lora_config.inference_mode = False
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )
        
    model = Idefics3ForConditionalGeneration.from_pretrained(
        model_id,
        quantization_config=bnb_config if USE_QLORA else None,
        _attn_implementation="flash_attention_2",
        device_map="auto"
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    print(model.get_nb_trainable_parameters())
else:
    model = Idefics3ForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        _attn_implementation="flash_attention_2",
    ).to(DEVICE)
    
    # if you'd like to only fine-tune LLM
    for param in model.model.vision_model.parameters():
        param.requires_grad = False

Let's write our data collating function. We will apply prompt template to have questions and answers together so model can learn to answer. Then we pass the formatted prompts and images to the processor which processes both.

In [12]:
image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")]

def collate_fn(examples):
  texts = []
  images = []
  for example in examples:
      image = example["image"]
      question = example["question"]
      answer = example["multiple_choice_answer"]
      messages = [
          {
              "role": "user",
              "content": [
                  {"type": "text", "text": "Answer briefly."},
                  {"type": "image"},
                  {"type": "text", "text": question}
              ]
          },
          {
              "role": "assistant",
              "content": [
                  {"type": "text", "text": answer}
              ]
          }
      ]
      text = processor.apply_chat_template(messages, add_generation_prompt=False)
      texts.append(text.strip())
      images.append([image])

  batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
  labels = batch["input_ids"].clone()
  labels[labels == processor.tokenizer.pad_token_id] = -100
  labels[labels == image_token_id] = -100 
  batch["labels"] = labels

  return batch


We can now initialize `Trainer` and initialize `TrainingArguments` to pass to `Trainer`.

In [14]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=25,
    save_strategy="steps",
    save_steps=250,
    save_total_limit=1,
    optim="adamw_hf", # for 8-bit, pick paged_adamw_hf
    #evaluation_strategy="epoch",
    bf16=True,
    output_dir="./idefics3-llama-vqav2",
    hub_model_id="idefics3-llama-vqav2",
    remove_unused_columns=False,
)


In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=train_ds,
    #eval_dataset=test_ds,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


I'm running standalone scripts on top of tmux so the logs will not appear here. I will upload my training script to this repository.

In [None]:
trainer.train()

In [None]:
trainer.push_to_hub()