ZTE-AIM
/

7B-Curr-ReFT

+---
+license: apache-2.0
+datasets:
+- ZTE-AIM/Curr-ReFT-data
+base_model:
+- Qwen/Qwen2.5-VL-3B-Instruct
+- Qwen/Qwen2.5-VL-7B-Instruct
+pipeline_tag: image-text-to-text
+---
+## Curr-ReFT-data
+[\[📂 GitHub\]](https://github.com/ding523/Curr_REFT)
+[\[🤗 HF Dataset\]](https://huggingface.co/datasets/ZTE-AIM/Curr-ReFT-data)
+## Curr-ReFT-model
+[\[🤗 Curr-ReFT-3B\]](https://huggingface.co/ZTE-AIM/3B-Curr-ReFT)
+[\[🤗 Curr-ReFT-7B\]](https://huggingface.co/ZTE-AIM/7B-Curr-ReFT)
+## Model Overview
+This is a multimodal large language model fine-tuned from Qwen2.5-VL using our innovative **Curr-ReFT** methodology. The model has undergone a two-stage training process: first through Curriculum Reinforcement Learning, which gradually increases task complexity, followed by Rejected Sample based Self-improvement to maintain foundational capabilities.
+The model significantly enhances vision-language understanding and reasoning capabilities, making it exceptionally well-suited for complex tasks such as visual reasoning, detailed image understanding, and multimodal problem-solving. With its robust ability to perform sophisticated multimodal reasoning, Curr-ReFT emerges as a powerful AI assistant capable of addressing a wide range of challenges across diverse domains with improved accuracy and contextual awareness.
+## Training Configuration
+- Framework: The training process uses the open-source **R1-V** library, with **Qwen2.5-VL-Instruct** as the base model. This model comes in three variants: 3B, 7B.
+The training configuration for grpo is as follows:
+```python
+max_pixels 401408
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 1
+learning_rate: 1.0e-5
+num_train_epochs: 1.0
+lr_scheduler_type: cosine
+bf16: true
+flash_attn: fa2
+```
+## Usage
+You can load the model using the Hugging Face `transformers` library:
+```python
+from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
+import torch
+from qwen_vl_utils import process_vision_info
+MODEL_ID = "Curr-ReFT-3B"
+processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    MODEL_ID,
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16
+).to("cuda").eval()
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "<your image path>"},
+            {"type": "text", "text": "Hint: Please answer the question and provide the final answer at the end. Question: Which number do you have to write in the last daisy?"},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to(model.device)
+generated_ids = model.generate(**inputs, max_new_tokens=4096)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+# Institution
+- ZTE-AIM
+- University of Science and Technology of China
+## Model Contact
+- [email protected]
+- [email protected]
+- [email protected]