ZTE-AIM
/

7B-Curr-ReFT

Image-Text-to-Text

Model card Files Files and versions

7B-Curr-ReFT / README.md

MeiManlin's picture

Upload README.md

0c4ef77 verified 7 months ago

|

history blame contribute delete

3.42 kB

	---
	license: apache-2.0
	datasets:
	- ZTE-AIM/Curr-ReFT-data
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	- Qwen/Qwen2.5-VL-7B-Instruct
	pipeline_tag: image-text-to-text
	---

	## Curr-ReFT-data
	[\[📂 GitHub\]](https://github.com/ding523/Curr_REFT)
	[\[🤗 HF Dataset\]](https://huggingface.co/datasets/ZTE-AIM/Curr-ReFT-data)
	## Curr-ReFT-model
	[\[🤗 Curr-ReFT-3B\]](https://huggingface.co/ZTE-AIM/3B-Curr-ReFT)
	[\[🤗 Curr-ReFT-7B\]](https://huggingface.co/ZTE-AIM/7B-Curr-ReFT)
	## Model Overview

	This is a multimodal large language model fine-tuned from Qwen2.5-VL using our innovative Curr-ReFT methodology. The model has undergone a two-stage training process: first through Curriculum Reinforcement Learning, which gradually increases task complexity, followed by Rejected Sample based Self-improvement to maintain foundational capabilities.
	The model significantly enhances vision-language understanding and reasoning capabilities, making it exceptionally well-suited for complex tasks such as visual reasoning, detailed image understanding, and multimodal problem-solving. With its robust ability to perform sophisticated multimodal reasoning, Curr-ReFT emerges as a powerful AI assistant capable of addressing a wide range of challenges across diverse domains with improved accuracy and contextual awareness.

	## Training Configuration
	- Framework: The training process uses the open-source R1-V library, with Qwen2.5-VL-Instruct as the base model. This model comes in three variants: 3B, 7B.

	The training configuration for grpo is as follows:
	```python
	max_pixels 401408
	per_device_train_batch_size: 1
	gradient_accumulation_steps: 1
	learning_rate: 1.0e-5

	num_train_epochs: 1.0
	lr_scheduler_type: cosine
	bf16: true
	flash_attn: fa2
	```

	## Usage

	You can load the model using the Hugging Face `transformers` library:

	```python
	from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
	import torch
	from qwen_vl_utils import process_vision_info

	MODEL_ID = "Curr-ReFT-3B"
	processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	MODEL_ID,
	trust_remote_code=True,
	torch_dtype=torch.bfloat16
	).to("cuda").eval()

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "<your image path>"},
	{"type": "text", "text": "Hint: Please answer the question and provide the final answer at the end. Question: Which number do you have to write in the last daisy?"},
	],
	}
	]

	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to(model.device)

	generated_ids = model.generate(**inputs, max_new_tokens=4096)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```



	# Institution
	- ZTE-AIM
	- University of Science and Technology of China

	## Model Contact
	- [email protected]
	- [email protected]
	- [email protected]