Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

Model Summary

This model is a part of the project Sparrow. It's a video-LLM fine-tuned from the image-LLM MiniCPM-Llama3-V-2_5.

Abstract: Recent years have seen the success of Multimodal Large Language Models (MLLMs) in the domain of vision understanding. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has been primarily driven by automatic data pipelines, which focus on the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our primary study approach involves fine-tuning pre-trained image-LLMs with video data and examining learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples. Meanwhile, we find that incorporating these synthetic samples can enhance the performance of long video understanding without requiring training on long video data.

Sample Usage

This model is designed for video-language understanding. You can load it using the transformers library. Ensure trust_remote_code=True is set for proper model loading. For video input, you will typically provide a list of image frames (PIL Images).

Prerequisites: You might need decord to easily load video frames. Install it via pip install decord.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
from PIL import Image
import torch
import numpy as np
from decord import VideoReader, cpu # For video loading

# Load model and processor
model_id = "VITA-MLLM/Sparrow-Llama3-V-2_5" # Replace with the actual model ID if different
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Use bfloat16 for better performance/memory
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# --- Example: Load video frames ---
video_path = "path/to/your/video.mp4" # <--- IMPORTANT: Replace with your video file path!
video_frames = []
try:
    vr = VideoReader(video_path, ctx=cpu(0))
    # Sample a maximum of 32 frames uniformly for demonstration
    total_frames = len(vr)
    num_frames_to_sample = min(total_frames, 32)
    frame_indices = np.linspace(0, total_frames - 1, num_frames_to_sample, dtype=int)

    video_frames = [Image.fromarray(vr[i].asnumpy()) for i in frame_indices]
    print(f"Loaded {len(video_frames)} frames from {video_path}")
except Exception as e:
    print(f"Could not load video from {video_path}: {e}")
    print("Using placeholder images for demonstration. Please provide a valid video file.")
    video_frames = [Image.new("RGB", (224, 224), color="blue")] * 4 # Fallback to placeholder images


# --- Prepare prompt with video frames ---
# The <video> tag is specific to MiniCPM-V models for indicating video/image input.
# It should be repeated for each image frame provided.
messages = [
    {"role": "user", "content": "<video>" * len(video_frames) + "
Describe this video in detail."}
]

# Apply chat template and tokenize inputs
inputs = processor.apply_chat_template(
    messages,
    video=video_frames, # Pass the list of PIL Images here
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)

# Move inputs to appropriate device (e.g., GPU)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# --- Generate response ---
with torch.no_grad():
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        image_pixel_values=inputs["image_pixel_values"], # Essential for vision inputs
        max_new_tokens=256, # Adjust as needed
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
    )

# Decode and print the output
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Clean up any potential chat template artifacts at the beginning/end
response = response.split('<|start_header_id|>assistant<|end_header_id|>')[-1].strip()

print("
Generated Response:")
print(response)

License

Model License

The code in this repo is released under the Apache-2.0 License.
The usage of MiniCPM-V series model weights must strictly follow MiniCPM Model License.md.
The models and weights of MiniCPM are completely free for academic research. After filling out a "questionnaire" for registration, are also available for free commercial use.

Statement

As an LLM, MiniCPM-Llama3-V 2.5 generates contents by learning a large mount of texts, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-Llama3-V 2.5 does not represent the views and positions of the model developers
We will not be liable for any problems arising from the use of the MinCPM-V open Source model, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Training dataset

100K video instruction data from Video-ChatGPT
100K video caption data from ShareGemini

xjtupanda
/

MiniCPM-V-200K-video-finetune

Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

Model Summary

Sample Usage

License

Model License

Statement

Training dataset

Model tree for xjtupanda/MiniCPM-V-200K-video-finetune

Datasets used to train xjtupanda/MiniCPM-V-200K-video-finetune

Collection including xjtupanda/MiniCPM-V-200K-video-finetune

Sparrow