Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
Paper
•
2408.15542
•
Published
Kangaroo has been released. Please check out our paper, blog and github for details.
We introduce Kangaroo, a powerful Multimodal Large Language Model designed for long-context video understanding. Our presented Kangaroo model shows remarkable performance across diverse video understanding tasks including video caption, QA and conversation. Generally, our key contributions in this work can be summarized as follows:
See our github page
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("KangarooGroup/kangaroo")
model = AutoModelForCausalLM.from_pretrained(
"KangarooGroup/kangaroo",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = model.to("cuda")
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
video_path = "/path/to/video"
# Round 1
query = "Give a brief description of the video."
out, history = model.chat(video_path=video_path,
query=query,
tokenizer=tokenizer,
max_new_tokens=512,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,)
print('Assitant: \n', out)
# Round 2
query = "What happend at the end of the video?"
out, history = model.chat(video_path=video_path,
query=query,
history=history,
tokenizer=tokenizer,
max_new_tokens=512,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,)
print('Assitant: \n', out)
If you find it useful for your research , please cite related papers/blogs using this BibTeX:
@misc{kangaroogroup,
title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
url={https://kangaroogroup.github.io/Kangaroo.github.io/},
author={Jiajun Liu and Yibing Wang and Hanghang Ma and Xiaoping Wu and Xiaoqi Ma and Jie Hu},
month={July},
year={2024}
}
Totally Free + Zero Barriers + No Login Required