Model Card for Model ID
Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators
Flex‑Omni‑7B is an 11B-parameter multimodal evaluator capable of handling not only vision-language tasks but also audio-based evaluations—something traditional VL models cannot do. It inherits the reasoning-by-text paradigm from Flex‑Judge, enabling strong performance across modalities, and even outperforms models like Gemini‑2.0‑Flash on audio benchmarks such as MOS and speech scoring. Unlike vision-language models, Flex‑Omni‑7B unifies vision, language, and audio reasoning within a single framework.
Model Description
- We propose Flex-Judge, a reasoning-guided multimodal evaluator that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats.
- Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable, multimodal model-as-a-judge.
Model Sources
- Repository: https://github.com/jongwooko/flex-judge
- Paper: Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators
Uses
For more comprehensive usage examples and implementation details, please refer to our official repository.
Requirements
pip install git+https://github.com/huggingface/[email protected]
pip accelerate
pip install qwen-omni-utils[decord] -U
pip install vllm
pip install datasets
Using vLLM
Here, we recommend using vllm
instead of transformers
to improve inference speed. The results in our papers are based on the vllm
library.
from datasets import load_dataset
from vllm import LLM, SamplingParams
# default: Load the model on the available device(s)
llm = LLM(
"jongwooko/Flex-Omni-7B",
tensor_parallel_size=4,
limit_mm_per_prompt={"image": 1}, # The maximum number to accept
)
sampling_params = SamplingParams(
max_tokens=4096,
temperature=0.2,
top_p=0.95,
)
# Example
example = load_dataset('MMInstruction/VL-RewardBench', split='test')[0]
question, image = example["query"], example["image"]
answer1, answer2 = example["response"]
# System prompt for Flex-Judge
SYSTEM_PROMPT = (
"You are a helpful assistant. The assistant first performs a detailed, "
"step-by-step reasoning process in its mind and then provides the user with"
"the answer. The reasoning process and answer are enclosed within <think> "
"reasoning process here, explaining each step of your evaluation for both "
"assistants </think><answer> answer here </answer>. Now the user asks you "
"to judge the performance of two AI assistants in response to the question. "
"Score assistants 1-10 (higher=better). Criteria includes helpfulness, "
"relevance, accuracy, and level of detail. Avoid order, length, style or "
"other bias. After thinking, when you finally reach a conclusion, clearly "
"provide your evaluation scores within <answer> </answer> tags, i.e., for "
"example, <answer>3</answer><answer>5</answer>"
)
instruction = (
f"<|vision_start|><|IMAGE|><|vision_end|>\n\n[Question]\n{question}\n\n"
"[Assistant 1's Answer]\n{answer1}\n\n[Assistant 2's Answer]\n{answer2}"
)
prompt = (
f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
f"<|im_start|>user\n{instruction}<|im_end|>\n"
"<|im_start|>assistant\n<think>\n\n"
)
inputs = {"prompt": prompt, "multi_modal_data": {"image": [image]}}
# Inference: Generation of the output
outputs = llm.generate([inputs], sampling_params=sampling_params)
output_text = outputs[0].outputs[0].text
print (output_text)
Citation
BibTeX:
@article{ko2025flex,
title={Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators},
author={Ko, Jongwoo and Kim, Sungnyun and Cho, Sungwoo and Yun, Se-Young},
journal={arXiv preprint arXiv:2505.18601},
year={2025}
}
- Downloads last month
- 959
Model tree for jongwooko/Flex-Omni-7B
Base model
Qwen/Qwen2.5-Omni-7B