Model Card for Model ID

Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

Flex‑Omni‑7B is an 11B-parameter multimodal evaluator capable of handling not only vision-language tasks but also audio-based evaluations—something traditional VL models cannot do. It inherits the reasoning-by-text paradigm from Flex‑Judge, enabling strong performance across modalities, and even outperforms models like Gemini‑2.0‑Flash on audio benchmarks such as MOS and speech scoring. Unlike vision-language models, Flex‑Omni‑7B unifies vision, language, and audio reasoning within a single framework.

Model Description

We propose Flex-Judge, a reasoning-guided multimodal evaluator that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats.
Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable, multimodal model-as-a-judge.

Model Sources

Repository: https://github.com/jongwooko/flex-judge
Paper: Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

Uses

For more comprehensive usage examples and implementation details, please refer to our official repository.

Requirements

pip install git+https://github.com/huggingface/[email protected]
pip accelerate
pip install qwen-omni-utils[decord] -U
pip install vllm
pip install datasets

Using vLLM

Here, we recommend using vllm instead of transformers to improve inference speed. The results in our papers are based on the vllm library.

from datasets import load_dataset
from vllm import LLM, SamplingParams

# default: Load the model on the available device(s)
llm = LLM(
    "jongwooko/Flex-Omni-7B",
    tensor_parallel_size=4,
    limit_mm_per_prompt={"image": 1},  # The maximum number to accept
)
sampling_params = SamplingParams(
    max_tokens=4096,
    temperature=0.2,
    top_p=0.95,
)

# Example
example = load_dataset('MMInstruction/VL-RewardBench', split='test')[0]
question, image = example["query"], example["image"]
answer1, answer2 = example["response"]

# System prompt for Flex-Judge
SYSTEM_PROMPT = (
    "You are a helpful assistant. The assistant first performs a detailed, "
    "step-by-step reasoning process in its mind and then provides the user with"
    "the answer. The reasoning process and answer are enclosed within <think> "
    "reasoning process here, explaining each step of your evaluation for both "
    "assistants </think><answer> answer here </answer>. Now the user asks you "
    "to judge the performance of two AI assistants in response to the question. "
    "Score assistants 1-10 (higher=better). Criteria includes helpfulness, "
    "relevance, accuracy, and level of detail. Avoid order, length, style or "
    "other bias. After thinking, when you finally reach a conclusion, clearly "
    "provide your evaluation scores within <answer> </answer> tags, i.e., for "
    "example, <answer>3</answer><answer>5</answer>"
)

instruction = (
    f"<|vision_start|><|IMAGE|><|vision_end|>\n\n[Question]\n{question}\n\n"
    "[Assistant 1's Answer]\n{answer1}\n\n[Assistant 2's Answer]\n{answer2}"
)
prompt = (
    f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
    f"<|im_start|>user\n{instruction}<|im_end|>\n"
    "<|im_start|>assistant\n<think>\n\n"
)
inputs = {"prompt": prompt, "multi_modal_data": {"image": [image]}}

# Inference: Generation of the output
outputs = llm.generate([inputs], sampling_params=sampling_params)
output_text = outputs[0].outputs[0].text
print (output_text)

Citation

BibTeX:

@article{ko2025flex,
  title={Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators},
  author={Ko, Jongwoo and Kim, Sungnyun and Cho, Sungwoo and Yun, Se-Young},
  journal={arXiv preprint arXiv:2505.18601},
  year={2025}
}

jongwooko
/

Flex-Omni-7B