metadata
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
datasets:
- Kwai-Keye/Thyme-SFT
- Kwai-Keye/Thyme-RL
language:
- en
license: mit
metrics:
- accuracy
pipeline_tag: image-text-to-text
library_name: transformers
Thyme: Think Beyond Images

[📖 Home Page] [📖 Github Repo] [📖 Technique Report]
[📊 Thyme SFT Model] [📊 Thyme RL Model] [📝 SFT Data] [📝 RL Data]
🔥 News
2025.08.15
🌟 We are excited to introduce Thyme: Think Beyond Images. Thyme transcends traditional ``thinking with images'' paradigms by autonomously generating and executing diverse image processing and computational operations through executable code, significantly enhancing performance on high-resolution perception and complex reasoning tasks. Leveraging a novel two-stage training strategy that combines supervised fine-tuning with reinforcement learning and empowered by the innovative GRPO-ATS algorithm, Thyme achieves a sophisticated balance between reasoning exploration with code execution precision.

We have provided the usage instructions, training code, and evaluation code in the GitHub repo.
Usage
Here's how to use the Thyme
model for inference with the 🤗 Transformers library:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# Default: Load the model on the available device(s)
model_id = "Kwai-Keye/Thyme-RL" # Or Kwai-Keye/Thyme-SFT
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# Example: Using an image from the GitHub repo's usage example
# For actual inference, replace with your local image path or a URL
# image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/object_detection_example.png"
image_path = "https://huggingface.co/Kwai-Keye/Thyme-RL/resolve/main/17127.jpg" # Example image from the paper's GitHub
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": image_path,
},
{"type": "text", "text": "Question: What is the plate number of the blue car in the picture?
Options:
A. S OT 911
B. S TQ 119
C. S QT 911
D. B QT 119
E. This image doesn't feature the plate number.
Please select the correct answer from the options above."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
# Decode only the newly generated tokens
input_token_len = inputs.input_ids.shape[1]
generated_ids_trimmed = generated_ids[:, input_token_len:]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
)[0] # Get the first (and only) output string
print(output_text)
# Expected output (simplified, actual output may vary slightly due to model variability and sandbox details):
# <think>To determine the plate number of the blue car in the image, we need to focus on the license plate located near the bottom front of the vehicle. ...
# </code>
# ```python
# import cv2
# ...
# print(processed_path)
# ```
# </sandbox_output>
# Upon examining the cropped and zoomed-in image of the license plate, it becomes clear that the characters are "S QT 911". This matches option C. Therefore, the correct answer is C. S QT 911.</think>
# <answer> **C. S QT 911** </answer>
Citation
If you find Thyme useful in your research or applications, please cite our paper:
@misc{zhang2025thymethinkimages,
title={Thyme: Think Beyond Images},
author={Yi-Fan Zhang and Xingyu Lu and Shukang Yin and Chaoyou Fu and Wei Chen and Xiao Hu and Bin Wen and Kaiyu Jiang and Changyi Liu and Tianke Zhang and Haonan Fan and Kaibing Chen and Jiankang Chen and Haojie Ding and Kaiyu Tang and Zhang Zhang and Liang Wang and Fan Yang and Tingting Gao and Guorui Zhou},
year={2025},
eprint={2508.11630},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.11630},
}
Related Projects
Explore other related work from our team:
- Kwai Keye-VL
- R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
- MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
- MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
- MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
- Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
- VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction