Add `transformers` library tag, model card title, and sample usage
Browse filesThis PR significantly improves the model card for the **Thyme: Think Beyond Images** model by:
- **Adding `library_name: transformers` to the metadata**: This tag enables the automated "how to use" widget on the Hugging Face Hub, providing users with a quick start guide. Evidence for `transformers` compatibility comes from the `pip install "transformers<4.55"` dependency listed in the GitHub README's `Environment Setup` and the `Qwen2_5_VLProcessor` and `Qwen2_5_VLForConditionalGeneration` classes indicated in the model's configuration files.
- **Adding a prominent model card title**: The title `# Thyme: Think Beyond Images` is added at the top for better readability and clear identification of the model.
- **Including a `Usage` section with a `transformers` code snippet**: This provides a practical, runnable example for users to quickly get started with inference using the `transformers` library, reinforcing the `library_name` tag.
- **Updating the BibTeX citation**: The citation in the "Citation" section has been updated to the more complete version found in the official GitHub repository, including all authors and relevant arXiv details.
The existing links to the project page, GitHub repository, and arXiv paper (under "Technique Report") remain unchanged, as per the guidelines regarding pre-existing arXiv links.
@@ -1,16 +1,20 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
datasets:
|
4 |
- Kwai-Keye/Thyme-SFT
|
5 |
- Kwai-Keye/Thyme-RL
|
6 |
language:
|
7 |
- en
|
|
|
8 |
metrics:
|
9 |
- accuracy
|
10 |
-
base_model:
|
11 |
-
- Qwen/Qwen2.5-VL-7B-Instruct
|
12 |
pipeline_tag: image-text-to-text
|
|
|
13 |
---
|
|
|
|
|
|
|
14 |
<div align="center">
|
15 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/dxBp6TmwqwNuBuJR9gfQC.png" width="40%" alt="Thyme Logo">
|
16 |
</div>
|
@@ -27,7 +31,7 @@ pipeline_tag: image-text-to-text
|
|
27 |
</div></font>
|
28 |
|
29 |
## 🔥 News
|
30 |
-
* **`2025.08.15`** 🌟 We are excited to introduce **Thyme: Think Beyond Images**. Thyme transcends traditional ``thinking with images'' paradigms by autonomously generating and executing diverse image processing and computational operations through executable code, significantly enhancing performance on high-resolution perception and complex reasoning tasks. Leveraging a novel two-stage training strategy that combines supervised fine-tuning with reinforcement learning and empowered by the innovative GRPO-ATS algorithm, Thyme achieves a sophisticated balance between reasoning exploration
|
31 |
|
32 |
<div align="center">
|
33 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/c_D7uX3RT1WUANDRB70ZC.png" width="100%" alt="Thyme Logo">
|
@@ -35,15 +39,109 @@ pipeline_tag: image-text-to-text
|
|
35 |
|
36 |
We have provided the usage instructions, training code, and evaluation code in the [GitHub repo](https://github.com/yfzhang114/Thyme).
|
37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
## Citation
|
39 |
|
40 |
If you find Thyme useful in your research or applications, please cite our paper:
|
41 |
|
42 |
```bibtex
|
43 |
-
@
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
|
|
|
|
|
|
48 |
}
|
49 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
4 |
datasets:
|
5 |
- Kwai-Keye/Thyme-SFT
|
6 |
- Kwai-Keye/Thyme-RL
|
7 |
language:
|
8 |
- en
|
9 |
+
license: mit
|
10 |
metrics:
|
11 |
- accuracy
|
|
|
|
|
12 |
pipeline_tag: image-text-to-text
|
13 |
+
library_name: transformers
|
14 |
---
|
15 |
+
|
16 |
+
# Thyme: Think Beyond Images
|
17 |
+
|
18 |
<div align="center">
|
19 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/dxBp6TmwqwNuBuJR9gfQC.png" width="40%" alt="Thyme Logo">
|
20 |
</div>
|
|
|
31 |
</div></font>
|
32 |
|
33 |
## 🔥 News
|
34 |
+
* **`2025.08.15`** 🌟 We are excited to introduce **Thyme: Think Beyond Images**. Thyme transcends traditional ``thinking with images'' paradigms by autonomously generating and executing diverse image processing and computational operations through executable code, significantly enhancing performance on high-resolution perception and complex reasoning tasks. Leveraging a novel two-stage training strategy that combines supervised fine-tuning with reinforcement learning and empowered by the innovative GRPO-ATS algorithm, Thyme achieves a sophisticated balance between reasoning exploration with code execution precision.
|
35 |
|
36 |
<div align="center">
|
37 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/c_D7uX3RT1WUANDRB70ZC.png" width="100%" alt="Thyme Logo">
|
|
|
39 |
|
40 |
We have provided the usage instructions, training code, and evaluation code in the [GitHub repo](https://github.com/yfzhang114/Thyme).
|
41 |
|
42 |
+
## Usage
|
43 |
+
|
44 |
+
Here's how to use the `Thyme` model for inference with the 🤗 Transformers library:
|
45 |
+
|
46 |
+
```python
|
47 |
+
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
|
48 |
+
from qwen_vl_utils import process_vision_info
|
49 |
+
import torch
|
50 |
+
|
51 |
+
# Default: Load the model on the available device(s)
|
52 |
+
model_id = "Kwai-Keye/Thyme-RL" # Or Kwai-Keye/Thyme-SFT
|
53 |
+
model = Qwen2VLForConditionalGeneration.from_pretrained(
|
54 |
+
model_id, torch_dtype="auto", device_map="auto"
|
55 |
+
)
|
56 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
57 |
+
|
58 |
+
# Example: Using an image from the GitHub repo's usage example
|
59 |
+
# For actual inference, replace with your local image path or a URL
|
60 |
+
# image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/object_detection_example.png"
|
61 |
+
image_path = "https://huggingface.co/Kwai-Keye/Thyme-RL/resolve/main/17127.jpg" # Example image from the paper's GitHub
|
62 |
+
|
63 |
+
messages = [
|
64 |
+
{
|
65 |
+
"role": "user",
|
66 |
+
"content": [
|
67 |
+
{
|
68 |
+
"type": "image",
|
69 |
+
"image": image_path,
|
70 |
+
},
|
71 |
+
{"type": "text", "text": "Question: What is the plate number of the blue car in the picture?
|
72 |
+
Options:
|
73 |
+
A. S OT 911
|
74 |
+
B. S TQ 119
|
75 |
+
C. S QT 911
|
76 |
+
D. B QT 119
|
77 |
+
E. This image doesn't feature the plate number.
|
78 |
+
Please select the correct answer from the options above."},
|
79 |
+
],
|
80 |
+
}
|
81 |
+
]
|
82 |
+
|
83 |
+
# Preparation for inference
|
84 |
+
text = processor.apply_chat_template(
|
85 |
+
messages, tokenize=False, add_generation_prompt=True
|
86 |
+
)
|
87 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
88 |
+
inputs = processor(
|
89 |
+
text=[text],
|
90 |
+
images=image_inputs,
|
91 |
+
videos=video_inputs,
|
92 |
+
padding=True,
|
93 |
+
return_tensors="pt",
|
94 |
+
)
|
95 |
+
inputs = {k: v.to(model.device) for k, v in inputs.items()}
|
96 |
+
|
97 |
+
# Inference: Generation of the output
|
98 |
+
generated_ids = model.generate(**inputs, max_new_tokens=128)
|
99 |
+
|
100 |
+
# Decode only the newly generated tokens
|
101 |
+
input_token_len = inputs.input_ids.shape[1]
|
102 |
+
generated_ids_trimmed = generated_ids[:, input_token_len:]
|
103 |
+
|
104 |
+
output_text = processor.batch_decode(
|
105 |
+
generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
|
106 |
+
)[0] # Get the first (and only) output string
|
107 |
+
|
108 |
+
print(output_text)
|
109 |
+
# Expected output (simplified, actual output may vary slightly due to model variability and sandbox details):
|
110 |
+
# <think>To determine the plate number of the blue car in the image, we need to focus on the license plate located near the bottom front of the vehicle. ...
|
111 |
+
# </code>
|
112 |
+
# ```python
|
113 |
+
# import cv2
|
114 |
+
# ...
|
115 |
+
# print(processed_path)
|
116 |
+
# ```
|
117 |
+
# </sandbox_output>
|
118 |
+
# Upon examining the cropped and zoomed-in image of the license plate, it becomes clear that the characters are "S QT 911". This matches option C. Therefore, the correct answer is C. S QT 911.</think>
|
119 |
+
# <answer> **C. S QT 911** </answer>
|
120 |
+
```
|
121 |
+
|
122 |
## Citation
|
123 |
|
124 |
If you find Thyme useful in your research or applications, please cite our paper:
|
125 |
|
126 |
```bibtex
|
127 |
+
@misc{zhang2025thymethinkimages,
|
128 |
+
title={Thyme: Think Beyond Images},
|
129 |
+
author={Yi-Fan Zhang and Xingyu Lu and Shukang Yin and Chaoyou Fu and Wei Chen and Xiao Hu and Bin Wen and Kaiyu Jiang and Changyi Liu and Tianke Zhang and Haonan Fan and Kaibing Chen and Jiankang Chen and Haojie Ding and Kaiyu Tang and Zhang Zhang and Liang Wang and Fan Yang and Tingting Gao and Guorui Zhou},
|
130 |
+
year={2025},
|
131 |
+
eprint={2508.11630},
|
132 |
+
archivePrefix={arXiv},
|
133 |
+
primaryClass={cs.CV},
|
134 |
+
url={https://arxiv.org/abs/2508.11630},
|
135 |
}
|
136 |
+
```
|
137 |
+
|
138 |
+
## Related Projects
|
139 |
+
Explore other related work from our team:
|
140 |
+
|
141 |
+
- [Kwai Keye-VL](https://github.com/Kwai-Keye/Keye)
|
142 |
+
- [R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning](https://github.com/yfzhang114/r1_reward)
|
143 |
+
- [MM-RLHF: The Next Step Forward in Multimodal LLM Alignment](https://mm-rlhf.github.io/)
|
144 |
+
- [MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?](https://github.com/yfzhang114/MME-RealWorld)
|
145 |
+
- [MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs](https://arxiv.org/abs/2411.15296)
|
146 |
+
- [Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models](https://github.com/yfzhang114/SliME)
|
147 |
+
- [VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction](https://github.com/VITA-MLLM/VITA)
|