Add `transformers` library tag, model card title, and sample usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +108 -10
README.md CHANGED
@@ -1,16 +1,20 @@
1
  ---
2
- license: mit
 
3
  datasets:
4
  - Kwai-Keye/Thyme-SFT
5
  - Kwai-Keye/Thyme-RL
6
  language:
7
  - en
 
8
  metrics:
9
  - accuracy
10
- base_model:
11
- - Qwen/Qwen2.5-VL-7B-Instruct
12
  pipeline_tag: image-text-to-text
 
13
  ---
 
 
 
14
  <div align="center">
15
  <img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/dxBp6TmwqwNuBuJR9gfQC.png" width="40%" alt="Thyme Logo">
16
  </div>
@@ -27,7 +31,7 @@ pipeline_tag: image-text-to-text
27
  </div></font>
28
 
29
  ## 🔥 News
30
- * **`2025.08.15`** 🌟 We are excited to introduce **Thyme: Think Beyond Images**. Thyme transcends traditional ``thinking with images'' paradigms by autonomously generating and executing diverse image processing and computational operations through executable code, significantly enhancing performance on high-resolution perception and complex reasoning tasks. Leveraging a novel two-stage training strategy that combines supervised fine-tuning with reinforcement learning and empowered by the innovative GRPO-ATS algorithm, Thyme achieves a sophisticated balance between reasoning exploration and code execution precision.
31
 
32
  <div align="center">
33
  <img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/c_D7uX3RT1WUANDRB70ZC.png" width="100%" alt="Thyme Logo">
@@ -35,15 +39,109 @@ pipeline_tag: image-text-to-text
35
 
36
  We have provided the usage instructions, training code, and evaluation code in the [GitHub repo](https://github.com/yfzhang114/Thyme).
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ## Citation
39
 
40
  If you find Thyme useful in your research or applications, please cite our paper:
41
 
42
  ```bibtex
43
- @article{zhang2025thyme,
44
- title={Thyme: Think Beyond Images},
45
- author={Kwai Keye},
46
- journal={arXiv preprint},
47
- year={2025}
 
 
 
48
  }
49
- ```
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
4
  datasets:
5
  - Kwai-Keye/Thyme-SFT
6
  - Kwai-Keye/Thyme-RL
7
  language:
8
  - en
9
+ license: mit
10
  metrics:
11
  - accuracy
 
 
12
  pipeline_tag: image-text-to-text
13
+ library_name: transformers
14
  ---
15
+
16
+ # Thyme: Think Beyond Images
17
+
18
  <div align="center">
19
  <img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/dxBp6TmwqwNuBuJR9gfQC.png" width="40%" alt="Thyme Logo">
20
  </div>
 
31
  </div></font>
32
 
33
  ## 🔥 News
34
+ * **`2025.08.15`** 🌟 We are excited to introduce **Thyme: Think Beyond Images**. Thyme transcends traditional ``thinking with images'' paradigms by autonomously generating and executing diverse image processing and computational operations through executable code, significantly enhancing performance on high-resolution perception and complex reasoning tasks. Leveraging a novel two-stage training strategy that combines supervised fine-tuning with reinforcement learning and empowered by the innovative GRPO-ATS algorithm, Thyme achieves a sophisticated balance between reasoning exploration with code execution precision.
35
 
36
  <div align="center">
37
  <img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/c_D7uX3RT1WUANDRB70ZC.png" width="100%" alt="Thyme Logo">
 
39
 
40
  We have provided the usage instructions, training code, and evaluation code in the [GitHub repo](https://github.com/yfzhang114/Thyme).
41
 
42
+ ## Usage
43
+
44
+ Here's how to use the `Thyme` model for inference with the 🤗 Transformers library:
45
+
46
+ ```python
47
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
48
+ from qwen_vl_utils import process_vision_info
49
+ import torch
50
+
51
+ # Default: Load the model on the available device(s)
52
+ model_id = "Kwai-Keye/Thyme-RL" # Or Kwai-Keye/Thyme-SFT
53
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
54
+ model_id, torch_dtype="auto", device_map="auto"
55
+ )
56
+ processor = AutoProcessor.from_pretrained(model_id)
57
+
58
+ # Example: Using an image from the GitHub repo's usage example
59
+ # For actual inference, replace with your local image path or a URL
60
+ # image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/object_detection_example.png"
61
+ image_path = "https://huggingface.co/Kwai-Keye/Thyme-RL/resolve/main/17127.jpg" # Example image from the paper's GitHub
62
+
63
+ messages = [
64
+ {
65
+ "role": "user",
66
+ "content": [
67
+ {
68
+ "type": "image",
69
+ "image": image_path,
70
+ },
71
+ {"type": "text", "text": "Question: What is the plate number of the blue car in the picture?
72
+ Options:
73
+ A. S OT 911
74
+ B. S TQ 119
75
+ C. S QT 911
76
+ D. B QT 119
77
+ E. This image doesn't feature the plate number.
78
+ Please select the correct answer from the options above."},
79
+ ],
80
+ }
81
+ ]
82
+
83
+ # Preparation for inference
84
+ text = processor.apply_chat_template(
85
+ messages, tokenize=False, add_generation_prompt=True
86
+ )
87
+ image_inputs, video_inputs = process_vision_info(messages)
88
+ inputs = processor(
89
+ text=[text],
90
+ images=image_inputs,
91
+ videos=video_inputs,
92
+ padding=True,
93
+ return_tensors="pt",
94
+ )
95
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
96
+
97
+ # Inference: Generation of the output
98
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
99
+
100
+ # Decode only the newly generated tokens
101
+ input_token_len = inputs.input_ids.shape[1]
102
+ generated_ids_trimmed = generated_ids[:, input_token_len:]
103
+
104
+ output_text = processor.batch_decode(
105
+ generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
106
+ )[0] # Get the first (and only) output string
107
+
108
+ print(output_text)
109
+ # Expected output (simplified, actual output may vary slightly due to model variability and sandbox details):
110
+ # <think>To determine the plate number of the blue car in the image, we need to focus on the license plate located near the bottom front of the vehicle. ...
111
+ # </code>
112
+ # ```python
113
+ # import cv2
114
+ # ...
115
+ # print(processed_path)
116
+ # ```
117
+ # </sandbox_output>
118
+ # Upon examining the cropped and zoomed-in image of the license plate, it becomes clear that the characters are "S QT 911". This matches option C. Therefore, the correct answer is C. S QT 911.</think>
119
+ # <answer> **C. S QT 911** </answer>
120
+ ```
121
+
122
  ## Citation
123
 
124
  If you find Thyme useful in your research or applications, please cite our paper:
125
 
126
  ```bibtex
127
+ @misc{zhang2025thymethinkimages,
128
+ title={Thyme: Think Beyond Images},
129
+ author={Yi-Fan Zhang and Xingyu Lu and Shukang Yin and Chaoyou Fu and Wei Chen and Xiao Hu and Bin Wen and Kaiyu Jiang and Changyi Liu and Tianke Zhang and Haonan Fan and Kaibing Chen and Jiankang Chen and Haojie Ding and Kaiyu Tang and Zhang Zhang and Liang Wang and Fan Yang and Tingting Gao and Guorui Zhou},
130
+ year={2025},
131
+ eprint={2508.11630},
132
+ archivePrefix={arXiv},
133
+ primaryClass={cs.CV},
134
+ url={https://arxiv.org/abs/2508.11630},
135
  }
136
+ ```
137
+
138
+ ## Related Projects
139
+ Explore other related work from our team:
140
+
141
+ - [Kwai Keye-VL](https://github.com/Kwai-Keye/Keye)
142
+ - [R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning](https://github.com/yfzhang114/r1_reward)
143
+ - [MM-RLHF: The Next Step Forward in Multimodal LLM Alignment](https://mm-rlhf.github.io/)
144
+ - [MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?](https://github.com/yfzhang114/MME-RealWorld)
145
+ - [MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs](https://arxiv.org/abs/2411.15296)
146
+ - [Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models](https://github.com/yfzhang114/SliME)
147
+ - [VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction](https://github.com/VITA-MLLM/VITA)