Kwai-Keye
/

Thyme-RL

@@ -1,16 +1,20 @@
 ---
-license: mit
 datasets:
 - Kwai-Keye/Thyme-SFT
 - Kwai-Keye/Thyme-RL
 language:
 - en
 metrics:
 - accuracy
-base_model:
-- Qwen/Qwen2.5-VL-7B-Instruct
 pipeline_tag: image-text-to-text
 ---
 <div align="center">
   <img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/dxBp6TmwqwNuBuJR9gfQC.png" width="40%" alt="Thyme Logo">
 </div>
@@ -27,7 +31,7 @@ pipeline_tag: image-text-to-text
 </div></font>
 ## 🔥 News
-* **`2025.08.15`** 🌟 We are excited to introduce **Thyme: Think Beyond Images**. Thyme transcends traditional ``thinking with images'' paradigms by autonomously generating and executing diverse image processing and computational operations through executable code, significantly enhancing performance on high-resolution perception and complex reasoning tasks. Leveraging a novel two-stage training strategy that combines supervised fine-tuning with reinforcement learning and empowered by the innovative GRPO-ATS algorithm, Thyme achieves a sophisticated balance between reasoning exploration and code execution precision.
 <div align="center">
   <img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/c_D7uX3RT1WUANDRB70ZC.png" width="100%" alt="Thyme Logo">
@@ -35,15 +39,109 @@ pipeline_tag: image-text-to-text
 We have provided the usage instructions, training code, and evaluation code in the [GitHub repo](https://github.com/yfzhang114/Thyme).
 ## Citation
 If you find Thyme useful in your research or applications, please cite our paper:
 ```bibtex
-@article{zhang2025thyme,
-  title={Thyme: Think Beyond Images},
-  author={Kwai Keye},
-  journal={arXiv preprint},
-  year={2025}
 }
-```

 ---
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
 datasets:
 - Kwai-Keye/Thyme-SFT
 - Kwai-Keye/Thyme-RL
 language:
 - en
+license: mit
 metrics:
 - accuracy
 pipeline_tag: image-text-to-text
+library_name: transformers
 ---
+# Thyme: Think Beyond Images
 <div align="center">
   <img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/dxBp6TmwqwNuBuJR9gfQC.png" width="40%" alt="Thyme Logo">
 </div>
 </div></font>
 ## 🔥 News
+* **`2025.08.15`** 🌟 We are excited to introduce **Thyme: Think Beyond Images**. Thyme transcends traditional ``thinking with images'' paradigms by autonomously generating and executing diverse image processing and computational operations through executable code, significantly enhancing performance on high-resolution perception and complex reasoning tasks. Leveraging a novel two-stage training strategy that combines supervised fine-tuning with reinforcement learning and empowered by the innovative GRPO-ATS algorithm, Thyme achieves a sophisticated balance between reasoning exploration with code execution precision.
 <div align="center">
   <img src="https://cdn-uploads.huggingface.co/production/uploads/685ba798484e3233f5ff6f11/c_D7uX3RT1WUANDRB70ZC.png" width="100%" alt="Thyme Logo">
 We have provided the usage instructions, training code, and evaluation code in the [GitHub repo](https://github.com/yfzhang114/Thyme).
+## Usage
+Here's how to use the `Thyme` model for inference with the 🤗 Transformers library:
+```python
+from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
+from qwen_vl_utils import process_vision_info
+import torch
+# Default: Load the model on the available device(s)
+model_id = "Kwai-Keye/Thyme-RL" # Or Kwai-Keye/Thyme-SFT
+model = Qwen2VLForConditionalGeneration.from_pretrained(
+    model_id, torch_dtype="auto", device_map="auto"
+)
+processor = AutoProcessor.from_pretrained(model_id)
+# Example: Using an image from the GitHub repo's usage example
+# For actual inference, replace with your local image path or a URL
+# image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/object_detection_example.png"
+image_path = "https://huggingface.co/Kwai-Keye/Thyme-RL/resolve/main/17127.jpg" # Example image from the paper's GitHub
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": image_path,
+            },
+            {"type": "text", "text": "Question: What is the plate number of the blue car in the picture?
+Options:
+A. S OT 911
+B. S TQ 119
+C. S QT 911
+D. B QT 119
+E. This image doesn't feature the plate number.
+Please select the correct answer from the options above."},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = {k: v.to(model.device) for k, v in inputs.items()}
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+# Decode only the newly generated tokens
+input_token_len = inputs.input_ids.shape[1]
+generated_ids_trimmed = generated_ids[:, input_token_len:]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
+)[0] # Get the first (and only) output string
+print(output_text)
+# Expected output (simplified, actual output may vary slightly due to model variability and sandbox details):
+# <think>To determine the plate number of the blue car in the image, we need to focus on the license plate located near the bottom front of the vehicle. ...
+# </code>
+# ```python
+# import cv2
+# ...
+# print(processed_path)
+# ```
+# </sandbox_output>
+# Upon examining the cropped and zoomed-in image of the license plate, it becomes clear that the characters are "S QT 911". This matches option C. Therefore, the correct answer is C. S QT 911.</think>
+# <answer> **C. S QT 911** </answer>
+```
 ## Citation
 If you find Thyme useful in your research or applications, please cite our paper:
 ```bibtex
+@misc{zhang2025thymethinkimages,
+      title={Thyme: Think Beyond Images},
+      author={Yi-Fan Zhang and Xingyu Lu and Shukang Yin and Chaoyou Fu and Wei Chen and Xiao Hu and Bin Wen and Kaiyu Jiang and Changyi Liu and Tianke Zhang and Haonan Fan and Kaibing Chen and Jiankang Chen and Haojie Ding and Kaiyu Tang and Zhang Zhang and Liang Wang and Fan Yang and Tingting Gao and Guorui Zhou},
+      year={2025},
+      eprint={2508.11630},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2508.11630},
 }
+```
+## Related Projects
+Explore other related work from our team:
+- [Kwai Keye-VL](https://github.com/Kwai-Keye/Keye)
+- [R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning](https://github.com/yfzhang114/r1_reward)
+- [MM-RLHF: The Next Step Forward in Multimodal LLM Alignment](https://mm-rlhf.github.io/)
+- [MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?](https://github.com/yfzhang114/MME-RealWorld)
+- [MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs](https://arxiv.org/abs/2411.15296)
+- [Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models](https://github.com/yfzhang114/SliME)
+- [VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction](https://github.com/VITA-MLLM/VITA)