Falconss1 nielsr HF Staff commited on
Commit
1f143f1
·
verified ·
1 Parent(s): 4312b91

Add library_name and pipeline_tag to metadata (#1)

Browse files

- Add library_name and pipeline_tag to metadata (1624117566abe6f9b727ca1f78ecc4de8f2e3af7)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +22 -10
README.md CHANGED
@@ -1,12 +1,6 @@
1
  ---
2
- language: en
3
- tags:
4
- - video-understanding
5
- - reasoning
6
- - multimodal
7
- - reinforcement-learning
8
- - question-answering
9
- license: mit
10
  datasets:
11
  - CLEVRER
12
  - NExT-QA
@@ -15,10 +9,28 @@ datasets:
15
  - TempCompass
16
  - Video-MME
17
  - STAR
18
- base_model:
19
- - Qwen/Qwen2.5-VL-7B-Instruct
 
 
 
 
 
 
 
 
20
  ---
21
 
 
 
 
 
 
 
 
 
 
 
22
  This repository contains the model as presented in "Reinforcing Video Reasoning with Focused Thinking".
23
 
24
  For training and evaluation, please refer to the Code: https://github.com/longmalongma/TW-GRPO
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
 
 
 
 
 
 
4
  datasets:
5
  - CLEVRER
6
  - NExT-QA
 
9
  - TempCompass
10
  - Video-MME
11
  - STAR
12
+ language: en
13
+ license: mit
14
+ tags:
15
+ - video-understanding
16
+ - reasoning
17
+ - multimodal
18
+ - reinforcement-learning
19
+ - question-answering
20
+ library_name: transformers
21
+ pipeline_tag: video-text-to-text
22
  ---
23
 
24
+ # Paper title and link
25
+
26
+ The model was presented in the paper [Reinforcing Video Reasoning with Focused Thinking](https://huggingface.co/papers/2505.24718).
27
+
28
+ # Paper abstract
29
+
30
+ The abstract of the paper is the following:
31
+
32
+ Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group information entropy), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from single-choice to multi-choice QA tasks, where soft rewards enable finer-grained gradient estimation by distinguishing partial correctness. Additionally, we propose question-answer inversion, a data augmentation strategy to generate diverse multi-choice samples from existing benchmarks. Experiments demonstrate state-of-the-art performance on several video reasoning and general understanding benchmarks. Notably, TW-GRPO achieves 50.4\% accuracy on CLEVRER (18.8\% improvement over Video-R1) and 65.8\% on MMVU. Our codes are available at \href{ this https URL }.
33
+
34
  This repository contains the model as presented in "Reinforcing Video Reasoning with Focused Thinking".
35
 
36
  For training and evaluation, please refer to the Code: https://github.com/longmalongma/TW-GRPO