Alexandre-Numind commited on
Commit
92ca68a
·
verified ·
1 Parent(s): 53d419b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -58
README.md CHANGED
@@ -1,66 +1,58 @@
1
  ---
 
2
  license: mit
3
  base_model: Qwen/Qwen2.5-VL-7B
4
- tags:
5
- - vision-language
6
- - document-to-markdown
7
- - reinforcement-learning
8
- - grpo
9
- - qwen2.5
10
- - markdown
11
  model_name: NuMarkdown-Qwen2.5-VL
12
- datasets:
13
- - NM-dev/markdown-input_output-v3
14
- - NM-dev/markdown-grpo-images3
15
- library_name: transformers
16
- pipeline_tag: text-generation
17
  ---
18
 
19
- # NuMarkdown-Qwen2.5-VL 🖋️📄 📝
20
 
21
- **NuMarkdown-Qwen2.5-VL** is the first reasoning vision-language model trained to converts documents into clean GitHub-flavoured Markdown.
22
- It is a fine-tune of **Qwen 2.5-VL-7B** using ~10 k synthetic doc-to-Markdown pairs, followed by a RL phase (GRPO) with a layout-centric reward.
23
 
24
- *(note: the number of thinking tokens can vary from 20% to 2X the number of token of the final answers)*
25
 
26
  ---
27
- ## Results
28
 
29
- (we plan to realease a markdown arena -similar to llmArena- for complex document to markdown task)
30
 
31
- ### Arena ranking (using trueskill-2 ranking system)
32
-
33
- | Rank | Model | μ | σ | μ |
34
- | ---- | --------------------------------------- | ----- | ---- | ------ |
35
- | 🥇 1 | **gemini-flash-reasoning** | 26.75 | 0.80 | 24.35 |
36
- | 🥈 2 | **NuMarkdown-reasoning** | 26.10 | 0.79 | 23.72 |
37
- | 🥉 3 | **NuMarkdown-reasoning-w/o\_reasoning** | 25.32 | 0.80 | 22.93 |
38
- | 4 | **OCRFlux-3B** | 24.63 | 0.80 | 22.22 |
39
- | 5 | **gpt-4o** | 24.48 | 0.80 | 22.08 |
40
- | 6 | **gemini-flash-w/o\_reasoning** | 24.11 | 0.79 | 21.74 |
41
- | 7 | **RolmoOCR** | 23.53 | 0.82 | 21.07 |
42
 
 
43
 
44
- ### Win-rate of our model against others models:
45
 
46
- <img src="bar plot.png" width="500"/>
47
 
48
- ### Matrix Win-rate:
 
 
 
 
 
 
 
 
49
 
50
- <img src="matrix.png" width="500"/>
51
 
 
 
 
52
 
53
  ---
54
 
55
- ## Training
56
-
57
- 1. **SFT**: One-epoch supervised fine-tune on synthetic reasoning trace generated from public PDFs (10K input/output pairs).
58
- 2. **RL (GRPO)**: RL pahse using a structure-aware reward (5K difficults image examples).
59
 
60
- *No proprietary data or prompts were used; see the [Datasets](#datasets) section for public sources only.*
 
61
 
 
62
 
63
- ## Quick start: 🤗 Transformers
64
 
65
  ```python
66
  from __future__ import annotations
@@ -69,11 +61,11 @@ import torch
69
  from PIL import Image
70
  from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
71
 
72
- model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
73
 
74
  processor = AutoProcessor.from_pretrained(
75
  model_id,
76
- trust_remote_code=True,
77
  )
78
 
79
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
@@ -85,37 +77,92 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
85
  )
86
 
87
  img = Image.open("invoice_scan.png").convert("RGB")
88
- messages = [{
89
- "role": "user",
90
- "content": [
91
- {"type": "image"},
92
- ],
93
- }]
94
- prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
95
- enc = processor(text=prompt, images=[img], return_tensors="pt").to(model.device)
 
 
 
 
96
 
97
- with torch.no_grad():
98
- out = model.generate(**enc, max_new_tokens=5000)
 
 
 
99
 
100
- print(processor.decode(out[0].split("<answer>")[1].split("</answer>")[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
101
  ```
102
 
 
 
 
103
 
104
- ## VLLM:
105
  ```python
106
  from PIL import Image
107
  from vllm import LLM, SamplingParameters
108
  from transformers import AutoProcessor
109
 
110
  model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
 
111
  llm = LLM(model=model_id, trust_remote_code=True, dtype="bfloat16")
112
  proc = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
113
 
114
  img = Image.open("invoice_scan.png")
115
- prompt = proc(text="Convert this to Markdown with reasoning.", image=img,
116
- return_tensors="np") # numpy arrays for vLLM
117
 
118
- params = SamplingParameters(max_tokens=1024, temperature=0.8, top_p=0.95)
119
- result = llm.generate([{"prompt": prompt}], params)[0].outputs[0].text.split("<answer>")[1].split("</answer>")[0]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
  print(result)
121
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+
3
  license: mit
4
  base_model: Qwen/Qwen2.5-VL-7B
 
 
 
 
 
 
 
5
  model_name: NuMarkdown-Qwen2.5-VL
6
+
 
 
 
 
7
  ---
8
 
9
+ # NuMarkdownQwen2.5VL 🖋️📄  📝
10
 
11
+ **NuMarkdownQwen2.5VL** is the **first reasoning visionlanguage model** that converts semi‑structured **documents and PDF scans into clean GitHubflavoured Markdown**, with layout preserved and an optional chain‑of‑thought explaining each step.
 
12
 
13
+ > *“From messy scans to tidy `.md` in one shot.”*
14
 
15
  ---
 
16
 
17
+ ## Overview
18
 
19
+ * **Architecture:** fine‑tune of [Qwen 2.5‑VL‑7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B).
20
+ * **Training data:** 10 k synthetic doc‑to‑Markdown pairs + 5 k challenging images.
21
+ * **Reasoning tokens:** during inference the model thinks \~20 % – 2 × more tokens than its final answer.
22
+ * **License:** MIT free for commercial & research use.
 
 
 
 
 
 
 
23
 
24
+ ---
25
 
26
+ ## Results
27
 
28
+ ### 🏆 Arena ranking — *Trueskill‑2 (μ − 3σ)*
29
 
30
+ | Rank | Model | μ | σ | μ − 3σ |
31
+ | ---- | -------------------------------------- | ----- | ---- | ------ |
32
+ | 🥇 1 | **gemini‑flash‑reasoning** | 26.75 | 0.80 | 24.35 |
33
+ | 🥈 2 | **NuMarkdown‑reasoning** | 26.10 | 0.79 | 23.72 |
34
+ | 🥉 3 | **NuMarkdown‑reasoning‑w/o reasoning** | 25.32 | 0.80 | 22.93 |
35
+ | 4 | **OCRFlux‑3B** | 24.63 | 0.80 | 22.22 |
36
+ | 5 | **gpt‑4o** | 24.48 | 0.80 | 22.08 |
37
+ | 6 | **gemini‑flash‑w/o reasoning** | 24.11 | 0.79 | 21.74 |
38
+ | 7 | **RolmoOCR** | 23.53 | 0.82 | 21.07 |
39
 
40
+ ### Win‑rate plots
41
 
42
+ | | |
43
+ | :----------------------------------------------: | :---------------------------------------: |
44
+ | ![Bar‑plot of pairwise win‑rate](bar_plot.png) | ![Matrix win‑rate heat‑map](matrix.png) |
45
 
46
  ---
47
 
48
+ ## Training procedure
 
 
 
49
 
50
+ 1. **Supervised fine‑tuning (SFT)** one epoch on 10 k synthetic pairs generated from public PDFs.
51
+ 2. **Reinforcement Learning (GRPO)** – 5 k difficult images with a **structure‑aware** reward focusing on layout fidelity.
52
 
53
+ ---
54
 
55
+ ## Quick start — 🤗 Transformers
56
 
57
  ```python
58
  from __future__ import annotations
 
61
  from PIL import Image
62
  from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
63
 
64
+ model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
65
 
66
  processor = AutoProcessor.from_pretrained(
67
  model_id,
68
+ trust_remote_code=True,
69
  )
70
 
71
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 
77
  )
78
 
79
  img = Image.open("invoice_scan.png").convert("RGB")
80
+ messages = [
81
+ {
82
+ "role": "user",
83
+ "content": [{"type": "image"}],
84
+ }
85
+ ]
86
+
87
+ prompt = processor.apply_chat_template(
88
+ messages,
89
+ tokenize=False,
90
+ add_generation_prompt=True,
91
+ )
92
 
93
+ inputs = processor(
94
+ text=prompt,
95
+ images=[img],
96
+ return_tensors="pt",
97
+ ).to(model.device)
98
 
99
+ with torch.no_grad():
100
+ outputs = model.generate(**inputs, max_new_tokens=5_000)
101
+
102
+ print(
103
+ processor.decode(
104
+ outputs[0]
105
+ .split("<answer>")[1]
106
+ .split("</answer>")[0],
107
+ skip_special_tokens=True,
108
+ )
109
+ )
110
  ```
111
 
112
+ ---
113
+
114
+ ## Quick start — vLLM
115
 
 
116
  ```python
117
  from PIL import Image
118
  from vllm import LLM, SamplingParameters
119
  from transformers import AutoProcessor
120
 
121
  model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
122
+
123
  llm = LLM(model=model_id, trust_remote_code=True, dtype="bfloat16")
124
  proc = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
125
 
126
  img = Image.open("invoice_scan.png")
 
 
127
 
128
+ prompt = proc(
129
+ text="Convert this to Markdown with reasoning.",
130
+ image=img,
131
+ return_tensors="np", # numpy arrays for vLLM
132
+ )
133
+
134
+ params = SamplingParameters(
135
+ max_tokens=1_024,
136
+ temperature=0.8,
137
+ top_p=0.95,
138
+ )
139
+
140
+ result = (
141
+ llm.generate([{"prompt": prompt}], params)[0]
142
+ .outputs[0]
143
+ .text.split("<answer>")[1]
144
+ .split("</answer>")[0]
145
+ )
146
+
147
  print(result)
148
+ ```
149
+
150
+ ---
151
+
152
+ ## Citation
153
+
154
+ If you use **NuMarkdown‑Qwen2.5‑VL** in your research, please cite the model:
155
+
156
+ ```bibtex
157
+ @software{NuMarkdown-Qwen2.5-VL,
158
+ title = {NuMarkdown-Qwen2.5-VL: Vision-language reasoning model for doc-to-Markdown},
159
+ author = {NM-dev},
160
+ year = 2025,
161
+ url = {https://huggingface.co/NM-dev/NuMarkdown-Qwen2.5-VL},
162
+ license = {MIT}
163
+ }
164
+ ```
165
+
166
+ ---
167
+
168
+ *Last updated: 2025‑08‑04*