Update README.md
Browse files
README.md
CHANGED
|
@@ -21,21 +21,34 @@ pipeline_tag: text-generation
|
|
| 21 |
**NuMarkdown-Qwen2.5-VL** is the first reasoning vision-language model trained to converts documents into clean GitHub-flavoured Markdown.
|
| 22 |
It is a lightweight fine-tune of **Qwen 2.5-VL-7B** using ~10 k synthetic doc-to-Markdown pairs, followed by a RL phase (GRPO) with a layout-centric reward.
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
---
|
| 27 |
## Results
|
| 28 |
|
| 29 |
(we plan to realease a markdown arena -similar to llmArena- for complex table to markdown format)
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
//
|
| 39 |
|
| 40 |
---
|
| 41 |
|
|
@@ -82,9 +95,9 @@ prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_
|
|
| 82 |
enc = processor(text=prompt, images=[img], return_tensors="pt").to(model.device)
|
| 83 |
|
| 84 |
with torch.no_grad():
|
| 85 |
-
out = model.generate(**enc, max_new_tokens=
|
| 86 |
|
| 87 |
-
print(processor.decode(out[0], skip_special_tokens=True))
|
| 88 |
```
|
| 89 |
|
| 90 |
|
|
@@ -103,6 +116,6 @@ prompt = proc(text="Convert this to Markdown with reasoning.", image=img,
|
|
| 103 |
return_tensors="np") # numpy arrays for vLLM
|
| 104 |
|
| 105 |
params = SamplingParameters(max_tokens=1024, temperature=0.8, top_p=0.95)
|
| 106 |
-
result = llm.generate([{"prompt": prompt}], params)[0].outputs[0].text
|
| 107 |
print(result)
|
| 108 |
```
|
|
|
|
| 21 |
**NuMarkdown-Qwen2.5-VL** is the first reasoning vision-language model trained to converts documents into clean GitHub-flavoured Markdown.
|
| 22 |
It is a lightweight fine-tune of **Qwen 2.5-VL-7B** using ~10 k synthetic doc-to-Markdown pairs, followed by a RL phase (GRPO) with a layout-centric reward.
|
| 23 |
|
| 24 |
+
(note: the number of thinking tokens can vary from 20% to 2X the number of token of the final answers)
|
| 25 |
|
| 26 |
---
|
| 27 |
## Results
|
| 28 |
|
| 29 |
(we plan to realease a markdown arena -similar to llmArena- for complex table to markdown format)
|
| 30 |
|
| 31 |
+
### Arena ranking (using trueskill-2 ranking system)
|
| 32 |
|
| 33 |
+
| Rank | Model | μ | σ | μ − 3σ |
|
| 34 |
+
| ---- | --------------------------------------- | ----- | ---- | ------ |
|
| 35 |
+
| 🥇 1 | **gemini-flash-reasoning** | 26.75 | 0.80 | 24.35 |
|
| 36 |
+
| 🥈 2 | **NuMarkdown-reasoning** | 26.10 | 0.79 | 23.72 |
|
| 37 |
+
| 🥉 3 | **NuMarkdown-reasoning-w/o\_reasoning** | 25.32 | 0.80 | 22.93 |
|
| 38 |
+
| 4 | **OCRFlux-3B** | 24.63 | 0.80 | 22.22 |
|
| 39 |
+
| 5 | **gpt-4o** | 24.48 | 0.80 | 22.08 |
|
| 40 |
+
| 6 | **gemini-flash-w/o\_reasoning** | 24.11 | 0.79 | 21.74 |
|
| 41 |
+
| 7 | **RolmoOCR** | 23.53 | 0.82 | 21.07 |
|
| 42 |
|
| 43 |
|
| 44 |
+
### Win-rate of our model against others models:
|
| 45 |
+
|
| 46 |
+
<img src="bar plot.png" width="500"/>
|
| 47 |
+
|
| 48 |
+
### Matrix Win-rate:
|
| 49 |
+
|
| 50 |
+
<img src="matrix.png" width="500"/>
|
| 51 |
|
|
|
|
| 52 |
|
| 53 |
---
|
| 54 |
|
|
|
|
| 95 |
enc = processor(text=prompt, images=[img], return_tensors="pt").to(model.device)
|
| 96 |
|
| 97 |
with torch.no_grad():
|
| 98 |
+
out = model.generate(**enc, max_new_tokens=5000)
|
| 99 |
|
| 100 |
+
print(processor.decode(out[0].split("<answer>")[1].split("</answer>")[0], skip_special_tokens=True))
|
| 101 |
```
|
| 102 |
|
| 103 |
|
|
|
|
| 116 |
return_tensors="np") # numpy arrays for vLLM
|
| 117 |
|
| 118 |
params = SamplingParameters(max_tokens=1024, temperature=0.8, top_p=0.95)
|
| 119 |
+
result = llm.generate([{"prompt": prompt}], params)[0].outputs[0].text.split("<answer>")[1].split("</answer>")[0]
|
| 120 |
print(result)
|
| 121 |
```
|