Update README.md

Browse files

Files changed (1) hide show

README.md +105 -58

README.md CHANGED Viewed

@@ -1,66 +1,58 @@
 ---
 license: mit
 base_model: Qwen/Qwen2.5-VL-7B
-tags:
-  - vision-language
-  - document-to-markdown
-  - reinforcement-learning
-  - grpo
-  - qwen2.5
-  - markdown
 model_name: NuMarkdown-Qwen2.5-VL
-datasets:
-  - NM-dev/markdown-input_output-v3
-  - NM-dev/markdown-grpo-images3
-library_name: transformers
-pipeline_tag: text-generation
 ---
-# NuMarkdown-Qwen2.5-VL 🖋️📄 → 📝
-**NuMarkdown-Qwen2.5-VL** is the first reasoning vision-language model trained to converts documents into clean GitHub-flavoured Markdown.
-It is a fine-tune of **Qwen 2.5-VL-7B** using ~10 k synthetic doc-to-Markdown pairs, followed by a RL phase (GRPO) with a layout-centric reward.
-*(note: the number of thinking tokens can vary from 20% to 2X the number of token of the final answers)*
 ---
-## Results
-(we plan to realease a markdown arena -similar to llmArena- for complex document to markdown task)
-### Arena ranking (using trueskill-2 ranking system)
-| Rank | Model                                   | μ     | σ    | μ − 3σ |
-| ---- | --------------------------------------- | ----- | ---- | ------ |
-| 🥇 1 | **gemini-flash-reasoning**              | 26.75 | 0.80 | 24.35  |
-| 🥈 2 | **NuMarkdown-reasoning**                | 26.10 | 0.79 | 23.72  |
-| 🥉 3 | **NuMarkdown-reasoning-w/o\_reasoning** | 25.32 | 0.80 | 22.93  |
-| 4    | **OCRFlux-3B**                          | 24.63 | 0.80 | 22.22  |
-| 5    | **gpt-4o**                              | 24.48 | 0.80 | 22.08  |
-| 6    | **gemini-flash-w/o\_reasoning**         | 24.11 | 0.79 | 21.74  |
-| 7    | **RolmoOCR**                            | 23.53 | 0.82 | 21.07  |
-### Win-rate of our model against others models:
-<img src="bar plot.png" width="500"/>
-### Matrix Win-rate:
-<img src="matrix.png" width="500"/>
 ---
-## Training
-1. **SFT**: One-epoch supervised fine-tune on synthetic reasoning trace generated from public PDFs (10K input/output pairs).
-2. **RL (GRPO)**: RL pahse using a structure-aware reward (5K difficults image examples).
-*No proprietary data or prompts were used; see the [Datasets](#datasets) section for public sources only.*
-## Quick start: 🤗 Transformers
 ```python
 from __future__ import annotations
@@ -69,11 +61,11 @@ import torch
 from PIL import Image
 from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
-model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
 processor = AutoProcessor.from_pretrained(
     model_id,
-    trust_remote_code=True,
 )
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
@@ -85,37 +77,92 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 )
 img = Image.open("invoice_scan.png").convert("RGB")
-messages = [{
-    "role": "user",
-    "content": [
-        {"type": "image"},
-    ],
-}]
-prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-enc = processor(text=prompt, images=[img], return_tensors="pt").to(model.device)
-with torch.no_grad():
-    out = model.generate(**enc, max_new_tokens=5000)
-print(processor.decode(out[0].split("<answer>")[1].split("</answer>")[0], skip_special_tokens=True))
 ```
-## VLLM:
 ```python
 from PIL import Image
 from vllm import LLM, SamplingParameters
 from transformers import AutoProcessor
 model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
 llm  = LLM(model=model_id, trust_remote_code=True, dtype="bfloat16")
 proc = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
 img = Image.open("invoice_scan.png")
-prompt = proc(text="Convert this to Markdown with reasoning.", image=img,
-              return_tensors="np")  # numpy arrays for vLLM
-params = SamplingParameters(max_tokens=1024, temperature=0.8, top_p=0.95)
-result = llm.generate([{"prompt": prompt}], params)[0].outputs[0].text.split("<answer>")[1].split("</answer>")[0]
 print(result)
-```

 ---
 license: mit
 base_model: Qwen/Qwen2.5-VL-7B
 model_name: NuMarkdown-Qwen2.5-VL
 ---
+# NuMarkdown‑Qwen2.5‑VL 🖋️📄 → 📝
+**NuMarkdown‑Qwen2.5‑VL** is the **first reasoning vision‑language model** that converts semi‑structured **documents and PDF scans into clean GitHub‑flavoured Markdown**, with layout preserved and an optional chain‑of‑thought explaining each step.
+> *“From messy scans to tidy `.md` in one shot.”*
 ---
+## Overview
+* **Architecture:** fine‑tune of [Qwen 2.5‑VL‑7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B).
+* **Training data:** 10 k synthetic doc‑to‑Markdown pairs + 5 k challenging images.
+* **Reasoning tokens:** during inference the model thinks \~20 % – 2 × more tokens than its final answer.
+* **License:** MIT – free for commercial & research use.
+---
+## Results
+### 🏆 Arena ranking — *Trueskill‑2 (μ − 3σ)*
+| Rank | Model                                  | μ     | σ    | μ − 3σ |
+| ---- | -------------------------------------- | ----- | ---- | ------ |
+| 🥇 1 | **gemini‑flash‑reasoning**             | 26.75 | 0.80 | 24.35  |
+| 🥈 2 | **NuMarkdown‑reasoning**               | 26.10 | 0.79 | 23.72  |
+| 🥉 3 | **NuMarkdown‑reasoning‑w/o reasoning** | 25.32 | 0.80 | 22.93  |
+| 4    | **OCRFlux‑3B**                         | 24.63 | 0.80 | 22.22  |
+| 5    | **gpt‑4o**                             | 24.48 | 0.80 | 22.08  |
+| 6    | **gemini‑flash‑w/o reasoning**         | 24.11 | 0.79 | 21.74  |
+| 7    | **RolmoOCR**                           | 23.53 | 0.82 | 21.07  |
+### Win‑rate plots
+|                                                  |                                           |
+| :----------------------------------------------: | :---------------------------------------: |
+| ![Bar‑plot of pairwise win‑rate](bar_plot.png) | ![Matrix win‑rate heat‑map](matrix.png) |
 ---
+## Training procedure
+1. **Supervised fine‑tuning (SFT)** – one epoch on 10 k synthetic pairs generated from public PDFs.
+2. **Reinforcement Learning (GRPO)** – 5 k difficult images with a **structure‑aware** reward focusing on layout fidelity.
+---
+## Quick start — 🤗 Transformers
 ```python
 from __future__ import annotations
 from PIL import Image
 from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
+model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
 processor = AutoProcessor.from_pretrained(
     model_id,
+    trust_remote_code=True,
 )
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 )
 img = Image.open("invoice_scan.png").convert("RGB")
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}],
+    }
+]
+prompt = processor.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+inputs = processor(
+    text=prompt,
+    images=[img],
+    return_tensors="pt",
+).to(model.device)
+with torch.no_grad():
+    outputs = model.generate(**inputs, max_new_tokens=5_000)
+print(
+    processor.decode(
+        outputs[0]
+        .split("<answer>")[1]
+        .split("</answer>")[0],
+        skip_special_tokens=True,
+    )
+)
 ```
+---
+## Quick start — vLLM
 ```python
 from PIL import Image
 from vllm import LLM, SamplingParameters
 from transformers import AutoProcessor
 model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
 llm  = LLM(model=model_id, trust_remote_code=True, dtype="bfloat16")
 proc = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
 img = Image.open("invoice_scan.png")
+prompt = proc(
+    text="Convert this to Markdown with reasoning.",
+    image=img,
+    return_tensors="np",   # numpy arrays for vLLM
+)
+params = SamplingParameters(
+    max_tokens=1_024,
+    temperature=0.8,
+    top_p=0.95,
+)
+result = (
+    llm.generate([{"prompt": prompt}], params)[0]
+    .outputs[0]
+    .text.split("<answer>")[1]
+    .split("</answer>")[0]
+)
 print(result)
+```
+---
+## Citation
+If you use **NuMarkdown‑Qwen2.5‑VL** in your research, please cite the model:
+```bibtex
+@software{NuMarkdown-Qwen2.5-VL,
+  title        = {NuMarkdown-Qwen2.5-VL: Vision-language reasoning model for doc-to-Markdown},
+  author       = {NM-dev},
+  year         = 2025,
+  url          = {https://huggingface.co/NM-dev/NuMarkdown-Qwen2.5-VL},
+  license      = {MIT}
+}
+```
+---
+*Last updated: 2025‑08‑04*