echo840 commited on
Commit
541413d
·
verified ·
1 Parent(s): cc449da

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -12
README.md CHANGED
@@ -3,7 +3,6 @@ license: apache-2.0
3
  pipeline_tag: visual-document-retrieval
4
  library_name: transformers
5
  ---
6
-
7
  <div align="center" xmlns="http://www.w3.org/1999/html">
8
  <h1 align="center">
9
  MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
@@ -21,7 +20,8 @@ MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradi
21
  > Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, Xiang Bai <br>
22
  [![arXiv](https://img.shields.io/badge/Arxiv-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2506.05218)
23
  [![Source_code](https://img.shields.io/badge/Code-Available-white)](README.md)
24
- [![Model Weight](https://img.shields.io/badge/Model_Weight-gray)](https://huggingface.co/echo840/MonkeyOCR)
 
25
  [![Demo](https://img.shields.io/badge/Demo-blue)](http://vlrlabmonkey.xyz:7685/)
26
 
27
 
@@ -33,9 +33,11 @@ MonkeyOCR adopts a Structure-Recognition-Relation (SRR) triplet paradigm, which
33
  2. Compared to end-to-end models, our 3B-parameter model achieves the best average performance on English documents, outperforming models such as Gemini 2.5 Pro and Qwen2.5 VL-72B.
34
  3. For multi-page document parsing, our method reaches a processing speed of 0.84 pages per second, surpassing MinerU (0.65) and Qwen2.5 VL-7B (0.12).
35
 
36
-
37
  <img src="https://v1.ax1x.com/2025/06/05/7jQ3cm.png" alt="7jQ3cm.png" border="0" />
38
 
 
 
 
39
  ## News
40
  * ```2025.06.05 ``` 🚀 We release MonkeyOCR, which supports the parsing of various types of Chinese and English documents.
41
 
@@ -52,23 +54,45 @@ cd MonkeyOCR
52
 
53
  # Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
54
  pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
55
- pip install .
56
  ```
57
  ### 2. Download Model Weights
 
58
  ```python
59
  pip install huggingface_hub
60
 
61
- python download_model.py
 
 
62
 
 
 
 
 
63
  ```
64
  ### 3. Inference
65
  ```bash
66
  # Make sure in MonkeyOCR directory
67
  python parse.py path/to/your.pdf
68
- # Specify MonkeyChat path and model configs path
69
- python parse.py path/to/your.pdf -m model_weight/Recognition -c config.yaml
 
 
70
  ```
71
- ### 4. Gradio demo
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ```bash
73
  # Prepare your env for gradio
74
  pip install gradio==5.23.3
@@ -78,8 +102,49 @@ pip install pdf2image==1.17.0
78
  # Start demo
79
  python demo/demo_gradio.py
80
  ```
 
 
 
 
 
 
 
 
 
81
 
82
- Using the [LMDeploy](https://github.com/InternLM/lmdeploy), our model can run efficiently on an NVIDIA 3090 GPU.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
 
85
  ## Benchmark Results
@@ -582,15 +647,29 @@ Here are the evaluation results of our model on OmniDocBench. MonkeyOCR-3B uses
582
 
583
  ## Visualization Demo
584
 
585
- Demo Link: http://vlrlabmonkey.xyz:7685
 
586
  > Our demo is simple and easy to use:
587
  >
588
  > 1. Upload a PDF or image.
589
  > 2. Click “Parse (解析)” to let the model perform structure detection, content recognition, and relationship prediction on the input document. The final output will be a markdown-formatted version of the document.
590
- > 3. Select a prompt and click “Chat (对话)” to let the model perform content recognition on the image based on the selected prompt.
 
 
 
 
 
 
591
 
 
 
592
 
 
 
593
 
 
 
 
594
 
595
  ## Citing MonkeyOCR
596
 
@@ -615,4 +694,4 @@ We would like to thank [MinerU](https://github.com/opendatalab/MinerU), [DocLayo
615
 
616
 
617
  ## Copyright
618
- MonkeyDoc dataset was collected from public datasets, crawled from the internet, and obtained through our own photography. The current technical report only presents the results of the 3B model. If you are interested in larger one, please contact Prof. Yuliang Liu at [email protected].
 
3
  pipeline_tag: visual-document-retrieval
4
  library_name: transformers
5
  ---
 
6
  <div align="center" xmlns="http://www.w3.org/1999/html">
7
  <h1 align="center">
8
  MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
 
20
  > Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, Xiang Bai <br>
21
  [![arXiv](https://img.shields.io/badge/Arxiv-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2506.05218)
22
  [![Source_code](https://img.shields.io/badge/Code-Available-white)](README.md)
23
+ [![Model Weight](https://img.shields.io/badge/HuggingFace-gray)](https://huggingface.co/echo840/MonkeyOCR)
24
+ [![Model Weight](https://img.shields.io/badge/ModelScope-green)](https://modelscope.cn/models/l1731396519/MonkeyOCR)
25
  [![Demo](https://img.shields.io/badge/Demo-blue)](http://vlrlabmonkey.xyz:7685/)
26
 
27
 
 
33
  2. Compared to end-to-end models, our 3B-parameter model achieves the best average performance on English documents, outperforming models such as Gemini 2.5 Pro and Qwen2.5 VL-72B.
34
  3. For multi-page document parsing, our method reaches a processing speed of 0.84 pages per second, surpassing MinerU (0.65) and Qwen2.5 VL-7B (0.12).
35
 
 
36
  <img src="https://v1.ax1x.com/2025/06/05/7jQ3cm.png" alt="7jQ3cm.png" border="0" />
37
 
38
+ MonkeyOCR currently does not support photographed documents, but we will continue to improve it in future updates. Stay tuned!
39
+ Currently, our model is deployed on a single GPU, so if too many users upload files at the same time, issues like “This application is currently busy” may occur. We're actively working on supporting Ollama and other deployment solutions to ensure a smoother experience for more users. Additionally, please note that the processing time shown on the demo page does not reflect computation time alone—it also includes result uploading and other overhead. During periods of high traffic, this time may be longer. The inference speeds of MonkeyOCR, MinerU, and Qwen2.5 VL-7B were measured on an H800 GPU.
40
+
41
  ## News
42
  * ```2025.06.05 ``` 🚀 We release MonkeyOCR, which supports the parsing of various types of Chinese and English documents.
43
 
 
54
 
55
  # Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
56
  pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
57
+ pip install -e .
58
  ```
59
  ### 2. Download Model Weights
60
+ Download our model from Huggingface.
61
  ```python
62
  pip install huggingface_hub
63
 
64
+ python tools/download_model.py
65
+ ```
66
+ You can also download our model from ModelScope.
67
 
68
+ ```python
69
+ pip install modelscope
70
+
71
+ python tools/download_model.py -t modelscope
72
  ```
73
  ### 3. Inference
74
  ```bash
75
  # Make sure in MonkeyOCR directory
76
  python parse.py path/to/your.pdf
77
+ # or with image as input
78
+ pyhton parse.py path/to/your/image
79
+ # Specify output path and model configs path
80
+ python parse.py path/to/your.pdf -o ./output -c config.yaml
81
  ```
82
+
83
+ #### Output Results
84
+ MonkeyOCR generates three types of output files:
85
+
86
+ 1. **Processed Markdown File** (`your.md`): The final parsed document content in markdown format, containing text, formulas, tables, and other structured elements.
87
+ 2. **Layout Results** (`your_layout.pdf`): The layout results drawed on origin PDF.
88
+ 2. **Intermediate Block Results** (`your_middle.json`): A JSON file containing detailed information about all detected blocks, including:
89
+ - Block coordinates and positions
90
+ - Block content and type information
91
+ - Relationship information between blocks
92
+
93
+ These files provide both the final formatted output and detailed intermediate results for further analysis or processing.
94
+
95
+ ### 4. Gradio Demo
96
  ```bash
97
  # Prepare your env for gradio
98
  pip install gradio==5.23.3
 
102
  # Start demo
103
  python demo/demo_gradio.py
104
  ```
105
+ ### Fix **shared memory error** on **RTX 3090 / 4090 / ...** GPUs (Optional)
106
+
107
+ Our 3B model runs efficiently on NVIDIA RTX 3090. However, when using **LMDeploy** as the inference backend, you may encounter compatibility issues on **RTX 3090 / 4090** GPUs — particularly the following error:
108
+
109
+ ```
110
+ triton.runtime.errors.OutOfResources: out of resource: shared memory
111
+ ```
112
+
113
+ To work around this issue, you can apply the patch below:
114
 
115
+ ```bash
116
+ python tools/lmdeploy_patcher.py patch
117
+ ```
118
+
119
+ > ⚠️ **Note:** This command will modify LMDeploy's source code in your environment.
120
+ > To revert the changes, simply run:
121
+
122
+ ```bash
123
+ python tools/lmdeploy_patcher.py restore
124
+ ```
125
+
126
+ **Special thanks to [@pineking](https://github.com/pineking) for the solution!**
127
+
128
+ ### Switch inference backend (Optional)
129
+
130
+ You can switch inference backend to `transformers` following the steps below:
131
+
132
+ 1. Install required dependency (if not already installed):
133
+ ```bash
134
+ # install flash attention 2, you can download the corresponding version from https://github.com/Dao-AILab/flash-attention/releases/
135
+ pip install flash-attn==2.7.4.post1 --no-build-isolation
136
+ ```
137
+ 2. Open the `model_configs.yaml` file
138
+ 3. Set `chat_config.backend` to `transformers`
139
+ 4. Adjust the `batch_size` according to your GPU's memory capacity to ensure stable performance
140
+
141
+ Example configuration:
142
+
143
+ ```yaml
144
+ chat_config:
145
+ backend: transformers
146
+ batch_size: 10 # Adjust based on your available GPU memory
147
+ ```
148
 
149
 
150
  ## Benchmark Results
 
647
 
648
  ## Visualization Demo
649
 
650
+ Get a Quick Hands-On Experience with Our Demo: http://vlrlabmonkey.xyz:7685
651
+
652
  > Our demo is simple and easy to use:
653
  >
654
  > 1. Upload a PDF or image.
655
  > 2. Click “Parse (解析)” to let the model perform structure detection, content recognition, and relationship prediction on the input document. The final output will be a markdown-formatted version of the document.
656
+ > 3. Select a prompt and click “Test by prompt” to let the model perform content recognition on the image based on the selected prompt.
657
+
658
+
659
+
660
+
661
+ ### Example for formula document
662
+ <img src="https://v1.ax1x.com/2025/06/10/7jVLgB.jpg" alt="7jVLgB.jpg" border="0" />
663
 
664
+ ### Example for table document
665
+ <img src="https://v1.ax1x.com/2025/06/11/7jcOaa.png" alt="7jcOaa.png" border="0" />
666
 
667
+ ### Example for newspaper
668
+ <img src="https://v1.ax1x.com/2025/06/11/7jcP5V.png" alt="7jcP5V.png" border="0" />
669
 
670
+ ### Example for financial report
671
+ <img src="https://v1.ax1x.com/2025/06/11/7jc10I.png" alt="7jc10I.png" border="0" />
672
+ <img src="https://v1.ax1x.com/2025/06/11/7jcRCL.png" alt="7jcRCL.png" border="0" />
673
 
674
  ## Citing MonkeyOCR
675
 
 
694
 
695
 
696
  ## Copyright
697
+ Please don’t hesitate to share your valuable feedback it’s a key motivation that drives us to continuously improve our framework. The current technical report only presents the results of the 3B model. Our model is intended for non-commercial use. If you are interested in larger one, please contact us at [email protected] or [email protected].