Update README.md
Browse files
README.md
CHANGED
|
@@ -26,18 +26,6 @@ OCRFlux is a multimodal large language model based toolkit for converting PDFs a
|
|
| 26 |
|
| 27 |
Try the online demo: https://ocrflux.pdfparser.io/
|
| 28 |
|
| 29 |
-
# Functions
|
| 30 |
-
|
| 31 |
-
## On each page
|
| 32 |
-
|
| 33 |
-
Convert into text with a natural reading order, even in the presence of multi-column layouts, figures, and insets
|
| 34 |
-
Support for complicated tables and equations
|
| 35 |
-
Automatically removes headers and footers
|
| 36 |
-
|
| 37 |
-
## Cross-page table/paragraph merging
|
| 38 |
-
|
| 39 |
-
Cross-page table merging
|
| 40 |
-
Cross-page paragraph merging
|
| 41 |
|
| 42 |
## Key features:
|
| 43 |
Superior parsing quality on each page
|
|
@@ -49,16 +37,100 @@ Native support for cross-page table/paragraph merging (to our best this is the f
|
|
| 49 |
Based on a 3B parameter VLM, so it can run even on GTX 3090 GPU.
|
| 50 |
|
| 51 |
|
| 52 |
-
## News
|
| 53 |
-
Jun 17, 2025 - v0.1.0 - Initial public launch and demo.
|
| 54 |
-
|
| 55 |
-
|
| 56 |
## Usage
|
| 57 |
|
| 58 |
The best way to use this model is via the [OCRFlux toolkit](https://github.com/chatdoc-com/OCRFlux).
|
| 59 |
The toolkit comes with an efficient inference setup via vllm that can handle millions of documents
|
| 60 |
at scale.
|
| 61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
### Benchmark for single-page parsing
|
| 63 |
|
| 64 |
We ship two comprehensive benchmarks to help measure the performance of our OCR system in single-page parsing:
|
|
|
|
| 26 |
|
| 27 |
Try the online demo: https://ocrflux.pdfparser.io/
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
## Key features:
|
| 31 |
Superior parsing quality on each page
|
|
|
|
| 37 |
Based on a 3B parameter VLM, so it can run even on GTX 3090 GPU.
|
| 38 |
|
| 39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
## Usage
|
| 41 |
|
| 42 |
The best way to use this model is via the [OCRFlux toolkit](https://github.com/chatdoc-com/OCRFlux).
|
| 43 |
The toolkit comes with an efficient inference setup via vllm that can handle millions of documents
|
| 44 |
at scale.
|
| 45 |
|
| 46 |
+
### API for directly calling OCRFlux (New)
|
| 47 |
+
You can use the inference API to directly call OCRFlux in your codes without using an online vllm server like following:
|
| 48 |
+
|
| 49 |
+
```
|
| 50 |
+
from vllm import LLM
|
| 51 |
+
from ocrflux.inference import parse
|
| 52 |
+
|
| 53 |
+
file_path = 'test.pdf'
|
| 54 |
+
# file_path = 'test.png'
|
| 55 |
+
llm = LLM(model="model_dir/OCRFlux-3B",gpu_memory_utilization=0.8,max_model_len=8192)
|
| 56 |
+
result = parse(llm,file_path)
|
| 57 |
+
document_markdown = result['document_text']
|
| 58 |
+
with open('test.md','w') as f:
|
| 59 |
+
f.write(document_markdown)
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
### Docker Usage
|
| 63 |
+
|
| 64 |
+
Requirements:
|
| 65 |
+
|
| 66 |
+
- Docker with GPU support [(NVIDIA Toolkit)](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
|
| 67 |
+
- Pre-downloaded model: [OCRFlux-3B](https://huggingface.co/ChatDOC/OCRFlux-3B)
|
| 68 |
+
|
| 69 |
+
To use OCRFlux in a docker container, you can use the following example command:
|
| 70 |
+
|
| 71 |
+
```bash
|
| 72 |
+
docker run -it --gpus all \
|
| 73 |
+
-v /path/to/localworkspace:/localworkspace \
|
| 74 |
+
-v /path/to/test_pdf_dir:/test_pdf_dir/ \
|
| 75 |
+
-v /path/to/OCRFlux-3B:/OCRFlux-3B \
|
| 76 |
+
chatdoc/ocrflux:latest /localworkspace --data /test_pdf_dir/* --model /OCRFlux-3B/
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
#### Viewing Results
|
| 80 |
+
Generate the final Markdown files by running the following command. Generated Markdown files will be in `./localworkspace/markdowns/DOCUMENT_NAME` directory.
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
python -m ocrflux.jsonl_to_markdown ./localworkspace
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
### Full documentation for the pipeline
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
python -m ocrflux.pipeline --help
|
| 90 |
+
usage: pipeline.py [-h] [--task {pdf2markdown,merge_pages,merge_tables}] [--data [DATA ...]] [--pages_per_group PAGES_PER_GROUP] [--max_page_retries MAX_PAGE_RETRIES]
|
| 91 |
+
[--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE]
|
| 92 |
+
[--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM] [--skip_cross_page_merge] [--port PORT]
|
| 93 |
+
workspace
|
| 94 |
+
|
| 95 |
+
Manager for running millions of PDFs through a batch inference pipeline
|
| 96 |
+
|
| 97 |
+
positional arguments:
|
| 98 |
+
workspace The filesystem path where work will be stored, can be a local folder
|
| 99 |
+
|
| 100 |
+
options:
|
| 101 |
+
-h, --help show this help message and exit
|
| 102 |
+
--data [DATA ...] List of paths to files to process
|
| 103 |
+
--pages_per_group PAGES_PER_GROUP
|
| 104 |
+
Aiming for this many pdf pages per work item group
|
| 105 |
+
--max_page_retries MAX_PAGE_RETRIES
|
| 106 |
+
Max number of times we will retry rendering a page
|
| 107 |
+
--max_page_error_rate MAX_PAGE_ERROR_RATE
|
| 108 |
+
Rate of allowable failed pages in a document, 1/250 by default
|
| 109 |
+
--workers WORKERS Number of workers to run at a time
|
| 110 |
+
--model MODEL The path to the model
|
| 111 |
+
--model_max_context MODEL_MAX_CONTEXT
|
| 112 |
+
Maximum context length that the model was fine tuned under
|
| 113 |
+
--model_chat_template MODEL_CHAT_TEMPLATE
|
| 114 |
+
Chat template to pass to vllm server
|
| 115 |
+
--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
|
| 116 |
+
Dimension on longest side to use for rendering the pdf pages
|
| 117 |
+
--skip_cross_page_merge
|
| 118 |
+
Whether to skip cross-page merging
|
| 119 |
+
--port PORT Port to use for the VLLM server
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
## Code overview
|
| 123 |
+
|
| 124 |
+
There are some nice reusable pieces of the code that may be useful for your own projects:
|
| 125 |
+
- Processing millions of PDFs through our released model using VLLM - [pipeline.py](https://github.com/chatdoc-com/OCRFlux/blob/main/ocrflux/pipeline.py)
|
| 126 |
+
- Generating final Markdowns from jsonl files - [jsonl_to_markdown.py](https://github.com/chatdoc-com/OCRFlux/blob/main/ocrflux/jsonl_to_markdown.py)
|
| 127 |
+
- Evaluating the model on the single-page parsing task - [eval_page_to_markdown.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_page_to_markdown.py)
|
| 128 |
+
- Evaluating the model on the table parising task - [eval_table_to_html.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_table_to_html.py)
|
| 129 |
+
- Evaluating the model on the paragraphs/tables merging detection task - [eval_element_merge_detect.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_element_merge_detect.py)
|
| 130 |
+
- Evaluating the model on the table merging task - [eval_html_table_merge.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_html_table_merge.py)
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
|
| 134 |
### Benchmark for single-page parsing
|
| 135 |
|
| 136 |
We ship two comprehensive benchmarks to help measure the performance of our OCR system in single-page parsing:
|