redmoe-ai-v1 ariG23498 HF Staff commited on
Commit
89eabff
·
verified ·
1 Parent(s): 4d84cbc

Adding `transformers` as a library, and also mentioning the `custom_code` tag (#29)

Browse files

- Adding `transformers` as a library, and also mentioning the `custom_code` tag (9becf2f563569f966f1825ef59e0f0a3b46e56c1)


Co-authored-by: Aritra Roy Gosthipaty <[email protected]>

Files changed (1) hide show
  1. README.md +82 -1
README.md CHANGED
@@ -9,6 +9,8 @@ tags:
9
  - layout
10
  - table
11
  - formula
 
 
12
  language:
13
  - en
14
  - zh
@@ -49,6 +51,85 @@ dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
49
  4. **Efficient and Fast Performance:** Built upon a compact 1.7B LLM, **dots.ocr** provides faster inference speeds than many other high-performing models based on larger foundations.
50
 
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ### Performance Comparison: dots.ocr vs. Competing Models
53
  <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
54
 
@@ -1231,4 +1312,4 @@ We also thank [DocLayNet](https://github.com/DS4SD/DocLayNet), [M6Doc](https://g
1231
  - **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
1232
 
1233
  We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
1234
- We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [[email protected]].
 
9
  - layout
10
  - table
11
  - formula
12
+ - transformers
13
+ - custom_code
14
  language:
15
  - en
16
  - zh
 
51
  4. **Efficient and Fast Performance:** Built upon a compact 1.7B LLM, **dots.ocr** provides faster inference speeds than many other high-performing models based on larger foundations.
52
 
53
 
54
+ ## Usage with transformers
55
+
56
+ ```py
57
+ import torch
58
+ from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
59
+ from qwen_vl_utils import process_vision_info
60
+ from dots_ocr.utils import dict_promptmode_to_prompt
61
+
62
+ model_path = "./weights/DotsOCR"
63
+ model = AutoModelForCausalLM.from_pretrained(
64
+ model_path,
65
+ attn_implementation="flash_attention_2",
66
+ torch_dtype=torch.bfloat16,
67
+ device_map="auto",
68
+ trust_remote_code=True
69
+ )
70
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
71
+
72
+ image_path = "demo/demo_image1.jpg"
73
+ prompt = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
74
+
75
+ 1. Bbox format: [x1, y1, x2, y2]
76
+
77
+ 2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].
78
+
79
+ 3. Text Extraction & Formatting Rules:
80
+ - Picture: For the 'Picture' category, the text field should be omitted.
81
+ - Formula: Format its text as LaTeX.
82
+ - Table: Format its text as HTML.
83
+ - All Others (Text, Title, etc.): Format their text as Markdown.
84
+
85
+ 4. Constraints:
86
+ - The output text must be the original text from the image, with no translation.
87
+ - All layout elements must be sorted according to human reading order.
88
+
89
+ 5. Final Output: The entire output must be a single JSON object.
90
+ """
91
+
92
+ messages = [
93
+ {
94
+ "role": "user",
95
+ "content": [
96
+ {
97
+ "type": "image",
98
+ "image": image_path
99
+ },
100
+ {"type": "text", "text": prompt}
101
+ ]
102
+ }
103
+ ]
104
+
105
+ # Preparation for inference
106
+ text = processor.apply_chat_template(
107
+ messages,
108
+ tokenize=False,
109
+ add_generation_prompt=True
110
+ )
111
+ image_inputs, video_inputs = process_vision_info(messages)
112
+ inputs = processor(
113
+ text=[text],
114
+ images=image_inputs,
115
+ videos=video_inputs,
116
+ padding=True,
117
+ return_tensors="pt",
118
+ )
119
+
120
+ inputs = inputs.to("cuda")
121
+
122
+ # Inference: Generation of the output
123
+ generated_ids = model.generate(**inputs, max_new_tokens=24000)
124
+ generated_ids_trimmed = [
125
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
126
+ ]
127
+ output_text = processor.batch_decode(
128
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
129
+ )
130
+ print(output_text)
131
+ ```
132
+
133
  ### Performance Comparison: dots.ocr vs. Competing Models
134
  <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
135
 
 
1312
  - **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
1313
 
1314
  We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
1315
+ We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [[email protected]].