Files changed (1) hide show
  1. reducto_RolmOCR.json +111 -0
reducto_RolmOCR.json ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bomFormat": "CycloneDX",
3
+ "specVersion": "1.6",
4
+ "serialNumber": "urn:uuid:d8411c65-830d-47b0-928a-db17d1885512",
5
+ "version": 1,
6
+ "metadata": {
7
+ "timestamp": "2025-06-05T09:41:39.920661+00:00",
8
+ "component": {
9
+ "type": "machine-learning-model",
10
+ "bom-ref": "reducto/RolmOCR-9488b32d-f36e-501e-bed2-4e214df3640f",
11
+ "name": "reducto/RolmOCR",
12
+ "externalReferences": [
13
+ {
14
+ "url": "https://huggingface.co/reducto/RolmOCR",
15
+ "type": "documentation"
16
+ }
17
+ ],
18
+ "modelCard": {
19
+ "modelParameters": {
20
+ "task": "image-text-to-text",
21
+ "architectureFamily": "qwen2_5_vl",
22
+ "modelArchitecture": "Qwen2_5_VLForConditionalGeneration",
23
+ "datasets": [
24
+ {
25
+ "ref": "allenai/olmOCR-mix-0225-60ff9e3d-6392-58a9-97f8-ebf183f689d7"
26
+ }
27
+ ]
28
+ },
29
+ "properties": [
30
+ {
31
+ "name": "library_name",
32
+ "value": "transformers"
33
+ },
34
+ {
35
+ "name": "base_model",
36
+ "value": "Qwen/Qwen2.5-VL-7B-Instruct"
37
+ }
38
+ ]
39
+ },
40
+ "authors": [
41
+ {
42
+ "name": "reducto"
43
+ }
44
+ ],
45
+ "licenses": [
46
+ {
47
+ "license": {
48
+ "id": "Apache-2.0",
49
+ "url": "https://spdx.org/licenses/Apache-2.0.html"
50
+ }
51
+ }
52
+ ],
53
+ "tags": [
54
+ "transformers",
55
+ "safetensors",
56
+ "qwen2_5_vl",
57
+ "image-text-to-text",
58
+ "conversational",
59
+ "dataset:allenai/olmOCR-mix-0225",
60
+ "base_model:Qwen/Qwen2.5-VL-7B-Instruct",
61
+ "base_model:finetune:Qwen/Qwen2.5-VL-7B-Instruct",
62
+ "license:apache-2.0",
63
+ "text-generation-inference",
64
+ "endpoints_compatible",
65
+ "region:us"
66
+ ]
67
+ }
68
+ },
69
+ "components": [
70
+ {
71
+ "type": "data",
72
+ "bom-ref": "allenai/olmOCR-mix-0225-60ff9e3d-6392-58a9-97f8-ebf183f689d7",
73
+ "name": "allenai/olmOCR-mix-0225",
74
+ "data": [
75
+ {
76
+ "type": "dataset",
77
+ "bom-ref": "allenai/olmOCR-mix-0225-60ff9e3d-6392-58a9-97f8-ebf183f689d7",
78
+ "name": "allenai/olmOCR-mix-0225",
79
+ "contents": {
80
+ "url": "https://huggingface.co/datasets/allenai/olmOCR-mix-0225",
81
+ "properties": [
82
+ {
83
+ "name": "configs",
84
+ "value": "Name of the dataset subset: 00_documents {\"split\": \"train_s2pdf\", \"path\": [\"train-s2pdf.parquet\"]}, {\"split\": \"eval_s2pdf\", \"path\": [\"eval-s2pdf.parquet\"]}"
85
+ },
86
+ {
87
+ "name": "configs",
88
+ "value": "Name of the dataset subset: 01_books {\"split\": \"train_iabooks\", \"path\": [\"train-iabooks.parquet\"]}, {\"split\": \"eval_iabooks\", \"path\": [\"eval-iabooks.parquet\"]}"
89
+ },
90
+ {
91
+ "name": "license",
92
+ "value": "odc-by"
93
+ }
94
+ ]
95
+ },
96
+ "governance": {
97
+ "owners": [
98
+ {
99
+ "organization": {
100
+ "name": "allenai",
101
+ "url": "https://huggingface.co/allenai"
102
+ }
103
+ }
104
+ ]
105
+ },
106
+ "description": "\n\t\n\t\t\n\t\tolmOCR-mix-0225\n\t\n\nolmOCR-mix-0225 is a dataset of ~250,000 PDF pages which have been OCRed into plain-text in a natural reading order using gpt-4o-2024-08-06 and a special\nprompting strategy that preserves any born-digital content from each page.\nThis dataset can be used to train, fine-tune, or evaluate your own OCR document pipeline.\nQuick links:\n\n\ud83d\udcc3 Paper\n\ud83e\udd17 Model\n\ud83d\udee0\ufe0f Code\n\ud83c\udfae Demo\n\n\n\t\n\t\t\n\t\n\t\n\t\tData Mix\n\t\n\n\n\t\n\t\n\t\n\t\tTable 1: Training set composition by source\n\t\n\n\n\t\n\t\t\nSource\nUnique\u2026 See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmOCR-mix-0225."
107
+ }
108
+ ]
109
+ }
110
+ ]
111
+ }