Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
BEEspoke Data
community
AI & ML interests
'an LLM is only as good as the dataset it was trained on' - Sun Tzu
Organization Card
🐝📊💁
🚧"raw" pretrained smol_llama checkpoints - WIP 🚧
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation • 0.1B • Updated • 2.7k • 30 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation • 81.3M • Updated • 1.22k • 9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation • 0.2B • Updated • 3.42k • 13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation • 58.1M • Updated • 1.12k • 4
Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
🚧"raw" pretrained smol_llama checkpoints - WIP 🚧
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation • 0.1B • Updated • 2.7k • 30 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation • 81.3M • Updated • 1.22k • 9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation • 0.2B • Updated • 3.42k • 13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation • 58.1M • Updated • 1.12k • 4
models
57
BEE-spoke-data/neobert-100k-test
Fill-Mask
•
0.1B
•
Updated
•
15
BEE-spoke-data/tiny-random-MPNetForMaskedLM
Fill-Mask
•
237k
•
Updated
•
12
BEE-spoke-data/wordpiece-tokenizer-32k-en_code-msp
Updated
BEE-spoke-data/wordpiece-tokenizer-32k-en_code-orig
Updated
BEE-spoke-data/bpe-tokenizer-32k-smolNeoX
Updated
BEE-spoke-data/pegasus-x-base-synthsumm_open-16k
Summarization
•
0.3B
•
Updated
•
23
•
2
BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan
0.7B
•
Updated
•
73
BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L2
Text Generation
•
0.7B
•
Updated
•
27
BEE-spoke-data/tFINE-900m-e16-d32-instruct_2e
0.9B
•
Updated
•
44
BEE-spoke-data/tFINE-900m-instruct-orpo
0.9B
•
Updated
•
17
datasets
82
BEE-spoke-data/govdocs1-pdf-source
Viewer
•
Updated
•
235k
•
2.26k
•
3
BEE-spoke-data/govdocs1-by-extension
Viewer
•
Updated
•
733k
•
1.33k
•
2
BEE-spoke-data/SurvivorLib-Nanonets-OCR-s
Viewer
•
Updated
•
11.7k
•
55
•
2
BEE-spoke-data/SurvivorLib-rolmOCR
Viewer
•
Updated
•
13.3k
•
49
•
1
BEE-spoke-data/napierone-pdf-nanonets-s
Viewer
•
Updated
•
9.96k
•
37
BEE-spoke-data/napierone-pdf-olmOCR
Viewer
•
Updated
•
19k
•
29
BEE-spoke-data/LONGCOT-merged-1M
Viewer
•
Updated
•
1.7M
•
86
•
2
BEE-spoke-data/cosmopedia-v2-mincols
Viewer
•
Updated
•
39.1M
•
148
•
1
BEE-spoke-data/reddit-title-body-hf
Viewer
•
Updated
•
251M
•
987
•
4
BEE-spoke-data/bigpatent-all
Viewer
•
Updated
•
2.43M
•
716