The reasoning datasets that defined 2025. Part 1 of Datasets Wrapped 2025. #DatasetsWrapped2025
Daniel van Strien PRO
AI & ML interests
Machine Learning Librarian
Recent Activity
updated
a dataset
about 4 hours ago
data-is-better-together/fineweb-c-progress
updated
a dataset
about 21 hours ago
librarian-bots/dataset-columns
updated
a dataset
3 days ago
librarian-bots/arxiv-metadata-snapshot
Organizations
hub-tldr
Creating a smol model for tl;dr-ing the hub
-
davanstrien/Smol-Hub-tldr
Text Generation • 0.4B • Updated • 34 • 11 -
Running84
Semantic Hugging Face Hub Search
🔎84Find datasets and models using semantic search
-
davanstrien/hub-tldr-dataset-summaries-llama
Viewer • Updated • 5k • 92 • 1 -
davanstrien/hub-tldr-model-summaries-llama
Viewer • Updated • 5k • 83 • 1
synthetic-data-generation-demos
A collection of demos for various approaches to synthetic data generation
-
Runtime error8
Genstruct 7B
👀8 -
Runtime errorFeatured86
Instruction Synthesizer
🐠86Generate instruction-response pairs from text
-
Running on ZeroFeatured72
Magpie
🐦72Generate and rate instruction-response pairs
-
Runtime error11
Bonito
💬11Generate task-specific instructions and responses from text
Synthetic (text) Dataset Generation
Papers about synthetic dataset generation
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper • 2404.14361 • Published • 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper • 2403.04190 • Published • 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 31 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper • 2404.14445 • Published
Historic language modeling
This collection contains models, datasets and spaces related to historic language models i.e. language models trained on historic data
Image Preference Optimization Datasets
Datasets suitable for Image Preference Optimization based on their colum names
Reasoning Required?
-
davanstrien/reasoning-required
Viewer • Updated • 5k • 277 • 19 -
davanstrien/ModernBERT-based-Reasoning-Required
Text Classification • 0.1B • Updated • 40 • 10 -
davanstrien/fineweb-with-reasoning-scores-and-topics
Viewer • Updated • 10k • 70 • 1 -
davanstrien/fine-reasoning-questions
Viewer • Updated • 244 • 127 • 19
Maths reasoning
Maths reasoning datasets found using https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search
sentence-transformers-from-synthetic-data
Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model
haiku
🌸 This is a collection of synthetic datasets built to help improve the ability of open language models to better write haikus through the use of DPO
Probably DPO datasets
A collection of datasets that probably support DPO
query-to-hub-datasets-viewer-project
Datasets Wrapped 2025: Reasoning
The reasoning datasets that defined 2025. Part 1 of Datasets Wrapped 2025. #DatasetsWrapped2025
Reasoning Required?
-
davanstrien/reasoning-required
Viewer • Updated • 5k • 277 • 19 -
davanstrien/ModernBERT-based-Reasoning-Required
Text Classification • 0.1B • Updated • 40 • 10 -
davanstrien/fineweb-with-reasoning-scores-and-topics
Viewer • Updated • 10k • 70 • 1 -
davanstrien/fine-reasoning-questions
Viewer • Updated • 244 • 127 • 19
hub-tldr
Creating a smol model for tl;dr-ing the hub
-
davanstrien/Smol-Hub-tldr
Text Generation • 0.4B • Updated • 34 • 11 -
Running84
Semantic Hugging Face Hub Search
🔎84Find datasets and models using semantic search
-
davanstrien/hub-tldr-dataset-summaries-llama
Viewer • Updated • 5k • 92 • 1 -
davanstrien/hub-tldr-model-summaries-llama
Viewer • Updated • 5k • 83 • 1
Maths reasoning
Maths reasoning datasets found using https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search
synthetic-data-generation-demos
A collection of demos for various approaches to synthetic data generation
-
Runtime error8
Genstruct 7B
👀8 -
Runtime errorFeatured86
Instruction Synthesizer
🐠86Generate instruction-response pairs from text
-
Running on ZeroFeatured72
Magpie
🐦72Generate and rate instruction-response pairs
-
Runtime error11
Bonito
💬11Generate task-specific instructions and responses from text
sentence-transformers-from-synthetic-data
Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model
Synthetic (text) Dataset Generation
Papers about synthetic dataset generation
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper • 2404.14361 • Published • 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper • 2403.04190 • Published • 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 31 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper • 2404.14445 • Published
haiku
🌸 This is a collection of synthetic datasets built to help improve the ability of open language models to better write haikus through the use of DPO
Historic language modeling
This collection contains models, datasets and spaces related to historic language models i.e. language models trained on historic data
Probably DPO datasets
A collection of datasets that probably support DPO
Image Preference Optimization Datasets
Datasets suitable for Image Preference Optimization based on their colum names
query-to-hub-datasets-viewer-project