All models and datasets relating to GEITje
Edwin Rijgersberg PRO
Rijgersberg
AI & ML interests
None yet
Recent Activity
liked
a dataset
5 days ago
BramVanroy/CommonCrawl-CreativeCommons-recommended
reacted
to
BramVanroy's
post
with 👍
5 days ago
Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually
- C5f (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.
It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.
liked
a dataset
9 days ago
nvidia/Granary