bhatta1 commited on
Commit
d3d9cef
·
verified ·
1 Parent(s): 7f4dabe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -13
README.md CHANGED
@@ -8,21 +8,9 @@ language:
8
 
9
  **Model Summary**
10
 
11
- Recently, IBM has introduced GneissWeb; a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. The models trained using GneissWeb dataset outperform those trained on FineWeb 1.1.0 by 2.14 percentage points in terms of average score computed on a set of 11 commonly used benchmarks.
12
-
13
  In order to be able to reproduce GneissWeb, we provide here GneissWeb.Med_classifier a medical category fastText classifier. This fastText model is used as part of the ensemble filter in GneissWeb to detect documents with medical content.
14
 
15
-
16
- **Intended Use**
17
-
18
- The fastText model takes as input text and classifies whether the text categorized as ''medical'' (labeled as `__label__hq`) or other categories''cc'' (labeled as `__label__cc`).
19
- The model can be used with python (please refer to [fasttext documentation](https://fasttext.cc/docs/en/python-module.html) for details on using fasttext classifiers)
20
- or with [IBM Data Prep Kit](https://github.com/IBM/data-prep-kit/) (DPK) (please refer to the [example notebook](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/gneissweb_classification/gneissweb_classification.ipynb) for using a fastText model with DPK).
21
-
22
- The GneissWeb ensemble filter uses the confidence score given to `__label__hq` for filtering documents based on an appropriately chosen threshold.
23
- The fastText model is used along with [GneissWeb.Edu_classifier](https://huggingface.co/ibm-granite/GneissWeb.Edu_classifier), [GneissWeb.Tech_classifier](https://huggingface.co/ibm-granite/GneissWeb.Tech_classifier), and [GneissWeb.Sci_classifier](https://huggingface.co/ibm-granite/GneissWeb.Sci_classifier) and other quality annotators.
24
-
25
-
26
 
27
       **Developers**: IBM Research
28
 
 
8
 
9
  **Model Summary**
10
 
 
 
11
  In order to be able to reproduce GneissWeb, we provide here GneissWeb.Med_classifier a medical category fastText classifier. This fastText model is used as part of the ensemble filter in GneissWeb to detect documents with medical content.
12
 
13
+ Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) page for more details
 
 
 
 
 
 
 
 
 
 
14
 
15
       **Developers**: IBM Research
16