Update README.md
Browse files
README.md
CHANGED
|
@@ -303,7 +303,7 @@ for output in outputs:
|
|
| 303 |
|
| 304 |
The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
|
| 305 |
The initial 1.6 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
| 306 |
-
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
| 307 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
| 308 |
During the following training, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
|
| 309 |
This adjustment resulted in a total of 2.68 trillion tokens used across 2 epochs, distributed as outlined below:
|
|
|
|
| 303 |
|
| 304 |
The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
|
| 305 |
The initial 1.6 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
| 306 |
+
and give more importance to Spain’s co-official languages (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
| 307 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
| 308 |
During the following training, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
|
| 309 |
This adjustment resulted in a total of 2.68 trillion tokens used across 2 epochs, distributed as outlined below:
|