|
|
--- |
|
|
language: |
|
|
- nl |
|
|
datasets: |
|
|
- yhavinga/mc4_nl_cleaned |
|
|
- yhavinga/ccmatrix |
|
|
tags: |
|
|
- t5 |
|
|
- translation |
|
|
- seq2seq |
|
|
|
|
|
pipeline_tag: translation |
|
|
widget: |
|
|
- text: "It is a painful and tragic spectacle that rises before me: I have drawn back the curtain from the rottenness of man. This word, in my mouth, is at least free from one suspicion: that it involves a moral accusation against humanity. It is used--and I wish to emphasize the fact again--without any moral significance: and this is so far true that the rottenness I speak of is most apparent to me precisely in those quarters where there has been most aspiration, hitherto, toward 'virtue' and 'godliness.'" |
|
|
- text: "For once Fletcher’s sedate features showed a certain lightness. 'I believe I will linger awhile longer.' He indicated a holoscreen which was displaying the image from an external camera. Cloud-splattered landscape was rolling past, pastel greens, browns, and blues illuminated by Duke’s radiance. 'It is not often a mortal man is permitted to view a world over the shoulder of angels.'" |
|
|
|
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# t5-small-24L-ccmatrix-multi |
|
|
|
|
|
A [T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) sequence to sequence model |
|
|
pre-trained from scratch on [cleaned Dutch 🇳🇱🇧🇪 mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned). |
|
|
|
|
|
|
|
|
This **t5 eff** model has **249M** parameters. |
|
|
It was pre-trained on the dataset |
|
|
`mc4_nl_cleaned` config `large_en_nl` for **1** epoch(s) and a duration of **4d10h**, |
|
|
with a sequence length of **512**, batch size **128** and **851852** total steps. |
|
|
Pre-training evaluation loss and accuracy are **1,18** and **0,74**. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Tokenizer |
|
|
|
|
|
The model uses a cased SentencePiece tokenizer configured with the `Nmt, NFKC, Replace multi-space to single-space` normalizers |
|
|
and has 32003 tokens. |
|
|
It was trained on Dutch and English with scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling). |
|
|
See [./raw/main/tokenizer.json](tokenizer.json) for details. |
|
|
|
|
|
## Dataset |
|
|
|
|
|
All models listed below are trained on |
|
|
[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned), |
|
|
which is the original mC4, except |
|
|
|
|
|
* Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed |
|
|
* Sentences with less than 3 words are removed |
|
|
* Sentences with a word of more than 1000 characters are removed |
|
|
* Documents with less than 5 sentences are removed |
|
|
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", |
|
|
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed. |
|
|
|
|
|
The Dutch and English models are trained on a 50/50% mix of Dutch mC4 and English C4. |
|
|
|
|
|
## Models |
|
|
|
|
|
Three types of models have been trained. `t5-base-dutch` is the only model with an original T5 config. |
|
|
The other model types t5-v1.1 and t5-eff have `gated-relu` instead of `relu` as activation function, |
|
|
and trained with a drop-out of `0.0` unless training would diverge (`t5-v1.1-large-dutch-cased`). |
|
|
The T5-eff models are models with mostly different numbers of layers. The table will list |
|
|
the several dimensions of these models. Note that `efficient` is a misnomer for models with few layers, |
|
|
e.g. `t5-xl-4L-dutch-english-cased`, that is not efficient and one of the worst models on downstream summarization. |
|
|
|
|
|
| | t5-base-dutch | t5-v1.1-base-dutch-uncased | t5-v1.1-base-dutch-cased | t5-v1.1-large-dutch-cased | t5-v1_1-base-dutch-english-cased | t5-v1_1-base-dutch-english-cased-1024 | t5-small-24L-dutch-english | t5-xl-4L-dutch-english-cased | t5-base-36L-dutch-english-cased | t5-eff-xl-8l-dutch-english-cased | t5-eff-large-8l-dutch-english-cased | |
|
|
|:------------------|:----------------|:-----------------------------|:---------------------------|:----------------------------|:-----------------------------------|:----------------------------------------|:-----------------------------|:-------------------------------|:----------------------------------|:-----------------------------------|:--------------------------------------| |
|
|
| type | t5 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5 eff | t5 eff | t5 eff | t5 eff | t5 eff | |
|
|
| d_model | 768 | 768 | 768 | 1024 | 768 | 768 | 512 | 2048 | 768 | 1024 | 1024 | |
|
|
| d_ff | 3072 | 2048 | 2048 | 2816 | 2048 | 2048 | 1920 | 5120 | 2560 | 16384 | 4096 | |
|
|
| num_heads | 12 | 12 | 12 | 16 | 12 | 12 | 8 | 32 | 12 | 32 | 16 | |
|
|
| d_kv | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 128 | 64 | |
|
|
| num_layers | 12 | 12 | 12 | 24 | 12 | 12 | 24 | 4 | 36 | 8 | 8 | |
|
|
| num parameters | 223M | 248M | 248M | 783M | 248M | 248M | 250M | 585M | 729M | 1241M | 335M | |
|
|
| feed_forward_proj | relu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | |
|
|
| dropout | 0.1 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | |
|
|
| dataset | mc4_nl_cleaned | mc4_nl_cleaned full | mc4_nl_cleaned full | mc4_nl_cleaned | mc4_nl_cleaned small_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | |
|
|
| tr. seq len | 512 | 1024 | 1024 | 512 | 512 | 1024 | 512 | 512 | 512 | 512 | 512 | |
|
|
| batch size | 128 | 64 | 64 | 64 | 128 | 64 | 128 | 512 | 512 | 64 | 128 | |
|
|
| total steps | 527500 | 1014525 | 1210154 | 2427498 | 2839630 | 1520k/3397024 | 851852 | 212963 | 212963 | 538k/1703705 | 851850 | |
|
|
| epochs | 1 | 2 | 2 | 2 | 10 | 4 | 1 | 1 | 1 | 1 | 1 | |
|
|
| duration | 2d9h | 5d5h | 6d6h | 8d13h | 11d18h | 9d1h | 4d10h | 6d1h | 17d15h | 4d 19h | 3d 23h | |
|
|
| optimizer | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | |
|
|
| lr | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.009 | 0.005 | 0.005 | |
|
|
| warmup | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 5000.0 | 20000.0 | 2500.0 | 1000.0 | 1500.0 | 1500.0 | |
|
|
| eval loss | 1,38 | 1,20 | 0,96 | 1,07 | 1,11 | 1,13 | 1,18 | 1,27 | 1,05 | 1,3019 | 1,15 | |
|
|
| eval acc | 0,70 | 0,73 | 0,78 | 0,76 | 0,75 | 0,74 | 0,74 | 0,72 | 0,76 | 0,71 | 0,74 | |
|
|
|
|
|
## Evaluation on summarization |
|
|
|
|
|
The models below have been evaluated on the summarization downstream task on 50K samples from the CNN Dailymail dataset. |
|
|
All models were fine-tuned with the AdamW optimizer with a batch size of 128 and constant learning rate of 1e-3 after a |
|
|
warmup of 64 steps, with a label smoothing factor of 0.05. |
|
|
Article and summary token lengths were set to 1024 and 142. |
|
|
|
|
|
| | t5-base-dutch | t5-v1.1-base-dutch-uncased | t5-v1.1-base-dutch-cased | t5-v1_1-base-dutch-english-cased | t5-v1_1-base-dutch-english-cased-1024 | t5-small-24L-dutch-english | t5-xl-4L-dutch-english-cased | t5-base-36L-dutch-english-cased | t5-eff-large-8l-dutch-english-cased | mt5-base | |
|
|
|:-------------------|:----------------|:-----------------------------|:---------------------------|:-----------------------------------|:----------------------------------------|:-----------------------------|:-------------------------------|:----------------------------------|:--------------------------------------|:-----------| |
|
|
| rouge1 | 33.0313 | 33.8432 | 34.0906 | 33.1116 | 34.6465 | 34.376 | 30.8983 | 35.0931 | 33.9293 | 33.6466 | |
|
|
| rouge2 | 12.9452 | 13.7706 | 13.6203 | 13.275 | 13.8525 | 13.8939 | 11.6005 | 14.3823 | 13.6274 | 13.1085 | |
|
|
| rougeL | 23.7204 | 24.5642 | 24.7304 | 24.3561 | 24.721 | 25.2496 | 22.6536 | 25.3213 | 24.5595 | 23.909 | |
|
|
| rougeLsum | 29.842 | 30.7783 | 31.1438 | 30.0548 | 31.6104 | 31.3838 | 27.8467 | 32.3526 | 30.952 | 30.5054 | |
|
|
| gen_len | 90.488 | 91.832 | 92.122 | 89.583 | 98.333 | 90.442 | 92.342 | 96.832 | 95.057 | 96.312 | |
|
|
| num parameters | 223M | 248M | 248M | 248M | 248M | 250M | 585M | 729M | 335M | 582M | |
|
|
| samples_per_second | 3.195 | 3.039 | 3.0 | 3.216 | 2.974 | 1.594 | 2.47 | 0.623 | 3.087 | 1.201 | |
|
|
|
|
|
## Translation models |
|
|
|
|
|
The small 24L and base 36L models have been fine-tuned for translation on the CCMatrix dataset. |
|
|
The models named *-`multi` support both directions of translation. The models are trained on CCMatrix only. As this is |
|
|
a really large dataset with over 100M Dutch-English sentence pairs, the models are trained on a fraction of it, |
|
|
refer to the table below for how long. Evaluation is performed on a CCMatrix section not trained on, but also |
|
|
on Tatoeba and Opus Books. The `_bp` columns list the *brevity penalty*. The `avg_bleu` score is the bleu score |
|
|
averaged over all three evaluation datasets. |
|
|
|
|
|
The translation metrics are listed in the table below: |
|
|
|
|
|
| | t5-base-36L-ccmatrix-en-nl | t5-base-36L-ccmatrix-multi | t5-base-36L-ccmatrix-multi | t5-small-24L-ccmatrix-multi | t5-small-24L-ccmatrix-multi | |
|
|
|:-----------------------|:-----------------------------|:-----------------------------|:-----------------------------|:------------------------------|:------------------------------| |
|
|
| id | 0 | 14 | 15 | 16 | 20 | |
|
|
| source_lang | en | en | nl | en | nl | |
|
|
| target_lang | nl | nl | en | nl | en | |
|
|
| source_prefix | translate English to Dutch: | translate English to Dutch: | translate Dutch to English: | translate English to Dutch: | translate Dutch to English: | |
|
|
| tatoeba_bp | 0.9897614370103832 | 0.9736173618072754 | 0.943521164106552 | 0.9760983304454847 | 0.9406676405486575 | |
|
|
| ccmatrix_bp | 0.9590750786190209 | 0.9536276245543676 | 0.9635673583308255 | 0.9517934939463099 | 0.9585648049711814 | |
|
|
| opus_books_bp | 0.7478011343203491 | 0.7950194726093107 | 0.9362852511299413 | 0.770498474692027 | 0.8870675076932444 | |
|
|
| tatoeba_score | 50.63006965176505 | 46.580601850286214 | 52.82030981131822 | 46.419809813946046 | 51.67887417355214 | |
|
|
| ccmatrix_score | 60.33227938980884 | 56.81297258845844 | 62.836646082246254 | 57.404319674892406 | 63.08633155239932 | |
|
|
| opus_books_score | 10.405013868050663 | 13.477997378535864 | 24.93113308798125 | 12.927244801365507 | 23.418552148252047 | |
|
|
| avg_bleu | 40.455787636541515 | 38.95719060576017 | 46.86269632718191 | 38.91712476340132 | 46.0612526247345 | |
|
|
| total steps | 78125 | 390625 | 390625 | 390625 | 390625 | |
|
|
| duration | 14h | 101h | 101h | 74h | 74h | |
|
|
| num_parameters | 728928000 | 728928000 | 728928000 | 249991680 | 249991680 | |
|
|
| label_smoothing_factor | 0.09 | 0.15 | 0.15 | 0.1 | 0.1 | |
|
|
| learning_rate | 0.0001 | 5e-05 | 5e-05 | 0.0005 | 0.0005 | |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
This project would not have been possible without compute generously provided by Google through the |
|
|
[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem and was also |
|
|
instrumental all parts of the training. Logging metrics to Weights & Biases made it possible to keep track of many |
|
|
models and orchestrate hyper-parameter sweeps with insightful visualizations. I cannot imagine how I would |
|
|
have completed this project otherwise. |
|
|
The following repositories where helpful in setting up the TPU-VM, |
|
|
and getting an idea what sensible hyper-parameters are for training gpt2 from scratch. |
|
|
|
|
|
* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp) |
|
|
* [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch) |
|
|
|
|
|
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/) |
|
|
|
|
|
|