monsoon-nlp
/

byt5-dv

text2text-generation

text-generation-inference

Model card Files Files and versions

monsoon-nlp commited on Jul 7, 2021

Commit

ce955e5

·

1 Parent(s): 682a3ba

add finetuning demo, caveats

Files changed (1) hide show

README.md +17 -2

README.md CHANGED Viewed

@@ -9,15 +9,30 @@ with ByT5, Google's new byte-level tokenizer strategy.
 Corpus: dv.wikipedia.org as of March 2020 (TFDS)
-Notebook: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH
 ## Demo
-## Todos
 The Wikipedia corpus is too small for this language. In the future I would add
 OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those
 as one TFDS dataset.

 Corpus: dv.wikipedia.org as of March 2020 (TFDS)
+Notebook - Pretraining on Wikipedia: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH
 ## Demo
+Notebook - Finetuning on Maldivian news classification task: https://colab.research.google.com/drive/11u5SafR4bKICmArgDl6KQ9vqfYtDpyWp
+Current performance:
+- mBERT: 52%
+- byt5-dv (first run): 78%
+- dv-wave (ELECTRA): 89%
+- dv-muril: 90.7%
+- dv-labse: 91.3-91.5%
+Source of dataset: https://github.com/Sofwath/DhivehiDatasets
+## Work in progress - todos
 The Wikipedia corpus is too small for this language. In the future I would add
 OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those
 as one TFDS dataset.
+This is based on ByT5-small ... we should try a larger model
+This needs more time for pretraining
+This needs better finetuning (reformatting batches to get all training data)