| language: dv | |
| # byt5-dv | |
| Pretrained from scratch on Dhivei (language of the Maldives) | |
| with ByT5, Google's new byte-level tokenizer strategy. | |
| Corpus: dv.wikipedia.org as of March 2020 (TFDS) | |
| Notebook: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH | |
| ## Demo | |
| ## Todos | |
| The Wikipedia corpus is too small for this language. In the future I would add | |
| OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those | |
| as one TFDS dataset. | |