| language: | |
| - he | |
| tags: | |
| - language model | |
| license: apache-2.0 | |
| datasets: | |
| - oscar | |
| - wikipedia | |
| # AlephBERT | |
| ## Hebrew Language Model | |
| State-of-the-art language model for Hebrew. Based on BERT. | |
| #### How to use | |
| ```python | |
| from transformers import BertModel, BertTokenizerFast | |
| alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base') | |
| alephbert = BertModel.from_pretrained('onlplab/alephbert-base') | |
| # if not finetuning - disable dropout | |
| alephbert.eval() | |
| ``` | |
| ## Training data | |
| - OSCAR (10G text, 20M sentences) | |
| - Wikipedia dump (0.6G text, 3M sentences) | |
| - Tweets (7G text, 70M sentences) | |
| ## Training procedure | |
| Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure. | |
| To optimize training time we split the data into 4 sections based on max number of tokens: | |
| 1. num tokens < 32 (70M sentences) | |
| 2. 32 <= num tokens < 64 (12M sentences) | |
| 3. 64 <= num tokens < 128 (10M sentences) | |
| 4. 128 <= num tokens < 512 (70M sentences) | |
| Each section was trained for 5 epochs with an initial learning rate set to 1e-4. | |
| Total training time was 5 days. | |
| ## Eval | |
