olm
/

olm-gpt2-dec-2022

+---
+language: en
+tags:
+- exbert
+---
+# OLM GPT-2 December 2022
+This is a more up-to-date version of the [original GPT-2](https://huggingface.co/gpt2).
+In addition to being more up-to-date, it also tends to perform better than the original GPT2 on standard benchmarks.
+It was trained on a cleaned December 2022 snapshot of Common Crawl and Wikipedia.
+This model was created as part of the OLM project, which has the goal of continuously training and releasing models that are up-to-date and comparable in standard language model performance to their static counterparts.
+This is important because we want our models to know about events like COVID or
+a presidential election right after they happen.
+## Intended uses
+You can use the raw model for text generation or fine-tune it to a downstream task.
+## How to use
+You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
+set a seed for reproducibility:
+```python
+>>> from transformers import pipeline, set_seed
+>>> # It is important to include the bad_words_ids=[[0,2]] if you want this model to stay on topic.
+>>> # Otherwise, the model may generate start and end tokens followed by text that is not relevant to
+>>> # the previous text.
+>>> generator = pipeline('text-generation', model='olm/olm-gpt2-dec-2022', bad_words_ids=[[0,2]])
+>>> set_seed(42)
+>>> # This example also illustrates that sometimes our model generates
+>>> # bloggy/spammy/webb-y things, even though it gets higher evaluation results
+>>> # than the original GPT-2 accross a variety of benchmarks. See the first output.
+>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
+TODO
+```
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained('olm/olm-gpt2-dec-2022')
+model = AutoModelForCausalLM.from_pretrained('olm/olm-gpt2-dec-2022')
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+## Dataset
+The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](TODO) plus this [December 2022 cleaned Wikipedia dataset](TODO).\
+The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-december-2022-tokenized-1024).\
+The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).
+## Training
+The model was trained according to the OLM GPT2 instructions at this [repo](https://github.com/huggingface/olm-training).
+## Evaluation results
+The model achieves the following results without any fine-tuning (zero-shot):
+| Task        | Metric     | Original GPT2       | OLM GPT2 Dec 2022 (Ours) | Significance of Difference (two-tailed p-value) |
+|:------------|:-----------|--------------------:|-------------------------:|----------------------------------:|
+|rte          |acc         |0.5307               |0.5199                    |                             |
+|piqa         |acc/acc_norm|0.6289/0.6251        |**0.6692**/**0.6665**     |            |
+|copa         |acc         |0.6400               |0.6800                    |                             |
+|record       |f1/em       |0.7094/0.7026        |0.6884/0.6818            |             |
+|boolq        |acc         |0.4872               |0.6021                |                        |
+|cb           |acc/f1      |0.4101/0.2619        |0.3393/0.1840            |/NA                          |
+|hellaswag    |acc/acc_norm|0.2892/0.3114        |0.3079/0.3482     |              |
+|mrpc         |acc/f1      |0.5662/0.6911        |0.6814/0.8099     |              |
+|multirc      |acc         |0.0189               |0.0220                    |                            |
+|lambada      |ppl/acc     |40.0554/0.3256       |28.3359/0.3699   |             |
+|wsc          |acc         |0.4327               |0.3654                   |                            |
+|wic          |acc         |0.4922               |0.5000                      |                            |
+|mnli         |acc         |0.3372               |0.3501                |                         |
+|qnli         |acc         |0.5017               |0.4946                   |                             |
+|cola         |mcc         |0.0126               |0.0000                    |                            |
+|triviaqa     |acc         |0.0151               |0.0181                |                        |
+|winogrande   |acc         |0.5162               |0.5051                   |                            |
+|webqs        |acc         |0.0030               |0.0079                |                        |
+|arc_easy     |acc/acc_norm|0.4381/0.3948        |0.4693/0.4230     |              |
+|arc_challenge|acc/acc_norm|0.1903/0.2270        |0.2090/0.2398            |                   |
+To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
+which can produce results different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.