language: en
tags:
- exbert
GPT-2
This is a more up-to-date version of the original GPT2, which is a pretrained model on English language using a causal language modeling (CLM) objective.
Intended uses & limitations
You can use the raw model for text generation or fine-tune it to a downstream task. See the
How to use
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='olm/olm-gpt2-oct-2022')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('olm/olm-gpt2-oct-2022')
model = AutoModelForCausalLM.from_pretrained('gpt2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Dataset
The model and tokenizer were trained with this October 2022 cleaned Common Crawl dataset plus this October 2022 cleaned Wikipedia dataset. The tokenized version of these concatenated datasets is here. The datasets were created with this repo.
Training
The model was trained according to the GPT2 instructions at this repo.
Evaluation results
The model achieves the following results without any fine-tuning (zero-shot):
Task | Metric | Original GPT2 | OLM GPT2 (Ours) | Significance (two-tailed p-value) |
---|---|---|---|---|
rte | acc | 0.5307 | 0.5415 | 0.7188 |
piqa | acc/acc_norm | 0.6289/0.6251 | 0.6638/0.6670 | 0.0020/0.0002 |
copa | acc | 0.6400 | 0.6900 | 0.3000 |
record | f1/em | 0.7094/0.7026 | 0.6874/0.6810 | 0.0000/0.0000 |
boolq | acc | 0.4872 | 0.5606 | 0.0000 |
cb | acc/f1 | 0.4101/0.2619 | 0.3571/0.1754 | 0.4193/NA |
hellaswag | acc/acc_norm | 0.2892/0.3114 | 0.3076/0.3491 | 0.0000/0.0000 |
mrpc | acc/f1 | 0.5662/0.6911 | 0.6495/0.7741 | 0.0007/0.0002 |
multirc | acc | 0.0189 | 0.0115 | 0.0959 |
lambada | ppl/acc | 40.0554/0.3256 | 28.6733/0.3625 | 0.0000/0.0000 |
wsc | acc | 0.4327 | 0.3654 | 0.1679 |
wic | acc | 0.4922 | 0.5 | 0.6924 |
mnli | acc | 0.3372 | 0.3471 | 0.0384 |
qnli | acc | 0.5017 | 0.4981 | 0.5884 |
cola | mcc | 0.0126 | 0.0181 | 0.8614 |
triviaqa | acc | 0.0151 | 0.0182 | 0.0048 |
winogrande | acc | 0.5162 | 0.5114 | 0.7360 |
webqs | acc | 0.0030 | 0.0108 | 0.0000 |
arc_easy | acc/acc_norm | 0.4381/0.3948 | 0.4651/0.4247 | 0.0082/0.0029 |
arc_challenge | acc/acc_norm | 0.1903/0.2270 | 0.1997/0.2329 | 0.4132/0.6256 |
To get these results, we used the Eleuther AI evaluation harness here The harness can produce results a little different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.