Commit 
							
							·
						
						0b718df
	
1
								Parent(s):
							
							f3dfcf4
								
create updated readme
Browse files
    	
        README.md
    ADDED
    
    | @@ -0,0 +1,277 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ---
         | 
| 2 | 
            +
            language:
         | 
| 3 | 
            +
            - en
         | 
| 4 | 
            +
            tags:
         | 
| 5 | 
            +
            - pytorch
         | 
| 6 | 
            +
            - causal-lm
         | 
| 7 | 
            +
            - pythia
         | 
| 8 | 
            +
            license: apache-2.0
         | 
| 9 | 
            +
            datasets:
         | 
| 10 | 
            +
            - the_pile
         | 
| 11 | 
            +
            ---
         | 
| 12 | 
            +
             | 
| 13 | 
            +
            The *Pythia Scaling Suite* is a collection of models developed to facilitate 
         | 
| 14 | 
            +
            interpretability research. It contains two sets of eight models of sizes 
         | 
| 15 | 
            +
            70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two 
         | 
| 16 | 
            +
            models: one trained on the Pile, and one trained on the Pile after the dataset 
         | 
| 17 | 
            +
            has been globally deduplicated. All 8 model sizes are trained on the exact 
         | 
| 18 | 
            +
            same data, in the exact same order. We also provide 154 intermediate 
         | 
| 19 | 
            +
            checkpoints per model, hosted on Hugging Face as branches.
         | 
| 20 | 
            +
             | 
| 21 | 
            +
            The Pythia model suite was deliberately designed to promote scientific 
         | 
| 22 | 
            +
            research on large language models, especially interpretability research. 
         | 
| 23 | 
            +
            Despite not centering downstream performance as a design goal, we find the 
         | 
| 24 | 
            +
            models <a href="#evaluations">match or exceed</a> the performance of 
         | 
| 25 | 
            +
            similar and same-sized models, such as those in the OPT and GPT-Neo suites.
         | 
| 26 | 
            +
             | 
| 27 | 
            +
            <details>
         | 
| 28 | 
            +
              <summary style="font-weight:600">Details on previous early release and naming convention.</summary>
         | 
| 29 | 
            +
             | 
| 30 | 
            +
            Previously, we released an early version of the Pythia suite to the public. 
         | 
| 31 | 
            +
            However, we decided to retrain the model suite to address a few hyperparameter 
         | 
| 32 | 
            +
            discrepancies. This model card <a href="#changelog">lists the changes</a>; 
         | 
| 33 | 
            +
            see appendix B in the Pythia paper for further discussion. We found no 
         | 
| 34 | 
            +
            difference in benchmark performance between the two Pythia versions. 
         | 
| 35 | 
            +
            The old models are 
         | 
| 36 | 
            +
            [still available](https://huggingface.co/models?other=pythia_v0), but we 
         | 
| 37 | 
            +
            suggest the retrained suite if you are just starting to use Pythia.<br>
         | 
| 38 | 
            +
            **This is the current release.**
         | 
| 39 | 
            +
             | 
| 40 | 
            +
            Please note that all models in the *Pythia* suite were renamed in January 
         | 
| 41 | 
            +
            2023. For clarity, a <a href="#naming-convention-and-parameter-count">table 
         | 
| 42 | 
            +
            comparing the old and new names</a> is provided in this model card, together 
         | 
| 43 | 
            +
            with exact parameter counts.
         | 
| 44 | 
            +
            </details>
         | 
| 45 | 
            +
            <br>
         | 
| 46 | 
            +
             | 
| 47 | 
            +
            # Pythia-70M
         | 
| 48 | 
            +
             | 
| 49 | 
            +
            ## Model Details
         | 
| 50 | 
            +
             | 
| 51 | 
            +
            - Developed by: [EleutherAI](http://eleuther.ai)
         | 
| 52 | 
            +
            - Model type: Transformer-based Language Model
         | 
| 53 | 
            +
            - Language: English
         | 
| 54 | 
            +
            - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
         | 
| 55 | 
            +
             for training procedure, config files, and details on how to use.
         | 
| 56 | 
            +
            - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
         | 
| 57 | 
            +
            - License: Apache 2.0
         | 
| 58 | 
            +
            - Contact: to ask questions about this model, join the [EleutherAI 
         | 
| 59 | 
            +
            Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
         | 
| 60 | 
            +
             Please read the existing *Pythia* documentation before asking about it in the 
         | 
| 61 | 
            +
             EleutherAI Discord. For general correspondence: [contact@eleuther.
         | 
| 62 | 
            +
             ai](mailto:[email protected]).
         | 
| 63 | 
            +
             | 
| 64 | 
            +
            <figure>
         | 
| 65 | 
            +
             | 
| 66 | 
            +
            | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate         | Equivalent Models      |
         | 
| 67 | 
            +
            | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
         | 
| 68 | 
            +
            | 70M          | 18,915,328           | 6      | 512       | 8     | 2M         | 1.0 x 10<sup>-3</sup> | —                      |
         | 
| 69 | 
            +
            | 160M         | 85,056,000           | 12     | 768       | 12    | 4M         | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
         | 
| 70 | 
            +
            | 410M         | 302,311,424          | 24     | 1024      | 16    | 4M         | 3.0 x 10<sup>-4</sup> | OPT-350M               |
         | 
| 71 | 
            +
            | 1.0B         | 805,736,448          | 16     | 2048      | 8     | 2M         | 3.0 x 10<sup>-4</sup> | —                      |
         | 
| 72 | 
            +
            | 1.4B         | 1,208,602,624        | 24     | 2048      | 16    | 4M         | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
         | 
| 73 | 
            +
            | 2.8B         | 2,517,652,480        | 32     | 2560      | 32    | 2M         | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
         | 
| 74 | 
            +
            | 6.9B         | 6,444,163,072        | 32     | 4096      | 32    | 2M         | 1.2 x 10<sup>-4</sup> | OPT-6.7B               |
         | 
| 75 | 
            +
            | 12B          | 11,327,027,200       | 36     | 5120      | 40    | 2M         | 1.2 x 10<sup>-4</sup> | —                      |
         | 
| 76 | 
            +
            <figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and 
         | 
| 77 | 
            +
            non-deduped models of a given size have the same hyperparameters. “Equivalent” 
         | 
| 78 | 
            +
            models have <b>exactly</b> the same architecture, and the same number of 
         | 
| 79 | 
            +
            non-embedding parameters.</figcaption>
         | 
| 80 | 
            +
            </figure>
         | 
| 81 | 
            +
             | 
| 82 | 
            +
            ## Uses and Limitations
         | 
| 83 | 
            +
             | 
| 84 | 
            +
            ### Intended Use
         | 
| 85 | 
            +
             | 
| 86 | 
            +
            The primary intended use of Pythia is research on the behavior, functionality, 
         | 
| 87 | 
            +
            and limitations of large language models. This suite is intended to provide 
         | 
| 88 | 
            +
            a controlled setting for performing scientific experiments. We also provide 
         | 
| 89 | 
            +
            154 checkpoints per model: initial `step0`, 10 log-spaced checkpoints 
         | 
| 90 | 
            +
            `step{1,2,4...512}`, and 143 evenly-spaced checkpoints from `step1000` to 
         | 
| 91 | 
            +
            `step143000`. These checkpoints are hosted on Hugging Face as branches. Note 
         | 
| 92 | 
            +
            that branch `143000` corresponds exactly to the model checkpoint on the `main` 
         | 
| 93 | 
            +
            branch of each model.
         | 
| 94 | 
            +
             | 
| 95 | 
            +
            You may also further fine-tune and adapt Pythia-70M for deployment, 
         | 
| 96 | 
            +
            as long as your use is in accordance with the Apache 2.0 license. Pythia 
         | 
| 97 | 
            +
            models work with the Hugging Face [Transformers 
         | 
| 98 | 
            +
            Library](https://huggingface.co/docs/transformers/index). If you decide to use 
         | 
| 99 | 
            +
            pre-trained Pythia-70M as a basis for your fine-tuned model, please 
         | 
| 100 | 
            +
            conduct your own risk and bias assessment. 
         | 
| 101 | 
            +
             | 
| 102 | 
            +
            ### Out-of-scope use
         | 
| 103 | 
            +
             | 
| 104 | 
            +
            The Pythia Suite is **not** intended for deployment. It is not a in itself 
         | 
| 105 | 
            +
            a product and cannot be used for human-facing interactions. For example, 
         | 
| 106 | 
            +
            the model may generate harmful or offensive text. Please evaluate the risks
         | 
| 107 | 
            +
            associated with your particular use case.
         | 
| 108 | 
            +
             | 
| 109 | 
            +
            Pythia models are English-language only, and are not suitable for translation 
         | 
| 110 | 
            +
            or generating text in other languages.
         | 
| 111 | 
            +
             | 
| 112 | 
            +
            Pythia-70M has not been fine-tuned for downstream contexts in which 
         | 
| 113 | 
            +
            language models are commonly deployed, such as writing genre prose, 
         | 
| 114 | 
            +
            or commercial chatbots. This means Pythia-70M will **not** 
         | 
| 115 | 
            +
            respond to a given prompt the way a product like ChatGPT does. This is because,
         | 
| 116 | 
            +
             unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement 
         | 
| 117 | 
            +
            Learning from Human Feedback (RLHF) to better “follow” human instructions.
         | 
| 118 | 
            +
             | 
| 119 | 
            +
            ### Limitations and biases
         | 
| 120 | 
            +
             | 
| 121 | 
            +
            The core functionality of a large language model is to take a string of text 
         | 
| 122 | 
            +
            and predict the next token. The token used by the model need not produce the 
         | 
| 123 | 
            +
            most “accurate” text. Never rely on Pythia-70M to produce factually accurate 
         | 
| 124 | 
            +
            output.
         | 
| 125 | 
            +
             | 
| 126 | 
            +
            This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset 
         | 
| 127 | 
            +
            known to contain profanity and texts that are lewd or otherwise offensive. 
         | 
| 128 | 
            +
            See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a 
         | 
| 129 | 
            +
            discussion of documented biases with regards to gender, religion, and race. 
         | 
| 130 | 
            +
            Pythia-70M may produce socially unacceptable or undesirable text, *even if* 
         | 
| 131 | 
            +
            the prompt itself does not include anything explicitly offensive. 
         | 
| 132 | 
            +
             | 
| 133 | 
            +
            If you plan on using text generated through, for example, the Hosted Inference 
         | 
| 134 | 
            +
            API, we recommend having a human curate the outputs of this language model 
         | 
| 135 | 
            +
            before presenting it to other people. Please inform your audience that the 
         | 
| 136 | 
            +
            text was generated by Pythia-70M.
         | 
| 137 | 
            +
             | 
| 138 | 
            +
            ### Quickstart
         | 
| 139 | 
            +
             | 
| 140 | 
            +
            Pythia models can be loaded and used via the following code, demonstrated here 
         | 
| 141 | 
            +
            for the third `pythia-70m-deduped` checkpoint:
         | 
| 142 | 
            +
             | 
| 143 | 
            +
            ```python
         | 
| 144 | 
            +
            from transformers import GPTNeoXForCausalLM, AutoTokenizer
         | 
| 145 | 
            +
             | 
| 146 | 
            +
            model = GPTNeoXForCausalLM.from_pretrained(
         | 
| 147 | 
            +
              "EleutherAI/pythia-70m-deduped",
         | 
| 148 | 
            +
              revision="step3000",
         | 
| 149 | 
            +
              cache_dir="./pythia-70m-deduped/step3000",
         | 
| 150 | 
            +
            )
         | 
| 151 | 
            +
             | 
| 152 | 
            +
            tokenizer = AutoTokenizer.from_pretrained(
         | 
| 153 | 
            +
              "EleutherAI/pythia-70m-deduped",
         | 
| 154 | 
            +
              revision="step3000",
         | 
| 155 | 
            +
              cache_dir="./pythia-70m-deduped/step3000",
         | 
| 156 | 
            +
            )
         | 
| 157 | 
            +
             | 
| 158 | 
            +
            inputs = tokenizer("Hello, I am", return_tensors="pt")
         | 
| 159 | 
            +
            tokens = model.generate(**inputs)
         | 
| 160 | 
            +
            tokenizer.decode(tokens[0])
         | 
| 161 | 
            +
            ```
         | 
| 162 | 
            +
             | 
| 163 | 
            +
            Revision/branch `step143000` corresponds exactly to the model checkpoint on 
         | 
| 164 | 
            +
            the `main` branch of each model.<br>
         | 
| 165 | 
            +
            For more information on how to use all Pythia models, see [documentation on 
         | 
| 166 | 
            +
            GitHub](https://github.com/EleutherAI/pythia).
         | 
| 167 | 
            +
             | 
| 168 | 
            +
            ## Training
         | 
| 169 | 
            +
             | 
| 170 | 
            +
            ### Training data
         | 
| 171 | 
            +
             | 
| 172 | 
            +
            [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in 
         | 
| 173 | 
            +
            English. It was created by EleutherAI specifically for training large language 
         | 
| 174 | 
            +
            models. It contains texts from 22 diverse sources, roughly broken down into 
         | 
| 175 | 
            +
            five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl), 
         | 
| 176 | 
            +
            prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and 
         | 
| 177 | 
            +
            miscellaneous (e.g. GitHub, Enron Emails). See [the Pile 
         | 
| 178 | 
            +
            paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources, 
         | 
| 179 | 
            +
            methodology, and a discussion of ethical implications. Consult [the 
         | 
| 180 | 
            +
            datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation 
         | 
| 181 | 
            +
            about the Pile and its component datasets. The Pile can be downloaded from 
         | 
| 182 | 
            +
            the [official website](https://pile.eleuther.ai/), or from a [community 
         | 
| 183 | 
            +
            mirror](https://the-eye.eu/public/AI/pile/).<br>
         | 
| 184 | 
            +
            The Pile was **not** deduplicated before being used to train Pythia-70M.
         | 
| 185 | 
            +
             | 
| 186 | 
            +
            ### Training procedure
         | 
| 187 | 
            +
             | 
| 188 | 
            +
            All models were trained on the exact same data, in the exact same order. Each 
         | 
| 189 | 
            +
            model saw 299,892,736,000 tokens during training, and 143 checkpoints for each 
         | 
| 190 | 
            +
            model are saved every 2,097,152,000 tokens, spaced evenly throughout training, 
         | 
| 191 | 
            +
            from `step1000` to `step143000` (which is the same as `main`). In addition, we 
         | 
| 192 | 
            +
            also provide frequent early checkpoints: `step0` and `step{1,2,4...512}`.
         | 
| 193 | 
            +
            This corresponds to training for just under 1 epoch on the Pile for 
         | 
| 194 | 
            +
            non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
         | 
| 195 | 
            +
             | 
| 196 | 
            +
            All *Pythia* models trained for 143000 steps at a batch size 
         | 
| 197 | 
            +
            of 2M (2,097,152 tokens).<br>
         | 
| 198 | 
            +
            See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
         | 
| 199 | 
            +
             procedure, including [how to reproduce 
         | 
| 200 | 
            +
             it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
         | 
| 201 | 
            +
            Pythia uses the same tokenizer as [GPT-NeoX-
         | 
| 202 | 
            +
            20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
         | 
| 203 | 
            +
             | 
| 204 | 
            +
            ## Evaluations
         | 
| 205 | 
            +
             | 
| 206 | 
            +
            All 16 *Pythia* models were evaluated using the [LM Evaluation 
         | 
| 207 | 
            +
            Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access 
         | 
| 208 | 
            +
            the results by model and step at `results/json/*` in the [GitHub 
         | 
| 209 | 
            +
            repository](https://github.com/EleutherAI/pythia/tree/main/results/json/).<br>
         | 
| 210 | 
            +
            Expand the sections below to see plots of evaluation results for all 
         | 
| 211 | 
            +
            Pythia and Pythia-deduped models compared with OPT and BLOOM.
         | 
| 212 | 
            +
             | 
| 213 | 
            +
            <details>
         | 
| 214 | 
            +
              <summary>LAMBADA – OpenAI</summary>
         | 
| 215 | 
            +
              <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai_v1.png" style="width:auto"/>
         | 
| 216 | 
            +
            </details>
         | 
| 217 | 
            +
             | 
| 218 | 
            +
            <details>
         | 
| 219 | 
            +
              <summary>Physical Interaction: Question Answering (PIQA)</summary>
         | 
| 220 | 
            +
              <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa_v1.png" style="width:auto"/>
         | 
| 221 | 
            +
            </details>
         | 
| 222 | 
            +
             | 
| 223 | 
            +
            <details>
         | 
| 224 | 
            +
              <summary>WinoGrande</summary>
         | 
| 225 | 
            +
              <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande_v1.png" style="width:auto"/>
         | 
| 226 | 
            +
            </details>
         | 
| 227 | 
            +
             | 
| 228 | 
            +
            <details>
         | 
| 229 | 
            +
              <summary>AI2 Reasoning Challenge—Easy Set</summary>
         | 
| 230 | 
            +
              <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_easy_v1.png" style="width:auto"/>
         | 
| 231 | 
            +
            </details>
         | 
| 232 | 
            +
             | 
| 233 | 
            +
            <details>
         | 
| 234 | 
            +
              <summary>SciQ</summary>
         | 
| 235 | 
            +
              <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq_v1.png" style="width:auto"/>
         | 
| 236 | 
            +
            </details>
         | 
| 237 | 
            +
             | 
| 238 | 
            +
            ## Changelog
         | 
| 239 | 
            +
             | 
| 240 | 
            +
            This section compares differences between previously released 
         | 
| 241 | 
            +
            [Pythia v0](https://huggingface.co/models?other=pythia_v0) and the current 
         | 
| 242 | 
            +
            models. See Appendix B of the Pythia paper for further discussion of these 
         | 
| 243 | 
            +
            changes and the motivation behind them. We found that retraining Pythia had no 
         | 
| 244 | 
            +
            impact on benchmark performance.
         | 
| 245 | 
            +
             | 
| 246 | 
            +
            - All model sizes are now trained with uniform batch size of 2M tokens. 
         | 
| 247 | 
            +
            Previously, the models of size 160M, 410M, and 1.4B parameters were trained 
         | 
| 248 | 
            +
            with batch sizes of 4M tokens.
         | 
| 249 | 
            +
            - We added checkpoints at initialization (step 0) and steps {1,2,4,8,16,32,64,
         | 
| 250 | 
            +
            128,256,512} in addition to every 1000 training steps.
         | 
| 251 | 
            +
            - Flash Attention was used in the new retrained suite.
         | 
| 252 | 
            +
            - We remedied a minor inconsistency that existed in the original suite: all 
         | 
| 253 | 
            +
            models of size 2.8B parameters or smaller had a learning rate (LR) schedule 
         | 
| 254 | 
            +
            which decayed to a minimum LR of 10% the starting LR rate, but the 6.9B and 
         | 
| 255 | 
            +
            12B models all used an LR schedule which decayed to a minimum LR of 0. In 
         | 
| 256 | 
            +
            the redone training runs, we rectified this inconsistency: all models now were 
         | 
| 257 | 
            +
            trained with LR decaying to a minimum of 0.1× their maximum LR.
         | 
| 258 | 
            +
             | 
| 259 | 
            +
            ### Naming convention and parameter count
         | 
| 260 | 
            +
             | 
| 261 | 
            +
            *Pythia* models were renamed in January 2023. It is possible that the old 
         | 
| 262 | 
            +
            naming convention still persists in some documentation by accident. The 
         | 
| 263 | 
            +
            current naming convention (70M, 160M, etc.) is based on total parameter count. 
         | 
| 264 | 
            +
             | 
| 265 | 
            +
            <figure style="width:32em">
         | 
| 266 | 
            +
              
         | 
| 267 | 
            +
            | current Pythia suffix | old suffix | total params   | non-embedding params |
         | 
| 268 | 
            +
            | --------------------: | ---------: | -------------: | -------------------: |
         | 
| 269 | 
            +
            | 70M                   | 19M        | 70,426,624     | 18,915,328           |
         | 
| 270 | 
            +
            | 160M                  | 125M       | 162,322,944    | 85,056,000           |
         | 
| 271 | 
            +
            | 410M                  | 350M       | 405,334,016    | 302,311,424          |
         | 
| 272 | 
            +
            | 1B                    | 800M       | 1,011,781,632  | 805,736,448          |
         | 
| 273 | 
            +
            | 1.4B                  | 1.3B       | 1,414,647,808  | 1,208,602,624        |
         | 
| 274 | 
            +
            | 2.8B                  | 2.7B       | 2,775,208,960  | 2,517,652,480        |
         | 
| 275 | 
            +
            | 6.9B                  | 6.7B       | 6,857,302,016  | 6,444,163,072        |
         | 
| 276 | 
            +
            | 12B                   | 13B        | 11,846,072,320 | 11,327,027,200       |
         | 
| 277 | 
            +
            </figure>
         | 
