pstjohn commited on
Commit
039e602
·
unverified ·
1 Parent(s): 3fef429

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +262 -79
  2. config.json +1 -1
README.md CHANGED
@@ -14,67 +14,104 @@ tags:
14
  > library. Slight numerical differences may be observed between the original model and the optimized
15
  > model. For instructions on how to install TransformerEngine, please refer to the
16
  > [official documentation](https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#installation).
17
- >
18
- > The original xformers-based models are available at [chandar-lab/AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_350M).
19
-
20
- ## AMPLIFY
21
-
22
- AMPLIFY is an efficient, state-of-the-art protein language model pre-trained using masked language modeling on UniRef100, OAS, and SCOP ([UR100P](https://huggingface.co/datasets/chandar-lab/UR100P)). AMPLIFY can generate residue and protein embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences, and much more. AMPLIFY is available in two sizes, 120M and 350M parameters, with the `_base` models not extended beyond 512 residues (Stage 1). The model architecture and pre-training procedure are detailed below. For more details, please refer to the [accompanying paper](https://www.biorxiv.org/content/10.1101/2024.09.23.614603v1).
23
-
24
- - [`AMPLIFY_350M`](https://huggingface.co/nvidia/AMPLIFY_350M)
25
- - [`AMPLIFY_350M_base`](https://huggingface.co/chandar-lab/AMPLIFY_350M_base)
26
- - [`AMPLIFY_120M`](https://huggingface.co/nvidia/AMPLIFY_120M)
27
- - [`AMPLIFY_120M_base`](https://huggingface.co/chandar-lab/AMPLIFY_120M_base)
28
-
29
- ### Model Description
30
-
31
- | | AMPLIFY 120M | AMPLIFY 350M |
32
- | :----------------------------- | -----------: | -----------: |
33
- | `hidden-size` | 640 | 960 |
34
- | `num-hidden-layers` | 24 | 32 |
35
- | `num-attention-heads` | 10 | 15 |
36
- | `intermediate-size` | 2560 | 3840 |
37
- | `max-position-embeddings` | 2048 | 2048 |
38
- | `vocab-size` | 27 | 27 |
39
- | `rope-theta` | 10000 | 10000 |
40
- | `dropout-prob` | 0 | 0 |
41
- | `embedding-init-range` | 0.02 | 0.02 |
42
- | `norm-eps` | 1.0e-05 | 1.0e-05 |
43
- | `hidden-act` | swiglu | swiglu |
44
- | `pre-activation-layer-norm` | true | true |
45
- | `layer-norm-after-embedding` | false | false |
46
- | `layer-norm-before-last-layer` | true | true |
47
- | `rms-norm` | true | true |
48
- | `ffn-bias` | false | false |
49
- | `attn-bias` | false | false |
50
-
51
- ### Training Description
52
-
53
- | | Stage 1 | Stage 2 |
54
- | :------------------ | ----------: | ---------------------------: |
55
- | `dataset` | UR100P | UR100P |
56
- | `max-steps` | 1000000 | 25000 (120M) or 50000 (350M) |
57
- | `max-length` | 512 | 2048 |
58
- | `optimizer` | adamw | adamw |
59
- | `lr` | 0.001 | 0.0001 |
60
- | `betas` | (0.9, 0.95) | (0.9, 0.95) |
61
- | `eps` | 1.0e-08 | 1.0e-08 |
62
- | `weight-decay` | 0.01 | 0.01 |
63
- | `scheduler` | cosinedecay | none |
64
- | `warmup-steps` | 1,000 | none |
65
- | `final-step` | 900,000 | none |
66
- | `warmup-steps` | 1,000 | none |
67
- | `gradient-clipping` | 1.0 | 1.0 |
68
- | `tf32` | true | true |
69
- | `mixed-precision` | bf16 | bf16 |
70
- | `padding` | max-length | max-length |
71
- | `random-truncate` | true | true |
72
- | `mask-probability` | 0.15 | 0.15 |
73
- | `total-batch-size` | 4096 | 4096 |
74
- | `deepspeed` | true | true |
75
- | `zero-stage` | 3 | 3 |
76
-
77
- ## Get Started
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
  ```python
80
  from transformers import AutoModel
@@ -83,7 +120,9 @@ from datasets import load_dataset
83
 
84
  # Load AMPLIFY and tokenizer
85
  model = AutoModel.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True)
86
- tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True)
 
 
87
 
88
  # Move the model to GPU (required due to Flash Attention)
89
  model = model.to("cuda")
@@ -107,20 +146,164 @@ for sample in dataset:
107
  break
108
  ```
109
 
110
- ## Citations
111
-
112
- If you find the models useful in your research, we ask that you cite the paper:
113
-
114
- ```bibtex
115
- @article{Fournier2024.09.23.614603,
116
- title = {Protein Language Models: Is Scaling Necessary?},
117
- author = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James},
118
- year = {2024},
119
- journal = {bioRxiv},
120
- publisher = {Cold Spring Harbor Laboratory},
121
- doi = {10.1101/2024.09.23.614603},
122
- url = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603},
123
- elocation-id = {2024.09.23.614603},
124
- eprint = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf}
125
- }
126
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  > library. Slight numerical differences may be observed between the original model and the optimized
15
  > model. For instructions on how to install TransformerEngine, please refer to the
16
  > [official documentation](https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#installation).
17
+
18
+ # AMPLIFY (TransformerEngine-Optimized) Overview
19
+
20
+ ## Description:
21
+
22
+ AMPLIFY is an efficient, state-of-the-art protein language model (pLM). AMPLIFY can generate residue and protein
23
+ embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences. AMPLIFY is available in two
24
+ sizes, 120M and 350M parameters.
25
+
26
+ This version of the AMPLIFY model is optimized with NVIDIA's
27
+ [TransformerEngine](https://github.com/NVIDIA/TransformerEngine) library. It is based on the original AMPLIFY model from
28
+ Chandar Research Lab (CRL), and (within numerical precision) has identical weights and outputs.
29
+
30
+ This model is ready for commercial/non-commercial use.
31
+
32
+ ## Third-Party Community Consideration
33
+
34
+ This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements
35
+ for this application and use case; see link to Non-NVIDIA [AMPLIFY Model
36
+ Card](https://huggingface.co/chandar-lab/AMPLIFY_350M).
37
+
38
+ ### License/Terms of Use:
39
+
40
+ AMPLIFY is provided under the [MIT license](https://github.com/chandar-lab/AMPLIFY/blob/main/LICENSE).
41
+
42
+ ### Deployment Geography:
43
+
44
+ Global
45
+
46
+ ### Use Case:
47
+
48
+ Protein design, mutation prediction, and function analysis.
49
+
50
+ ### Release Date:
51
+
52
+ Hugging Face 06/12/2025 via [https://huggingface.co/nvidia/AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M)
53
+
54
+ ## References:
55
+
56
+ - [Protein Language Models: Is Scaling
57
+ Necessary?](https://www.biorxiv.org/content/biorxiv/early/2024/09/23/2024.09.23.614603.full.pdf) - detailed
58
+ information on the model architecture and training data.
59
+
60
+ ## Model Architecture:
61
+
62
+ **Architecture Type:** Transformer
63
+ **Network Architecture:** ESM-2
64
+
65
+ **This model was developed based on:** [AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_350M) <br>
66
+ **Number of model parameters:** 3.5 x 10^8
67
+
68
+ ## Input:
69
+
70
+ **Input Type:** Text (Protein Sequences) <br>
71
+ **Input Format:** String <br>
72
+ **Input Parameters:** One-Dimensional (1D) <br>
73
+ **Other Properties Related to Input:** Protein sequence represented as a string of canonical amino acids. The maximum
74
+ context length is 2048 residues.
75
+
76
+ ## Output:
77
+
78
+ **Output Type:** Embeddings (Amino acid and sequence-level) <br>
79
+ **Output Format:** Numeric vector <br>
80
+ **Output Parameters:** One-Dimensional (1D) <br>
81
+ **Other Properties Related to Output:** Numeric vector with floating-point values corresponding to an embedding for each
82
+ amino acid in the input protein sequence.
83
+
84
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware
85
+ (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times
86
+ compared to CPU-only solutions.
87
+
88
+ ## Software Integration:
89
+
90
+ **Runtime Engines:**
91
+
92
+ - Hugging Face Transformers
93
+
94
+ **Supported Hardware Microarchitecture Compatibility:**
95
+
96
+ - NVIDIA Ampere
97
+ - NVIDIA Blackwell
98
+ - NVIDIA Hopper
99
+
100
+ **Preferred Operating System(s):**
101
+
102
+ - Linux
103
+
104
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific
105
+ data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at
106
+ both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure
107
+ compliance with safety and ethical standards before deployment.
108
+
109
+ ## Model and checkpoint versions are noted below:
110
+
111
+ - [AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M) <br>
112
+ - [AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M) <br>
113
+
114
+ **Get Started**
115
 
116
  ```python
117
  from transformers import AutoModel
 
120
 
121
  # Load AMPLIFY and tokenizer
122
  model = AutoModel.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True)
123
+ tokenizer = AutoTokenizer.from_pretrained(
124
+ "nvidia/AMPLIFY_350M", trust_remote_code=True
125
+ )
126
 
127
  # Move the model to GPU (required due to Flash Attention)
128
  model = model.to("cuda")
 
146
  break
147
  ```
148
 
149
+ ## Training and Evaluation Datasets:
150
+
151
+ ## Training Datasets:
152
+
153
+ **Link:** [UniRef100](https://www.uniprot.org/uniref?query=identity%3A1.0)
154
+
155
+ **Data Modality:**
156
+
157
+ - Text (Protein Sequences)
158
+
159
+ **Text Training Data Size:**
160
+
161
+ - 1 Billion to 10 Trillion Tokens
162
+
163
+ **Data Collection Method:**
164
+
165
+ - Human
166
+
167
+ **Labeling Method:**
168
+
169
+ - N/A
170
+
171
+ **Properties (Quantity, Dataset Descriptions, Sensor(s)):** UniRef100 contains all records in the UniProt Knowledgebase
172
+ and selected UniParc records. In UniRef100, identical sequences and subfragments are placed into a single cluster using
173
+ the CD-HIT algorithm. The longest members of the cluster (seed sequences) are used to generate UniRef90. However, the
174
+ longest sequence is not always the most informative. There is often more biologically relevant information and
175
+ annotation (name, function, cross-references) available on other cluster members. All the proteins in each cluster are
176
+ ranked to facilitate the selection of a biologically relevant representative for the cluster.
177
+
178
+ **Link:** [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/downloads_paired/)
179
+
180
+ **Data Modality:**
181
+
182
+ - Text (Protein Sequences)
183
+
184
+ **Text Training Data Size:**
185
+
186
+ - 1 Billion to 10 Trillion Tokens
187
+
188
+ **Data Collection Method:**
189
+
190
+ - Human
191
+
192
+ **Labeling Method:**
193
+
194
+ - Human
195
+
196
+ **Properties:** The Observed Antibody Space (OAS) database is a project to collect and annotate immune repertoires for
197
+ use in large-scale analysis. It currently contains over one billion sequences, from over 80 different studies. These
198
+ repertoires cover diverse immune states, organisms (primarily human and mouse), and individuals.
199
+
200
+ **Link:** [Structural Classification of Proteins (SCOP)](https://www.ebi.ac.uk/pdbe/scop/download)
201
+
202
+ **Data Modality:**
203
+
204
+ - Text (Protein Sequences)
205
+
206
+ **Text Training Data Size:**
207
+
208
+ - 1 Billion to 10 Trillion Tokens
209
+
210
+ **Data Collection Method:**
211
+
212
+ - Hybrid: Human, Automated
213
+
214
+ **Labeling Method:**
215
+
216
+ - Hybrid: Human, Automated
217
+
218
+ **Properties:** The main levels of classification in SCOP are:
219
+
220
+ - Class: Groups proteins based on their secondary structure content, such as all-alpha, all-beta, alpha/beta, and
221
+ alpha+beta.
222
+ - Fold: Proteins within the same fold have the same major secondary structures arranged in the same way with the same
223
+ topological connections.
224
+ - Superfamily: Groups protein domains with a probable common evolutionary ancestry based on shared structural and
225
+ functional features, even if sequence similarity is low.
226
+ - Family: Groups closely related proteins with clear evidence of a common evolutionary origin, often detectable through
227
+ sequence comparison methods.
228
+ - Species: Represents a distinct protein sequence.
229
+ - Protein: Groups similar sequences with the same function.
230
+
231
+ ## Evaluation Datasets:
232
+
233
+ **Link:** [Continuous Automated Model EvaluatiOn (CAMEO)](https://pmc.ncbi.nlm.nih.gov/articles/PMC8673552/)
234
+
235
+ **Benchmark Score:** LR P@L of 20.9±15.7
236
+
237
+ **Data Collection Method:**
238
+
239
+ - Human
240
+
241
+ **Labeling Method:**
242
+
243
+ - N/A
244
+
245
+ **Properties:** The data is collected by taking sequences of protein structures that are about to be released weekly by
246
+ the Protein Data Bank (PDB). These sequences are sent as "blind targets" to participating protein structure prediction
247
+ servers, which then return their predictions.
248
+
249
+ **Link:** [CASP14 (Critical Assessment of Methods of Protein Structure
250
+ Prediction)](https://pubmed.ncbi.nlm.nih.gov/34533838/)
251
+
252
+ **Benchmark Score:** LR P@L of 16.6±13.6
253
+
254
+ **Data Collection Method:**
255
+
256
+ - Human
257
+
258
+ **Labeling Method:**
259
+
260
+ - N/A
261
+
262
+ **Properties:** The data for CASP14 targets is collected from protein structures that are newly solved by experimental
263
+ structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
264
+ three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
265
+ participating research groups and servers, who must submit their predicted structures within a specific time frame.
266
+
267
+ **Link:** [CASP15 (Critical Assessment of Methods of Protein Structure
268
+ Prediction)](https://pubmed.ncbi.nlm.nih.gov/37920879/)
269
+
270
+ **Benchmark Score:** LR P@L of 20.0±14.6
271
+
272
+ **Data Collection Method:**
273
+
274
+ - Human
275
+
276
+ **Labeling Method:**
277
+
278
+ - N/A
279
+
280
+ **Properties:** The data for CASP15 targets is collected from protein structures that are newly solved by experimental
281
+ structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
282
+ three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
283
+ participating research groups and servers, who must submit their predicted structures within a specific time frame.
284
+
285
+ ## Inference:
286
+
287
+ **Acceleration Engine:**
288
+
289
+ - Hugging Face Transformers
290
+
291
+ **Test Hardware:**
292
+
293
+ - A100
294
+ - H100
295
+ - H200
296
+ - GB200
297
+
298
+ ## Ethical Considerations:
299
+
300
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable
301
+ development for a wide array of AI applications. When downloaded or used in accordance with our terms of service,
302
+ developers should work with their internal model team to ensure this model meets requirements for the relevant industry
303
+ and use case and addresses unforeseen product misuse.
304
+
305
+ Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and
306
+ comply with applicable safety regulations and ethical standards.
307
+
308
+ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns
309
+ [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
config.json CHANGED
@@ -32,7 +32,7 @@
32
  "padded_vocab_size": 32,
33
  "pre_activation_layer_norm": true,
34
  "rms_norm": true,
35
- "transformers_version": "4.56.1",
36
  "unk_token_id": 1,
37
  "vocab_path": "conf/tokenizer/amplify_vocab.txt",
38
  "vocab_size": 27
 
32
  "padded_vocab_size": 32,
33
  "pre_activation_layer_norm": true,
34
  "rms_norm": true,
35
+ "transformers_version": "4.56.2",
36
  "unk_token_id": 1,
37
  "vocab_path": "conf/tokenizer/amplify_vocab.txt",
38
  "vocab_size": 27