File size: 10,514 Bytes
8795250
98100a7
8795250
 
 
 
 
 
 
 
 
 
 
 
 
 
531b049
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8795250
 
 
 
 
 
 
531b049
 
 
 
8795250
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
531b049
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
---
library_name: transformers
license: mit
datasets:
  - chandar-lab/UR100P
language:
  - en
tags:
  - biology
---

> [!NOTE]
> This model has been optimized using NVIDIA's [TransformerEngine](https://github.com/NVIDIA/TransformerEngine)
> library. Slight numerical differences may be observed between the original model and the optimized
> model. For instructions on how to install TransformerEngine, please refer to the
> [official documentation](https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#installation).

# AMPLIFY (TransformerEngine-Optimized) Overview

## Description:

AMPLIFY is an efficient, state-of-the-art protein language model (pLM). AMPLIFY can generate residue and protein
embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences. AMPLIFY is available in two
sizes, 120M and 350M parameters.

This version of the AMPLIFY model is optimized with NVIDIA's
[TransformerEngine](https://github.com/NVIDIA/TransformerEngine) library. It is based on the original AMPLIFY model from
Chandar Research Lab (CRL), and (within numerical precision) has identical weights and outputs.

This model is ready for commercial/non-commercial use.

## Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements
for this application and use case; see link to Non-NVIDIA [AMPLIFY Model
Card](https://huggingface.co/chandar-lab/AMPLIFY_120M).

### License/Terms of Use:

AMPLIFY is provided under the [MIT license](https://github.com/chandar-lab/AMPLIFY/blob/main/LICENSE).

### Deployment Geography:

Global

### Use Case:

Protein design, mutation prediction, and function analysis.

### Release Date:

Hugging Face 06/12/2025 via [https://huggingface.co/nvidia/AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M)

## References:

- [Protein Language Models: Is Scaling
  Necessary?](https://www.biorxiv.org/content/biorxiv/early/2024/09/23/2024.09.23.614603.full.pdf) - detailed
  information on the model architecture and training data.

## Model Architecture:

**Architecture Type:** Transformer
**Network Architecture:** ESM-2

**This model was developed based on:** [AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_120M) <br>
**Number of model parameters:** 1.2 x 10^8

## Input:

**Input Type:** Text (Protein Sequences) <br>
**Input Format:** String <br>
**Input Parameters:** One-Dimensional (1D) <br>
**Other Properties Related to Input:** Protein sequence represented as a string of canonical amino acids. The maximum
context length is 2048 residues.

## Output:

**Output Type:** Embeddings (Amino acid and sequence-level) <br>
**Output Format:** Numeric vector <br>
**Output Parameters:** One-Dimensional (1D) <br>
**Other Properties Related to Output:** Numeric vector with floating-point values corresponding to an embedding for each
amino acid in the input protein sequence.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware
(e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times
compared to CPU-only solutions.

## Software Integration:

**Runtime Engines:**

- Hugging Face Transformers

**Supported Hardware Microarchitecture Compatibility:**

- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Hopper

**Preferred Operating System(s):**

- Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific
data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at
both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure
compliance with safety and ethical standards before deployment.

## Model and checkpoint versions are noted below:

- [AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M) <br>
- [AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M) <br>

**Get Started**

```python
from transformers import AutoModel
from transformers import AutoTokenizer
from datasets import load_dataset

# Load AMPLIFY and tokenizer
model = AutoModel.from_pretrained("nvidia/AMPLIFY_120M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    "nvidia/AMPLIFY_120M", trust_remote_code=True
)

# Move the model to GPU (required due to Flash Attention)
model = model.to("cuda")

# Load the UniProt validation set
dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test")

for sample in dataset:
    # Protein
    print("Sample: ", sample["name"], sample["sequence"])

    # Tokenize the protein
    input = tokenizer.encode(sample["sequence"], return_tensors="pt")
    print("Input: ", input)

    # Move to the GPU and make a prediction
    input = input.to("cuda")
    output = model(input)
    print("Output: ", output)

    break
```

## Training and Evaluation Datasets:

## Training Datasets:

**Link:** [UniRef100](https://www.uniprot.org/uniref?query=identity%3A1.0)

**Data Modality:**

- Text (Protein Sequences)

**Text Training Data Size:**

- 1 Billion to 10 Trillion Tokens

**Data Collection Method:**

- Human

**Labeling Method:**

- N/A

**Properties (Quantity, Dataset Descriptions, Sensor(s)):** UniRef100 contains all records in the UniProt Knowledgebase
and selected UniParc records. In UniRef100, identical sequences and subfragments are placed into a single cluster using
the CD-HIT algorithm. The longest members of the cluster (seed sequences) are used to generate UniRef90. However, the
longest sequence is not always the most informative. There is often more biologically relevant information and
annotation (name, function, cross-references) available on other cluster members. All the proteins in each cluster are
ranked to facilitate the selection of a biologically relevant representative for the cluster.

**Link:** [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/downloads_paired/)

**Data Modality:**

- Text (Protein Sequences)

**Text Training Data Size:**

- 1 Billion to 10 Trillion Tokens

**Data Collection Method:**

- Human

**Labeling Method:**

- Human

**Properties:** The Observed Antibody Space (OAS) database is a project to collect and annotate immune repertoires for
use in large-scale analysis. It currently contains over one billion sequences, from over 80 different studies. These
repertoires cover diverse immune states, organisms (primarily human and mouse), and individuals.

**Link:** [Structural Classification of Proteins (SCOP)](https://www.ebi.ac.uk/pdbe/scop/download)

**Data Modality:**

- Text (Protein Sequences)

**Text Training Data Size:**

- 1 Billion to 10 Trillion Tokens

**Data Collection Method:**

- Hybrid: Human, Automated

**Labeling Method:**

- Hybrid: Human, Automated

**Properties:** The main levels of classification in SCOP are:

- Class: Groups proteins based on their secondary structure content, such as all-alpha, all-beta, alpha/beta, and
  alpha+beta.
- Fold: Proteins within the same fold have the same major secondary structures arranged in the same way with the same
  topological connections.
- Superfamily: Groups protein domains with a probable common evolutionary ancestry based on shared structural and
  functional features, even if sequence similarity is low.
- Family: Groups closely related proteins with clear evidence of a common evolutionary origin, often detectable through
  sequence comparison methods.
- Species: Represents a distinct protein sequence.
- Protein: Groups similar sequences with the same function.

## Evaluation Datasets:

**Link:** [Continuous Automated Model EvaluatiOn (CAMEO)](https://pmc.ncbi.nlm.nih.gov/articles/PMC8673552/)

**Benchmark Score:** LR P@L of 17.8±14.1

**Data Collection Method:**

- Human

**Labeling Method:**

- N/A

**Properties:** The data is collected by taking sequences of protein structures that are about to be released weekly by
the Protein Data Bank (PDB). These sequences are sent as "blind targets" to participating protein structure prediction
servers, which then return their predictions.

**Link:** [CASP14 (Critical Assessment of Methods of Protein Structure
Prediction)](https://pubmed.ncbi.nlm.nih.gov/34533838/)

**Benchmark Score:** LR P@L of 12.4±11.3

**Data Collection Method:**

- Human

**Labeling Method:**

- N/A

**Properties:** The data for CASP14 targets is collected from protein structures that are newly solved by experimental
structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
participating research groups and servers, who must submit their predicted structures within a specific time frame.

**Link:** [CASP15 (Critical Assessment of Methods of Protein Structure
Prediction)](https://pubmed.ncbi.nlm.nih.gov/37920879/)

**Benchmark Score:** LR P@L of 16.9±13.2

**Data Collection Method:**

- Human

**Labeling Method:**

- N/A

**Properties:** The data for CASP15 targets is collected from protein structures that are newly solved by experimental
structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
participating research groups and servers, who must submit their predicted structures within a specific time frame.

## Inference:

**Acceleration Engine:**

- Hugging Face Transformers

**Test Hardware:**

- A100
- H100
- H200
- GB200

## Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable
development for a wide array of AI applications. When downloaded or used in accordance with our terms of service,
developers should work with their internal model team to ensure this model meets requirements for the relevant industry
and use case and addresses unforeseen product misuse.

Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and
comply with applicable safety regulations and ethical standards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns
[here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).