File size: 10,514 Bytes
8795250 98100a7 8795250 531b049 8795250 531b049 8795250 531b049 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
---
library_name: transformers
license: mit
datasets:
- chandar-lab/UR100P
language:
- en
tags:
- biology
---
> [!NOTE]
> This model has been optimized using NVIDIA's [TransformerEngine](https://github.com/NVIDIA/TransformerEngine)
> library. Slight numerical differences may be observed between the original model and the optimized
> model. For instructions on how to install TransformerEngine, please refer to the
> [official documentation](https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#installation).
# AMPLIFY (TransformerEngine-Optimized) Overview
## Description:
AMPLIFY is an efficient, state-of-the-art protein language model (pLM). AMPLIFY can generate residue and protein
embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences. AMPLIFY is available in two
sizes, 120M and 350M parameters.
This version of the AMPLIFY model is optimized with NVIDIA's
[TransformerEngine](https://github.com/NVIDIA/TransformerEngine) library. It is based on the original AMPLIFY model from
Chandar Research Lab (CRL), and (within numerical precision) has identical weights and outputs.
This model is ready for commercial/non-commercial use.
## Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements
for this application and use case; see link to Non-NVIDIA [AMPLIFY Model
Card](https://huggingface.co/chandar-lab/AMPLIFY_120M).
### License/Terms of Use:
AMPLIFY is provided under the [MIT license](https://github.com/chandar-lab/AMPLIFY/blob/main/LICENSE).
### Deployment Geography:
Global
### Use Case:
Protein design, mutation prediction, and function analysis.
### Release Date:
Hugging Face 06/12/2025 via [https://huggingface.co/nvidia/AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M)
## References:
- [Protein Language Models: Is Scaling
Necessary?](https://www.biorxiv.org/content/biorxiv/early/2024/09/23/2024.09.23.614603.full.pdf) - detailed
information on the model architecture and training data.
## Model Architecture:
**Architecture Type:** Transformer
**Network Architecture:** ESM-2
**This model was developed based on:** [AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_120M) <br>
**Number of model parameters:** 1.2 x 10^8
## Input:
**Input Type:** Text (Protein Sequences) <br>
**Input Format:** String <br>
**Input Parameters:** One-Dimensional (1D) <br>
**Other Properties Related to Input:** Protein sequence represented as a string of canonical amino acids. The maximum
context length is 2048 residues.
## Output:
**Output Type:** Embeddings (Amino acid and sequence-level) <br>
**Output Format:** Numeric vector <br>
**Output Parameters:** One-Dimensional (1D) <br>
**Other Properties Related to Output:** Numeric vector with floating-point values corresponding to an embedding for each
amino acid in the input protein sequence.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware
(e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times
compared to CPU-only solutions.
## Software Integration:
**Runtime Engines:**
- Hugging Face Transformers
**Supported Hardware Microarchitecture Compatibility:**
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Hopper
**Preferred Operating System(s):**
- Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific
data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at
both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure
compliance with safety and ethical standards before deployment.
## Model and checkpoint versions are noted below:
- [AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M) <br>
- [AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M) <br>
**Get Started**
```python
from transformers import AutoModel
from transformers import AutoTokenizer
from datasets import load_dataset
# Load AMPLIFY and tokenizer
model = AutoModel.from_pretrained("nvidia/AMPLIFY_120M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
"nvidia/AMPLIFY_120M", trust_remote_code=True
)
# Move the model to GPU (required due to Flash Attention)
model = model.to("cuda")
# Load the UniProt validation set
dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test")
for sample in dataset:
# Protein
print("Sample: ", sample["name"], sample["sequence"])
# Tokenize the protein
input = tokenizer.encode(sample["sequence"], return_tensors="pt")
print("Input: ", input)
# Move to the GPU and make a prediction
input = input.to("cuda")
output = model(input)
print("Output: ", output)
break
```
## Training and Evaluation Datasets:
## Training Datasets:
**Link:** [UniRef100](https://www.uniprot.org/uniref?query=identity%3A1.0)
**Data Modality:**
- Text (Protein Sequences)
**Text Training Data Size:**
- 1 Billion to 10 Trillion Tokens
**Data Collection Method:**
- Human
**Labeling Method:**
- N/A
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** UniRef100 contains all records in the UniProt Knowledgebase
and selected UniParc records. In UniRef100, identical sequences and subfragments are placed into a single cluster using
the CD-HIT algorithm. The longest members of the cluster (seed sequences) are used to generate UniRef90. However, the
longest sequence is not always the most informative. There is often more biologically relevant information and
annotation (name, function, cross-references) available on other cluster members. All the proteins in each cluster are
ranked to facilitate the selection of a biologically relevant representative for the cluster.
**Link:** [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/downloads_paired/)
**Data Modality:**
- Text (Protein Sequences)
**Text Training Data Size:**
- 1 Billion to 10 Trillion Tokens
**Data Collection Method:**
- Human
**Labeling Method:**
- Human
**Properties:** The Observed Antibody Space (OAS) database is a project to collect and annotate immune repertoires for
use in large-scale analysis. It currently contains over one billion sequences, from over 80 different studies. These
repertoires cover diverse immune states, organisms (primarily human and mouse), and individuals.
**Link:** [Structural Classification of Proteins (SCOP)](https://www.ebi.ac.uk/pdbe/scop/download)
**Data Modality:**
- Text (Protein Sequences)
**Text Training Data Size:**
- 1 Billion to 10 Trillion Tokens
**Data Collection Method:**
- Hybrid: Human, Automated
**Labeling Method:**
- Hybrid: Human, Automated
**Properties:** The main levels of classification in SCOP are:
- Class: Groups proteins based on their secondary structure content, such as all-alpha, all-beta, alpha/beta, and
alpha+beta.
- Fold: Proteins within the same fold have the same major secondary structures arranged in the same way with the same
topological connections.
- Superfamily: Groups protein domains with a probable common evolutionary ancestry based on shared structural and
functional features, even if sequence similarity is low.
- Family: Groups closely related proteins with clear evidence of a common evolutionary origin, often detectable through
sequence comparison methods.
- Species: Represents a distinct protein sequence.
- Protein: Groups similar sequences with the same function.
## Evaluation Datasets:
**Link:** [Continuous Automated Model EvaluatiOn (CAMEO)](https://pmc.ncbi.nlm.nih.gov/articles/PMC8673552/)
**Benchmark Score:** LR P@L of 17.8±14.1
**Data Collection Method:**
- Human
**Labeling Method:**
- N/A
**Properties:** The data is collected by taking sequences of protein structures that are about to be released weekly by
the Protein Data Bank (PDB). These sequences are sent as "blind targets" to participating protein structure prediction
servers, which then return their predictions.
**Link:** [CASP14 (Critical Assessment of Methods of Protein Structure
Prediction)](https://pubmed.ncbi.nlm.nih.gov/34533838/)
**Benchmark Score:** LR P@L of 12.4±11.3
**Data Collection Method:**
- Human
**Labeling Method:**
- N/A
**Properties:** The data for CASP14 targets is collected from protein structures that are newly solved by experimental
structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
participating research groups and servers, who must submit their predicted structures within a specific time frame.
**Link:** [CASP15 (Critical Assessment of Methods of Protein Structure
Prediction)](https://pubmed.ncbi.nlm.nih.gov/37920879/)
**Benchmark Score:** LR P@L of 16.9±13.2
**Data Collection Method:**
- Human
**Labeling Method:**
- N/A
**Properties:** The data for CASP15 targets is collected from protein structures that are newly solved by experimental
structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
participating research groups and servers, who must submit their predicted structures within a specific time frame.
## Inference:
**Acceleration Engine:**
- Hugging Face Transformers
**Test Hardware:**
- A100
- H100
- H200
- GB200
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable
development for a wide array of AI applications. When downloaded or used in accordance with our terms of service,
developers should work with their internal model team to ensure this model meets requirements for the relevant industry
and use case and addresses unforeseen product misuse.
Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and
comply with applicable safety regulations and ethical standards.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns
[here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
|