|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- chandar-lab/UR100P |
|
language: |
|
- en |
|
tags: |
|
- biology |
|
--- |
|
|
|
> [!NOTE] |
|
> This model has been optimized using NVIDIA's [TransformerEngine](https://github.com/NVIDIA/TransformerEngine) |
|
> library. Slight numerical differences may be observed between the original model and the optimized |
|
> model. For instructions on how to install TransformerEngine, please refer to the |
|
> [official documentation](https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#installation). |
|
|
|
# AMPLIFY (TransformerEngine-Optimized) Overview |
|
|
|
## Description: |
|
|
|
AMPLIFY is an efficient, state-of-the-art protein language model (pLM). AMPLIFY can generate residue and protein |
|
embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences. AMPLIFY is available in two |
|
sizes, 120M and 350M parameters. |
|
|
|
This version of the AMPLIFY model is optimized with NVIDIA's |
|
[TransformerEngine](https://github.com/NVIDIA/TransformerEngine) library. It is based on the original AMPLIFY model from |
|
Chandar Research Lab (CRL), and (within numerical precision) has identical weights and outputs. |
|
|
|
This model is ready for commercial/non-commercial use. |
|
|
|
## Third-Party Community Consideration |
|
|
|
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements |
|
for this application and use case; see link to Non-NVIDIA [AMPLIFY Model |
|
Card](https://huggingface.co/chandar-lab/AMPLIFY_350M). |
|
|
|
### License/Terms of Use: |
|
|
|
AMPLIFY is provided under the [MIT license](https://github.com/chandar-lab/AMPLIFY/blob/main/LICENSE). |
|
|
|
### Deployment Geography: |
|
|
|
Global |
|
|
|
### Use Case: |
|
|
|
Protein design, mutation prediction, and function analysis. |
|
|
|
### Release Date: |
|
|
|
Hugging Face 06/12/2025 via [https://huggingface.co/nvidia/AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M) |
|
|
|
## References: |
|
|
|
- [Protein Language Models: Is Scaling |
|
Necessary?](https://www.biorxiv.org/content/biorxiv/early/2024/09/23/2024.09.23.614603.full.pdf) - detailed |
|
information on the model architecture and training data. |
|
|
|
## Model Architecture: |
|
|
|
**Architecture Type:** Transformer |
|
**Network Architecture:** ESM-2 |
|
|
|
**This model was developed based on:** [AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_350M) <br> |
|
**Number of model parameters:** 3.5 x 10^8 |
|
|
|
## Input: |
|
|
|
**Input Type:** Text (Protein Sequences) <br> |
|
**Input Format:** String <br> |
|
**Input Parameters:** One-Dimensional (1D) <br> |
|
**Other Properties Related to Input:** Protein sequence represented as a string of canonical amino acids. The maximum |
|
context length is 2048 residues. |
|
|
|
## Output: |
|
|
|
**Output Type:** Embeddings (Amino acid and sequence-level) <br> |
|
**Output Format:** Numeric vector <br> |
|
**Output Parameters:** One-Dimensional (1D) <br> |
|
**Other Properties Related to Output:** Numeric vector with floating-point values corresponding to an embedding for each |
|
amino acid in the input protein sequence. |
|
|
|
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware |
|
(e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times |
|
compared to CPU-only solutions. |
|
|
|
## Software Integration: |
|
|
|
**Runtime Engines:** |
|
|
|
- Hugging Face Transformers |
|
|
|
**Supported Hardware Microarchitecture Compatibility:** |
|
|
|
- NVIDIA Ampere |
|
- NVIDIA Blackwell |
|
- NVIDIA Hopper |
|
|
|
**Preferred Operating System(s):** |
|
|
|
- Linux |
|
|
|
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific |
|
data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at |
|
both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure |
|
compliance with safety and ethical standards before deployment. |
|
|
|
## Model and checkpoint versions are noted below: |
|
|
|
- [AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M) <br> |
|
- [AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M) <br> |
|
|
|
**Get Started** |
|
|
|
```python |
|
from transformers import AutoModel |
|
from transformers import AutoTokenizer |
|
from datasets import load_dataset |
|
|
|
# Load AMPLIFY and tokenizer |
|
model = AutoModel.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
"nvidia/AMPLIFY_350M", trust_remote_code=True |
|
) |
|
|
|
# Move the model to GPU (required due to Flash Attention) |
|
model = model.to("cuda") |
|
|
|
# Load the UniProt validation set |
|
dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test") |
|
|
|
for sample in dataset: |
|
# Protein |
|
print("Sample: ", sample["name"], sample["sequence"]) |
|
|
|
# Tokenize the protein |
|
input = tokenizer.encode(sample["sequence"], return_tensors="pt") |
|
print("Input: ", input) |
|
|
|
# Move to the GPU and make a prediction |
|
input = input.to("cuda") |
|
output = model(input) |
|
print("Output: ", output) |
|
|
|
break |
|
``` |
|
|
|
## Training and Evaluation Datasets: |
|
|
|
## Training Datasets: |
|
|
|
**Link:** [UniRef100](https://www.uniprot.org/uniref?query=identity%3A1.0) |
|
|
|
**Data Modality:** |
|
|
|
- Text (Protein Sequences) |
|
|
|
**Text Training Data Size:** |
|
|
|
- 1 Billion to 10 Trillion Tokens |
|
|
|
**Data Collection Method:** |
|
|
|
- Human |
|
|
|
**Labeling Method:** |
|
|
|
- N/A |
|
|
|
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** UniRef100 contains all records in the UniProt Knowledgebase |
|
and selected UniParc records. In UniRef100, identical sequences and subfragments are placed into a single cluster using |
|
the CD-HIT algorithm. The longest members of the cluster (seed sequences) are used to generate UniRef90. However, the |
|
longest sequence is not always the most informative. There is often more biologically relevant information and |
|
annotation (name, function, cross-references) available on other cluster members. All the proteins in each cluster are |
|
ranked to facilitate the selection of a biologically relevant representative for the cluster. |
|
|
|
**Link:** [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/downloads_paired/) |
|
|
|
**Data Modality:** |
|
|
|
- Text (Protein Sequences) |
|
|
|
**Text Training Data Size:** |
|
|
|
- 1 Billion to 10 Trillion Tokens |
|
|
|
**Data Collection Method:** |
|
|
|
- Human |
|
|
|
**Labeling Method:** |
|
|
|
- Human |
|
|
|
**Properties:** The Observed Antibody Space (OAS) database is a project to collect and annotate immune repertoires for |
|
use in large-scale analysis. It currently contains over one billion sequences, from over 80 different studies. These |
|
repertoires cover diverse immune states, organisms (primarily human and mouse), and individuals. |
|
|
|
**Link:** [Structural Classification of Proteins (SCOP)](https://www.ebi.ac.uk/pdbe/scop/download) |
|
|
|
**Data Modality:** |
|
|
|
- Text (Protein Sequences) |
|
|
|
**Text Training Data Size:** |
|
|
|
- 1 Billion to 10 Trillion Tokens |
|
|
|
**Data Collection Method:** |
|
|
|
- Hybrid: Human, Automated |
|
|
|
**Labeling Method:** |
|
|
|
- Hybrid: Human, Automated |
|
|
|
**Properties:** The main levels of classification in SCOP are: |
|
|
|
- Class: Groups proteins based on their secondary structure content, such as all-alpha, all-beta, alpha/beta, and |
|
alpha+beta. |
|
- Fold: Proteins within the same fold have the same major secondary structures arranged in the same way with the same |
|
topological connections. |
|
- Superfamily: Groups protein domains with a probable common evolutionary ancestry based on shared structural and |
|
functional features, even if sequence similarity is low. |
|
- Family: Groups closely related proteins with clear evidence of a common evolutionary origin, often detectable through |
|
sequence comparison methods. |
|
- Species: Represents a distinct protein sequence. |
|
- Protein: Groups similar sequences with the same function. |
|
|
|
## Evaluation Datasets: |
|
|
|
**Link:** [Continuous Automated Model EvaluatiOn (CAMEO)](https://pmc.ncbi.nlm.nih.gov/articles/PMC8673552/) |
|
|
|
**Benchmark Score:** LR P@L of 20.9±15.7 |
|
|
|
**Data Collection Method:** |
|
|
|
- Human |
|
|
|
**Labeling Method:** |
|
|
|
- N/A |
|
|
|
**Properties:** The data is collected by taking sequences of protein structures that are about to be released weekly by |
|
the Protein Data Bank (PDB). These sequences are sent as "blind targets" to participating protein structure prediction |
|
servers, which then return their predictions. |
|
|
|
**Link:** [CASP14 (Critical Assessment of Methods of Protein Structure |
|
Prediction)](https://pubmed.ncbi.nlm.nih.gov/34533838/) |
|
|
|
**Benchmark Score:** LR P@L of 16.6±13.6 |
|
|
|
**Data Collection Method:** |
|
|
|
- Human |
|
|
|
**Labeling Method:** |
|
|
|
- N/A |
|
|
|
**Properties:** The data for CASP14 targets is collected from protein structures that are newly solved by experimental |
|
structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full, |
|
three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to |
|
participating research groups and servers, who must submit their predicted structures within a specific time frame. |
|
|
|
**Link:** [CASP15 (Critical Assessment of Methods of Protein Structure |
|
Prediction)](https://pubmed.ncbi.nlm.nih.gov/37920879/) |
|
|
|
**Benchmark Score:** LR P@L of 20.0±14.6 |
|
|
|
**Data Collection Method:** |
|
|
|
- Human |
|
|
|
**Labeling Method:** |
|
|
|
- N/A |
|
|
|
**Properties:** The data for CASP15 targets is collected from protein structures that are newly solved by experimental |
|
structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full, |
|
three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to |
|
participating research groups and servers, who must submit their predicted structures within a specific time frame. |
|
|
|
## Inference: |
|
|
|
**Acceleration Engine:** |
|
|
|
- Hugging Face Transformers |
|
|
|
**Test Hardware:** |
|
|
|
- A100 |
|
- H100 |
|
- H200 |
|
- GB200 |
|
|
|
## Ethical Considerations: |
|
|
|
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable |
|
development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, |
|
developers should work with their internal model team to ensure this model meets requirements for the relevant industry |
|
and use case and addresses unforeseen product misuse. |
|
|
|
Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and |
|
comply with applicable safety regulations and ethical standards. |
|
|
|
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns |
|
[here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). |
|
|