File size: 10,514 Bytes

---
library_name: transformers
license: mit
datasets:
  - chandar-lab/UR100P
language:
  - en
tags:
  - biology
---

> [!NOTE]
> This model has been optimized using NVIDIA's [TransformerEngine](https://github.com/NVIDIA/TransformerEngine)
> library. Slight numerical differences may be observed between the original model and the optimized
> model. For instructions on how to install TransformerEngine, please refer to the
> [official documentation](https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#installation).

# AMPLIFY (TransformerEngine-Optimized) Overview

## Description:

AMPLIFY is an efficient, state-of-the-art protein language model (pLM). AMPLIFY can generate residue and protein
embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences. AMPLIFY is available in two
sizes, 120M and 350M parameters.

This version of the AMPLIFY model is optimized with NVIDIA's
[TransformerEngine](https://github.com/NVIDIA/TransformerEngine) library. It is based on the original AMPLIFY model from
Chandar Research Lab (CRL), and (within numerical precision) has identical weights and outputs.

This model is ready for commercial/non-commercial use.

## Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements
for this application and use case; see link to Non-NVIDIA [AMPLIFY Model
Card](https://huggingface.co/chandar-lab/AMPLIFY_120M).

### License/Terms of Use:

AMPLIFY is provided under the [MIT license](https://github.com/chandar-lab/AMPLIFY/blob/main/LICENSE).

### Deployment Geography:

Global

### Use Case:

Protein design, mutation prediction, and function analysis.

### Release Date:

Hugging Face 06/12/2025 via [https://huggingface.co/nvidia/AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M)

## References:

- [Protein Language Models: Is Scaling
  Necessary?](https://www.biorxiv.org/content/biorxiv/early/2024/09/23/2024.09.23.614603.full.pdf) - detailed
  information on the model architecture and training data.

## Model Architecture:

**Architecture Type:** Transformer
**Network Architecture:** ESM-2

**This model was developed based on:** [AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_120M) <br>
**Number of model parameters:** 1.2 x 10^8

## Input:

**Input Type:** Text (Protein Sequences) <br>
**Input Format:** String <br>
**Input Parameters:** One-Dimensional (1D) <br>
**Other Properties Related to Input:** Protein sequence represented as a string of canonical amino acids. The maximum
context length is 2048 residues.

## Output:

**Output Type:** Embeddings (Amino acid and sequence-level) <br>
**Output Format:** Numeric vector <br>
**Output Parameters:** One-Dimensional (1D) <br>
**Other Properties Related to Output:** Numeric vector with floating-point values corresponding to an embedding for each
amino acid in the input protein sequence.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware
(e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times
compared to CPU-only solutions.

## Software Integration:

**Runtime Engines:**

- Hugging Face Transformers

**Supported Hardware Microarchitecture Compatibility:**

- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Hopper

**Preferred Operating System(s):**

- Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific
data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at
both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure
compliance with safety and ethical standards before deployment.

## Model and checkpoint versions are noted below:

- [AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M) <br>
- [AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M) <br>

**Get Started**

```python
from transformers import AutoModel
from transformers import AutoTokenizer
from datasets import load_dataset

# Load AMPLIFY and tokenizer
model = AutoModel.from_pretrained("nvidia/AMPLIFY_120M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    "nvidia/AMPLIFY_120M", trust_remote_code=True
)

# Move the model to GPU (required due to Flash Attention)
model = model.to("cuda")

# Load the UniProt validation set
dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test")

for sample in dataset:
    # Protein
    print("Sample: ", sample["name"], sample["sequence"])

    # Tokenize the protein
    input = tokenizer.encode(sample["sequence"], return_tensors="pt")
    print("Input: ", input)

    # Move to the GPU and make a prediction
    input = input.to("cuda")
    output = model(input)
    print("Output: ", output)

    break
```

## Training and Evaluation Datasets:

## Training Datasets:

**Link:** [UniRef100](https://www.uniprot.org/uniref?query=identity%3A1.0)

**Data Modality:**

- Text (Protein Sequences)

**Text Training Data Size:**

- 1 Billion to 10 Trillion Tokens

**Data Collection Method:**

- Human

**Labeling Method:**

- N/A

**Properties (Quantity, Dataset Descriptions, Sensor(s)):** UniRef100 contains all records in the UniProt Knowledgebase
and selected UniParc records. In UniRef100, identical sequences and subfragments are placed into a single cluster using
the CD-HIT algorithm. The longest members of the cluster (seed sequences) are used to generate UniRef90. However, the
longest sequence is not always the most informative. There is often more biologically relevant information and
annotation (name, function, cross-references) available on other cluster members. All the proteins in each cluster are
ranked to facilitate the selection of a biologically relevant representative for the cluster.

**Link:** [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/downloads_paired/)

**Data Modality:**

- Text (Protein Sequences)

**Text Training Data Size:**

- 1 Billion to 10 Trillion Tokens

**Data Collection Method:**

- Human

**Labeling Method:**

- Human

**Properties:** The Observed Antibody Space (OAS) database is a project to collect and annotate immune repertoires for
use in large-scale analysis. It currently contains over one billion sequences, from over 80 different studies. These
repertoires cover diverse immune states, organisms (primarily human and mouse), and individuals.

**Link:** [Structural Classification of Proteins (SCOP)](https://www.ebi.ac.uk/pdbe/scop/download)

**Data Modality:**

- Text (Protein Sequences)

**Text Training Data Size:**

- 1 Billion to 10 Trillion Tokens

**Data Collection Method:**

- Hybrid: Human, Automated

**Labeling Method:**

- Hybrid: Human, Automated

**Properties:** The main levels of classification in SCOP are:

- Class: Groups proteins based on their secondary structure content, such as all-alpha, all-beta, alpha/beta, and
  alpha+beta.
- Fold: Proteins within the same fold have the same major secondary structures arranged in the same way with the same
  topological connections.
- Superfamily: Groups protein domains with a probable common evolutionary ancestry based on shared structural and
  functional features, even if sequence similarity is low.
- Family: Groups closely related proteins with clear evidence of a common evolutionary origin, often detectable through
  sequence comparison methods.
- Species: Represents a distinct protein sequence.
- Protein: Groups similar sequences with the same function.

## Evaluation Datasets:

**Link:** [Continuous Automated Model EvaluatiOn (CAMEO)](https://pmc.ncbi.nlm.nih.gov/articles/PMC8673552/)

**Benchmark Score:** LR P@L of 17.8±14.1

**Data Collection Method:**

- Human

**Labeling Method:**

- N/A

**Properties:** The data is collected by taking sequences of protein structures that are about to be released weekly by
the Protein Data Bank (PDB). These sequences are sent as "blind targets" to participating protein structure prediction
servers, which then return their predictions.

**Link:** [CASP14 (Critical Assessment of Methods of Protein Structure
Prediction)](https://pubmed.ncbi.nlm.nih.gov/34533838/)

**Benchmark Score:** LR P@L of 12.4±11.3

**Data Collection Method:**

- Human

**Labeling Method:**

- N/A

**Properties:** The data for CASP14 targets is collected from protein structures that are newly solved by experimental
structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
participating research groups and servers, who must submit their predicted structures within a specific time frame.

**Link:** [CASP15 (Critical Assessment of Methods of Protein Structure
Prediction)](https://pubmed.ncbi.nlm.nih.gov/37920879/)

**Benchmark Score:** LR P@L of 16.9±13.2

**Data Collection Method:**

- Human

**Labeling Method:**

- N/A

**Properties:** The data for CASP15 targets is collected from protein structures that are newly solved by experimental
structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
participating research groups and servers, who must submit their predicted structures within a specific time frame.

## Inference:

**Acceleration Engine:**

- Hugging Face Transformers

**Test Hardware:**

- A100
- H100
- H200
- GB200

## Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable
development for a wide array of AI applications. When downloaded or used in accordance with our terms of service,
developers should work with their internal model team to ensure this model meets requirements for the relevant industry
and use case and addresses unforeseen product misuse.

Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and
comply with applicable safety regulations and ethical standards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns
[here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).