sarahalamdari commited on
Commit
cc4008c
·
verified ·
1 Parent(s): 9dc4787

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -25
README.md CHANGED
@@ -4,18 +4,9 @@
4
 
5
  # Model Card for Dayhoff
6
 
7
- In this work, we combined genomic-derived protein sequences, metagenomics, structure-based synthetic sequences, and MSAs to create the Dayhoff Atlas of protein data and language models.
8
- We first created a large-scale natural protein dataset, GigaRef, by combining and reclustering sequences from metagenomic databases with UniRef100. With 3.3B sequences in 1.7B clusters, GigaRef is the largest open dataset of natural proteins to date.
9
-
10
- To infuse the benefits of protein structure information into sequence space, we generated the first large-scale structure-based synthetic dataset, called BackboneRef, by sampling 240,830 backbone structures from a structure-based generative model and then using them to design a total of 46M synthetic sequences.
11
- Using UniRef, GigaRef, BackboneRef, and 16M MSAs from OpenProteinSet, we then trained the Dayhoff series of PLMs, which use a a hybrid state-space-model (SSM) and transformer architecture along with a mixture-of-experts (MoE) mechanism to enable the long context lengths needed to combine single sequences and MSAs at scale.
12
- Dayhoff models make accurate zero-shot predictions of mutations effects, generate sequences conditioned on aligned or unaligned homologs, and generate shorter Cas9s that preserve the functional domain architecture.
13
-
14
- Larger models, metagenomic sequences, and structure-based augmentation all increased the expression rates of unconditional generations in E. coli.
15
- Finally, we generated, characterized, and release 16M synthetic sequences as DayhoffRef
16
-
17
- Dayhoff is described in this [preprint](preprint); if you use the code from this repository or the results, please cite the preprint.
18
 
 
19
 
20
  ## Model Details
21
 
@@ -33,16 +24,17 @@ Dayhoff is described in this [preprint](preprint); if you use the code from this
33
 
34
  ### Downstream Use
35
 
36
- * Protein Language Model Training: Training protein language models, to generate new protein sequences, predict mutation effects, and design functional proteins.
37
- * Zero-shot Prediction: Predicting the functional impact of mutations.
38
- * Sequence Generation: Generating new protein sequences unconditionally, or conditioned on homologs for designing proteins with desired properties.
39
- * Synthetic Sequence Generation: Exploring novel protein structures and functions using synthetic sequences.
40
 
 
 
 
 
41
 
42
- ## Bias, Risks, and Limitations
43
 
44
- The [software/model] described in this repository is provided for research and development use only. The [software/model] is not intended for use in clinical decision-making or for any other clinical use, and the performance of model for clinical use has not been established. You bear sole responsibility for any use of this [software/model], including incorporation into any product intended for clinical use. 
45
 
 
46
 
47
  ## How to Get Started with the Model
48
 
@@ -73,7 +65,131 @@ For detailed instructions on package usage, please refer to the README in model
73
 
74
  ### Results
75
 
76
- Dayhoff models make accurate zero-shot predictions of mutation effects, generate sequences conditioned on aligned or unaligned homologs, and generate shorter Cas9s that preserve the functional domain architecture. Larger models, metagenomic sequences, and structure-based augmentation all increased the expression rates of unconditional generations in E. coli
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  ## Technical Specifications
79
 
@@ -83,15 +199,13 @@ Dayhoff models make accurate zero-shot predictions of mutation effects, generate
83
  * 3B-parameter models: trained on 176 NVIDIA H100 GPUs using Fully Sharded Data Parallel in hybrid-shard mode.
84
 
85
 
86
- ## Citation
87
-
88
- **BibTeX:**
89
- If you use this model in your work, please cite it as follows:
90
 
91
- <ADD INFO>
92
 
 
93
 
94
- ## Model Card Authors
95
 
96
- Samir Char, Sarah A. Alamdari
97
 
 
 
4
 
5
  # Model Card for Dayhoff
6
 
7
+ Dayhoff is an Atlas of both protein sequence data and generative language models — a centralized resource that brings together 3.3 billion protein sequences across 1.7 billion clusters of metagenomic and natural protein sequences (GigaRef), 46 million structure-derived synthetic sequences (BackboneRef), and 16 million multiple sequence alignments (OpenProteinSet). These models can natively predict zero-shot mutation effects on fitness, scaffold structural motifs by conditioning on evolutionary or structural context, and perform guided generation of novel proteins within specified families. Learning from metagenomic and structure-based synthetic data from the Dayhoff Atlas increased the cellular expression rates of generated proteins, highlighting the real-world value of expanding the scale, diversity, and novelty of protein sequence data.
 
 
 
 
 
 
 
 
 
 
8
 
9
+ The Dayhoff architecture is a hybrid of state-space Mamba layers and Transformer self-attention, interleaved with Mixture-of-Experts modules to maximize capacity while preserving efficiency. It natively handles long contexts, allowing both single sequences and unrolled MSAs to be modeled. Trained with an autoregressive objective in both N→C and C→N directions, Dayhoff supports order-agnostic infilling and scales to billions of parameters.
10
 
11
  ## Model Details
12
 
 
24
 
25
  ### Downstream Use
26
 
27
+ Dayhoff is intended for broad research use on protein language modeling. The model has been used and assessed on the following capabilities:
 
 
 
28
 
29
+ 1. Unconditional design of protein sequences
30
+ 2. Zero-shot mutation effect prediction on [ProteinGym](https://proteingym.org/)
31
+ 3. Designing scaffolds for structural motifs in sequence space on [RFDiffusion](https://www.nature.com/articles/s41586-023-06415-8) and [MotifBench](https://arxiv.org/abs/2502.12479)
32
+ 4. Homolog conditioning with Dayhoff-3b-GR-HM and Dayhoff-3b-GR-HM-c
33
 
 
34
 
35
+ ## Bias, Risks, and Limitations
36
 
37
+ This model should not be used to generate anything that is not a protein sequence or a set of homologuous protein sequences. It is not meant for natural language or other biological sequences, such as DNA sequences. Not all sequences are guaranteed to be realistic. It remains difficult to generate high-quality sequences with no sequence homology to any natural sequence.
38
 
39
  ## How to Get Started with the Model
40
 
 
65
 
66
  ### Results
67
 
68
+ See the [preprint](https://aka.ms/dayhoff/preprint) for the latest benchmark results and evaluations.
69
+
70
+ **Model perplexity on held-out test sequences for Dayhoff models.**
71
+
72
+ | Model | UniRef50 | GigaRef | Aligned homologs | Unaligned homologs |
73
+ |------------------|---------:|--------:|-----------------:|-------------------:|
74
+ | 170m-UR50 | 11.62 | 11.88 | | |
75
+ | 170m-UR90 | 11.52 | 11.85 | | |
76
+ | 170m-GR | 13.67 | 9.36 | | |
77
+ | 170m-UR50-BRn | 11.78 | 12.03 | | |
78
+ | 170m-UR50-BRq | 11.67 | 11.91 | | |
79
+ | 170m-UR50-BRu | 11.66 | 11.87 | | |
80
+ | 3b-UR90 | 8.95 | 9.64 | | |
81
+ | 3b-GR-HM | 11.95 | 6.68 | 4.34 | 4.60 |
82
+ | 3b-GR-HM-c | 10.11 | 9.21 | 3.57 | 3.56 |
83
+
84
+
85
+ **Quality of generated sequences** as measured by ESMFold pLDDT and scPerplexity. Dataset statistics are for 1024 randomly-sampled sequences. Model statistics are for 1024 generations at T=1 in the N-to-C direction.
86
+
87
+ | Model or dataset | pLDDT (mean ± s.d.) | scPerplexity (mean ± s.d.) |
88
+ |-------------------------|---------------------|----------------------------|
89
+ | **Natural sequences** | | |
90
+ | UniRef50 | 0.653 ± 0.196 | 9.45 ± 2.89 |
91
+ | GigaRef-clusters | 0.619 ± 0.199 | 9.69 ± 2.83 |
92
+ | GigaRef-singletons | 0.561 ± 0.201 | 10.07 ± 2.88 |
93
+ | **Generated sequences** | | |
94
+ | 170m-UR50 | 0.421 ± 0.132 | 11.97 ± 2.14 |
95
+ | 170m-UR90 | 0.407 ± 0.125 | 12.12 ± 2.14 |
96
+ | 170m-GR | 0.422 ± 0.129 | 11.83 ± 2.12 |
97
+ | 170m-UR50-BRu | 0.441 ± 0.157 | 11.71 ± 2.18 |
98
+ | 170m-UR50-BRq | 0.434 ± 0.152 | 11.72 ± 2.24 |
99
+ | 170m-UR50-BRn | 0.432 ± 0.131 | 11.77 ± 2.24 |
100
+ | 3b-UR90 | 0.454 ± 0.150 | 11.79 ± 2.38 |
101
+ | 3b-GR-HM | 0.406 ± 0.126 | 11.50 ± 2.16 |
102
+ | 3b-GR-HM-c | 0.423 ± 0.132 | 11.91 ± 2.18 |
103
+
104
+
105
+
106
+ **ProteinGym zero-shot performance** Spearman’s correlation coefficient on ProteinGym substitutions and indels.
107
+
108
+ | Input | Model | Parameters | Substitutions | Indels |
109
+ |------------------------|----------------|-----------:|--------------:|-------:|
110
+ | **Single sequence** | 170m-UR50 | 170M | 0.353 | 0.479 |
111
+ | | 170m-UR90 | 170M | 0.354 | 0.483 |
112
+ | | 170m-GR | 170M | 0.199 | 0.292 |
113
+ | | 170m-UR50-BRu | 170M | 0.341 | 0.476 |
114
+ | | 170m-UR50-BRq | 170M | 0.356 | 0.477 |
115
+ | | 170m-UR50-BRn | 170M | 0.341 | 0.478 |
116
+ | | 3b-UR90 | 3B | 0.394 | 0.497 |
117
+ | | 3b-GR-HM | 3B | 0.328 | 0.423 |
118
+ | | 3b-GR-HM-c | 3B | 0.417 | 0.466 |
119
+ | **Aligned homologs** | 3b-GR-HM-c | 3B | 0.368 | NA |
120
+ | **Unaligned homologs** | 3b-GR-HM-c | 3B | 0.372 | 0.401 |
121
+
122
+
123
+ **RFDiffusion Benchmark Performance** Motif scaffolding performance, problems solved, successes out of 100, and MotifBench score.
124
+
125
+ | Problem | 170m-UR50 | 170m-UR90 | 170m-GR | 170m-UR50-BRn | 170m-UR50-BRq | 170m-UR50-BRu | 3b-UR90 | 3b-GR-HM | 3b-GR-HM-c | EvoDiff-Seq |
126
+ |--------------------|---------:|---------:|--------:|-------------:|-------------:|-------------:|-------:|--------:|----------:|-----------:|
127
+ | 1PRW | 62 | 72 | 81 | 95 | 91 | 90 | 94 | 81 | 79 | 82 |
128
+ | 1BCF | 0 | 0 | 5 | 0 | 0 | 0 | 10 | 8 | 0 | 7 |
129
+ | 5TPN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
130
+ | 5IUS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
131
+ | 3IXT | 12 | 17 | 12 | 14 | 18 | 12 | 18 | 11 | 14 | 20 |
132
+ | 5YUI | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
133
+ | 1QJG | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
134
+ | 1YCR | 2 | 5 | 0 | 6 | 7 | 6 | 2 | 3 | 4 | 2 |
135
+ | 2KL8 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
136
+ | 7MRX_60 | 1 | 0 | 0 | 0 | 0 | 2 | 42 | 0 | 9 | 0 |
137
+ | 7MRX_85 | 0 | 0 | 0 | 0 | 0 | 0 | 19 | 1 | 1 | 0 |
138
+ | 7MRX_128 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
139
+ | 4JHW | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
140
+ | 4ZYP | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
141
+ | 5WN9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
142
+ | 6VW1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
143
+ | 5TRV_short | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
144
+ | 5TRV_med | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
145
+ | 5TRV_long | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
146
+ | 6E6R_short | 2 | 2 | 1 | 3 | 3 | 2 | 14 | 7 | 8 | 6 |
147
+ | 6E6R_med | 0 | 1 | 2 | 0 | 0 | 2 | 4 | 0 | 2 | 0 |
148
+ | 6E6R_long | 0 | 1 | 0 | 0 | 0 | 1 | 3 | 0 | 1 | 0 |
149
+ | 6EXZ_short | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
150
+ | 6EXZ_med | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
151
+ | 6EXZ_long | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
152
+ | **Problems solved** | **6** | **8** | **6** | **5** | **4** | **10** | **10** | **7** | **9** | **6** |
153
+ | **Successes** | **80** | **100** | **102** | **119** | **119** | **118** | **207** | **112** | **119** | **118** |
154
+ | **Score** | **9.65** | **12.25** | **6.10** | **7.26** | **10.62** | **14.36** | **16.32** | **11.90** | **14.14** | **7.67** |
155
+
156
+ **MotifBench Benchmark Performance** Motif scaffolding performance, problems solved, successes out of 100, and MotifBench score.
157
+
158
+ | Problem | 170m-UR50 | 170m-UR90 | 170m-GR | 170m-UR50-BRn | 170m-UR50-BRq | 170m-UR50-BRu | 3b-UR90 | 3b-GR-HM | 3b-GR-HM-c | EvoDiff-Seq |
159
+ |------------|----------:|----------:|--------:|-------------:|-------------:|-------------:|--------:|---------:|-----------:|------------:|
160
+ | 01_1LDB | 1 | 1 | 3 | 0 | 0 | 1 | 20 | 2 | 12 | 0 |
161
+ | 02_1ITU | 4 | 33 | 4 | 1 | 1 | 4 | 37 | 57 | 48 | 0 |
162
+ | 03_2CGA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
163
+ | 04_5WN9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
164
+ | 05_5ZE9 | 0 | 1 | 21 | 0 | 0 | 0 | 16 | 40 | 9 | 0 |
165
+ | 06_6E6R | 1 | 1 | 1 | 1 | 2 | 1 | 6 | 3 | 1 | 2 |
166
+ | 07_6E6R | 0 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 0 |
167
+ | 08_7AD5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
168
+ | 09_7CG5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
169
+ | 10_7WRK | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
170
+ | 11_3TQB | 4 | 11 | 3 | 4 | 3 | 7 | 40 | 8 | 26 | 0 |
171
+ | 12_4JHW | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
172
+ | 13_4JHW | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
173
+ | 14_5IUS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
174
+ | 15_7A8S | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
175
+ | 16_7BNY | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
176
+ | 17_7DGW | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
177
+ | 18_7MQQ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
178
+ | 19_7MQQ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
179
+ | 20_7UWL | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
180
+ | 21_1B73 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
181
+ | 22_1BCF | 0 | 0 | 3 | 0 | 0 | 0 | 20 | 9 | 0 | 19 |
182
+ | 23_1MPY | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
183
+ | 24_1QY3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
184
+ | 35_2RKX | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
185
+ | 36_3B5V | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
186
+ | 37_4XOJ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
187
+ | 28_5YUI | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
188
+ | 29_6CPA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
189
+ | 30_7UWL | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
190
+ | **Problems**| **4**| **5**| **6**| **4**| **3**| **4**| **7**| **6**| **5**| **2** |
191
+ | **Successes**| **10**| **47**| **35**| **8**| **6**| **13**| **141**| **119**| **96**| **21** |
192
+ | **Score** | **2.33**| **2.92**| **4.33**| **2.75**| **2.17**| **2.75**| **8.36**| **4.96**| **4.48**| **1.58** |
193
 
194
  ## Technical Specifications
195
 
 
199
  * 3B-parameter models: trained on 176 NVIDIA H100 GPUs using Fully Sharded Data Parallel in hybrid-shard mode.
200
 
201
 
202
+ ## Responsible AI Considerations
 
 
 
203
 
204
+ The intended use of this model is to generate high-quality, realistic, protein sequences or sets of homologous protein sequences. Generations can be designed from scratch or conditioned on partial sequences in both N→C and C→N directions.
205
 
206
+ The code and datasets released in this repository are provided for research and development use only. They are not intended for use in clinical decision-making or for any other clinical use, and the performance of these models for clinical use has not been established. You bear sole responsibility for any use of these models, data and software, including incorporation into any product intended for clinical use.
207
 
 
208
 
209
+ ## Citation
210
 
211
+ If you use the code, data, models, or results. please cite our [preprint](https://aka.ms/dayhoff/preprint).