Update README.md
Browse files
README.md
CHANGED
|
@@ -30,6 +30,7 @@ an autoregressive fashion, using low‑temperature sampling to produce classific
|
|
| 30 |
### Training Data
|
| 31 |
ChatNT was instruction‑tuned on a unified corpus covering 27 diverse tasks from DNA, RNA and proteins, spanning multiple species, tissues and biological processes.
|
| 32 |
This amounted to 605 million DNA tokens (≈ 3.6 billion bases) and 273 million English tokens, sampled uniformly over tasks for 2 billion instruction tokens.
|
|
|
|
| 33 |
|
| 34 |
### Tokenization
|
| 35 |
DNA inputs are broken into overlapping 6‑mer tokens and padded or truncated to 2048 tokens (~ 12 kb). English prompts and
|
|
|
|
| 30 |
### Training Data
|
| 31 |
ChatNT was instruction‑tuned on a unified corpus covering 27 diverse tasks from DNA, RNA and proteins, spanning multiple species, tissues and biological processes.
|
| 32 |
This amounted to 605 million DNA tokens (≈ 3.6 billion bases) and 273 million English tokens, sampled uniformly over tasks for 2 billion instruction tokens.
|
| 33 |
+
Examples of questions and sequences for each task, as well as additional task information, can be found in [Datasets_overview.csv](Datasets_overview.csv).
|
| 34 |
|
| 35 |
### Tokenization
|
| 36 |
DNA inputs are broken into overlapping 6‑mer tokens and padded or truncated to 2048 tokens (~ 12 kb). English prompts and
|