File size: 5,192 Bytes
4e68cd2
 
 
 
 
 
 
 
 
 
3683fa4
4e68cd2
3683fa4
 
4e68cd2
 
3683fa4
e7e9ef6
4e68cd2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3683fa4
 
4e68cd2
 
 
 
99c473e
 
 
 
 
 
 
4e68cd2
 
 
99c473e
4e68cd2
99c473e
 
4e68cd2
99c473e
4e68cd2
 
 
 
99c473e
 
 
4e68cd2
99c473e
 
 
4dace67
99c473e
 
4e68cd2
 
 
 
 
3683fa4
4e68cd2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69f5390
4e68cd2
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
license: cc-by-nc-sa-4.0
widget:
- text: ACCTGA<mask>TTCTGAGTC
tags:
- DNA
- biology
- genomics
- segmentation
---
# segment-nt-multi-species

Segment-NT-multi-species is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics 
elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [Segment-NT](https://huggingface.co/InstaDeepAI/segment_nt) model on a dataset encompassing the human genome 
but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm. 

For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **Segment-NT**, mainly because only this subset of annotations is
available for these species. The annotations therefore concern the 7 main gene elements available from [Ensembl](https://www.ensembl.org/index.html), namely protein-coding gene, 5’UTR, 3’UTR, intron, exon, 
splice acceptor and donor sites. 


**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
- **Paper:** [Segmenting the genome at single-nucleotide resolution with DNA foundation models]() TODO: Add link to preprint

### How to use

<!-- Need to adapt this section to our model. Need to figure out how to load the models from huggingface and do inference on them -->
Until its next release, the `transformers` library needs to be installed from source with the following command in order to use the models:
```bash
pip install --upgrade git+https://github.com/huggingface/transformers.git
```

A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
```python
# Load model and tokenizer
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)

# Choose the length to which the input sequences are padded. By default, the 
# model max length is chosen, but feel free to decrease it as the time taken to 
# obtain the embeddings increases significantly with it.
# The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by
# 2 to the power of the number of downsampling block, i.e 4.
max_length = 12 + 1

assert (max_length - 1) % 4 == 0, (
    "The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
     "2 to the power of the number of downsampling block, i.e 4.")

# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# Infer
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
    tokens,
    attention_mask=attention_mask,
    output_hidden_states=True
)

# Obtain the logits over the genomic features
logits = outs.logits.detach()
# Transform them in probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")

# Get probabilities associated with intron
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")
```


## Training data

The **segment-nt-multi-species** model was finetuned on human, mouse, chicken, fly, zebrafish and worm genomes. For each specie, a subset of chromosomes is kept as 
validation for training monitoring and test for final evaluation.

## Training procedure

### Preprocessing

The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokens as described in the [Tokenization](https://github.com/instadeepai/nucleotide-transformer#tokenization-abc) section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:

```
<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>
```

### Training

The model was finetuned on a DGXH100 node with 8 GPUs on a total of 8B tokens for 3 days. 


### Architecture

The model is composed of the [nucleotide-transformer-v2-500m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) encoder, from which we removed 
the language model head and replaced it by a 1-dimensional U-Net segmentation head [4] made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these 
blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. This additional segmentation head accounts for 53 million parameters, bringing the total number of parameters
to 562M.

### BibTeX entry and citation info

#TODO: Add bibtex citation here
```bibtex

```