|
# tokenizing DNA sequences |
|
|
|
Tokenizers for DNA tokenization for enigma-1.5b model. |
|
## Overview |
|
DNA-(Dexoy-ribo Nucleic Acid) has 4 nucleobases named Adenine, Thymine, Guanine, Cytosine or A, T, G, C. Just like in english we have most basic things: alphabets, in DNA, these nucleobases are most basic things. We need to tokenize them on the basis of these pairs and characters. So this means, our initial vocab is going to be ['A', 'T', 'G', 'C'] instead of 256 utf-8 characters. |
|
|
|
Read more about DNA: [Wikipedia/DNA](https://en.wikipedia.org/wiki/DNA) |
|
|
|
 |
|
|
|
## Tokenizer: |
|
|
|
### Base Level |
|
It's very basic in working, just like per-character tokenizer which enumerates each and every unique character present in the train file. In our case, we'll have only 4-bases along with '`\n`' and 4-special tokens represented as characters. P, M, U, S as padding, mask, unknown & space token, respectively. |
|
|
|
```python |
|
self.init_vocab = {"\n": 1, "A": 2, "T": 3, "G": 4, "C": 5, "P": 6, "M": 7, "U": 8, "S": 9} |
|
``` |
|
|
|
For encoding and decoding purpose, two functions `string_to_index` & `index_to_string` convert each character into a number from 1 to 9 and decoder takes those 1 to 9 numbers and returns the joint string of respective characters. |
|
```python |
|
self.string_to_index = {ch: i for i, ch in enumerate(self.chars)} |
|
self.index_to_string = {i: ch for i, ch in enumerate(self.chars)} |
|
``` |
|
|
|
### K-Mer Tokenization |
|
Let's say we have a long sequence of DNA. This tokenizer splits that sequence into sections of consecutively occurring bases, and each section has length of value equal to `k_mer` which is by default set to 4. |
|
`build_vocab()` function then builds a vocab out of all tokenized sequences by storing them into a new dictionary, seq as key and index as value. And finally, you can save the generated vocab using `save_model()` function and can be loaded later for use. |
|
```python |
|
tokenizer.load_model('../tokenizer/trained models/base_5k.json') |
|
``` |
|
I used this tokenizer to train decoder-only model, here is how to use it: |
|
```python |
|
from tokenizer import KMerTokenizer |
|
|
|
tokenizer = KMerTokenizer(k_mers=5) |
|
tokenizer.build_vocab([train_data]) |
|
tokenizer.save_model('../tokenizer/trained models') |
|
|
|
encoded_tokens = tokenizer.encode(test_data) |
|
decoded_tokens = tokenizer.decode(encoded_tokens) |
|
``` |
|
|
|
### Sub-K-Mer Level |
|
It works kind of same as BPE tokenizer, however has some changes in the way it builds its vocab. It first splits it's training into sequences containing only 4 consecutive letters of DNA (same as K-Mer tokenizer with k=4) and then it trains the tokenizer to build new merges based on the frequency of those pairs, like it would have done with the BPE tokenizer. |
|
It can be trained quiet easily and then model file can be saved in two different files; *'.model': contains merges* & *'.json: contains vocab'*. |
|
Encoding and decoding works same as the BPE. |
|
```python |
|
from tokenizer import KmerPairTokenizer |
|
|
|
tokenizer = KmerPairTokenizer() |
|
tokenizer.train(train_data) |
|
tokenizer.save_model('../tokenizer/trained models') |
|
|
|
encoded_tokens = tokenizer.encode(test_data) |
|
decoded_tokens = tokenizer.decode(encoded_tokens) |
|
``` |
|
This tokenizer works fine but it has one problem in decode function, it outputs more tokens than actual present tokens, means: |
|
```shell |
|
test_data == decoded_tokens is False |
|
``` |
|
I'll try to fix it and make this work soon, but for now, it's not suitable for use, at-least not for decoding. |