enigma1 / tokenizer /tokenizer.md

added tokenizer files

f5eb6b9 verified over 1 year ago

3.49 kB

	# tokenizing DNA sequences

	Tokenizers for DNA tokenization for enigma-1.5b model.
	## Overview
	DNA-(Dexoy-ribo Nucleic Acid) has 4 nucleobases named Adenine, Thymine, Guanine, Cytosine or A, T, G, C. Just like in english we have most basic things: alphabets, in DNA, these nucleobases are most basic things. We need to tokenize them on the basis of these pairs and characters. So this means, our initial vocab is going to be ['A', 'T', 'G', 'C'] instead of 256 utf-8 characters.

	Read more about DNA: [Wikipedia/DNA](https://en.wikipedia.org/wiki/DNA)

	![dna seq](https://www.genome.gov/sites/default/files/media/images/tg/DNA.jpg)

	## Tokenizer:

	### Base Level
	It's very basic in working, just like per-character tokenizer which enumerates each and every unique character present in the train file. In our case, we'll have only 4-bases along with '`\n`' and 4-special tokens represented as characters. P, M, U, S as padding, mask, unknown & space token, respectively.

	```python
	self.init_vocab = {"\n": 1, "A": 2, "T": 3, "G": 4, "C": 5, "P": 6, "M": 7, "U": 8, "S": 9}
	```

	For encoding and decoding purpose, two functions `string_to_index` & `index_to_string` convert each character into a number from 1 to 9 and decoder takes those 1 to 9 numbers and returns the joint string of respective characters.
	```python
	self.string_to_index = {ch: i for i, ch in enumerate(self.chars)}
	self.index_to_string = {i: ch for i, ch in enumerate(self.chars)}
	```

	### K-Mer Tokenization
	Let's say we have a long sequence of DNA. This tokenizer splits that sequence into sections of consecutively occurring bases, and each section has length of value equal to `k_mer` which is by default set to 4.
	`build_vocab()` function then builds a vocab out of all tokenized sequences by storing them into a new dictionary, seq as key and index as value. And finally, you can save the generated vocab using `save_model()` function and can be loaded later for use.
	```python
	tokenizer.load_model('../tokenizer/trained models/base_5k.json')
	```
	I used this tokenizer to train decoder-only model, here is how to use it:
	```python
	from tokenizer import KMerTokenizer

	tokenizer = KMerTokenizer(k_mers=5)
	tokenizer.build_vocab([train_data])
	tokenizer.save_model('../tokenizer/trained models')

	encoded_tokens = tokenizer.encode(test_data)
	decoded_tokens = tokenizer.decode(encoded_tokens)
	```

	### Sub-K-Mer Level
	It works kind of same as BPE tokenizer, however has some changes in the way it builds its vocab. It first splits it's training into sequences containing only 4 consecutive letters of DNA (same as K-Mer tokenizer with k=4) and then it trains the tokenizer to build new merges based on the frequency of those pairs, like it would have done with the BPE tokenizer.
	It can be trained quiet easily and then model file can be saved in two different files; '.model': contains merges & '.json: contains vocab'.
	Encoding and decoding works same as the BPE.
	```python
	from tokenizer import KmerPairTokenizer

	tokenizer = KmerPairTokenizer()
	tokenizer.train(train_data)
	tokenizer.save_model('../tokenizer/trained models')

	encoded_tokens = tokenizer.encode(test_data)
	decoded_tokens = tokenizer.decode(encoded_tokens)
	```
	This tokenizer works fine but it has one problem in decode function, it outputs more tokens than actual present tokens, means:
	```shell
	test_data == decoded_tokens is False
	```
	I'll try to fix it and make this work soon, but for now, it's not suitable for use, at-least not for decoding.