|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# pcuenq/Hunyuan-7B-Instruct-tokenizer |
|
|
|
This is a transformers fast tokenizer for [mlx-community/Hunyuan-7B-Instruct-3bit](https://huggingface.co/mlx-community/Hunyuan-7B-Instruct-3bit/blob/main/tokenizer_config.json) |
|
|
|
## Conversion |
|
|
|
We used this code to convert the tokenizer from the original `tiktoken` format: |
|
|
|
```py |
|
from huggingface_hub import snapshot_download |
|
from tokenization_hy import * |
|
from tokenizers import normalizers |
|
from transformers import PreTrainedTokenizerFast |
|
from transformers.convert_slow_tokenizer import TikTokenConverter |
|
|
|
snapshot_download( |
|
"mlx-community/Hunyuan-7B-Instruct-3bit", |
|
local_dir=".", |
|
allow_patterns=["hy.tiktoken", "tokenization_hy.py", "tokenizer_config.json", "special_tokens_map.json"] |
|
) |
|
|
|
original = HYTokenizer.from_pretrained(".") |
|
|
|
converter = TikTokenConverter( |
|
vocab_file="hy.tiktoken", |
|
pattern=PAT_STR, |
|
additional_special_tokens=[t[1] for t in SPECIAL_TOKENS], |
|
) |
|
converted = converter.converted() |
|
converted.normalizer = normalizers.NFC() |
|
|
|
t_fast = PreTrainedTokenizerFast( |
|
tokenizer_object=converted, |
|
model_input_names=original.model_input_names, |
|
model_max_length=256*1024, |
|
clean_up_tokenization_spaces=False, |
|
) |
|
t_fast.chat_template = original.chat_template |
|
t_fast.push_to_hub("Hunyuan-7B-Instruct-tokenizer") |
|
``` |
|
|
|
## Verification |
|
|
|
```py |
|
from datasets import load_dataset |
|
from tqdm import tqdm |
|
from tokenization_hy import HYTokenizer |
|
from transformers import AutoTokenizer |
|
|
|
original = HYTokenizer.from_pretrained("mlx-community/Hunyuan-7B-Instruct-3bit") |
|
t_fast = AutoTokenizer.from_pretrained("pcuenq/Hunyuan-7B-Instruct-tokenizer") |
|
|
|
# Testing on XNLI |
|
|
|
xnli = load_dataset("xnli", "all_languages", split="validation") |
|
|
|
def verify(lang, text): |
|
encoded_original = original.encode(text) |
|
encoded_fast = t_fast.encode(text) |
|
assert encoded_fast == encoded_original, f"Fast encode error: {lang} - {text}" |
|
decoded = original.decode(encoded_original) |
|
decoded_fast = t_fast.decode(encoded_fast, skip_special_tokens=True) |
|
assert decoded_fast == decoded, f"Fast decode error: {lang} - {text}" |
|
|
|
for p in tqdm(xnli["premise"]): |
|
for lang, text in p.items(): |
|
verify(lang, text) |
|
|
|
|
|
# Testing on codeparrot subset |
|
|
|
ds = load_dataset("codeparrot/github-code", streaming=True, trust_remote_code=True, split="train") |
|
|
|
iterator = iter(ds) |
|
for _ in tqdm(range(1000)): |
|
item = next(iterator) |
|
code = item["code"] |
|
lang = item["language"] |
|
verify(lang, code) |
|
``` |