TFM-tokenizer

TFM-tokenizer is trained based on SmallCorpus, supporting table understanding, document retrieval, tool invocation, and reasoning.

This tokenizer was trained on 2M samples from:

Web-EN 50%
Web-ZH 20%
TextBook-EN 15%
TextBook-ZH 5%
Code 5%
Math 5%

How to use

With text

from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast

tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")

text = "I am TFM, a table foundation model."

input_ids = tokenizer([text], return_tensors="pt")
print(input_ids)

{
    'input_ids': tensor([[128000,     40,   1097,    350,  26691,     11,    264,   2007,  16665,   1646,     13]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}

With table

from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast

tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")

table = [
    ["Name", "Age", "City"],
    ["Jingze", "21", "Guangzhou"],
]

input_ids = tokenizer.batch_process_tables([table])
print(input_ids)

{
    'input_ids': tensor([[  678, 17166, 13020,    41,   287,  3059,  1691, 17198,   526, 52865]]),
    'row_ids': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]),
    'col_ids': tensor([[0, 1, 2, 0, 0, 0, 1, 2, 2, 2]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}

With conversation

With documents

With tools

With reasoning

With documents tools reasoning

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

DIAL-TFM
/

TFM-tokenizer

TFM-tokenizer

How to use

Dataset used to train DIAL-TFM/TFM-tokenizer

TFM-tokenizer

How to use

Dataset used to train DIAL-TFM/TFM-tokenizer

🎉 Free Image Generator Now Available!