Transformers

TFM-tokenizer

TFM-tokenizer is trained based on SmallCorpus, supporting table understanding, document retrieval, tool invocation, and reasoning.

This tokenizer was trained on 2M samples from:

  • Web-EN 50%
  • Web-ZH 20%
  • TextBook-EN 15%
  • TextBook-ZH 5%
  • Code 5%
  • Math 5%

How to use

With text
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast

tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")

text = "I am TFM, a table foundation model."

input_ids = tokenizer([text], return_tensors="pt")
print(input_ids)
{
    'input_ids': tensor([[128000,     40,   1097,    350,  26691,     11,    264,   2007,  16665,   1646,     13]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
With table
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast

tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")

table = [
    ["Name", "Age", "City"],
    ["Jingze", "21", "Guangzhou"],
]

input_ids = tokenizer.batch_process_tables([table])
print(input_ids)
{
    'input_ids': tensor([[  678, 17166, 13020,    41,   287,  3059,  1691, 17198,   526, 52865]]),
    'row_ids': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]),
    'col_ids': tensor([[0, 1, 2, 0, 0, 0, 1, 2, 2, 2]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
With conversation


With documents


With tools


With reasoning


With documents tools reasoning


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train DIAL-TFM/TFM-tokenizer