TFM-tokenizer
TFM-tokenizer is trained based on SmallCorpus, supporting table understanding, document retrieval, tool invocation, and reasoning.
This tokenizer was trained on 2M samples from:
- Web-EN 50%
- Web-ZH 20%
- TextBook-EN 15%
- TextBook-ZH 5%
- Code 5%
- Math 5%
How to use
With text
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast
tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")
text = "I am TFM, a table foundation model."
input_ids = tokenizer([text], return_tensors="pt")
print(input_ids)
{
'input_ids': tensor([[128000, 40, 1097, 350, 26691, 11, 264, 2007, 16665, 1646, 13]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
With table
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast
tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")
table = [
["Name", "Age", "City"],
["Jingze", "21", "Guangzhou"],
]
input_ids = tokenizer.batch_process_tables([table])
print(input_ids)
{
'input_ids': tensor([[ 678, 17166, 13020, 41, 287, 3059, 1691, 17198, 526, 52865]]),
'row_ids': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]),
'col_ids': tensor([[0, 1, 2, 0, 0, 0, 1, 2, 2, 2]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
With conversation
With documents
With tools
With reasoning
With documents tools reasoning
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support