Instructions to use intfloat/multilingual-e5-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use intfloat/multilingual-e5-base with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("intfloat/multilingual-e5-base") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Inference
- Notebooks
- Google Colab
- Kaggle
vocab.txt
where can I find the vocab.txt for this multilingual model?
The vocabulary is based on sentencepiece instead of word piece like BERT.
You can use the following code to print the vocab:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')
print(tokenizer.vocab)
@intfloat Thank you. So you are saying I can write the vocab.txt with the tokenizer.vocab value? I don't know why the multilingual e5 models don't come with vocab.txt just like the english e5 model does.
The reason I am asking is I am trying to convert this model to ggml format using bert.cpp, which requires vocab.txt.
As far as I know, only models based on bert have vocab.txt, models like t5 and xlm-roberta do not have this file.
Multilingual e5 models are based on xlm-roberta instead of bert.
I guess you should not try to run this model with bert codebase.
@intfloat This model supports 94 languages. How to choose only specific languages from the list? I need only 40 languages