CafeBERT / README.md

ThuanPhong

Update README.md

b4b6358 verified over 1 year ago

preview code

raw

history blame

1.7 kB

metadata

license: apache-2.0
widget:
  - text: Thủ đô của nước Việt Nam là <mask> Nội.
    example_title: Example 1
  - text: Cà phê được trồng nhiều ở khu vực Tây <mask> của Việt Nam.
    example_title: Example 2

CafeBERT: A Pre-Trained Language Model for Vietnamese (NAACL-2024 Findings)

The pre-trained CafeBERT model is the state-of-the-art language model for Vietnamese (Cafe or coffee is a popular drink every morning in Vietnam):

CafeBERT is a large-scale multilingual language model with strong support for Vietnamese. The model is based on XLM-Roberta (the state-of-the-art multilingual language model) and is enhanced with a large Vietnamese corpus with many domains: Wikipedia, newspapers... CafeBERT has outstanding performance on the VLUE benchmark and other tasks, like: machine reading comprehension, text classification, natural language inference, part-of-speech tagging...

The general architecture and experimental results of PhoBERT can be found in our paper:

Please CITE our paper when CafeBERT is used to help produce published results or is incorporated into other software.

Installation

Install transformers and SentencePiece packages:

pip install transformers
pip install SentencePiece

Example usage

from transformers import AutoModel, AutoTokenizer
import torch

model= AutoModel.from_pretrained('uitnlp/CafeBERT')
tokenizer = AutoTokenizer.from_pretrained('uitnlp/CafeBERT')

encoding = tokenizer('Cà phê được trồng nhiều ở khu vực Tây Nguyên của Việt Nam.', return_tensors='pt')

with torch.no_grad():
  output = model(**encoding)