Update README.md

41869b3 about 3 years ago

3.92 kB

	---
	tags:
	- generation
	language:
	- multilingual
	- cs
	- en
	---

	# Mt5-base for Prime Czech+English Generative Question Answering

	This is the [mt5-base](https://huggingface.co/google/mt5-base) model with an LM head for a generation of extractive answers,
	given a small set of 2-5 demonstrations (i.e. primes).

	## Priming

	Note that this is a priming model that expects a set of demonstrations of your task of interest,
	similarly to GPT-3.
	Rather than performing well on the conventional question answering, it aims to learn to extrapolate the pattern of given demonstrations
	to novel tasks, such as Named Entity Recognition or Keywords Extraction from a given pattern.

	## Data & Training

	This model was trained on a combination of [English SQuAD 1.1](https://huggingface.co/datasets/squad)
	and [Czech SQAD 3.0](https://lindat.cz/repository/xmlui/handle/11234/1-3069)
	Question Answering datasets.

	To allow the model to rely on a trend given in demonstrations, we've clustered the samples by the question-word(s)
	in English SQuAD and by the category in the Czech SQAD and used the examples of the same cluster as the demonstrations
	of the task in training.

	The specific algorithm of selection of these demonstrations makes a big difference in the model's ability to extrapolate
	to new tasks and will be shared in the following article; stay tuned!

	For the Czech SQAD 3.0, original contexts (=whole Wikipedia websites) were limited to a maximum of 8000 characters
	per a sequence of prime demonstrations.
	Pre-processing script for Czech SQAD is available [here](https://huggingface.co/gaussalgo/xlm-roberta-large_extractive-QA_en-cs/blob/main/parse_czech_squad.py).


	For training the model (and hence intended also for the inference), we've used the following patterns of 2-7 demonstrations:

	For English samples:

	input:
	```
	Question: {Q1} Context: {C1} Answer: {A1},
	Question: {Q2} Context: {C2} Answer: {A2},
	[...possibly more demonstrations...]

	Question: {Q} Context: {C} Answer:`
	```
	=> target:
	```
	{A}
	```

	For Czech samples:

	input:
	```
	Otázka: {Q1} Kontext: {C1} Odpověď: {A1},
	Otázka: {Q2} Kontext: {C2} Odpověď: {A2},
	[...possibly more demonstrations...]

	Otázka: {Q} Kontext: {C} Odpověď:`
	```
	=> target:
	```
	{A}
	```


	The best checkpoint was picked to maximize the model's zero-shot performance on Named Entity Recognition
	on the out-of-distribution domain of texts and labels.

	## Intended uses & limitations

	This model is purposed for a few-shot application on any text extraction task in English and Czech, where the prompt can be stated
	as a natural question. E.g to use this model for extracting the entities of customer names from the text,
	prompt it with demonstrations in the following format:

	```python
	input_text = """
	Question: What is the customer's name? Context: Origin: Barrack Obama, Customer id: Bill Moe.
	Answer: Bill Moe,
	Question: What is the customer's name? Context: Customer id: Barrack Obama, if not deliverable, return to Bill Clinton.
	Answer:
	"""
	```

	Note that despite its size, English SQuAD has a variety of reported biases,
	conditioned by the relative position or type of the answer in the context that can affect the model's performance on new data
	(see, e.g. [L. Mikula (2022)](https://is.muni.cz/th/adh58/?lang=en), Chap. 4.1).

	## Usage

	Here is how to use this model to answer the question on a given context using 🤗 Transformers in PyTorch:

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained("gaussalgo/mt5-base-priming-QA_en-cs")
	model = AutoModelForSeq2SeqLM.from_pretrained("gaussalgo/mt5-base-priming-QA_en-cs")

	# For the expected format of input_text, see Intended use above
	inputs = tokenizer(input_text, return_tensors="pt")

	outputs = model.generate(**inputs)

	print("Answer:")
	print(tokenizer.decode(outputs))
	```