all-with_prefix-t5-base-v1 / README.md

nreimers

update readme

7390cc1 almost 4 years ago

preview code

raw

history blame

3.84 kB

metadata

language: en
datasets:
  - sentence-transformers/reddit-title-body
  - sentence-transformers/embedding-training-data
widget:
  - text: >-
      answer2question: Python is an interpreted, high-level and general-purpose
      programming language. Python's design philosophy emphasizes code
      readability with its notable use of significant whitespace. Its language
      constructs and object-oriented approach aim to help programmers write
      clear, logical code for small and large-scale projects.
license: apache-2.0

doc2query/all-with_prefix-t5-base-v1

Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = 'doc2query/all-with_prefix-t5-base-v1'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

prefix = "answer2question"
text = "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."

text = prefix+": "+text

input_ids = tokenizer.encode(text, max_length=384, truncation=True, return_tensors='pt')
outputs = model.generate(
    input_ids=input_ids,
    max_length=64,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=5)

print("Text:")
print(text)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

Training

This model fine-tuned google/t5-v1_1-base for 575k training steps. For the training script, see the train_script.py in this repository.

The input-text was truncated to 384 word pieces. Output text was generated up to 64 word pieces.

This model was trained on a large collection of datasets. For the exact datasets names and weights see the data_config.json in this repository. Most of the datasets are available at https://huggingface.co/sentence-transformers.

The datasets include besides others:

(title, body) pairs from Reddit
(title, body) pairs and (title, answer) pairs from StackExchange and Yahoo Answers!
(title, review) pairs from Amazon reviews
(query, paragraph) pairs from MS MARCO, NQ, and GooAQ
(question, duplicate_question) from Quora and WikiAnswers
(title, abstract) pairs from S2ORC

Prefix

This model was trained with prefixed: You start the text with a specific index that defines what type out output text you would like to receive. Depending on the prefix, the output is different.

E.g. the above text about Python produces the following output:

Prefix	Output
answer2question	Why should I use python in my business? ; What is the difference between Python and.NET? ; what is the python design philosophy?
review2title	Python a powerful and useful language ; A new and improved programming language ; Object-oriented, practical and accessibl
abstract2title	Python: A Software Development Platform ; A Research Guide for Python X: Conceptual Approach to Programming ; Python : Language and Approach
text2query	is python a low level language? ; what is the primary idea of python? ; is python a programming language?

These are all available pre-fixes:

text2reddit
question2title
answer2question
abstract2title
review2title
news2title
text2query
question2question

For the datasets and weights for the different pre-fixes see data_config.json in this repository.