|
--- |
|
language: |
|
- en |
|
tags: |
|
- retrieval |
|
- document_expansion |
|
datasets: |
|
- irds:msmarco-passage |
|
library_name: pyterrier |
|
--- |
|
|
|
A Doc2Query model based on `t5-base` and trained on MS MARCO. This is a version of [the checkpoint released by the original authors](https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip), converted to pytorch format and ready for use in [`pyterrier_doc2query`](https://github.com/terrierteam/pyterrier_doc2query). |
|
|
|
**Creating a transformer:** |
|
|
|
```python |
|
import pyterrier as pt |
|
pt.init() |
|
from pyterrier_doc2query import Doc2Query |
|
doc2query = Doc2Query('macavaney/doc2query-t5-base-msmarco') |
|
``` |
|
|
|
**Transforming documents** |
|
|
|
```python |
|
import pandas as pd |
|
doc2query(pd.DataFrame([ |
|
{'docno': '0', 'text': 'Hello Terrier!'}, |
|
{'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'}, |
|
])) |
|
# docno text querygen |
|
# 0 Hello Terrier! hello terrier what kind of dog is a terrier wh... |
|
# 1 Doc2Query expands queries with potentially rel... can dodoc2query extend query query? what is do... |
|
``` |
|
|
|
**Indexing transformed documents** |
|
|
|
```python |
|
doc2query.append = True # append querygen to text |
|
indexer = pt.IterDictIndexer('./my_index', fields=['text']) |
|
pipeline = doc2query >> indexer |
|
pipeline.index([ |
|
{'docno': '0', 'text': 'Hello Terrier!'}, |
|
{'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'}, |
|
]) |
|
``` |
|
|
|
**Expanding and indexing a dataset** |
|
|
|
```python |
|
dataset = pt.get_dataset('irds:vaswani') |
|
pipeline.index(dataset.get_corpus_iter()) |
|
``` |
|
|
|
## References |
|
|
|
- [Nogueira20]: Rodrigo Nogueira and Jimmy Lin. From doc2query to docTTTTTquery. https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf |
|
- [Macdonald20]: Craig Macdonald, Nicola Tonellotto. Declarative Experimentation inInformation Retrieval using PyTerrier. Craig Macdonald and Nicola Tonellotto. In Proceedings of ICTIR 2020. https://arxiv.org/abs/2007.14271 |
|
|