michael-guenther commited on
Commit
9349adf
·
verified ·
1 Parent(s): 8e64b38

Update vidore_eval.md

Browse files
Files changed (1) hide show
  1. vidore_eval.md +9 -1
vidore_eval.md CHANGED
@@ -15,4 +15,12 @@ vidore-benchmark evaluate-retriever \
15
  --collection-name jinaai/document-screenshot-retrieval-benchmark-small-684831c022c53b21c313b449 \
16
  --dataset-format qa \
17
  --split test
18
- ```
 
 
 
 
 
 
 
 
 
15
  --collection-name jinaai/document-screenshot-retrieval-benchmark-small-684831c022c53b21c313b449 \
16
  --dataset-format qa \
17
  --split test
18
+ ```
19
+
20
+ ## Evaluate Pure Text Retrieval Models on Refined Vidore Tasks
21
+
22
+ The original Vidore dataset contain multiple text chunks per image to evaluate text retrieval models on them.
23
+ Those text chunks are extracted from the document pages using different tools like [Unstructured](https://github.com/Unstructured-IO/unstructured), OCR models, and LLMs.
24
+ For evaluating text retrieval models on our filtered versions of the Vidore datasets, you can use the datasets in the collection `https://huggingface.co/collections/jinaai/jina-vdr-vidoreocr-tasks-6852cfc55ccf837e7fecfa1b`.
25
+
26
+ It is also possible to evaluate jina-embeddings-v4 and other vision retrieval models on them. This however takes more time and should lead to the same evaluation results as running the vesions of the datasets in the Jina VDR collection.