About This Document
- sl:arxiv_author :
- sl:arxiv_firstAuthor : Manuel Faysse
- sl:arxiv_num : 2407.01449
- sl:arxiv_published : 2024-06-27T15:45:29Z
- sl:arxiv_summary : Documents are visually rich structures that convey information through text,
as well as tables, figures, page layouts, or fonts. While modern document
retrieval systems exhibit strong performance on query-to-text matching, they
struggle to exploit visual cues efficiently, hindering their performance on
practical document retrieval applications such as Retrieval Augmented
Generation. To benchmark current systems on visually rich document retrieval,
we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of
various page-level retrieving tasks spanning multiple domains, languages, and
settings. The inherent shortcomings of modern systems motivate the introduction
of a new retrieval model architecture, ColPali, which leverages the document
understanding capabilities of recent Vision Language Models to produce
high-quality contextualized embeddings solely from images of document pages.
Combined with a late interaction matching mechanism, ColPali largely
outperforms modern document retrieval pipelines while being drastically faster
and end-to-end trainable.@en
- sl:arxiv_title : ColPali: Efficient Document Retrieval with Vision Language Models@en
- sl:arxiv_updated : 2024-07-02T13:02:58Z
- sl:bookmarkOf : https://arxiv.org/abs/2407.01449
- sl:creationDate : 2024-09-07
- sl:creationTime : 2024-09-07T13:56:46Z
Documents with similar tags (experimental)