About This Document
- sl:arxiv_author :
- sl:arxiv_firstAuthor : Jinhyuk Lee
- sl:arxiv_num : 2304.01982
- sl:arxiv_published : 2023-04-04T17:37:06Z
- sl:arxiv_summary : Multi-vector retrieval models such as ColBERT [Khattab and Zaharia, 2020]
allow token-level interactions between queries and documents, and hence achieve
state of the art on many information retrieval benchmarks. However, their
non-linear scoring function cannot be scaled to millions of documents,
necessitating a three-stage process for inference: retrieving initial
candidates via token retrieval, accessing all token vectors, and scoring the
initial candidate documents. The non-linear scoring function is applied over
all token vectors of each candidate document, making the inference process
complicated and slow. In this paper, we aim to simplify the multi-vector
retrieval by rethinking the role of token retrieval. We present XTR,
ConteXtualized Token Retriever, which introduces a simple, yet novel, objective
function that encourages the model to retrieve the most important document
tokens first. The improvement to token retrieval allows XTR to rank candidates
only using the retrieved tokens rather than all tokens in the document, and
enables a newly designed scoring stage that is two-to-three orders of magnitude
cheaper than that of ColBERT. On the popular BEIR benchmark, XTR advances the
state-of-the-art by 2.8 nDCG@10 without any distillation. Detailed analysis
confirms our decision to revisit the token retrieval stage, as XTR demonstrates
much better recall of the token retrieval stage compared to ColBERT.@en
- sl:arxiv_title : Rethinking the Role of Token Retrieval in Multi-Vector Retrieval@en
- sl:arxiv_updated : 2023-04-04T17:37:06Z
- sl:bookmarkOf : https://arxiv.org/abs/2304.01982
- sl:creationDate : 2023-04-05
- sl:creationTime : 2023-04-05T08:33:18Z
Documents with similar tags (experimental)