Σχετικά με το έγγραφο αυτό
- sl:arxiv_author :
- sl:arxiv_firstAuthor : Ori Ram
- sl:arxiv_num : 2212.10380
- sl:arxiv_published : 2022-12-20T16:03:25Z
- sl:arxiv_summary : Dual encoders are now the dominant architecture for dense retrieval. Yet, we
have little understanding of how they represent text, and why this leads to
good performance. In this work, we shed light on this question via
distributions over the vocabulary. We propose to interpret the vector
representations produced by dual encoders by projecting them into the model's
vocabulary space. We show that the resulting distributions over vocabulary
tokens are intuitive and contain rich semantic information. We find that this
view can explain some of the failure cases of dense retrievers. For example,
the inability of models to handle tail entities can be explained via a tendency
of the token distributions to forget some of the tokens of those entities. We
leverage this insight and propose a simple way to enrich query and passage
representations with lexical information at inference time, and show that this
significantly improves performance compared to the original model in
out-of-domain settings.@en
- sl:arxiv_title : What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary@en
- sl:arxiv_updated : 2022-12-20T16:03:25Z
- sl:bookmarkOf : https://arxiv.org/abs/2212.10380
- sl:creationDate : 2022-12-21
- sl:creationTime : 2022-12-21T18:32:12Z