Semanlink - [2212.10380] What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary

Tags:

Σχετικά με το έγγραφο αυτό

sl:arxiv_author :
sl:arxiv_firstAuthor : Ori Ram
sl:arxiv_num : 2212.10380
sl:arxiv_published : 2022-12-20T16:03:25Z
sl:arxiv_summary : Dual encoders are now the dominant architecture for dense retrieval. Yet, we have little understanding of how they represent text, and why this leads to good performance. In this work, we shed light on this question via distributions over the vocabulary. We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting distributions over vocabulary tokens are intuitive and contain rich semantic information. We find that this view can explain some of the failure cases of dense retrievers. For example, the inability of models to handle tail entities can be explained via a tendency of the token distributions to forget some of the tokens of those entities. We leverage this insight and propose a simple way to enrich query and passage representations with lexical information at inference time, and show that this significantly improves performance compared to the original model in out-of-domain settings.@en
sl:arxiv_title : What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary@en
sl:arxiv_updated : 2022-12-20T16:03:25Z
sl:bookmarkOf : https://arxiv.org/abs/2212.10380
sl:creationDate : 2022-12-21
sl:creationTime : 2022-12-21T18:32:12Z

Πληροφορία αρχείου

Bookmark of: https://arxiv.org/abs/2212.10380

Linked From

Ori Ram sur Twitter :"What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary"

Tags:

2022-12-21 About

Documents with similar tags (experimental)

[1904.08375] Document Expansion by Query Prediction

Tags:

2022-01-05 About

[2112.09118] Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Tags:

2021-12-21 About

[2112.07577] GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Tags:

An unsupervised domain adaptation technique for dense retrieval models

1. synthetic queries
are generated for each passage from the target corpus (using an existing pre-trained [T5](tag:text_to_text_transfer_transformer)
encoder-decoder)
2. the generated queries are used for mining negative
passages (retrieving the most similar
paragraphs using an existing dense retrieval
model == hard negatives!)
3. the query-passage pairs are labeled by a cross-encoder and used to train the domain-adapted
dense retriever (using method described in [Hofstätter et al.,
2020](doc:2021/12/2010_02666_improving_efficien))

[Nils Reimers sur Twitter](doc:2021/12/nils_reimers_sur_twitter_do_), [GitHub](https://github.com/UKPLab/gpl),  by the author of [TSDAE](doc:2021/09/2104_06979_tsdae_using_trans)

Claims to improve "Doc2Query" [Document Expansion by Query Prediction](doc:2022/01/1904_08375_document_expansion): ([src](https://twitter.com/KexinWang2049/status/1471435779415150598))

> - GPL: Uses doc2query to construct synthetic data and does knowledge distillation (i.e. training) on that data.
> - Doc2query: Generates queries to extend the documents and use BM25 on top of them w/o training.

2021-12-15 About

[2109.08133] Phrase Retrieval Learns Passage Retrieval, Too

Tags:

2021-09-30 About

[2106.00882] Efficient Passage Retrieval with Hashing for Open-domain Question Answering

Tags:

2021-06-03 About

[2004.04906] Dense Passage Retrieval for Open-Domain Question Answering

Tags:

2021-06-03 About