Semanlink - [2010.02666] Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

[2010.02666] Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

Tags:

About This Document

sl:arxiv_author :
sl:arxiv_firstAuthor : Sebastian Hofstätter
sl:arxiv_num : 2010.02666
sl:arxiv_published : 2020-10-06T12:35:53Z
sl:arxiv_summary : Retrieval and ranking models are the backbone of many applications such as web search, open domain QA, or text-based recommender systems. The latency of neural ranking models at query time is largely dependent on the architecture and deliberate choices by their designers to trade-off effectiveness for higher efficiency. This focus on low query latency of a rising number of efficient ranking architectures make them feasible for production deployment. In machine learning an increasingly common approach to close the effectiveness gap of more efficient models is to apply knowledge distillation from a large teacher model to a smaller student model. We find that different ranking architectures tend to produce output scores in different magnitudes. Based on this finding, we propose a cross-architecture training procedure with a margin focused loss (Margin-MSE), that adapts knowledge distillation to the varying score output distributions of different BERT and non-BERT passage ranking architectures. We apply the teachable information as additional fine-grained labels to existing training triples of the MSMARCO-Passage collection. We evaluate our procedure of distilling knowledge from state-of-the-art concatenated BERT models to four different efficient architectures (TK, ColBERT, PreTT, and a BERT CLS dot product model). We show that across our evaluated architectures our Margin-MSE knowledge distillation significantly improves re-ranking effectiveness without compromising their efficiency. Additionally, we show our general distillation method to improve nearest neighbor based index retrieval with the BERT dot product model, offering competitive results with specialized and much more costly training methods. To benefit the community, we publish the teacher-score training files in a ready-to-use package.@en
sl:arxiv_title : Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation@en
sl:arxiv_updated : 2021-01-22T16:24:52Z
sl:bookmarkOf : https://arxiv.org/abs/2010.02666
sl:creationDate : 2021-12-16
sl:creationTime : 2021-12-16T13:37:29Z
sl:relatedDoc : http://www.semanlink.net/doc/2021/12/2112_07577_gpl_generative_ps

File info

Bookmark of: https://arxiv.org/abs/2010.02666

Linked From

[2112.07577] GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Tags:

An unsupervised domain adaptation technique for dense retrieval models

1. synthetic queries
are generated for each passage from the target corpus (using an existing pre-trained [T5](tag:text_to_text_transfer_transformer)
encoder-decoder)
2. the generated queries are used for mining negative
passages (retrieving the most similar
paragraphs using an existing dense retrieval
model == hard negatives!)
3. the query-passage pairs are labeled by a cross-encoder and used to train the domain-adapted
dense retriever (using method described in [Hofstätter et al.,
2020](doc:2021/12/2010_02666_improving_efficien))

[Nils Reimers sur Twitter](doc:2021/12/nils_reimers_sur_twitter_do_), [GitHub](https://github.com/UKPLab/gpl),  by the author of [TSDAE](doc:2021/09/2104_06979_tsdae_using_trans)

Claims to improve "Doc2Query" [Document Expansion by Query Prediction](doc:2022/01/1904_08375_document_expansion): ([src](https://twitter.com/KexinWang2049/status/1471435779415150598))

> - GPL: Uses doc2query to construct synthetic data and does knowledge distillation (i.e. training) on that data.
> - Doc2query: Generates queries to extend the documents and use BM25 on top of them w/o training.

2021-12-15 About

Documents with similar tags (experimental)

[2309.06131] Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection

Tags:

2023-09-14 About

[2002.06275] TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval

Tags:

2023-08-27 About

[2207.06300] Re2G: Retrieve, Rerank, Generate

Tags:

2022-07-14 About

[2004.09813] Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

Tags:

2022-03-18 About

[2106.13474] Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains

Tags:

2021-10-21 About

[2010.02194] Self-training Improves Pre-training for Natural Language Understanding

Tags:

2021-03-12 About

[2012.04584] Distilling Knowledge from Reader to Retriever for Question Answering

Tags:

> a method to train an information retrieval module for downstream tasks, **without using pairs of queries and documents as annotations**.

Uses two models (standard pipeline for open-domain QA):

- the first one retrieves documents from a large source of knowledge (the retriever)
- the second one processes the support documents to solve the task (the reader).

> First the retriever selects support passages in a large knowledge
source. Then these passages are processed by the reader, along with the question, to generate an
answer

Inspired by knowledge distillation: the reader model is the teacher and the retriever is the student.

> More precisely, we use a sequence-to-sequence model as the reader, and use
the attention activations over the input documents as synthetic labels to train the retriever. 
> (**train the retriever by learning to approximate the attention score of the reader**)

Refers to:

- [REALM: Retrieval-Augmented Language Model Pre-Training](doc:2020/12/2002_08909_realm_retrieval_a)
- [Dehghani: Neural Ranking Models with Weak Supervision](doc:?uri=https%3A%2F%2Farxiv.org%2Fabs%2F1704.08803)

2020-12-11 About

[1910.01348] On the Efficacy of Knowledge Distillation

Tags:

2020-06-06 About

[1804.03235] Large scale distributed neural network training through online distillation

Tags: