]> 2022-07-12T09:16:33Z Johannes M. van Hulst Arjen P. de Vries 2020-06-02T22:51:17Z Krisztian Balog [2006.01969] REL: An Entity Linker Standing on the Shoulders of Giants Koen Dercksen Entity linking is a standard component in modern retrieval system that is often performed by third-party toolkits. Despite the plethora of open source options, it is difficult to find a single system that has a modular architecture where certain components may be replaced, does not depend on external sources, can easily be updated to newer Wikipedia versions, and, most important of all, has state-of-the-art performance. The REL system presented in this paper aims to fill that gap. Building on state-of-the-art neural components from natural language processing research, it is provided as a Python package as well as a web API. We also report on an experimental comparison against both well-established systems and the current state-of-the-art on standard entity linking benchmarks. 2022-05-02T11:53:59Z 2205.00820 Emma J. Gerritse Pre-trained language models such as BERT have been a key ingredient to achieve state-of-the-art results on a variety of tasks in natural language processing and, more recently, also in information retrieval.Recent research even claims that BERT is able to capture factual knowledge about entity relations and properties, the information that is commonly obtained from knowledge graphs. This paper investigates the following question: Do BERT-based entity retrieval models benefit from additional entity information stored in knowledge graphs? To address this research question, we map entity embeddings into the same input space as a pre-trained BERT model and inject these entity embeddings into the BERT model. This entity-enriched language model is then employed on the entity retrieval task. We show that the entity-enriched BERT model improves effectiveness on entity-oriented queries over a regular BERT model, establishing a new state-of-the-art result for the entity retrieval task, with substantial improvements for complex natural language queries and queries requesting a list of entities with a certain property. Additionally, we show that the entity information provided by our entity-enriched model particularly helps queries related to less popular entities. Last, we observe empirically that the entity-enriched BERT models enable fine-tuning on limited training data, which otherwise would not be feasible due to the known instabilities of BERT in few-sample fine-tuning, thereby contributing to data-efficient training of BERT for entity search. 2022-07-12T08:18:56Z Entity-aware Transformers for Entity Search 2022-07-12 Arjen P. de Vries [2205.00820] Entity-aware Transformers for Entity Search Faegheh Hasibi > **Do BERT-based entity retrieval models benefit from additional entity information stored in knowledge graphs?** To address this research question, we map entity embeddings into the same input space as a pre-trained BERT model and inject these entity embeddings into the BERT model. This entity-enriched language model is then employed on the entity retrieval task. > we observe empirically that the entity-enriched BERT models **enable fine-tuning on limited training data**, which otherwise would not be feasible due to the known instabilities of BERT in few-sample fine-tuning Uses [Wikipedia2Vec](tag:wikipedia2vec) as graph embedding method 2022-05-02T11:53:59Z Emma J. Gerritse Faegheh Hasibi > REL detects mentions using Flair embeddings. REL performs candidate selection based on Wikipedia2Vec embeddings, and entity disambiguation based on latent relations between entity mentions in the text [src](doc:2022/07/2205_00820_entity_aware_trans) Johannes M. van Hulst REL: An Entity Linker Standing on the Shoulders of Giants 2022-07-12 2006.01969 2020-06-02T22:51:17Z Leshem Choshen sur Twitter : "Computational (Chomskian) hierarchies can predict OOD capabilities..." 2022-07-11 2022-07-11T11:10:31Z About a paper by DeepMind ["Neural Networks and the Chomsky Hierarchy"](https://arxiv.org/abs/2207.02098) > for our subset of tasks, RNNs and Transformers fail to generalize on non-regular tasks... only networks augmented with structured memory (such as a stack or memory tape) can successfully generalize on context-free and context-sensitive tasks > En défendant le pouvoir d’achat sans cibler les ménages les plus modestes, les oppositions mettent en péril le nécessaire désengagement des énergies fossiles et font preuve d’une coupable myopie. 2022-07-27T23:30:54Z 2022-07-27 Transition énergétique : une occasion manquée 2022-07-18T11:39:48Z Michael A. Hedderich 1807.00745 2022-07-18 Michael A. Hedderich Automatically created labels can deteriorate a classifier’s performance > approach to training a neural network with **a combination of a small amount of clean data and a larger set of automatically annotated, noisy instances** > > We model the noise explicitly using a **noise layer** that is added to the network architecture. This allows us to directly optimize the network weights using standard techniques. After training, the noise layer is not needed anymore, removing any added complexity. [related blog post](https://www.roxanne-euproject.org/news/blog/making-natural-language-processing-work-for-little-training-data) Dietrich Klakow 2018-07-22T06:01:14Z [1807.00745] Training a Neural Network in a Low-Resource Setting on Automatically Annotated Noisy Data Manually labeled corpora are expensive to create and often not available for low-resource languages or domains. Automatic labeling approaches are an alternative way to obtain labeled data in a quicker and cheaper way. However, these labels often contain more errors which can deteriorate a classifier's performance when trained on this data. We propose a noise layer that is added to a neural network architecture. This allows modeling the noise and train on a combination of clean and noisy data. We show that in a low-resource NER task we can improve performance by up to 35% by using additional, noisy data and handling the noise. 2018-07-02T15:35:02Z Training a Neural Network in a Low-Resource Setting on Automatically Annotated Noisy Data 2022-07-22 2022-07-22T15:36:30Z AdapterHub: A Framework for Adapting Transformers | Towards Data Science > Snorkel’s process is as follows. First, a developer writes labelling functions and evaluates them on a small set of labelled training data. Snorkel allows us to evaluate the accuracy and coverage of all our labelling functions, and their overlaps and conflicts with each other. Next, it trains a generative label model over these labelling functions that learns how best to combine them. Finally, this label model outputs probabilistic labels that we can use to train an end model. 2022-07-18 2022-07-18T11:06:41Z Dealing with Data Scarcity in Natural Language Processing | by Yves Peirsman | NLPTown | Medium 2019) Insecticide - Comment l'agrochimie a tué les insectes **Depuis 1990, la population d’insectes aurait chuté de 75 % en Europe**. Aussi captivante qu’alarmante, cette enquête internationale pointe le rôle des néonicotinoïdes, des insecticides neurotoxiques, dans le désastre écologique en cours. Néonicotinoïdes : apparus au Japon dans les années 1990. Utilisés en traitement préventif des semences, se propagent dans toute la plante pour la protéger des ravageurs. Leur marché, détenu par une poignée de multinationales (Syngenta, Bayer-Monsanto, BASF), pèserait ainsi entre 3 et 4 milliards de dollars à l’échelle planétaire. Pressions sur les chercheurs, les décideurs politiques et les autorités de régulation, financement d'études favorables à leurs produits, tests d'homologation biaisés : es lobbies de l'agrochimie brouillent les pistes pour entretenir l'immobilisme. Après les avoir interdits en 2018, la France a réautorisé provisoirement les néonicotinoïdes pour le traitement des betteraves sucrières. ### Alternatives convaincantes fondé sur l’enquête de Stéphane Foucart **Et le monde devint silencieux – Comment l’agrochimie a détruit les insectes** (Éditions du Seuil, 2019), retrace l’histoire des néonicotinoïdes et décrypte leurs effets... Le film met également en lumière les stratégies des industriels pour préserver leurs profits, tout en s’arrêtant sur des alternatives convaincantes : dans la plaine du Pô, en Italie, **l’ingénieur agronome [Lorenzo Furlan](doc:2022/07/lorenzo_furlan_«_l_assurance_) a mis en place un fonds mutuel permettant de compenser les éventuelles – et très rares – pertes de rendement causées par la réduction des pesticides**. Réalisation : Sylvain Lepetit Miyuki Droz Aramaki Sébastien Séga Pays : France Belgique Année : 2021 2022-07-05T21:44:32Z 2022-07-05 2022-07-21T01:50:41Z Uri Alon sur Twitter : "New repo: kNN-Transformers..." 2022-07-21 adapter-hub/adapter-transformers: Huggingface Transformers + Adapters 2022-07-22 2022-07-22T15:27:52Z Yihong Chen 2022-07-23 Pontus Stenetorp ReFactorGNNs: Revisiting Factorisation-based Models from a Message-Passing Perspective 2022-07-20T15:39:30Z Pushkar Mishra Pasquale Minervini Factorisation-based Models (FMs), such as DistMult, have enjoyed enduring success for Knowledge Graph Completion (KGC) tasks, often outperforming Graph Neural Networks (GNNs). However, unlike GNNs, FMs struggle to incorporate node features and to generalise to unseen nodes in inductive settings. Our work bridges the gap between FMs and GNNs by proposing ReFactorGNNs. This new architecture draws upon both modelling paradigms, which previously were largely thought of as disjoint. Concretely, using a message-passing formalism, we show how FMs can be cast as GNNs by reformulating the gradient descent procedure as message-passing operations, which forms the basis of our ReFactorGNNs. Across a multitude of well-established KGC benchmarks, our ReFactorGNNs achieve comparable transductive performance to FMs, and state-of-the-art inductive performance while using an order of magnitude fewer parameters. Sebastian Riedel Luca Franceschi 2022-07-23T12:57:37Z 2207.09980 [2207.09980] ReFactorGNNs: Revisiting Factorisation-based Models from a Message-Passing Perspective 2022-07-21T13:33:26Z Yihong Chen 2022-07-21 2022-07-21T01:57:58Z neulab/knn-transformers: PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including an implementation of kNN-LM and kNN-MT > a Hugging Face's transformers implementation of k-nearest-neighbor-based language models and machine translation models, designed to be easy and useful in research, and for experimenting with new ideas in kNN-based models. cf. - [[1911.00172] Generalization through Memorization: Nearest Neighbor Language Models](doc:2019/12/_1911_00172_generalization_thr) - [[2201.12431] Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval](doc:2022/07/2201_12431_neuro_symbolic_lan) 2201.12431 Junxian He 2022-07-21T09:58:40Z Uri Alon Uri Alon Dan Roth 2022-06-09T16:55:32Z Graham Neubig 2022-07-21 Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval Frank F. Xu Retrieval-based language models (R-LM) model the probability of natural language text by combining a standard language model (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present RetoMaton - retrieval automaton - which approximates the datastore search, based on (1) saving pointers between consecutive datastore entries, and (2) clustering of entries into "states". This effectively results in a weighted finite automaton built on top of the datastore, instead of representing the datastore as a flat list. The creation of the automaton is unsupervised, and a RetoMaton can be constructed from any text collection: either the original training corpus or from another domain. Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity by up to 1.85, or alternatively saves up to 83% of the nearest neighbor searches over $k$NN-LM (Khandelwal et al., 2020) without hurting perplexity. Our code and trained models are available at https://github.com/neulab/retomaton . 2022-01-28T21:38:56Z Sudipta Sengupta [2201.12431] Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval > The key ingredient of R-LMs is their ability to utilize training examples at test time without having to rely on the information encoded in the model’s weights only. 2022-07-11T17:04:48Z 2022-07-11 Recent Advances in Retrieval-Augmented Text Generation ### [Retrieval augmented LM](tag:retrieval_augmented_lm) > Motivation of Retrieval-augmented LM : Store knowledge in LM -> Store knowledge in non-parametric index >Three types: > - KNN-LM——Token-level and Interpolation-based model [Generalization through Memorization: Nearest Neighbor Language Models](doc:2019/12/_1911_00172_generalization_thr) > - Explicitly memorizing the training data helps generation > - LMs can scale to larger text collections without the added cost of training, by simply adding the data to the index > - A single LM can adapt to multiple domains without the in-domain training, by adding domain-specific data to the index > - [REALM](tag:realm)——Document-level and Joint-Training model > - RETRO——Chunk-level, Frozen-Retriever, huge index model [Improving Language Models by Retrieving from Trillions of Tokens | DeepMind](doc:2021/12/improving_language_models_by_re) 2022-07-04T08:16:25Z 2022-07-04 Jean de Nyandwi sur Twitter : "Efficient Transformers: A Survey..." 2022-07-08 2022-07-08T12:28:53Z cs224n Lecture 13: Integrating Knowledge in Language Models No Language Left Behind 2022-07-06 2022-07-06T20:57:57Z [tweet](https://twitter.com/vedanujg/status/1544925973635690497?s=20&t=ZunLNurhmN7aHDmnzPO5yQ) 10 Best African Language Datasets for Data Science Projects 2022-07-14T11:42:48Z 2022-07-14 Télescope James-Webb : « Le niveau de détail est époustouflant », la première image dévoilée 2022-07-12 2022-07-12T08:11:35Z 2022-07-20T14:10:11Z Bojan Tunguz sur Twitter : "Does anyone know of any recent NLP/NLG work on “text corpus summarization”?" 2022-07-20 <https://github.com/allenai/primer> Ankita Rajaram Naik Michael Glass > Recent models such as RAG and REALM have introduced retrieval into conditional generation. These models incorporate neural initial retrieval from a corpus of passages. We build on this line of research, proposing Re2G, which combines both neural initial retrieval and reranking into a BART-based sequence-to-sequence generation. Our reranking approach also permits merging retrieval results from sources with incomparable scores, enabling an ensemble of BM25 and neural initial retrieval. > > To train our system end-to-end, we introduce a novel variation of knowledge distillation to train the initial retrieval, reranker, and generation using only ground truth on the target sequence output. > > Large gains in four diverse tasks: zero-shot slot filling, question answering, fact checking and dialog, with relative gains of 9% to 34% over the previous SotA on the KILT leaderboard. [Code]( ibm/kgi-slot-filling) 2022-07-14T11:37:46Z [2207.06300] Re2G: Retrieve, Rerank, Generate 2022-07-13T15:51:40Z 2022-07-13T15:51:40Z Alfio Gliozzo 2207.06300 Re2G: Retrieve, Rerank, Generate 2022-07-14 Gaetano Rossiello As demonstrated by GPT-3 and T5, transformers grow in capability as parameter spaces become larger and larger. However, for tasks that require a large amount of knowledge, non-parametric memory allows models to grow dramatically with a sub-linear increase in computational cost and GPU memory requirements. Recent models such as RAG and REALM have introduced retrieval into conditional generation. These models incorporate neural initial retrieval from a corpus of passages. We build on this line of research, proposing Re2G, which combines both neural initial retrieval and reranking into a BART-based sequence-to-sequence generation. Our reranking approach also permits merging retrieval results from sources with incomparable scores, enabling an ensemble of BM25 and neural initial retrieval. To train our system end-to-end, we introduce a novel variation of knowledge distillation to train the initial retrieval, reranker, and generation using only ground truth on the target sequence output. We find large gains in four diverse tasks: zero-shot slot filling, question answering, fact-checking, and dialog, with relative gains of 9% to 34% over the previous state-of-the-art on the KILT leaderboard. We make our code available as open source at https://github.com/IBM/kgi-slot-filling/tree/re2g. Md Faisal Mahbub Chowdhury Pengshan Cai Michael Glass python - certificate verify failed: unable to get local issuer certificate - Stack Overflow 2022-07-21T15:51:51Z 2022-07-21 > StATIK uses Language Models to extract the semantic information from text descriptions, while using Message Passing Neural Networks to capture the structural information. > Structure is incorporated through a Message Passing Neural Network (MPNN) (Gilmer et al., 2017) that aggregates information from a neighborhood defined around each entity, while textual information is incorporated through a pretrained language model such as BERT KG are dynamic (new entities are added) -> we want an inductive KG completion model (able to generalize to unseen entities) 2022-07-17 2022-07-17T00:01:23Z StATIK: Structure and Text for Inductive Knowledge Graph Completion - ACL Anthology (2022) 2022-07-12T13:05:32Z 2022-07-12 « Avec l’érosion de la biodiversité, nous perdons des milliers et des milliers de mondes » [Simple Local Attentions Remain Competitive for Long-Context Tasks](https://arxiv.org/abs/2112.07210) 2022-07-18 2022-07-18T14:33:04Z Christopher Manning sur Twitter : "This seems like an important contribution to the external validity of the (big) recent line of work on long-context transformer models" 2022-06-13T23:40:34Z [2206.06520] Memory-Based Model Editing at Scale 2022-07-07 2206.06520 2022-07-07T16:16:11Z Eric Mitchell 2022-06-13T23:40:34Z Editing knowledge of a Language Model without retraining it. Eric Mitchell Christopher D. Manning Memory-Based Model Editing at Scale Antoine Bosselut Charles Lin Chelsea Finn Even the largest neural networks make errors, and once-correct predictions can become invalid as the world changes. Model editors make local updates to the behavior of base (pre-trained) models to inject updated knowledge or correct undesirable behaviors. Existing model editors have shown promise, but also suffer from insufficient expressiveness: they struggle to accurately model an edit's intended scope (examples affected by the edit), leading to inaccurate predictions for test inputs loosely related to the edit, and they often fail altogether after many edits. As a higher-capacity alternative, we propose Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model (SERAC), which stores edits in an explicit memory and learns to reason over them to modulate the base model's predictions as needed. To enable more rigorous evaluation of model editors, we introduce three challenging language model editing problems based on question answering, fact-checking, and dialogue generation. We find that only SERAC achieves high performance on all three problems, consistently outperforming existing approaches to model editing by a significant margin. Code, data, and additional project information will be made available at https://sites.google.com/view/serac-editing. 2020-04-17T17:16:08Z > the story of how we put words into computers 1902.06006 [1902.06006] Contextual Word Representations: A Contextual Introduction 2022-07-08T14:56:29Z Contextual Word Representations: A Contextual Introduction Noah A. Smith This introduction aims to tell the story of how we put words into computers. It is part of the story of the field of natural language processing (NLP), a branch of artificial intelligence. It targets a wide audience with a basic understanding of computer programming, but avoids a detailed mathematical treatment, and it does not present any algorithms. It also does not focus on any particular application of NLP such as translation, question answering, or information extraction. The ideas presented here were developed by many researchers over many decades, so the citations are not exhaustive but rather direct the reader to a handful of papers that are, in the author's view, seminal. After reading this document, you should have a general understanding of word vectors (also known as word embeddings): why they exist, what problems they solve, where they come from, how they have changed over time, and what some of the open questions about them are. Readers already familiar with word vectors are advised to skip to Section 5 for the discussion of the most recent advance, contextual word vectors. 2019-02-15T23:28:36Z Noah A. Smith 2022-07-08 2022-07-26 2022-07-26T22:11:14Z Sparql Secrets In Jena-Fuseki - DataScienceCentral.com Espèces menacées : les scientifiques en alerte | CNRS Le journal 2022-07-04T13:23:43Z 2022-07-04 2022-07-14T11:33:33Z 2022-07-14 [P] Sioyek 1.4 | Academic PDF Viewer : MachineLearning 2022-07-12 Prompting: Better Ways of Using Language Models for NLP Tasks > Starting from BERT (Devlin et al., 2019), fine-tuning pre-trained language models (LMs) with task-specific heads on downstream applications has become standard practice in NLP. However, the GPT-3 model with 175B parameters (Brown et al., 2020) has brought a new way of using LMs for downstream tasks: as the title “Language Models are Few-Shot Learners” suggests, GPT-3 can well handle a wide range of tasks with only a few examples by leveraging natural-language prompts and task demonstrations as context, while not updating the parameters in the underlying model. 2022-07-12T18:29:11Z 2022-07-23T20:37:52Z 2022-07-23 The Metamorphosis of Prime Intellect > interpretable "stack traces" of thought. <https://arxiv.org/abs/2207.10342> 2022-07-23 2022-07-23T01:25:22Z Andrej Karpathy sur Twitter : "Language Model Cascades" Lorenzo Furlan : « L'assurance récolte » comme alternative aux néonicotinoïdes - Pollinis 2022-07-05T22:56:56Z 2022-07-05 > L'agronome Lorenzo Furlan explique comment la création d’une assurance collaborative face aux aléas économiques liés aux mauvaises récoltes permettrait de réduire drastiquement l'usage des pesticides. Couplé à des pratiques agricoles spécifiques, cet outil est une alternative au retour des néonicotinoïdes réclamé par les betteraviers. Andrej Karpathy sur Twitter : "For people wondering why, as a "vision person", I am interested in language models..." 2022-07-18T23:04:50Z 2022-07-18 [To Understand Language is to Understand Generalization | Eric Jang](doc:2022/07/to_understand_language_is_to_un) language models are engines of generalization 2022-07-18T23:05:53Z 2022-07-18 To Understand Language is to Understand Generalization | Eric Jang 2022-07-08T09:28:48Z 2022-07-08 Andrej Karpathy: Books (Sci-fi) Cohere 2022-07-08 2022-07-08T08:56:24Z > "Making NLP part of every developer's toolkit" Tom Hope 2022-07-07 2022-05-16T22:55:45Z [Tara Safavi sur Twitter : "CascadER, a new knowledge graph (KG) link prediction method leveraging structured relations + unstructured text..."](doc:2022/07/tara_safavi_sur_twitter_casc) [2205.08012] CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction 2205.08012 2022-07-07T08:50:22Z Tara Safavi sur Twitter : "CascadER, a new knowledge graph (KG) link prediction method leveraging structured relations + unstructured text..." 2022-07-07 > for improved scientific discovery, entity recommendation, and hypothesis generation. [[2205.08012] CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction](doc:2022/07/2205_08012_cascader_cross_mo) 2022-07-07T14:44:59Z 2022-05-16T22:55:45Z Tara Safavi Doug Downey Tara Safavi Knowledge graph (KG) link prediction is a fundamental task in artificial intelligence, with applications in natural language processing, information retrieval, and biomedicine. Recently, promising results have been achieved by leveraging cross-modal information in KGs, using ensembles that combine knowledge graph embeddings (KGEs) and contextual language models (LMs). However, existing ensembles are either (1) not consistently effective in terms of ranking accuracy gains or (2) impractically inefficient on larger datasets due to the combinatorial explosion problem of pairwise ranking with deep language models. In this paper, we propose a novel tiered ranking architecture CascadER to maintain the ranking accuracy of full ensembling while improving efficiency considerably. CascadER uses LMs to rerank the outputs of more efficient base KGEs, relying on an adaptive subset selection scheme aimed at invoking the LMs minimally while maximizing accuracy gain over the KGE. Extensive experiments demonstrate that CascadER improves MRR by up to 9 points over KGE baselines, setting new state-of-the-art performance on four benchmarks while improving efficiency by one or more orders of magnitude over competitive cross-modal baselines. Our empirical analyses reveal that diversity of models across modalities and preservation of individual models' confidence signals help explain the effectiveness of CascadER, and suggest promising directions for cross-modal cascaded architectures. Code and pretrained models are available at https://github.com/tsafavi/cascader. Mike Lewis Dani Yogatama Questions Are All You Need to Train a Dense Passage Retriever 2022-06-21T18:16:31Z Devendra Singh Sachan 2206.10658 2022-06-21T18:16:31Z > **approach for training dense retrieval models that does not require any labeled training data**. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. > > ART, in contrast, only requires access to unpaired inputs and outputs (e.g. questions and potential answer documents). > > It uses a new document-retrieval autoencoding scheme, where > 1. an input question is used to retrieve a set of evidence documents, and > 2. the documents are then used to compute the probability of reconstructing the original question. > > Training for retrieval based on question reconstruction enables effective unsupervised learning of both document and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning. [Tweet](doc:2022/07/devendra_singh_sachan_sur_twitt) > Given an input question, ART first retrieves a small set of possible evidences documents. It then recon structs the original question by attending to these documents > > The key idea in ART is to consider the retrieved documents as a noisy representation of the original question and question reconstruction probability as a way of denoising that provides soft-labels for how likely each document is to have been the correct result Refers to [[IZACARD 2012.04584] Distilling Knowledge from Reader to Retriever for Question Answering](doc:2020/12/2012_04584_distilling_knowled) [Arxiv](doc:2022/07/2206_10658_questions_are_all_) Devendra Singh Sachan sur Twitter : "ART (Autoencoding-based Retriever Training), an unsupervised method to train a dense retriever that only uses questions and a collection of unpaired documents as the training data." 2022-07-06 2022-07-06T23:15:50Z We introduce ART, a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. ART, in contrast, only requires access to unpaired inputs and outputs (e.g. questions and potential answer documents). It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question. Training for retrieval based on question reconstruction enables effective unsupervised learning of both document and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning. Extensive experiments demonstrate that ART obtains state-of-the-art results on multiple QA retrieval benchmarks with only generic initialization from a pre-trained language model, removing the need for labeled data and task-specific losses. Joelle Pineau 2022-07-06T23:39:29Z Manzil Zaheer Devendra Singh Sachan [2206.10658] Questions Are All You Need to Train a Dense Passage Retriever Luke Zettlemoyer 2022-07-06