Semanlink - [1909.04120] Span Selection Pre-training for Question Answering

[1909.04120] Span Selection Pre-training for Question Answering

Tags:

> a **new pre-training task inspired by reading
comprehension** and an **effort to avoid encoding general knowledge in the transformer network itself**

Current transformer architectures store general knowledge -> large models, long pre-training time. Better to offload the requirement of general knowledge to a sparsely activated network.

"Span selection" as an additional auxiliary task: the query is a sentence drawn from a corpus
with a term replaced with a special token: [BLANK]. The term replaced by the blank is the answer term. The passage is
relevant as determined by a BM25 search, and answer-bearing (containing the answer
term). Unlike BERT’s cloze task, where the answer must be drawn from the model itself, the answer is found in a passage
using language understanding.

> **We hope to progress to a model of general purpose language modeling that uses an indexed long
term memory to retrieve world knowledge, rather than holding it in the densely activated transformer encoder layers.**

Au sujet de ce document

sl:arxiv_author :
sl:arxiv_firstAuthor : Michael Glass
sl:arxiv_num : 1909.04120
sl:arxiv_published : 2019-09-09T19:32:31Z
sl:arxiv_summary : BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transformers have provided large gains across many language understanding tasks, achieving a new state-of-the-art (SOTA). BERT is pre-trained on two auxiliary tasks: Masked Language Model and Next Sentence Prediction. In this paper we introduce a new pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself. We find significant and consistent improvements over both BERT-BASE and BERT-LARGE on multiple reading comprehension (MRC) and paraphrasing datasets. Specifically, our proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT-LARGE by 3 F1 points on short answer prediction. We also establish a new SOTA in HotpotQA, improving answer prediction F1 by 4 F1 points and supporting fact prediction by 1 F1 point. Moreover, we show that our pre-training approach is particularly effective when training data is limited, improving the learning curve by a large amount.@en
sl:arxiv_title : Span Selection Pre-training for Question Answering@en
sl:arxiv_updated : 2019-09-09T19:32:31Z
sl:bookmarkOf : https://arxiv.org/abs/1909.04120
sl:creationDate : 2019-09-18
sl:creationTime : 2019-09-18T17:26:33Z

Infos sur le fichier

Bookmark of: https://arxiv.org/abs/1909.04120

Linked From

[1907.10529] SpanBERT: Improving Pre-training by Representing and Predicting Spans

Tags:

2023-08-29 A propos

[2108.13934] Robust Retrieval Augmented Generation for Zero-shot Slot Filling

Tags:

2022-01-19 A propos

[2007.00849] Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge

Tags:

> a neural language model that includes **an explicit interface between symbolically interpretable factual information and subsymbolic neural knowledge.**... **The model can be updated without re-training by manipulating its symbolic representations**. In particular this model allows us to add new facts and overwrite existing ones.

> a **neural language model which learns to access information
in a symbolic knowledge graph.**

> This
model builds on the recently-proposed [Entities as
Experts](doc:2020/07/2004_07202_entities_as_expert) (EaE) language model (Févry et al., 2020),
which extends the same transformer (Vaswani
et al., 2017) architecture of BERT (Devlin et al., 2019) with an additional external memory for entities.
>
> After training EaE, the embedding associated
with an entity will (ideally) capture information
about the textual context in which that
entity appears, and by inference, the entity’s semantic
properties
>
> we include an additional
memory called a fact memory, which encodes
triples from a symbolic KB.
>
> This combination results in a
neural language model which learns to access information
in a the symbolic knowledge graph.

TODO:

- read again IBM's [Span Selection Pre-training for Question Answering](doc:2019/09/_1909_04120_span_selection_pre) ("an effort to avoid encoding general knowledge in the transformer network itself")
- compare with [[1907.05242] Large Memory Layers with Product Keys](doc:2019/07/_1907_05242_large_memory_layer)
- how does it relate with [[2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training](doc:2020/12/2002_08909_realm_retrieval_a)?

2020-07-09 A propos

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Tags:

> It has recently been observed that neural language
models trained on unstructured text can
implicitly store and retrieve knowledge using
natural language queries.

indeed, cf. Facebook's paper [Language Models as Knowledge Bases?](/doc/2019/09/_1909_01066_language_models_as)

> In this short paper,
we measure the practical utility of this
approach by fine-tuning pre-trained models to
answer questions without access to any external
context or knowledge.

> we show that a large language
model pre-trained on unstructured text can
attain competitive results on open-domain question
answering benchmarks without any access
to external knowledge

BUT:

>1. state-of-the-art results only with the largest model
which had 11 billion parameters.
>1. “open-book” models
typically provide some indication of what information
they accessed when answering a question
that provides a useful form of interpretability.
In contrast, our model distributes knowledge
in its parameters in an inexplicable way, which
precludes this form of interpretability.
>1. **the maximum-likelihood objective provides no guarantees as to whether
a model will learn a fact or not.**

So, what's the point? To be compared with this [IBM's paper](/doc/2019/09/_1909_04120_span_selection_pre): "a new pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself"

2020-02-11 A propos

Documents with similar tags (experimental)

[2207.06300] Re2G: Retrieve, Rerank, Generate

Tags:

2022-07-14 A propos

[2108.13934] Robust Retrieval Augmented Generation for Zero-shot Slot Filling

Tags:

2022-01-19 A propos

[2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training

Tags:

**Augment language model pre-training with a retriever module**, which
is trained using the masked language modeling objective.

> To capture knowledge in a more modular and interpretable way, we augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference. **For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner**, using masked language modeling as the learning signal and backpropagating through a retrieval step that considers millions of documents

Hum, #TODO: parallel to be drawn with techniques in [KG-augmented Language Models](tag:knowledge_graph_augmented_language_models) which focus "on the problem of capturing declarative knowledge in the learned parameters of a language model."

[Google AI Blog Post](doc:2020/08/google_ai_blog_realm_integrat)

[Summary](https://joeddav.github.io/blog/2020/03/03/REALM.html) for the [Hugging Face awesome-papers reading group](doc:2021/03/huggingface_awesome_papers_pap)

2020-12-12 A propos

[2012.04584] Distilling Knowledge from Reader to Retriever for Question Answering

Tags:

> a method to train an information retrieval module for downstream tasks, **without using pairs of queries and documents as annotations**.

Uses two models (standard pipeline for open-domain QA):

- the first one retrieves documents from a large source of knowledge (the retriever)
- the second one processes the support documents to solve the task (the reader).

> First the retriever selects support passages in a large knowledge
source. Then these passages are processed by the reader, along with the question, to generate an
answer

Inspired by knowledge distillation: the reader model is the teacher and the retriever is the student.

> More precisely, we use a sequence-to-sequence model as the reader, and use
the attention activations over the input documents as synthetic labels to train the retriever. 
> (**train the retriever by learning to approximate the attention score of the reader**)

Refers to:

- [REALM: Retrieval-Augmented Language Model Pre-Training](doc:2020/12/2002_08909_realm_retrieval_a)
- [Dehghani: Neural Ranking Models with Weak Supervision](doc:?uri=https%3A%2F%2Farxiv.org%2Fabs%2F1704.08803)

2020-12-11 A propos

[2004.07202] Entities as Experts: Sparse Memory Access with Entity Supervision

Tags:

2020-07-11 A propos

[1911.05507] Compressive Transformers for Long-Range Sequence Modelling

Tags:

2020-02-11 A propos

[1907.05242] Large Memory Layers with Product Keys

Tags:

> **a structured memory which can be easily integrated into a neural network.** The memory is very large by design and therefore significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead. Its design and access pattern is based on **product keys**, which enable fast and exact nearest neighbor search. The ability to increase the number of parameters while keeping the same computational budget lets the overall system strike a better trade-off between prediction accuracy and computation efficiency both at training and test time.

> a key-value memory layer that can increase model capacity for a negligible computational cost. A 12-layer transformer with a memory outperforms a 24-layer transformer, and is 2x faster!

[Implementation](/doc/2019/08/product_key_memory_pkm_minima)

TODO: compare with [[2007.00849] Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge](doc:2020/07/2007_00849_facts_as_experts_)

2019-07-13 A propos