Semanlink - [1911.05507] Compressive Transformers for Long-Range Sequence Modelling

Tags:

About This Document

sl:arxiv_author :
sl:arxiv_firstAuthor : Jack W. Rae
sl:arxiv_num : 1911.05507
sl:arxiv_published : 2019-11-13T14:36:01Z
sl:arxiv_summary : We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.@en
sl:arxiv_title : Compressive Transformers for Long-Range Sequence Modelling@en
sl:arxiv_updated : 2019-11-13T14:36:01Z
sl:bookmarkOf : https://arxiv.org/abs/1911.05507
sl:creationDate : 2020-02-11
sl:creationTime : 2020-02-11T08:48:20Z
sl:relatedDoc : http://www.semanlink.net/doc/2020/02/a_new_model_and_dataset_for_lon

File info

Bookmark of: https://arxiv.org/abs/1911.05507

Linked From

A new model and dataset for long-range memory | DeepMind

Tags:

2020-02-11 About

Documents with similar tags (experimental)

[2305.07185] MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Tags:

2023-07-01 About

[2010.06467] Pretrained Transformers for Text Ranking: BERT and Beyond

Tags:

2021-07-09 About

[1909.04120] Span Selection Pre-training for Question Answering

Tags:

> a **new pre-training task inspired by reading
comprehension** and an **effort to avoid encoding general knowledge in the transformer network itself**

Current transformer architectures store general knowledge -> large models, long pre-training time. Better to offload the requirement of general knowledge to a sparsely activated network.

"Span selection" as an additional auxiliary task: the query is a sentence drawn from a corpus
with a term replaced with a special token: [BLANK]. The term replaced by the blank is the answer term. The passage is
relevant as determined by a BM25 search, and answer-bearing (containing the answer
term). Unlike BERT’s cloze task, where the answer must be drawn from the model itself, the answer is found in a passage
using language understanding.

> **We hope to progress to a model of general purpose language modeling that uses an indexed long
term memory to retrieve world knowledge, rather than holding it in the densely activated transformer encoder layers.**

2019-09-18 About

[1907.05242] Large Memory Layers with Product Keys

Tags:

> **a structured memory which can be easily integrated into a neural network.** The memory is very large by design and therefore significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead. Its design and access pattern is based on **product keys**, which enable fast and exact nearest neighbor search. The ability to increase the number of parameters while keeping the same computational budget lets the overall system strike a better trade-off between prediction accuracy and computation efficiency both at training and test time.

> a key-value memory layer that can increase model capacity for a negligible computational cost. A 12-layer transformer with a memory outperforms a 24-layer transformer, and is 2x faster!

[Implementation](/doc/2019/08/product_key_memory_pkm_minima)

TODO: compare with [[2007.00849] Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge](doc:2020/07/2007_00849_facts_as_experts_)

2019-07-13 About

[1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Tags:

2018-10-12 About