Semanlink - [2001.04451] Reformer: The Efficient Transformer

[2001.04451] Reformer: The Efficient Transformer

Tags:

Σχετικά με το έγγραφο αυτό

sl:arxiv_author :
sl:arxiv_firstAuthor : Nikita Kitaev
sl:arxiv_num : 2001.04451
sl:arxiv_published : 2020-01-13T18:38:28Z
sl:arxiv_summary : Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.@en
sl:arxiv_title : Reformer: The Efficient Transformer@en
sl:arxiv_updated : 2020-02-18T16:01:18Z
sl:bookmarkOf : https://arxiv.org/abs/2001.04451
sl:creationDate : 2020-06-29
sl:creationTime : 2020-06-29T19:04:03Z

Πληροφορία αρχείου

Bookmark of: https://arxiv.org/abs/2001.04451

Documents with similar tags (experimental)

[2112.05682] Self-attention Does Not Need O(n^2) Memory

Tags:

2023-02-27 About

[2007.00849] Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge

Tags:

> a neural language model that includes **an explicit interface between symbolically interpretable factual information and subsymbolic neural knowledge.**... **The model can be updated without re-training by manipulating its symbolic representations**. In particular this model allows us to add new facts and overwrite existing ones.

> a **neural language model which learns to access information
in a symbolic knowledge graph.**

> This
model builds on the recently-proposed [Entities as
Experts](doc:2020/07/2004_07202_entities_as_expert) (EaE) language model (Févry et al., 2020),
which extends the same transformer (Vaswani
et al., 2017) architecture of BERT (Devlin et al., 2019) with an additional external memory for entities.
>
> After training EaE, the embedding associated
with an entity will (ideally) capture information
about the textual context in which that
entity appears, and by inference, the entity’s semantic
properties
>
> we include an additional
memory called a fact memory, which encodes
triples from a symbolic KB.
>
> This combination results in a
neural language model which learns to access information
in a the symbolic knowledge graph.

TODO:

- read again IBM's [Span Selection Pre-training for Question Answering](doc:2019/09/_1909_04120_span_selection_pre) ("an effort to avoid encoding general knowledge in the transformer network itself")
- compare with [[1907.05242] Large Memory Layers with Product Keys](doc:2019/07/_1907_05242_large_memory_layer)
- how does it relate with [[2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training](doc:2020/12/2002_08909_realm_retrieval_a)?

2020-07-09 About

[1703.07464] No Fuss Distance Metric Learning using Proxies

Tags:

> We address the problem of distance metric learning (DML), defined as learning a distance consistent with a notion of semantic similarity...
> Traditionnaly, supervision is expressed in the form of sets of points that follow
an ordinal relationship – an anchor point x is similar to
a set of positive points Y , and dissimilar to a set of negative
points Z, and a loss defined over these distances is minimized.
> Triplet-Based methods are challenging to optimize (a main issue is the need for finding informative triplets).
>
> We propose to **optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points which are learned as well**. These proxies approximate the original data points, so that a triplet loss over the proxies is a tight upper bound of the original loss.

Mentioned in this [blog post](/doc/2020/01/training_a_speaker_embedding_fr):

> "**Proxy based triplet learning**": instead of generating triplets, we learn an embedding for each class and use the learnt embedding as a proxy for triplets as part of the training. In other words, we can train end to end without the computationally expensive step of resampling triplets after each network update.

Near the conclusion:

> Our formulation of Proxy-NCA loss produces a loss very
similar to the standard cross-entropy loss used in classification.
However, we arrive at our formulation from a different
direction: we are not interested in the actual classifier and
indeed discard the proxies once the model has been trained.
Instead, the proxies are auxiliary variables, enabling more
effective optimization of the embedding model parameters.
**As such, our formulation not only enables us to surpass the
state of the art in zero-shot learning, but also offers an explanation
to the effectiveness of the standard trick of training
a classifier, and using its penultimate layer’s output as the
embedding.**

2020-02-09 About

[2001.07685] FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

Tags:

2020-01-22 About

[1803.11175] Universal Sentence Encoder

Tags:

2018-05-29 About