Semanlink - [2307.08621] Retentive Network: A Successor to Transformer for Large Language Models

[2307.08621] Retentive Network: A Successor to Transformer for Large Language Models

Tags:

About This Document

sl:arxiv_author :
sl:arxiv_firstAuthor : Yutao Sun
sl:arxiv_num : 2307.08621
sl:arxiv_published : 2023-07-17T16:40:01Z
sl:arxiv_summary : In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.@en
sl:arxiv_title : Retentive Network: A Successor to Transformer for Large Language Models@en
sl:arxiv_updated : 2023-07-19T05:56:42Z
sl:bookmarkOf : https://arxiv.org/abs/2307.08621
sl:creationDate : 2023-07-20
sl:creationTime : 2023-07-20T23:43:53Z

File info

Bookmark of: https://arxiv.org/abs/2307.08621

Documents with similar tags (experimental)

microsoft/semantic-kernel: Integrate cutting-edge LLM technology quickly and easily into your apps

Tags:

2023-10-19 About

[2306.04640] ModuleFormer: Modularity Emerges from Mixture-of-Experts

Tags:

2023-09-16 About

Sebastien Bubeck sur X : "How far does one billion parameters take you? ... Releasing phi-1.5, a 1.3B parameter LLM exhibiting emergent behaviors surprisingly close to much larger LLMs"

Tags:

2023-09-12 About

[2002.06275] TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval

Tags:

2023-08-27 About

[2305.12517] Retrieving Texts based on Abstract Descriptions

Tags:

2023-06-15 About

[2306.07174] Augmenting Language Models with Long-Term Memory

Tags:

2023-06-13 About

[2305.15294] Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

Tags:

2023-05-26 About

[2303.17651] Self-Refine: Iterative Refinement with Self-Feedback

Tags:

2023-04-03 About

[2106.09685] LoRA: Low-Rank Adaptation of Large Language Models

Tags:

2023-03-21 About

[2302.08091] Do We Still Need Clinical Language Models?

Tags:

2023-02-17 About

[2203.14465] STaR: Bootstrapping Reasoning With Reasoning

Tags:

2023-02-07 About

[2206.02743] A Neural Corpus Indexer for Document Retrieval

Tags:

2023-01-18 About

[2002.01808] K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters

Tags:

2023-01-12 About

[2212.02623] Unifying Vision, Text, and Layout for Universal Document Processing

Tags:

2022-12-07 About

[2008.12813] HittER: Hierarchical Transformers for Knowledge Graph Embeddings

Tags:

2022-06-30 About

[2004.05119] Beyond Fine-tuning: Few-Sample Sentence Embedding Transfer

Tags:

2022-03-31 About

[1911.02655] Towards Domain Adaptation from Limited Data for Question Answering Using Deep Neural Networks

Tags:

2021-11-19 About

[2106.13474] Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains

Tags:

2021-10-21 About

[2004.09095] The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Tags:

2021-10-03 About

[2007.15779] Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Tags:

2021-04-11 About

[2001.09522] TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced Graph Neural Network

Tags:

how to add a set of new concepts to an existing taxonomy.

[Tweet](https://twitter.com/mickeyjs6/status/1253772146142216194?s=20) [GitHub](https://github.com/mickeystroller/TaxoExpan)

> we study the taxonomy expansion task: given an
existing taxonomy and a set of new emerging concepts, we aim
to automatically expand the taxonomy to incorporate these new
concepts (without changing the existing relations in the given taxonomy).

> To the best of our knowledge, this is the first study on **how to
expand an existing directed acyclic graph (as we model a taxonomy
as a DAG) using self-supervised learning**.

Self-supervised framework, the existing taxonomy being used as training data: it learns a model to predict whether a query concept is the direct hyponym of an anchor concept.

> 2 techniques:
>
> 1. a **position-enhanced graph neural network that encodes the local structure of an anchor concept** in the existing taxonomy,
> 2. a noise-robust training objective that enables the learned model to be insensitive to the label noise in the self-supervision data.

Regarding 1: uses [GNN](/tag/graph_neural_networks.html) to model the "ego network" of concepts (potential “siblings”
and “grand parents” of the query concept).

> Regular
GNNs fail to distinguish nodes with different relative positions to
the query (i.e., some nodes are grand parents of the query while
the others are siblings of the query). To address this limitation, we
present a simple but effective enhancement to inject such position
information into GNNs using position embedding. We show that
such embedding can be easily integrated with existing GNN architectures
(e.g., [GCN](/tag/graph_convolutional_networks) and GAT) and significantly boosts the
prediction performance

Regarding point 2: uses InfoNCE loss, cf. [Contrastive Predictive Coding](/doc/?uri=https%3A%2F%2Farxiv.org%2Fabs%2F1807.03748)

> Instead of predicting
whether each individual ⟨query concept, anchor concept⟩ pair
is positive or not, we first group all pairs sharing the same query
concept into a single training instance and learn a model to select
the positive pair among other negative ones from the group.

(Hum, ça me rappelle quelque chose)

> assume each concept (in existing taxonomy + set of new concepts) has an initial embedding
vector learned from some text associated with this concept.

To keep things tractable, only attempts to find a single parent node of each new concept.

2020-04-25 About

[2002.02925] BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

Tags:

2020-02-10 About

[2001.01447] Improving Entity Linking by Modeling Latent Entity Type Information

Tags:

2020-01-09 About

[1912.08904] Macaw: An Extensible Conversational Information Seeking Platform

Tags:

2020-01-01 About

[1901.11504] Multi-Task Deep Neural Networks for Natural Language Understanding

Tags:

2019-02-17 About

[1810.00438] Parameter-free Sentence Embedding via Orthogonal Basis

Tags:

2018-10-06 About