Semanlink - [2307.15936] A Theory for Emergence of Complex Skills in Language Models

[2307.15936] A Theory for Emergence of Complex Skills in Language Models

Tags:

About This Document

sl:arxiv_author :
- Sanjeev Arora
- Anirudh Goyal
sl:arxiv_firstAuthor : Sanjeev Arora
sl:arxiv_num : 2307.15936
sl:arxiv_published : 2023-07-29T09:22:54Z
sl:arxiv_summary : A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework. Contributions include: (a) A statistical framework that relates cross-entropy loss of LLMs to competence on the basic skills that underlie language tasks. (b) Mathematical analysis showing that the Scaling Laws imply a strong form of inductive bias that allows the pre-trained model to learn very efficiently. We informally call this {\em slingshot generalization} since naively viewed it appears to give competence levels at skills that violate usual generalization theory. (c) A key example of slingshot generalization, that competence at executing tasks involving $k$-tuples of skills emerges essentially at the same scaling and same rate as competence on the elementary skills themselves.@en
sl:arxiv_title : A Theory for Emergence of Complex Skills in Language Models@en
sl:arxiv_updated : 2023-11-06T00:36:24Z
sl:bookmarkOf : https://arxiv.org/abs/2307.15936
sl:creationDate : 2024-02-24
sl:creationTime : 2024-02-24T00:11:29Z
sl:relatedDoc : http://www.semanlink.net/doc/2024/02/new_theory_suggests_chatbots_ca

File info

Bookmark of: https://arxiv.org/abs/2307.15936

Linked From

New Theory Suggests Chatbots Can Understand Text | Quanta Magazine

Tags:

2024-02-11 About

Documents with similar tags (experimental)

[2404.11018] Many-Shot In-Context Learning

Tags:

2024-04-21 About

New Theory Suggests Chatbots Can Understand Text | Quanta Magazine

Tags:

2024-02-11 About

[2302.06600] Task-Specific Skill Localization in Fine-tuned Language Models

Tags:

2023-08-25 About

[2305.14128] Dr.ICL: Demonstration-Retrieved In-context Learning

Tags:

2023-07-14 About

[2305.11778] Cross-Lingual Supervision improves Large Language Models Pre-training

Tags:

2023-05-22 About

[2305.06897] AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

Tags:

2023-05-15 About

[2303.16839] MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Tags:

2023-04-25 About

[2304.01982] Rethinking the Role of Token Retrieval in Multi-Vector Retrieval

Tags:

2023-04-05 About

[2112.05682] Self-attention Does Not Need O(n^2) Memory

Tags:

2023-02-27 About

[2203.14465] STaR: Bootstrapping Reasoning With Reasoning

Tags:

2023-02-07 About

Characterizing Emergent Phenomena in Large Language Models – Google AI Blog

Tags:

2023-01-26 About

[2202.06991] Transformer Memory as a Differentiable Search Index

Tags:

2022-10-25 About

[2206.10658] Questions Are All You Need to Train a Dense Passage Retriever

Tags:

> **approach for training dense retrieval models that does not require any labeled training data**. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples.
>
> ART, in contrast, only requires access to unpaired inputs and outputs (e.g. questions and potential answer documents).
>
> It uses a new document-retrieval autoencoding scheme, where
> 1. an input question is used to retrieve a set of evidence documents, and
> 2. the documents are then used to compute the probability of reconstructing the original question.
>
> Training for retrieval based on question reconstruction enables effective unsupervised learning of both document and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning.

[Tweet](doc:2022/07/devendra_singh_sachan_sur_twitt)

> Given an
input question, ART first retrieves a small set
of possible evidences documents. It then recon
structs
the original question by attending to these
documents
>
> The
key idea in ART is to consider the retrieved documents
as a noisy representation of the original
question and question reconstruction probability
as a way of denoising that provides soft-labels for
how likely each document is to have been the correct
result

Refers to [[IZACARD 2012.04584] Distilling Knowledge from Reader to Retriever for Question Answering](doc:2020/12/2012_04584_distilling_knowled)

2022-07-06 About

[2205.08184] SKILL: Structured Knowledge Infusion for Large Language Models

Tags:

2022-05-18 About

[2205.05131] Unifying Language Learning Paradigms

Tags:

2022-05-12 About

[2203.08913] Memorizing Transformers

Tags:

2022-05-07 About

[2202.14037] Understanding Contrastive Learning Requires Incorporating Inductive Biases

Tags:

2022-03-05 About

[1906.00300] Latent Retrieval for Weakly Supervised Open Domain Question Answering

Tags:

2022-01-11 About

[2106.04647] Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Tags:

2021-09-29 About

[2010.12566] DICT-MLM: Improved Multilingual Pre-Training using Bilingual Dictionaries

Tags:

2021-09-06 About

[2102.07043] Reasoning Over Virtual Knowledge Bases With Open Predicate Relations

Tags:

> a method for constructing **a virtual KB (VKB) trained entirely from text**

Open Predicate Query Language (OPQL): constructing a virtual knowledge base (VKB) that supports KB reasoning & open-domain QA, tackling the incompleteness of knowledge bases by constructing a virtual KB only from text

> OPQL constructs
a VKB by **encoding and indexing a set of
relation mentions** in a way that naturally enables
reasoning and can be trained without any structured
supervision.

> can be used
as an **external memory integrated into a language
model**

cf. this earlier paper [[2002.10640] Differentiable Reasoning over a Virtual Knowledge Base](doc:2020/07/2002_10640_differentiable_rea). But does not require an initial structured KB for distant
supervision.

> The key idea in constructing the OPQL VKB is to use a
dual-encoder pre-training process, similar to 
[[1906.03158] Matching the Blanks: Distributional Similarity for Relation Learning](doc:2021/05/1906_03158_matching_the_blank)

Related work section refers to [[1909.04164] Knowledge Enhanced Contextual Word Representations](doc:2020/05/1909_04164_knowledge_enhanced). Also refers to [[2007.00849] Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge](doc:2020/07/2007_00849_facts_as_experts_) (some authors in common)

2021-06-20 About

[1906.03158] Matching the Blanks: Distributional Similarity for Relation Learning

Tags:

2021-05-13 About

[1909.10506] Learning Dense Representations for Entity Retrieval

Tags:

> We show that it is feasible to perform **entity
linking by training a dual encoder (two-tower)
model that encodes mentions and entities in
the same dense vector space**, where candidate
entities are retrieved by approximate nearest
neighbor search. Unlike prior work, **this setup
does not rely on an alias table followed by a
re-ranker, and is thus the first fully learned entity
retrieval model**.

Contributions:

> -  a dual encoder architecture for
learning entity and mention encodings suitable for
retrieval. A key feature of the architecture is that it
employs a modular **hierarchy of sub-encoders that
capture different aspects of mentions and entities**
> - a simple, fully unsupervised **hard negative
mining** strategy that produces massive gains
in retrieval performance, compared to using only
random negatives
> - high
quality candidate entities very efficiently using approximate nearest neighbor search
> - outperforms discrete retrieval
baselines like an alias table or BM25

> strong retrieval
performance across all 5.7 million Wikipedia entities in
around 3ms per mention

> since we are using a two-tower or dual
encoder architecture, **our model cannot use any kind of attention over
both mentions and entities at once**, nor feature-wise
comparisons as done by Francis-Landau et al. (2016).
This is a fairly severe constraint – for example, **we cannot
directly compare the mention span to the entity title**
– but it permits retrieval with nearest neighbor search
for the entire context against a single, all encompassing
representation for each entity

2021-05-01 About

[1902.00751] Parameter-Efficient Transfer Learning for NLP

Tags:

2021-04-11 About

[2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training

Tags:

**Augment language model pre-training with a retriever module**, which
is trained using the masked language modeling objective.

> To capture knowledge in a more modular and interpretable way, we augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference. **For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner**, using masked language modeling as the learning signal and backpropagating through a retrieval step that considers millions of documents

Hum, #TODO: parallel to be drawn with techniques in [KG-augmented Language Models](tag:knowledge_graph_augmented_language_models) which focus "on the problem of capturing declarative knowledge in the learned parameters of a language model."

[Google AI Blog Post](doc:2020/08/google_ai_blog_realm_integrat)

[Summary](https://joeddav.github.io/blog/2020/03/03/REALM.html) for the [Hugging Face awesome-papers reading group](doc:2021/03/huggingface_awesome_papers_pap)

2020-12-12 About

[2004.07202] Entities as Experts: Sparse Memory Access with Entity Supervision

Tags:

2020-07-11 About

[2007.00849] Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge

Tags:

> a neural language model that includes **an explicit interface between symbolically interpretable factual information and subsymbolic neural knowledge.**... **The model can be updated without re-training by manipulating its symbolic representations**. In particular this model allows us to add new facts and overwrite existing ones.

> a **neural language model which learns to access information
in a symbolic knowledge graph.**

> This
model builds on the recently-proposed [Entities as
Experts](doc:2020/07/2004_07202_entities_as_expert) (EaE) language model (Févry et al., 2020),
which extends the same transformer (Vaswani
et al., 2017) architecture of BERT (Devlin et al., 2019) with an additional external memory for entities.
>
> After training EaE, the embedding associated
with an entity will (ideally) capture information
about the textual context in which that
entity appears, and by inference, the entity’s semantic
properties
>
> we include an additional
memory called a fact memory, which encodes
triples from a symbolic KB.
>
> This combination results in a
neural language model which learns to access information
in a the symbolic knowledge graph.

TODO:

- read again IBM's [Span Selection Pre-training for Question Answering](doc:2019/09/_1909_04120_span_selection_pre) ("an effort to avoid encoding general knowledge in the transformer network itself")
- compare with [[1907.05242] Large Memory Layers with Product Keys](doc:2019/07/_1907_05242_large_memory_layer)
- how does it relate with [[2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training](doc:2020/12/2002_08909_realm_retrieval_a)?

2020-07-09 About

[1906.02715] Visualizing and Measuring the Geometry of BERT

Tags:

2019-06-07 About

[1803.02893] An efficient framework for learning sentence representations

Tags:

2019-03-20 About

[1902.09229] A Theoretical Analysis of Contrastive Unsupervised Representation Learning

Tags:

2019-03-20 About

[1601.03764] Linear Algebraic Structure of Word Senses, with Applications to Polysemy

Tags:

> Here it is shown that multiple word senses reside
in linear superposition within the word
embedding and simple sparse coding can recover
vectors that approximately capture the
senses

> Each extracted word sense is accompanied by one of about  2000 “discourse atoms” that gives a succinct description of which other words co-occur with that word sense.

> The success of the approach is mathematically explained using a variant of
the random walk on discourses model

("random walk": a generative model for language). Under the assumptions of this model,  there
exists a linear relationship between the vector of a
word w and the vectors of the words in its contexts (It is not the average of the words in w's context, but in a given corpus the matrix of the linear relationship does not depend on w. It can be estimated, and so we can compute the embedding of a word from the contexts it belongs to)

[Related blog post](/doc/?uri=https%3A%2F%2Fwww.offconvex.org%2F2016%2F07%2F10%2Fembeddingspolysemy%2F)

2018-08-28 About