Semanlink - [2306.04640] ModuleFormer: Modularity Emerges from Mixture-of-Experts

[2306.04640] ModuleFormer: Modularity Emerges from Mixture-of-Experts

Tags:

About This Document

sl:arxiv_author :
sl:arxiv_firstAuthor : Yikang Shen
sl:arxiv_num : 2306.04640
sl:arxiv_published : 2023-06-07T17:59:57Z
sl:arxiv_summary : Large Language Models (LLMs) have achieved remarkable results. However, existing models are expensive to train and deploy, and it is also difficult to expand their knowledge beyond pre-training data without forgetting previous knowledge. This paper proposes a new neural network architecture, ModuleFormer, that leverages modularity to improve the efficiency and flexibility of large language models. ModuleFormer is based on the Sparse Mixture of Experts (SMoE). Unlike the previous SMoE-based modular language model, which requires domain-labeled data to learn domain-specific experts, ModuleFormer can induce modularity from uncurated data with its new load balancing and concentration losses. ModuleFormer is a modular architecture that includes two different types of modules: new stick-breaking attention heads and feedforward experts. Different modules are sparsely activated conditions on the input token during training and inference. In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task and the task-unrelated modules could be easily pruned for a lightweight deployment.@en
sl:arxiv_title : ModuleFormer: Modularity Emerges from Mixture-of-Experts@en
sl:arxiv_updated : 2023-09-11T19:31:26Z
sl:bookmarkOf : https://arxiv.org/abs/2306.04640
sl:creationDate : 2023-09-16
sl:creationTime : 2023-09-16T00:15:56Z

File info

Bookmark of: https://arxiv.org/abs/2306.04640

Documents with similar tags (experimental)

[2405.06394] Memory Mosaics

Tags:

2024-05-17 About

fast.ai - Can LLMs learn from a single example?

Tags:

2023-10-21 About

[2307.08621] Retentive Network: A Successor to Transformer for Large Language Models

Tags:

2023-07-20 About

[2305.12517] Retrieving Texts based on Abstract Descriptions

Tags:

2023-06-15 About

[2303.17651] Self-Refine: Iterative Refinement with Self-Feedback

Tags:

2023-04-03 About

[2302.08091] Do We Still Need Clinical Language Models?

Tags:

2023-02-17 About

[2203.14465] STaR: Bootstrapping Reasoning With Reasoning

Tags:

2023-02-07 About

[2212.01340] Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking

Tags:

2022-12-06 About

[2210.13952] KnowGL: Knowledge Generation and Linking from Text

Tags:

2022-11-13 About

[2010.00711] A Survey of the State of Explainable AI for Natural Language Processing

Tags:

2022-09-08 About

[2208.05388] ATLAS: Universal Function Approximator for Memory Retention

Tags:

2022-08-28 About

[2201.00042] Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments

Tags:

2022-06-26 About

[2203.10581] Cluster & Tune: Boost Cold Start Performance in Text Classification

Tags:

[Leshem Choshen sur Twitter : "Labelled data is scarce, what can we do?..."](doc:2022/04/leshem_choshen_sur_twitter_l)

> **One-sentence Summary**: we suggest adding an unsupervised intermediate classification step, before finetunning and after pretraining BERT, and show it improves performance for data-constrained cases.

> for text classification cold start (when labeled
data is scarce), **add an intermediate unsupervised
classification task**, between the pretraining
and fine-tuning phases:
> perform clustering and
train the pre-trained model on predicting the
cluster labels.

> this additional
classification phase can significantly improve
performance, mainly for **topical classification**
tasks

> we use an efficient clustering technique,
that relies on simple Bag Of Words (BOW)
representations, to partition the unlabeled training
data into relatively homogeneous clusters of text
instances.
>
> Next, we treat these clusters as labeled
data for an intermediate text classification task, and
train the pre-trained model – with or without additional
MLM pretraining – with respect to this
multi-class problem, prior to the final fine-tuning
over the actual target-task labels

> The underlying
intuition is that inter-training the model
over a related text classification task would be more
beneficial compared to MLM inter-training, which
focuses on different textual entities, namely predicting
the identity of a single token.

2022-04-06 About

[2108.13934] Robust Retrieval Augmented Generation for Zero-shot Slot Filling

Tags:

2022-01-19 About

[1802.07569] Continual Lifelong Learning with Neural Networks: A Review

Tags:

2020-01-01 About

[1909.04120] Span Selection Pre-training for Question Answering

Tags:

> a **new pre-training task inspired by reading
comprehension** and an **effort to avoid encoding general knowledge in the transformer network itself**

Current transformer architectures store general knowledge -> large models, long pre-training time. Better to offload the requirement of general knowledge to a sparsely activated network.

"Span selection" as an additional auxiliary task: the query is a sentence drawn from a corpus
with a term replaced with a special token: [BLANK]. The term replaced by the blank is the answer term. The passage is
relevant as determined by a BM25 search, and answer-bearing (containing the answer
term). Unlike BERT’s cloze task, where the answer must be drawn from the model itself, the answer is found in a passage
using language understanding.

> **We hope to progress to a model of general purpose language modeling that uses an indexed long
term memory to retrieve world knowledge, rather than holding it in the densely activated transformer encoder layers.**

2019-09-18 About

[1801.06146] Universal Language Model Fine-tuning for Text Classification

Tags:

2018-01-19 About