About This Document
- sl:arxiv_author :
- sl:arxiv_firstAuthor : Suchin Gururangan
- sl:arxiv_num : 2303.14177
- sl:arxiv_published : 2023-03-24T17:38:58Z
- sl:arxiv_summary : Large language models are typically trained densely: all parameters are
updated with respect to all inputs. This requires synchronization of billions
of parameters across thousands of GPUs. We introduce a simple but effective
method to asynchronously train large, sparse language models on arbitrary text
corpora. Our method clusters a corpus into sets of related documents, trains a
separate expert language model on each cluster, and combines them in a sparse
ensemble for inference. This approach generalizes embarrassingly parallel
training by automatically discovering the domains for each expert, and
eliminates nearly all the communication overhead of existing sparse language
models. Our technique outperforms dense baselines on multiple corpora and
few-shot tasks, and our analysis shows that specializing experts to meaningful
clusters is key to these gains. Performance also improves with the number of
experts and size of training data, suggesting this is a highly efficient and
accessible approach to training large language models.@en
- sl:arxiv_title : Scaling Expert Language Models with Unsupervised Domain Discovery@en
- sl:arxiv_updated : 2023-03-24T17:38:58Z
- sl:bookmarkOf : https://arxiv.org/abs/2303.14177
- sl:creationDate : 2023-03-27
- sl:creationTime : 2023-03-27T23:25:12Z
Documents with similar tags (experimental)