About This Document
- sl:arxiv_author :
- sl:arxiv_firstAuthor : Jingfei Du
- sl:arxiv_num : 2010.02194
- sl:arxiv_published : 2020-10-05T17:52:25Z
- sl:arxiv_summary : Unsupervised pre-training has led to much recent progress in natural language
understanding. In this paper, we study self-training as another way to leverage
unlabeled data through semi-supervised learning. To obtain additional data for
a specific task, we introduce SentAugment, a data augmentation method which
computes task-specific query embeddings from labeled data to retrieve sentences
from a bank of billions of unlabeled sentences crawled from the web. Unlike
previous semi-supervised methods, our approach does not require in-domain
unlabeled data and is therefore more generally applicable. Experiments show
that self-training is complementary to strong RoBERTa baselines on a variety of
tasks. Our augmentation approach leads to scalable and effective self-training
with improvements of up to 2.6% on standard text classification benchmarks.
Finally, we also show strong gains on knowledge-distillation and few-shot
learning.@en
- sl:arxiv_title : Self-training Improves Pre-training for Natural Language Understanding@en
- sl:arxiv_updated : 2020-10-05T17:52:25Z
- sl:bookmarkOf : https://arxiv.org/abs/2010.02194
- sl:creationDate : 2021-03-12
- sl:creationTime : 2021-03-12T06:17:22Z
Documents with similar tags (experimental)