About This Document
- sl:arxiv_author :
- sl:arxiv_firstAuthor : Cody Coleman
- sl:arxiv_num : 2007.00077
- sl:arxiv_published : 2020-06-30T19:46:10Z
- sl:arxiv_summary : Many active learning and search approaches are intractable for industrial
settings with billions of unlabeled examples. Existing approaches, such as
uncertainty sampling or information density, search globally for the optimal
examples to label, scaling linearly or even quadratically with the unlabeled
data. However, in practice, data is often heavily skewed; only a small fraction
of collected data will be relevant for a given learning task. For example, when
identifying rare classes, detecting malicious content, or debugging model
performance, the ratio of positive to negative examples can be 1 to 1,000 or
more. In this work, we exploit this skew in large training datasets to reduce
the number of unlabeled examples considered in each selection round by only
looking at the nearest neighbors to the labeled examples. Empirically, we
observe that learned representations effectively cluster unseen concepts,
making active learning very effective and substantially reducing the number of
viable unlabeled examples. We evaluate several active learning and search
techniques in this setting on three large-scale datasets: ImageNet, Goodreads
spoiler detection, and OpenImages. For rare classes, active learning methods
need as little as 0.31% of the labeled data to match the average precision of
full supervision. By limiting active learning methods to only consider the
immediate neighbors of the labeled data as candidates for labeling, we need
only process as little as 1% of the unlabeled data while achieving similar
reductions in labeling costs as the traditional global approach. This process
of expanding the candidate pool with the nearest neighbors of the labeled set
can be done efficiently and reduces the computational complexity of selection
by orders of magnitude.@en
- sl:arxiv_title : Similarity Search for Efficient Active Learning and Search of Rare Concepts@en
- sl:arxiv_updated : 2020-06-30T19:46:10Z
- sl:bookmarkOf : https://arxiv.org/abs/2007.00077
- sl:creationDate : 2020-07-02
- sl:creationTime : 2020-07-02T15:31:34Z
- sl:relatedDoc : http://www.semanlink.net/doc/2020/06/facebookresearch_faiss_a_libra
Documents with similar tags (experimental)