About This Document
- sl:arxiv_author :
- sl:arxiv_firstAuthor : Jang Hyun Cho
- sl:arxiv_num : 1910.01348
- sl:arxiv_published : 2019-10-03T08:14:13Z
- sl:arxiv_summary : In this paper, we present a thorough evaluation of the efficacy of knowledge
distillation and its dependence on student and teacher architectures. Starting
with the observation that more accurate teachers often don't make good
teachers, we attempt to tease apart the factors that affect knowledge
distillation performance. We find crucially that larger models do not often
make better teachers. We show that this is a consequence of mismatched
capacity, and that small students are unable to mimic large teachers. We find
typical ways of circumventing this (such as performing a sequence of knowledge
distillation steps) to be ineffective. Finally, we show that this effect can be
mitigated by stopping the teacher's training early. Our results generalize
across datasets and models.@en
- sl:arxiv_title : On the Efficacy of Knowledge Distillation@en
- sl:arxiv_updated : 2019-10-03T08:14:13Z
- sl:bookmarkOf :
- sl:creationDate : 2020-06-06
- sl:creationTime : 2020-06-06T17:20:52Z
Documents with similar tags (experimental)