About This Document
- sl:arxiv_author :
- sl:arxiv_firstAuthor : Canwen Xu
- sl:arxiv_num : 2002.02925
- sl:arxiv_published : 2020-02-07T17:52:16Z
- sl:arxiv_summary : In this paper, we propose a novel model compression approach to effectively
compress BERT by progressive module replacing. Our approach first divides the
original BERT into several modules and builds their compact substitutes. Then,
we randomly replace the original modules with their substitutes to train the
compact modules to mimic the behavior of the original modules. We progressively
increase the probability of replacement through the training. In this way, our
approach brings a deeper level of interaction between the original and compact
models, and smooths the training process. Compared to the previous knowledge
distillation approaches for BERT compression, our approach leverages only one
loss function and one hyper-parameter, liberating human effort from
hyper-parameter tuning. Our approach outperforms existing knowledge
distillation approaches on GLUE benchmark, showing a new perspective of model
compression.@en
- sl:arxiv_title : BERT-of-Theseus: Compressing BERT by Progressive Module Replacing@en
- sl:arxiv_updated : 2020-03-25T15:20:44Z
- sl:bookmarkOf : https://arxiv.org/abs/2002.02925
- sl:creationDate : 2020-02-10
- sl:creationTime : 2020-02-10T21:50:03Z
Documents with similar tags (experimental)