About This Document
- sl:arxiv_author :
- sl:arxiv_firstAuthor : Ananya Kumar
- sl:arxiv_num : 2202.10054
- sl:arxiv_published : 2022-02-21T09:03:34Z
- sl:arxiv_summary : When transferring a pretrained model to a downstream task, two popular
methods are full fine-tuning (updating all the model parameters) and linear
probing (updating only the last linear layer -- the \"head\"). It is well known
that fine-tuning leads to better accuracy in-distribution (ID). However, in
this paper, we find that fine-tuning can achieve worse accuracy than linear
probing out-of-distribution (OOD) when the pretrained features are good and the
distribution shift is large. On 10 distribution shift datasets
(Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW,
ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on
average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We
show theoretically that this tradeoff between ID and OOD accuracy arises even
in a simple setting: fine-tuning overparameterized two-layer linear networks.
We prove that the OOD error of fine-tuning is high when we initialize with a
fixed or random head -- this is because while fine-tuning learns the head, the
lower layers of the neural network change simultaneously and distort the
pretrained features. Our analysis suggests that the easy two-step strategy of
linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning
heuristic, combines the benefits of both fine-tuning and linear probing.
Empirically, LP-FT outperforms both fine-tuning and linear probing on the above
datasets (1% better ID, 10% better OOD than full fine-tuning).@en
- sl:arxiv_title : Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution@en
- sl:arxiv_updated : 2022-02-21T09:03:34Z
- sl:bookmarkOf : https://arxiv.org/abs/2202.10054
- sl:creationDate : 2022-05-01
- sl:creationTime : 2022-05-01T08:15:47Z
Documents with similar tags (experimental)