Semanlink - [2202.10054] Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Tags:

About This Document

sl:arxiv_author :
sl:arxiv_firstAuthor : Ananya Kumar
sl:arxiv_num : 2202.10054
sl:arxiv_published : 2022-02-21T09:03:34Z
sl:arxiv_summary : When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the \"head\"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).@en
sl:arxiv_title : Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution@en
sl:arxiv_updated : 2022-02-21T09:03:34Z
sl:bookmarkOf : https://arxiv.org/abs/2202.10054
sl:creationDate : 2022-05-01
sl:creationTime : 2022-05-01T08:15:47Z

File info

Bookmark of: https://arxiv.org/abs/2202.10054

Documents with similar tags (experimental)

[2309.06131] Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection

Tags:

2023-09-14 About

[2306.07536] TART: A plug-and-play Transformer module for task-agnostic reasoning

Tags:

2023-06-15 About

[2302.04870] Offsite-Tuning: Transfer Learning without Full Model

Tags:

2023-02-11 About

[2205.12410] AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Tags:

2022-12-16 About