Semanlink - [2302.06600] Task-Specific Skill Localization in Fine-tuned Language Models

[2302.06600] Task-Specific Skill Localization in Fine-tuned Language Models

Tags:

About This Document

sl:arxiv_author :
sl:arxiv_firstAuthor : Abhishek Panigrahi
sl:arxiv_num : 2302.06600
sl:arxiv_published : 2023-02-13T18:55:52Z
sl:arxiv_summary : Pre-trained language models can be fine-tuned to solve diverse NLP tasks, including in few-shot settings. Thus fine-tuning allows the model to quickly pick up task-specific ``skills,'' but there has been limited study of where these newly-learnt skills reside inside the massive model. This paper introduces the term skill localization for this problem and proposes a solution. Given the downstream task and a model fine-tuned on that task, a simple optimization is used to identify a very small subset of parameters ($\sim0.01$% of model parameters) responsible for ($>95$%) of the model's performance, in the sense that grafting the fine-tuned values for just this tiny subset onto the pre-trained model gives performance almost as well as the fine-tuned model. While reminiscent of recent works on parameter-efficient fine-tuning, the novel aspects here are that: (i) No further re-training is needed on the subset (unlike, say, with lottery tickets). (ii) Notable improvements are seen over vanilla fine-tuning with respect to calibration of predictions in-distribution ($40$-$90$% error reduction) as well as the quality of predictions out-of-distribution (OOD). In models trained on multiple tasks, a stronger notion of skill localization is observed, where the sparse regions corresponding to different tasks are almost disjoint, and their overlap (when it happens) is a proxy for task similarity. Experiments suggest that localization via grafting can assist certain forms of continual learning.@en
sl:arxiv_title : Task-Specific Skill Localization in Fine-tuned Language Models@en
sl:arxiv_updated : 2023-07-02T01:55:00Z
sl:bookmarkOf : https://arxiv.org/abs/2302.06600
sl:creationDate : 2023-08-25
sl:creationTime : 2023-08-25T22:52:04Z

File info

Bookmark of: https://arxiv.org/abs/2302.06600

Documents with similar tags (experimental)

[2404.03592] ReFT: Representation Finetuning for Language Models

Tags:

2024-04-08 About

[2307.15936] A Theory for Emergence of Complex Skills in Language Models

Tags:

2024-02-24 About

[2311.11077] Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning

Tags:

2023-11-25 About

Sanjeev Arora sur Twitter : "new `skills' induced by LLM fine-tuning can be localized in tiny fraction of the model."

Tags:

2023-07-07 About

Sanjeev Arora sur Twitter : "Fine-tuning language models using just forward pass!...r

Tags:

2023-06-09 About

[2106.09685] LoRA: Low-Rank Adaptation of Large Language Models

Tags:

2023-03-21 About

[2205.12410] AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Tags:

2022-12-16 About

[2205.05638] Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Tags:

2022-12-15 About

Sanjeev Arora sur Twitter : "A priori, fine-tuning a huge LM on a few datapoints could lead to catastrophic overfitting. So why doesn’t it? Our theory + experiments..."

Tags:

2022-10-14 About

[2209.00099] Efficient Methods for Natural Language Processing: A Survey

Tags:

2022-09-04 About

[2010.07835] Fine-Tuning Pre-trained Language Model with Weak Supervision: A Contrastive-Regularized Self-Training Approach

Tags:

2022-09-02 About

[2106.10199] BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

Tags:

2022-09-01 About

[2202.14037] Understanding Contrastive Learning Requires Incorporating Inductive Biases

Tags:

2022-03-05 About

[1908.11860] Adapt or Get Left Behind: Domain Adaptation through BERT Language Model Finetuning for Aspect-Target Sentiment Classification

Tags:

2021-10-21 About

[2106.04647] Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Tags:

2021-09-29 About

[2004.10964] Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

Tags:

2020-12-01 About

[1902.09229] A Theoretical Analysis of Contrastive Unsupervised Representation Learning

Tags:

2019-03-20 About

[1601.03764] Linear Algebraic Structure of Word Senses, with Applications to Polysemy

Tags:

> Here it is shown that multiple word senses reside
in linear superposition within the word
embedding and simple sparse coding can recover
vectors that approximately capture the
senses

> Each extracted word sense is accompanied by one of about  2000 “discourse atoms” that gives a succinct description of which other words co-occur with that word sense.

> The success of the approach is mathematically explained using a variant of
the random walk on discourses model

("random walk": a generative model for language). Under the assumptions of this model,  there
exists a linear relationship between the vector of a
word w and the vectors of the words in its contexts (It is not the average of the words in w's context, but in a given corpus the matrix of the linear relationship does not depend on w. It can be estimated, and so we can compute the embedding of a word from the contexts it belongs to)

[Related blog post](/doc/?uri=https%3A%2F%2Fwww.offconvex.org%2F2016%2F07%2F10%2Fembeddingspolysemy%2F)

2018-08-28 About