]> 'Bizarro World' - The Boston Globe 2022-03-18 2022-03-18T16:57:03Z story of a guy accidentally discovering his wife is the world's best Tetris player « Mining Secrets » : nouvelles révélations sur les pratiques controversées d’un géant de l’industrie minière au Guatemala 2022-03-06 2022-03-06T18:45:35Z Nigeria, De la préhistoire à l'indépendance, par Robert Cornevin (Le Monde diplomatique, décembre 1975) 2022-03-19T14:52:15Z 2022-03-19 2022-03-28T23:03:23Z 2022-03-28 (((ل()(ل() 'yoav))))👾 sur Twitter : "our attempt at producing large-scale, sense-annotated corpora, with automatically derived word senses ..." 2004.05119 2022-03-31 [2004.05119] Beyond Fine-tuning: Few-Sample Sentence Embedding Transfer Beyond Fine-tuning: Few-Sample Sentence Embedding Transfer 2020-04-10T16:57:06Z 2020-10-05T16:57:39Z 2022-03-31T21:04:02Z > Fine-tuning (FT) pre-trained sentence embedding models on small datasets has been shown to have limitations. In this paper we show that concatenating the embeddings from the pre-trained model with those from a simple sentence embedding model trained only on the target data, can improve over the performance of FT for few-sample tasks Siddhant Garg Rohit Kumar Sharma Siddhant Garg Fine-tuning (FT) pre-trained sentence embedding models on small datasets has been shown to have limitations. In this paper we show that concatenating the embeddings from the pre-trained model with those from a simple sentence embedding model trained only on the target data, can improve over the performance of FT for few-sample tasks. To this end, a linear classifier is trained on the combined embeddings, either by freezing the embedding model weights or training the classifier and embedding models end-to-end. We perform evaluation on seven small datasets from NLP tasks and show that our approach with end-to-end training outperforms FT with negligible computational overhead. Further, we also show that sophisticated combination techniques like CCA and KCCA do not work as well in practice as concatenation. We provide theoretical analysis to explain this empirical observation. Yingyu Liang Adding New Words into a Language Model using Parameters of Known Words with Similar Behavior (2018) 2022-03-21 2022-03-21T22:51:55Z 2022-03-24T14:28:07Z [tweet](https://twitter.com/s_hofstaetter/status/1508803785317138435) [2203.13088] Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction Recent progress in neural information retrieval has demonstrated large gains in effectiveness, while often sacrificing the efficiency and interpretability of the neural model compared to classical approaches. This paper proposes ColBERTer, a neural retrieval model using contextualized late interaction (ColBERT) with enhanced reduction. Along the effectiveness Pareto frontier, ColBERTer's reductions dramatically lower ColBERT's storage requirements while simultaneously improving the interpretability of its token-matching scores. To this end, ColBERTer fuses single-vector retrieval, multi-vector refinement, and optional lexical matching components into one model. For its multi-vector component, ColBERTer reduces the number of stored vectors per document by learning unique whole-word representations for the terms in each document and learning to identify and remove word representations that are not essential to effective scoring. We employ an explicit multi-task, multi-stage training to facilitate using very small vector dimensions. Results on the MS MARCO and TREC-DL collection show that ColBERTer can reduce the storage footprint by up to 2.5x, while maintaining effectiveness. With just one dimension per token in its smallest setting, ColBERTer achieves index storage parity with the plaintext size, with very strong effectiveness results. Finally, we demonstrate ColBERTer's robustness on seven high-quality out-of-domain collections, yielding statistically significant gains over traditional retrieval baselines. Mete Sertkan 2022-03-30T00:55:25Z 2022-03-30 2203.13088 Sebastian Hofstätter Allan Hanbury Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction Sophia Althammer 2022-03-24T14:28:07Z Omar Khattab Sebastian Hofstätter 2022-03-21 2022-03-21T08:17:45Z Math Behind Graph Neural Networks - Rishabh Anand Andrew Trask about large language models: The "bigness" is a temporary flaw, not a permanent feature of progress" 2022-03-13 2022-03-13T09:16:01Z > In this article, we will create two simple entity linking systems based on Bi-encoder. The former is based on surface-based candidate generation (CG), and the latter on Approximate Nearest Neighbor Search (ANNSearch). 2022-03-23T01:37:24Z 2022-03-23 Building Transformer-Based Entity Linking System | by izuna385 | Medium (2021) 2022-03-31T08:59:25Z 2022-03-31 Domain Adaptation — Sentence-Transformers documentation 2022-03-16 2022-03-16T17:36:19Z Domain Adaptation with BERT-based Domain Classification and Data Selection - ACL Anthology (2019) 2022-03-18T19:15:00Z 2022-03-18 GuideToTransformersDomainAdaptation.ipynb - Colaboratory > end-to-end workflow of domain adaptation, where we domain-adapt a transfomer model for biomedical NLP applications 2022-03-15 2022-03-15T20:47:39Z Studio Ousia sur Twitter : "Now using LUKE is easier than ever!" / Twitter > We train a document encoder to match online job descriptions to one of many standardized job roles from Singapore’s Skills Framework. The encoder generates semantically meaningful document encodings from textual descriptions of job roles, which are then compared using Cosine Similarity to determine matching. During training, we implement the methodology used by Sentence-BERT, fine tuning pre-trained BERT models using a siamese network architecture on labelled document pairs. 2022-03-09T18:18:50Z 2022-03-09 Document Matching for Job Descriptions | Semantic Scholar (2021) > Very succinctly, we show that individual columns in the feedforward matrices at different layers contribute to shifting the prediction towards specific concepts, *which we can interpret*. > We show that a token representation can be viewed as a changing distribution over the output vocabulary 2022-03-30 2022-03-30T00:40:43Z (((ل()(ل() 'yoav))))👾 sur Twitter : "... another step in understanding how transformer-based LMs work..." Jordan Ash 2022-02-28T18:59:20Z Sham Kakade 2022-03-05 Understanding Contrastive Learning Requires Incorporating Inductive Biases Contrastive learning is a popular form of self-supervised learning that encourages augmentations (views) of the same input to have more similar representations compared to augmentations of different inputs. Recent attempts to theoretically explain the success of contrastive learning on downstream classification tasks prove guarantees depending on properties of {\em augmentations} and the value of {\em contrastive loss} of representations. We demonstrate that such analyses, that ignore {\em inductive biases} of the function class and training algorithm, cannot adequately explain the success of contrastive learning, even {\em provably} leading to vacuous guarantees in some settings. Extensive experiments on image and text domains highlight the ubiquity of this problem -- different function classes and algorithms behave very differently on downstream tasks, despite having the same augmentations and contrastive losses. Theoretical analysis is presented for the class of linear representations, where incorporating inductive biases of the function class allows contrastive learning to work with less stringent conditions compared to prior analyses. 2202.14037 Sanjeev Arora Nikunj Saunshi Surbhi Goel Akshay Krishnamurthy [2202.14037] Understanding Contrastive Learning Requires Incorporating Inductive Biases Cyril Zhang 2022-03-05T11:25:53Z Dipendra Misra 2022-02-28T18:59:20Z Nikunj Saunshi 2022-03-15T11:43:14Z 2022-03-15 Séquencer l’ADN de l’environnement : une technique qui fait sa révolution 2022-03-31T10:06:14Z 2022-03-31 Sentence Embedding Fine-tuning for the French Language | by La Javaness R&D | Feb, 2022 | Medium 2022-03-20T11:58:26Z 2022-03-20 « Vouloir produire plus au nom de l’indépendance agricole, c’est comme vouloir mettre plus d’automobiles sur les routes au nom des économies d’énergie » AK sur Twitter : "HyperMixer: An MLP-based Green AI Alternative to Transformers" 2022-03-16 2022-03-16T17:01:00Z Nils Reimers Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation Iryna Gurevych We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models. The training is based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence. We use the original (monolingual) model to generate sentence embeddings for the source language and then train a new system on translated sentences to mimic the original model. Compared to other methods for training multilingual sentence embeddings, this approach has several advantages: It is easy to extend existing models with relatively few samples to new languages, it is easier to ensure desired properties for the vector space, and the hardware requirements for training is lower. We demonstrate the effectiveness of our approach for 50+ languages from various language families. Code to extend sentence embeddings models to more than 400 languages is publicly available. 2020-10-05T06:30:56Z 2004.09813 Nils Reimers 2022-03-18 2022-03-18T18:38:00Z [2004.09813] Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation 2020-04-21T08:20:25Z Révolte des Taiping — Wikipédia 2022-03-24T13:07:50Z 2022-03-24 entre vingt et trente millions de morts entre 1851 et 1864 2022-03-23 2022-03-23T16:32:44Z Thèse de Hicham El Boukkouri, univ Paris-Saclay [Github](https://github.com/helboukkouri/phd-code) ### Goal Given a target specialized domain, improve the quality of general-domain word representations using in-domain corpora and/or knowledge bases ### Contributions #### a method for specializing general-domain embeddings in a [Low-Resource](tag:nlp_low_resource_scenarios) context. > - train static representations on the task corpus, > - resume the pre-training of general-domain contextual embeddings on the same task corpus, > - finally, combine both static and contextual representations into one final model #### we tackle the issue of using a general-domain vocabulary in a specialized domain #### Evaluation of re-training vs training from scratch on specialized corpora using a specialized vocabulary training from scratch better, but not that much: re-training from a general model is still appropriate as it is less expensive and leads to comparable, although slightly lower, performance #### Regarding subword-based tokenization systems > we argue that they are inconvenient in practice -> CharacterBERT, a variant of BERT that uses ELMo’s character-based system instead of WordPieces. More convenient ti use, superior robustness to misspellings #### Ways to specialize general-domain representations using knowledge bases a strong baseline using a simple method relying on graph embeddings and concatenation, using only is_a relation > both static and contextual embeddings may effectively be specialized using this simple approach #### Knowledge Injection Modules (KIM) that inject the knowledge representations directly within the BERT-like models' architecture ### Notes > our experiments focused on a single setting (i.e. the medical domain and the English language) > meta-embeddings, an approach that consists in combining different sets of representations for achieving improved performance Domain adaptation of word embeddings through the exploitation of in-domain corpora and knowledge bases (PhD Thesis 2021) 2022-03-18T17:41:40Z 2022-03-18 NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece | by Pierre Guillou | Medium [2006.05987] Revisiting Few-sample BERT Fine-tuning Felix Wu 2022-03-21 Kilian Q. Weinberger This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process. 2006.05987 Yoav Artzi 2021-03-11T17:22:50Z Tianyi Zhang 2022-03-21T10:46:15Z 2020-06-10T17:57:03Z > A study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. > The most commonly used optimizer for fine-tuning BERT is BERTADAM, a modified version of the ADAM first-order stochastic optimization method. It differs from the original ADAM algorithm (Kingma & Ba, 2014) in omitting a bias correction step. > > ... We observe that the bias correction omission influences the learning rate, especially early in the fine-tuning process, and is one of the primary reasons for instability in fine-tuning BERT and this is bad when finetuning with less than 10K samples. Pb included in many > open source libraries, including the official implementation huggingface’s Transformers How to solve pb in HuggingFace? > HuggingFace Transformers AdamW has correct_bias parameter set to True by default. Still it's worth noting the importance this parameter serves. [src](doc:2022/08/on_stability_of_few_sample_tran) Revisiting Few-sample BERT Fine-tuning Arzoo Katiyar Tianyi Zhang Jason Weston sur Twitter : "SeeKeR: An open source search-augmented language model" 2022-03-25 2022-03-25T16:52:07Z Stanford NLP Group sur Twitter : "...how to use AI systems to augment the work of humans in interactive systems " 2022-03-05 2022-03-05T11:21:20Z Michael Tänzer Michael Tänzer 2022-03-15T01:14:16Z 2105.00828 Marek Rei State-of-the-art pre-trained language models have been shown to memorise facts and perform well with limited amounts of training data. To gain a better understanding of how these models learn, we study their generalisation and memorisation capabilities in noisy and low-resource scenarios. We find that the training of these models is almost unaffected by label noise and that it is possible to reach near-optimal results even on extremely noisy datasets. However, our experiments also show that they mainly learn from high-frequency patterns and largely fail when tested on low-resource tasks such as few-shot learning and rare entity recognition. To mitigate such limitations, we propose an extension based on prototypical networks that improves performance in low-resource named entity recognition tasks. [2105.00828] Memorisation versus Generalisation in Pre-trained Language Models 2022-03-30 > State-of-the-art pre-trained language models have been shown to memorise facts and perform well with limited amounts of training data.... > However, our experiments also show that they **mainly learn from high-frequency patterns and largely fail when tested on low-resource tasks such as few-shot learning and rare entity recognition**. 2022-03-30T16:11:53Z 2021-04-16T18:53:19Z Memorisation versus Generalisation in Pre-trained Language Models Sebastian Ruder 2022-03-15T13:36:34Z 2022-03-15 Old World trade routes c. 1490 "J'ai foulé le monde, j'ai ri de la peur, je me suis moqué du danger. Hordes de sauvages, déserts calcinés, septentrion gelé, glaces éternelles et mers tempétueuses, de quoi ne suis-je pas sorti indemne ?" (cité selon TC Boyle) [Dartmouth alumni magazine](https://archive.dartmouthalumnimagazine.com/article/1990/11/1/john-ledyard-1776) 2022-03-10T13:32:37Z 2022-03-10 John Ledyard 2019-10-14T17:22:37Z 2019-10-17T08:07:19Z Peter Izsak 2022-03-31T21:06:23Z Peter Izsak 1910.06294 Training models on low-resource named entity recognition tasks has been shown to be a challenge, especially in industrial applications where deploying updated models is a continuous effort and crucial for business operations. In such cases there is often an abundance of unlabeled data, while labeled data is scarce or unavailable. Pre-trained language models trained to extract contextual features from text were shown to improve many natural language processing (NLP) tasks, including scarcely labeled tasks, by leveraging transfer learning. However, such models impose a heavy memory and computational burden, making it a challenge to train and deploy such models for inference use. In this work-in-progress we combined the effectiveness of transfer learning provided by pre-trained masked language models with a semi-supervised approach to train a fast and compact model using labeled and unlabeled examples. Preliminary evaluations show that the compact models can achieve competitive accuracy with 36x compression rate when compared with a state-of-the-art pre-trained language model, and run significantly faster in inference, allowing deployment of such models in production environments or on edge devices. Shira Guskin 2022-03-31 Training Compact Models for Low Resource Entity Tagging using Pre-trained Language Models [1910.06294] Training Compact Models for Low Resource Entity Tagging using Pre-trained Language Models Moshe Wasserblat 2022-03-09 2022-03-09T10:53:24Z NAVER LABS Europe : "@Nils_Reimers of @huggingface on 'Unsupervised domain adaptation for neural search'" Unsupervised Training of Retrievers Using GenQ (The Art of Asking Questions with GenQ) | Pinecone 2022-03-09 2022-03-09T10:56:30Z Combining pre-trained language models and structured knowledge [2101.12294] Combining pre-trained language models and structured knowledge Pedro Colon-Hernandez 2101.12294 Matthew Huggins 2021-01-28T21:54:03Z 2021-02-05T18:02:25Z Catherine Havasi Cynthia Breazeal Pedro Colon-Hernandez 2022-03-25T16:05:35Z Jason Alonso 2022-03-25 In recent years, transformer-based language models have achieved state of the art performance in various NLP benchmarks. These models are able to extract mostly distributional information with some semantics from unstructured text, however it has proven challenging to integrate structured information, such as knowledge graphs into these models. We examine a variety of approaches to integrate structured knowledge into current language models and determine challenges, and possible opportunities to leverage both structured and unstructured information sources. From our survey, we find that there are still opportunities at exploiting adapter-based injections and that it may be possible to further combine various of the explored approaches into one system. Coût, gestion des déchets et sécurité : huit questions que pose le retour annoncé du nucléaire en France 2022-03-04T13:52:43Z 2022-03-04 Alan Ramponi [2006.00632] Neural Unsupervised Domain Adaptation in NLP---A Survey 2022-03-30T01:13:03Z 2006.00632 Alan Ramponi Barbara Plank Deep neural networks excel at learning from labeled data and achieve state-of-the-art resultson a wide array of Natural Language Processing tasks. In contrast, learning from unlabeled data, especially under domain shift, remains a challenge. Motivated by the latest advances, in this survey we review neural unsupervised domain adaptation techniques which do not require labeled target domain data. This is a more challenging yet a more widely applicable setup. We outline methods, from early traditional non-neural methods to pre-trained model transfer. We also revisit the notion of domain, and we uncover a bias in the type of Natural Language Processing tasks which received most attention. Lastly, we outline future directions, particularly the broader need for out-of-distribution generalization of future NLP. 2022-03-30 Neural Unsupervised Domain Adaptation in NLP---A Survey 2020-10-28T08:24:14Z 2020-05-31T22:34:14Z 2022-03-18T16:32:36Z 2022-03-18 Retraining roberta-base using the RoBERTa MLM Procedure | Medium Document Representation | SpringerLink 2022-03-10 2022-03-10T12:30:47Z 2022-03-10T13:54:40Z Recent progress in pretrained Transformer-based language models has shown great success in learning contextual representation of text. However, due to the quadratic self-attention complexity, most of the pretrained Transformers models can only handle relatively short text. It is still a challenge when it comes to modeling very long documents. In this work, we propose to use a graph attention network on top of the available pretrained Transformers model to learn document embeddings. This graph attention network allows us to leverage the high-level semantic structure of the document. In addition, based on our graph document model, we design a simple contrastive learning strategy to pretrain our models on a large amount of unlabeled corpus. Empirically, we demonstrate the effectiveness of our approaches in document classification and document retrieval tasks. 2110.10778 [2110.10778] Contrastive Document Representation Learning with Graph Attention Networks Peng Xu Xiaofei Ma Peng Xu Bing Xiang 2022-03-10 > most of the pretrained Transformers models can only handle relatively short text. It is still a challenge when it comes to modeling very long documents. In this work, we propose to use a graph attention network on top of the available pretrained Transformers model to learn document embeddings Contrastive Document Representation Learning with Graph Attention Networks 2021-10-20T21:05:02Z Xinchi Chen Zhiheng Huang 2021-10-20T21:05:02Z [2203.06169] LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval 2203.06169 Daya Guo Nan Duan LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever that does not require any supervised data for training. Specifically, we first present Iterative Contrastive Learning (ICoL) that iteratively trains the query and document encoders with a cache mechanism. ICoL not only enlarges the number of negative instances but also keeps representations of cached examples in the same hidden space. We then propose Lexicon-Enhanced Dense Retrieval (LEDR) as a simple yet effective way to enhance dense retrieval with lexical matching. We evaluate LaPraDoR on the recently proposed BEIR benchmark, including 18 datasets of 9 zero-shot text retrieval tasks. Experimental results show that LaPraDoR achieves state-of-the-art performance compared with supervised dense retrieval models, and further analysis reveals the effectiveness of our training strategy and objectives. Compared to re-ranking, our lexicon-enhanced approach can be run in milliseconds (22.5x faster) while achieving superior performance. Canwen Xu 2022-03-11T18:53:12Z 2022-03-11T18:53:12Z 2022-03-29T08:03:18Z Julian McAuley 2022-03-29 Canwen Xu > We find that standard fine-tuning can do poorly out-of-distribution (test data ≠ fine-tuning data). Our analysis leads to a simple fix Ananya Kumar sur Twitter : "How should you fine-tune a large pretrained model (CLIP, SimCLR) robustly?..." 2022-03-03 2022-03-03T13:15:45Z Sentence Transformer Fine-Tuning (SetFit): Outperforming GPT-3 on few-shot Text-Classification while being 1600 times smaller | by Moshe Wasserblat (2021-12) Finetuning d'un SBERT sur une tâche de classification (in fine, produit un SBERT) > **Few-shot text classification based on fine-tuning a Sentence Transformer with task-specific data** that can easily be implemented with the sentence-transformers library > Surprisingly, we did not find any work that performed an end-to-end ST fine-tuning for text classification in a Siamese manner. [COLAB](https://colab.research.google.com/github/MosheWasserb/SetFit/blob/main/SetFit_SST_2.ipynb) [Nils Reimers sur Twitter](doc:2022/03/nils_reimers_sur_twitter_gre) 2022-03-31T10:49:48Z 2022-03-31 2022-03-31 About [Sentence Transformer Fine-Tuning (SetFit): Outperforming GPT-3 on few-shot Text-Classification while being 1600 times smaller | by Moshe Wasserblat](doc:2022/03/sentence_transformer_fine_tunin) > - Outperforms GPT-3 in few-shot text-classification (50 labeled examples, secret test set) > - 1600 times smaller > - Can be run on your CPU > - No limitation on the number of training examples > - Just few lines of code needed Nils Reimers sur Twitter : "Great post on SetFit" 2022-03-31T10:48:50Z > Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. > > "Update: Pre-trained Universal Sentence Encoders and BERT Sentence Transformer now available for embedding." > **The assumption the algorithm makes is that many semantically similar documents are indicative of an underlying topic**. The first step is to create a joint embedding of document and word vectors. Once documents and words are embedded in a vector space the goal of the algorithm is to find dense clusters of documents, then identify which words attracted those documents together. Each dense area is a topic and the words that attracted the documents to the dense area are the topic words. > Once you train the Top2Vec model you can: > - ... > - Get **hierarchical topics**. > - Search topics by keywords. > - Search documents by topic, by keywords. > - Find similar words, similar documents. Refered by [BERTopic](doc:2022/03/maartengr_bertopic_leveraging_) ddangelov/Top2Vec: Top2Vec learns jointly embedded topic, document and word vectors. 2022-03-10 2022-03-10T09:51:16Z 2022-03-10 2022-03-10T09:41:50Z > topic modeling technique that leverages 🤗 transformers and [c-TF-IDF](https://github.com/MaartenGr/cTFIDF) to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. refers to [Top2Vec](doc:2022/03/ddangelov_top2vec_top2vec_lear) [youtube](https://www.youtube.com/watch?v=Qub3PrFvauI) [tweet](https://twitter.com/JayAlammar/status/1594681648121102336?s=20&t=R0G_LrajK9WBtzypwXtD7Q) MaartenGr/BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics. > the problem of building text classifiers with little or no training data. > > In recent years, an approach based on neural textual entailment models has been found to give strong results on a diverse range of tasks. (cf. #[NLI](tag:nli), using the input text as the premise and the text representing the label as the hypothesis) > In this work, we show that **with proper pre-training, Siamese Networks that embed texts and labels** offer a competitive alternative. > > We introduce **label tuning: fine-tuning the label embeddings only**. While giving lower performance than model fine-tuning (which updates all params of the model), this approach has the architectural advantage that a single encoder can be shared by many different tasks (we only fine-tune the label embeddings) > The drop in quality can be compensated by using a variant of **[Knowledge distillation](tag:knowledge_distillation)** [Github](https://tinyurl.com/label-tuning), [Tweet](doc:2022/03/thomas_muller_sur_twitter_pa) 2203.14655 2022-03-28T11:16:46Z Few-Shot Learning with Siamese Networks and Label Tuning 2022-03-30 2022-03-30T15:48:13Z Thomas Müller sur Twitter : "paper & code of a novel light-weight few-shot model based on sentence embeddings..." > The idea is simple: It's well known that you can use sentence embedding models to build zero-shot models by encoding the input text and a label description. You can improve quality by fine-tuning the encoder. Instead of tuning the entire encoder **you can just tune the label embeddings**. [Paper](doc:2022/03/2203_14655_few_shot_learning_) 2022-03-30 Thomas Müller Marc Franco-Salvador Guillermo Pérez-Torró 2022-03-30T16:14:44Z [2203.14655] Few-Shot Learning with Siamese Networks and Label Tuning 2022-03-28T11:16:46Z We study the problem of building text classifiers with little or no training data, commonly known as zero and few-shot text classification. In recent years, an approach based on neural textual entailment models has been found to give strong results on a diverse range of tasks. In this work, we show that with proper pre-training, Siamese Networks that embed texts and labels offer a competitive alternative. These models allow for a large reduction in inference cost: constant in the number of labels rather than linear. Furthermore, we introduce label tuning, a simple and computationally efficient approach that allows to adapt the models in a few-shot setup by only changing the label embeddings. While giving lower performance than model fine-tuning, this approach has the architectural advantage that a single encoder can be shared by many different tasks. Thomas Müller