]> 2022-04-17 2022-04-17T17:45:30Z > Un long tunnel entre les deux régions, et voici qu’on était dans le pays de neige Qu’est-ce qui a tué Kawabata ? Human Language Understanding & Reasoning | American Academy of Arts and Sciences (2022) 2022-04-14 > theory of reference: the meaning of a word, phrase, or sentence is the set of objects or situations in the world that it describes 2022-04-14T23:48:33Z Gary Marcus 🇺🇦 sur Twitter : "Hybrid neurosymbolic system beats end-to-end deep learning in autonomous driving study from Caltech." / Twitter 2022-04-06T15:39:45Z 2022-04-06 Target encoding done the right way • Max Halford 2022-04-16 2022-04-16T15:45:59Z > just do contrastive learning on *all* unlabeled data + finetune on source labels. Features are NOT domain-invariant, but disentangle class & domain info to enable transfer. 2022-04-13T00:30:50Z Tengyu Ma sur Twitter : "Pretraining is ≈SoTA for domain adaptation..." 2022-04-13 2022-04-15T23:34:47Z 2022-04-15 BigScience Research Workshop sur Twitter : "how architectures & pretraining objectives impact zero-shot performance." 2022-04-03T15:45:38Z 2022-04-03 « La cause environnementale a complètement disparu après seulement dix-huit mois du mandat d’Emmanuel Macron » 2022-04-01 2020-08-25T18:31:08Z Anna Kruspe 2008.11228 2022-04-01T14:07:28Z Anna Kruspe Pre-trained sentence embeddings have been shown to be very useful for a variety of NLP tasks. Due to the fact that training such embeddings requires a large amount of data, they are commonly trained on a variety of text data. An adaptation to specific domains could improve results in many cases, but such a finetuning is usually problem-dependent and poses the risk of over-adapting to the data used for adaptation. In this paper, we present a simple universal method for finetuning Google's Universal Sentence Encoder (USE) using a Siamese architecture. We demonstrate how to use this approach for a variety of data sets and present results on different data sets representing similar problems. The approach is also compared to traditional finetuning on these data sets. As a further advantage, the approach can be used for combining data sets with different annotations. We also present an embedding finetuned on all data sets in parallel. [2008.11228] A simple method for domain adaptation of sentence embeddings A simple method for domain adaptation of sentence embeddings 2020-08-25T18:31:08Z 2022-04-28 Dimo Angelov 2008.09470 Dimo Angelov 2020-08-19T20:58:27Z Top2Vec: Distributed Representations of Topics 2020-08-19T20:58:27Z [2008.09470] Top2Vec: Distributed Representations of Topics Topic modeling is used for discovering latent semantic structure, usually referred to as topics, in a large collection of documents. The most widely used methods are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. Despite their popularity they have several weaknesses. In order to achieve optimal results they often require the number of topics to be known, custom stop-word lists, stemming, and lemmatization. Additionally these methods rely on bag-of-words representation of documents which ignore the ordering and semantics of words. Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents. We present $\texttt{top2vec}$, which leverages joint document and word semantic embedding to find $\textit{topic vectors}$. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that $\texttt{top2vec}$ finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models. 2022-04-28T12:08:34Z 2022-04-01T11:48:36Z 2022-04-01 Réconcilier sécurité alimentaire et environnement 2022-04-27 2022-04-27T15:50:15Z Tanishq Mathew Abraham sur Twitter : "Awesome and surprising things you can do with Jupyter Notebooks" 2022-04-04T22:51:50Z 2022-04-04 manvir singh sur Twitter : "In the 1970s & 80s, anthropologists working in small-scale, non-industrial societies fastidiously noted down what people were doing throughout the day. I’ve been exploring the data & am struck by one of the most popular activities: doing nothing. [thread] 2022-04-08T12:02:23Z 2022-04-08 Dennis Meadows : « Il faut mettre fin à la croissance incontrôlée, le cancer de la société » I.A.B sur Twitter : "When we "know the meaning" of a word, what is it that we know? For example, what does knowing the words "dolphin" and "tiger" entail?..." 2022-04-15T23:22:16Z 2022-04-15 2022-04-30T08:59:59Z 2022-04-25T04:31:33Z Personal Research Knowledge Graphs Maintaining research-related information in an organized manner can be challenging for a researcher. In this paper, we envision personal research knowledge graphs (PRKGs) as a means to represent structured information about the research activities of a researcher. PRKGs can be used to power intelligent personal assistants, and personalize various applications. We explore what entities and relations could be potentially included in a PRKG, how to extract them from various sources, and how to share a PRKG within a research group. 2022-04-30 Sudakshina Dutta Prantika Chakraborty 2022-04-25T04:31:33Z Prantika Chakraborty Debarshi Kumar Sanyal 2204.11428 [2204.11428] Personal Research Knowledge Graphs Où déjeuner à Paris pour moins de 20 euros ? 2022-04-18 2022-04-18T19:35:53Z 2022-04-06 2022-04-06T00:57:55Z Les Ammonites, princesses de Villers-sur-Mer | Philippe Courville In the early seventies, I had got a very fine pyritic ammonite from Les Vaches Noires (through an exchange with a friend who had found it). It had been identified as "Goliathiceras (Pachycardioceras) Repletum (Maire)". Unfortunately it has been lost a long time ago, and I don't know why I remember this name. But this is why I find beautiful and very appropriate the title of this paper. Energy efficiency guru Amory Lovins: ‘It’s the largest, cheapest, safest, cleanest way to address the crisis’ | Energy efficiency | The Guardian 2022-04-10 2022-04-10T16:13:45Z 2022-04-04T22:04:58Z 2022-04-04 abhishek sur Twitter : "When was the last time you used SVM?" Le GIEC appelle à des mesures immédiates et dans tous les secteurs pour « garantir un avenir vivable » 2022-04-04 2022-04-04T22:35:54Z 2022-04-08 2022-04-08T16:32:34Z EASE: Entity-Aware Contrastive Learning of Sentence Embedding | Papers With Code > Our experiments have demonstrated that entity supervision in EASE improves the quality of sentence embeddings both in the monolingual setting and, in particular, the multilingual setting. 1909.00426 Yuji Matsumoto 2022-04-18T19:49:22Z Ikuya Yamada 2022-04-12T05:37:42Z 2019-09-01T16:29:53Z We propose a global entity disambiguation (ED) model based on BERT. To capture global contextual information for ED, our model treats not only words but also entities as input tokens, and solves the task by sequentially resolving mentions to their referent entities and using resolved entities as inputs at each step. We train the model using a large entity-annotated corpus obtained from Wikipedia. We achieve new state-of-the-art results on five standard ED datasets: AIDA-CoNLL, MSNBC, AQUAINT, ACE2004, and WNED-WIKI. The source code and model checkpoint are available at https://github.com/studio-ousia/luke. Hiroyuki Shindo Ikuya Yamada [1909.00426] Global Entity Disambiguation with BERT Koki Washio Global Entity Disambiguation with BERT 2022-04-18 > we propose an extractive formulation, where a model receives as input the mention, its context and the text representation of each candidate, and has to extract the span corresponding to the representation of the entity that best matches the (mention, context) pair under consideration. SapienzaNLP/extend: Entity Disambiguation as text extraction (ACL 2022) 2022-04-19T17:46:50Z 2022-04-19 True Crime (1999 film) - Jugé coupable 2022-04-14T23:20:32Z 2022-04-14 Ramsri Goutham Golla sur Twitter : "Hi @Nils_Reimers For GPL you used "msmarco-distilbert-base-tas-b" model and ..." 2022-04-27T22:17:10Z 2022-04-27 Google AI Blog: Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance 2022-04-05T22:16:07Z 2022-04-05 Active Learning Helps Pretrained Models Learn the Intended Task Models can fail in unpredictable ways during deployment due to task ambiguity, when multiple behaviors are consistent with the provided training data. An example is an object classifier trained on red squares and blue circles: when encountering blue squares, the intended behavior is undefined. We investigate whether pretrained models are better active learners, capable of disambiguating between the possible tasks a user may be trying to specify. Intriguingly, we find that better active learning is an emergent property of the pretraining process: pretrained models require up to 5 times fewer labels when using uncertainty-based active learning, while non-pretrained models see no or even negative benefit. We find these gains come from an ability to select examples with attributes that disambiguate the intended behavior, such as rare product categories or atypical backgrounds. These attributes are far more linearly separable in pretrained model's representation spaces vs non-pretrained models, suggesting a possible mechanism for this behavior. 2022-04-18T18:00:19Z Dat Nguyen 2022-04-20 Alex Tamkin 2022-04-18T18:00:19Z Alex Tamkin Noah Goodman [2204.08491] Active Learning Helps Pretrained Models Learn the Intended Task 2204.08491 Jesse Mu 2022-04-20T08:08:47Z Salil Deshpande « L’enjeu est d’augmenter la production agricole en Afrique » 2022-04-01 2022-04-01T11:50:11Z 2022-04-15T00:11:05Z 2022-04-15 Jeremy Howard sur Twitter : "NLP competition at Kaggle about patent concept similarity...." Papers with Code sur Twitter : "10 Recent Trends in Language Models In this thread..." 2022-04-25T17:10:09Z 2022-04-25 Nils Reimers sur Twitter : "A nice thread on generalization performance for Dense Retrieval models..." 2022-04-27 2022-04-27T16:13:08Z > Dense retrieval model will perform badly for unseen queries > How to solve it? >- Either train on a lot more data (models & datasets exist: https://huggingface.co/sentence-transformers…) >- Generate your own training data for your corpus:[GPL](tag:gpl_generative_pseudo_labeling) 2022-04-18 Devendra Singh Sachan sur Twitter : "...Unsupervised Passage Re-ranker (UPR), an approach to re-rank retrieved passages for information retrieval tasks." 2022-04-18T23:21:01Z [2203.10581] Cluster & Tune: Boost Cold Start Performance in Text Classification Eyal Shnarch [Leshem Choshen sur Twitter : "Labelled data is scarce, what can we do?..."](doc:2022/04/leshem_choshen_sur_twitter_l) > **One-sentence Summary**: we suggest adding an unsupervised intermediate classification step, before finetunning and after pretraining BERT, and show it improves performance for data-constrained cases. > for text classification cold start (when labeled data is scarce), **add an intermediate unsupervised classification task**, between the pretraining and fine-tuning phases: > perform clustering and train the pre-trained model on predicting the cluster labels. > this additional classification phase can significantly improve performance, mainly for **topical classification** tasks > we use an efficient clustering technique, that relies on simple Bag Of Words (BOW) representations, to partition the unlabeled training data into relatively homogeneous clusters of text instances. > > Next, we treat these clusters as labeled data for an intermediate text classification task, and train the pre-trained model – with or without additional MLM pretraining – with respect to this multi-class problem, prior to the final fine-tuning over the actual target-task labels > The underlying intuition is that inter-training the model over a related text classification task would be more beneficial compared to MLM inter-training, which focuses on different textual entities, namely predicting the identity of a single token. Eyal Shnarch Cluster & Tune: Boost Cold Start Performance in Text Classification 2022-03-20T15:29:34Z Leshem Choshen 2022-04-06T01:22:32Z Ariel Gera In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce. In such cases, the common practice of fine-tuning pre-trained models, such as BERT, for a target classification task, is prone to produce poor performance. We suggest a method to boost the performance of such models by adding an intermediate unsupervised classification task, between the pre-training and fine-tuning phases. As such an intermediate task, we perform clustering and train the pre-trained model on predicting the cluster labels. We test this hypothesis on various data sets, and show that this additional classification phase can significantly improve performance, mainly for topical classification tasks, when the number of labeled instances available for fine-tuning is only a couple of dozen to a few hundred. Noam Slonim Lena Dankin Ranit Aharonov > We can MLM on the unlabeled data, but You can do better: Cluster & Tune - **finetune on clusters as labels** [github](https://github.com/IBM/intermediate-training-using-clustering) ; Paper: [[2203.10581] Cluster & Tune: Boost Cold Start Performance in Text Classification](doc:2022/04/2203_10581_cluster_tune_bo) 2022-04-06 Leshem Choshen sur Twitter : "Labelled data is scarce, what can we do?..." 2022-04-06T01:18:22Z 2022-04-06 2022-03-20T15:29:34Z 2203.10581 Alon Halfon Tu Vu sur Twitter : "Enormous LMs like GPT-3 exhibit impressive few-shot performance, but w/ self-training a BERT base sized model can achieve much better results! 2022-04-13T13:37:58Z > [[2109.06270] STraTA: Self-Training with Task Augmentation for Better Few-shot Learning](doc:2022/04/2109_06270_strata_self_train) [Github](https://github.com/google-research/google-research/tree/master/STraTA) [at HuggingFace](https://github.com/huggingface/transformers/tree/main/examples/research_projects/self-training-text-classification) -- Remark: Like [[2203.10581] Cluster & Tune: Boost Cold Start Performance in Text Classification](doc:2022/04/2203_10581_cluster_tune_bo), adds an intermediate fine-tuning step // TODO compare 2022-04-13 Tu Vu Grady Simon [2109.06270] STraTA: Self-Training with Task Augmentation for Better Few-shot Learning 2021-09-13T19:14:01Z 2022-04-12T16:44:16Z Quoc V. Le 2109.06270 STraTA: Self-Training with Task Augmentation for Better Few-shot Learning 2022-04-14 2022-04-14T19:26:35Z Tu Vu Mohit Iyyer [Tu Vu sur Twitter](doc:2022/04/tu_vu_sur_twitter_enormous_l) Despite their recent successes in tackling many NLP tasks, large-scale pre-trained language models do not perform as well in few-shot settings where only a handful of training examples are available. To address this shortcoming, we propose STraTA, which stands for Self-Training with Task Augmentation, an approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA uses task augmentation, a novel technique that synthesizes a large amount of data for auxiliary-task fine-tuning from target-task unlabeled texts. Second, STraTA performs self-training by further fine-tuning the strong base model created by task augmentation on a broad distribution of pseudo-labeled data. Our experiments demonstrate that STraTA can substantially improve sample efficiency across 12 few-shot benchmarks. Remarkably, on the SST-2 sentiment dataset, STraTA, with only 8 training examples per class, achieves comparable results to standard fine-tuning with 67K training examples. Our analyses reveal that task augmentation and self-training are both complementary and independently effective. Minh-Thang Luong 2022-04-13 [[2110.08151] mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](doc:2022/04/2110_08151_mluke_the_power_o) > mLUKE, an extension of [LUKE](tag:luke) based on 1M Wikidata entity embeddings shared across languages > mLUKE solves downstream tasks by using its language-agnostic entity embeddings as inputs. > entity representations are shared across languages during pretraining -> they are much more language-agnostic than word representations Ikuya Yamada sur Twitter : "Is entity representation effective to improve multilingual language models?..." 2022-04-13T15:46:06Z 2110.08151 [Ikuya Yamada sur Twitter : "Is entity representation effective to improve multilingual language models?..."](doc:2022/04/ikuya_yamada_sur_twitter_is_) > Recent studies have shown that multilingual pretrained language models can be effectively improved with cross-lingual alignment information from Wikipedia entities. However, **existing methods only exploit entity information in pretraining and do not explicitly use entities in downstream tasks**. In this study, we explore the **effectiveness of leveraging entity representations for downstream cross-lingual tasks**. > > the key insight is that incorporating entity representations into the input allows us to extract more language-agnostic features. [Github](https://github.com/studio-ousia/luke) > Entity representations are known to enhance language models in mono-lingual settings (Zhang et al., 2019: [ERNIE](tag:ernie.html); Peters et al., 2019: [[1909.04164] Knowledge Enhanced Contextual Word Representations](doc:2020/05/1909_04164_knowledge_enhanced); Wang et al., 2021 [[1911.06136] KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation](doc:2020/11/1911_06136_kepler_a_unified_); Xiong et al., 2020; Yamada et al., 2020: [[2010.01057] LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](doc:2020/11/2010_01057_luke_deep_context)) presumably by introducing real-world knowledge. We show that using entity representations facilitates cross-lingual transfer by providing languageindependent features. > > Multilingual extension of LUKE. The model is trained with the multilingual masked language modeling (MLM) task as well as the masked entity prediction (MEP) task with Wikipedia entity embeddings > We investigate two ways of using the entity representations in cross-lingual transfer tasks: > 1. perform entity linking for the input text, and append the detected entity tokens to the input sequence. The entity tokens are expected to provide language independent features to the model > 2. use the entity [MASK] token from the MEP task as a languageindependent feature extractor. Ryokan Ri Ikuya Yamada 2021-10-15T15:28:38Z Recent studies have shown that multilingual pretrained language models can be effectively improved with cross-lingual alignment information from Wikipedia entities. However, existing methods only exploit entity information in pretraining and do not explicitly use entities in downstream tasks. In this study, we explore the effectiveness of leveraging entity representations for downstream cross-lingual tasks. We train a multilingual language model with 24 languages with entity representations and show the model consistently outperforms word-based pretrained models in various cross-lingual transfer tasks. We also analyze the model and the key insight is that incorporating entity representations into the input allows us to extract more language-agnostic features. We also evaluate the model with a multilingual cloze prompt task with the mLAMA dataset. We show that entity-based prompt elicits correct factual knowledge more likely than using only word representations. Our source code and pretrained models are available at https://github.com/studio-ousia/luke. 2022-04-17T23:20:52Z Yoshimasa Tsuruoka 2022-04-17 2022-03-30T14:27:20Z [2110.08151] mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models Ryokan Ri