]> Google AI Blog: From Vision to Language: Semi-supervised Learning in Action…at Scale Semi-Supervised Distillation (SSD). First, the teacher model infers pseudo-labels on the unlabeled dataset from which we then train a new teacher model (T’) that is of equal-or-larger size than the original teacher model. This step, which is essentially self-training, is then followed by knowledge distillation to produce a smaller student model for production. 2021-07-14T23:34:40Z 2021-07-14 Davlan (David Adelani) @Huggingface 2021-07-29 2021-07-29T00:01:52Z includes a [roberta-base-finetuned-hausa](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-hausa) (using data from [CC-100: Monolingual Datasets from Web Crawl Data](doc:2021/07/cc_100_monolingual_datasets_fr)) CC-100: Monolingual Datasets from Web Crawl Data 2021-07-29 2021-07-29T00:20:28Z Attempt to recreate the dataset used for training XLM-R ([[1911.02116] Unsupervised Cross-lingual Representation Learning at Scale](doc:2021/07/1911_02116_unsupervised_cross)) [1911.02116] Unsupervised Cross-lingual Representation Learning at Scale Kartikay Khandelwal Edouard Grave Myle Ott 2021-07-29T00:16:13Z Alexis Conneau This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available. Vishrav Chaudhary Luke Zettlemoyer 1911.02116 Guillaume Wenzek Data: [CC-100: Monolingual Datasets from Web Crawl Data](doc:2021/07/cc_100_monolingual_datasets_fr) Francisco Guzmán 2019-11-05T22:42:00Z 2020-04-08T01:02:17Z 2021-07-29 Alexis Conneau Veselin Stoyanov Unsupervised Cross-lingual Representation Learning at Scale Naman Goyal 2021-07-31 2021-07-31T18:34:37Z Rio Tinto blasting of 46,000-year-old Aboriginal sites compared to Islamic State's destruction in Palmyra - ABC News (May 2020) Stefan Bauer Yoshua Bengio 2102.11107 The two fields of machine learning and graphical causality arose and developed separately. However, there is now cross-pollination and increasing interest in both fields to benefit from the advances of the other. In the present paper, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities. Nan Rosemary Ke Nal Kalchbrenner Francesco Locatello [2102.11107] Towards Causal Representation Learning 2021-07-15 Towards Causal Representation Learning 2021-02-22T15:26:57Z This article reviews fundamental concepts of causal inference and relates them to crucial open problems of machine learning, including transfer learning and generalization, thereby assaying how causality can contribute to modern machine learning research Related: [Making sense of raw input](doc:2021/05/making_sense_of_raw_input) 2021-07-15T00:29:21Z Anirudh Goyal Bernhard Schölkopf Bernhard Schölkopf 2021-02-22T15:26:57Z Vincent Carles 2006.07264 Low-resource Languages: A Review of Past Work and Future Challenges 2021-07-06 Evan Heetderks 2020-06-12T15:21:57Z Alexandre Magueresse A current problem in NLP is massaging and processing low-resource languages which lack useful training attributes such as supervised data, number of native speakers or experts, etc. This review paper concisely summarizes previous groundbreaking achievements made towards resolving this problem, and analyzes potential improvements in the context of the overall future research direction. [2006.07264] Low-resource Languages: A Review of Past Work and Future Challenges bof 2020-06-12T15:21:57Z Alexandre Magueresse 2021-07-06T13:07:39Z 2021-07-29 Les groupes sanguins de Neandertal et Denisova décryptés | CNRS 2021-07-29T00:37:49Z Les enjeux de la publicité politique ciblée | CNRS Le journal 2021-07-17 2021-07-17T15:40:27Z Oana Goga, chercheuse au @LIGLab , a développé, avec son équipe, l'outil AdAnalyst, qui leur a permis d’analyser le ciblage des publicités politiques sur #Facebook > Nos mesures ont montré que les publicitaires peuvent sélectionner parmi plus de 250 000 attributs, dont beaucoup sont très spécifiques et parfois sensibles tels que « l’intérêt dans les mouvements anti-avortement » ou « la conscience du cancer ». 2021-07-28T09:25:45Z 2021-07-28 Lombok into eclipse Nkiruka Odu Jonathan Mukiibi Kelechi Nwaike Orevaoghene Ahia Gerald Muriuki Clemencia Siro Ayodele Awokoya Anuoluwapo Aremu Sebastian Ruder Deborah Nabagereka Kelechi Ogueji Israel Abebe Azime 2103.11811 Mouhamadane MBOUP Paul Rayson 2021-03-22T13:12:44Z David Ifeoluwa Adelani Salomey Osei 2021-07-06T13:08:36Z Ignatius Ezeani Henok Tilaye Chester Palen-Michel Happy Buzaaba Samba Ngom Tobius Saul Bateesa Degaga Wolde MasakhaNER: Named Entity Recognition for African Languages Abdoulaye Diallo Adewale Akinfaderin Tendai Marengereke Samuel Oyerinde Daniel D'souza 2021-07-06 Maurice Katusiime We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP. Seid Muhie Yimam Stephen Mayhew Shamsuddeen Muhammad Shruti Rijhwani Verrah Otiende Perez Ogayo Mofetoluwa Adeyemi Chiamaka Chukwuneke Catherine Gitau Bonaventure F. P. Dossou 2021-07-05T15:14:32Z Graham Neubig Yvonne Wambui Thierno Ibrahima DIOP Dibora Gebreyohannes Blessing Sibanda Tajuddeen Gwadabe Jade Abbott Abdoulaye Faye Temilola Oloyede Joyce Nakatumba-Nabende Tosin Adewumi Jesujoba Alabi Iroro Orife Eric Peter Wairagala Emmanuel Anebi Derguene Mbaye Chris Chinenye Emezue Davis David Victor Akinode Rubungo Andre Niyongabo David Ifeoluwa Adelani [2103.11811] MasakhaNER: Named Entity Recognition for African Languages Julia Kreutzer Constantine Lignos 2021-07-16T13:37:51Z 2021-07-16 A la découverte de la cryptographie quantique 2107.00676 [2107.00676] A Primer on Pretrained Multilingual Language Models Mitesh M. Khapra > MLLMs are useful for bilingual tasks, particularly in low resource scenarios. > > The surprisingly good performance of MLLMs in crosslingual transfer as well as bilingual tasks motivates the hypothesis that MLLMs are learning universal patterns. However, our survey of the studies in this space indicates that there is no consensus yet. Sumanth Doddapaneni A Primer on Pretrained Multilingual Language Models Sumanth Doddapaneni Gowtham Ramesh 2021-07-01T18:01:46Z Pratyush Kumar 2021-07-13T13:33:29Z Anoop Kunchukuttan 2021-07-13 Multilingual Language Models (MLLMs) such as mBERT, XLM, XLM-R, \textit{etc.} have emerged as a viable option for bringing the power of pretraining to a large number of languages. Given their success in zero shot transfer learning, there has emerged a large body of work in (i) building bigger MLLMs covering a large number of languages (ii) creating exhaustive benchmarks covering a wider variety of tasks and languages for evaluating MLLMs (iii) analysing the performance of MLLMs on monolingual, zero shot crosslingual and bilingual tasks (iv) understanding the universal language patterns (if any) learnt by MLLMs and (v) augmenting the (often) limited capacity of MLLMs to improve their performance on seen or even unseen languages. In this survey, we review the existing literature covering the above broad areas of research pertaining to MLLMs. Based on our survey, we recommend some promising directions of future research. 2021-07-01T18:01:46Z 2021-07-10 A Moderate Proposal for Radically Better AI-powered Web Search 2021-07-10T09:10:20Z 2021-07-26 jeshraghian/snntorch: Deep learning with spiking neural networks in Python 2021-07-26T15:46:20Z a Python package for performing gradient-based learning with spiking neural networks 2021-07-09 Rodrigo Nogueira [2010.06467] Pretrained Transformers for Text Ranking: BERT and Beyond a 155 pages paper! - [Ranking metrics](tag:ranking_metrics) p 23 - keyword search p 35 > most current applications of transformers for text ranking rely on keyword search in a multi-stage ranking architecture, which is the focus of Section 3. - 3.3 From Passage to Document Ranking p 52 [#Long documents](tag:nlp_long_documents) Jimmy Lin Pretrained Transformers for Text Ranking: BERT and Beyond 2010.06467 Jimmy Lin 2020-10-13T15:20:32Z The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has, without exaggeration, revolutionized the fields of natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage ranking architectures and learned dense representations that attempt to perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond the typical sentence-by-sentence processing approaches used in NLP, and techniques for addressing the tradeoff between effectiveness (result quality) and efficiency (query latency). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading. 2020-10-13T15:20:32Z 2021-07-09T14:50:44Z Andrew Yates « Au vu des forces économiques en présence, les abeilles et les pollinisateurs apparaissent indéfendables » 2021-07-04 2021-07-04T09:50:12Z Une solution inspirée du cerveau pour éviter l’oubli catastrophique des IA | INSIS 2021-07-26T15:54:47Z 2021-07-26 2021-07-26 « Amkoullel, l’enfant peul », fresque vivante d’une jeunesse malienne au début du XXe siècle 2021-07-26T21:11:08Z Livre : « Tout s’effondre », un hommage à l’Afrique anté-coloniale à l’heure de sa désagrégation 2021-07-14 2021-07-14T22:16:37Z "Things fall apart" > Tant que les lions n’auront pas leurs propres historiens, l’histoire de la chasse glorifiera toujours le chasseur. raphaelsty/rebert: Renault Bert 2021-07-26 2021-07-26T16:44:11Z MLM pre-training using an already pre-trained model, eg. continue the pre-training on Renault's texts Inspired by [Retraining roberta-base using the RoBERTa MLM Procedure | Medium](doc:2022/03/retraining_roberta_base_using_t) Thomas Piketty : « Face au régime chinois, la bonne réponse passe par une nouvelle forme de socialisme démocratique et participatif » 2021-07-10T13:38:28Z 2021-07-10 2021-07-18T23:15:09Z 2021-07-18 « Projet Pegasus » : révélations sur un système mondial d’espionnage de téléphones 2021-07-06 2021-07-06T12:51:20Z Practical Natural Language Processing for Low-Resource Languages Du glyphosate aux SDHI, les ressorts de la controverse 2021-07-01 2021-07-01T13:46:19Z [Pesticides et santé : les conclusions inquiétantes de l’expertise collective de l’Inserm](doc:2021/07/pesticides_et_sante_les_concl) 2021-07-01 2021-07-01T12:12:34Z Cancers, troubles cognitifs, maladies neurodégénératives, endométriose… Les experts mandatés par l’Institut national de la santé et de la recherche médicale ont dressé le tableau le plus exhaustif à ce jour des effets de l’exposition à ces produits. Pesticides et santé : les conclusions inquiétantes de l’expertise collective de l’Inserm L'Insulte (film) 2021-07-21 2021-07-21T22:22:01Z A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios Michael A. Hedderich Dietrich Klakow 2010.12309 Jannik Strötgen Heike Adel 2021-04-09T13:48:02Z Michael A. Hedderich Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settings. Motivated by the recent fundamental changes towards neural models and the popular pre-train and fine-tune paradigm, we survey promising approaches for low-resource natural language processing. After a discussion about the different dimensions of data availability, we give a structured overview of methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting. Further key aspects of this work are to highlight open issues and to outline promising directions for future research. 2021-07-06T13:08:01Z 2020-10-23T11:22:01Z Lukas Lange 2021-07-06 [2010.12309] A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios Low-resource scenarios: low-resource languages, but also non standard domain and tasks. one key goal of this survey is to highlight the underlying assumptions [Blog post](https://towardsdatascience.com/a-visual-guide-to-low-resource-nlp-d7b4c7b1a4bc) Jacob Desvarieux, leader du groupe antillais Kassav’, est mort 2021-07-31 2021-07-31T08:04:50Z 2021-07-09 Nandan Thakur sur Twitter : "@ikuyamada @Nils_Reimers Thanks @ikuyamad..." Related to [UKPLab/beir: A Heterogeneous Benchmark for Information Retrieval.](doc:2021/07/ukplab_beir_a_heterogeneous_be) and [[2106.00882] Efficient Passage Retrieval with Hashing for Open-domain Question Answering](doc:2021/06/2106_00882_efficient_passage_) 2021-07-09T12:32:10Z 2021-07-09T12:19:50Z UKPLab/beir: A Heterogeneous Benchmark for Information Retrieval. 2021-07-09 > BEIR is a heterogeneous benchmark containing diverse IR tasks. > Easy to use, evaluate your NLP-based retrieval models across 15+ diverse IR datasets. [Paper](doc:2021/07/2104_08663_beir_a_heterogeno) Neural IR models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their generalization capabilities. To address this, and to allow researchers to more broadly establish the effectiveness of their models, we introduce BEIR (Benchmarking IR), a heterogeneous benchmark for information retrieval. We leverage a careful selection of 17 datasets for evaluation spanning diverse retrieval tasks including open-domain datasets as well as narrow expert domains. We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup on BEIR, finding that performing well consistently across all datasets is challenging. Our results show BM25 is a robust baseline and Reranking-based models overall achieve the best zero-shot performances, however, at high computational costs. In contrast, Dense-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. In this work, we extensively analyze different retrieval models and provide several suggestions that we believe may be useful for future work. BEIR datasets and code are available at https://github.com/UKPLab/beir. 2021-07-09 [2104.08663] BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models Andreas Rücklé Nandan Thakur [GitHub](doc:2021/07/ukplab_beir_a_heterogeneous_be) > Our results show **BM25 is a robust baseline** and **Reranking-based models overall achieve the best zero-shot performances**, however, at high computational costs. In contrast, **Denseretrieval models are computationally more efficient but often underperform other approaches** 17 English evaluation datasets, 9 heterogeneous tasks (Non-English left for future work) Nandan Thakur 2021-04-28T13:59:17Z Nils Reimers 2104.08663 2021-04-17T23:29:55Z Iryna Gurevych Abhishek Srivastava 2021-07-09T12:36:38Z BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models 2021-07-01T00:23:56Z 2021-07-01 Plotly: The front end for ML and data science models > The premier low-code platform for ML & data science apps. [Dash](https://plotly.com/dash/)