]> 2020-02-12T14:19:09Z 2020-02-12 Online speech recognition with wav2letter@anywhere 2020-02-20T09:11:40Z Scalable Nearest Neighbor Search for Optimal Transport Arturs Backurs 2020-02-14T14:54:37Z Yihe Dong 2019-10-09T17:12:41Z Arturs Backurs Ilya Razenshteyn 2020-02-20 1910.04126 Tal Wagner The Optimal Transport (a.k.a. Wasserstein) distance is an increasingly popular similarity measure for rich data domains, such as images or text documents. This raises the necessity for fast nearest neighbor search with respect to this distance, a problem that poses a substantial computational bottleneck for various tasks on massive datasets. In this work, we study fast tree-based approximation algorithms for searching nearest neighbors w.r.t. the Wasserstein-1 distance. A standard tree-based technique, known as Quadtree, has been previously shown to obtain good results. We introduce a variant of this algorithm, called Flowtree, and formally prove it achieves asymptotically better accuracy. Our extensive experiments, on real-world text and image datasets, show that Flowtree improves over various baselines and existing methods in either running time or accuracy. In particular, its quality of approximation is in line with previous high-accuracy methods, while its running time is much faster. [1910.04126] Scalable Nearest Neighbor Search for Optimal Transport Piotr Indyk 2020-02-16T13:48:54Z 2020-02-16 CARAA sur Twitter : "Probably the first photo of Notre Dame de Paris in 1838 !! (daguerreotype)" Yoshua Bengio 2020-02-12 2020-02-12T08:38:52Z [Yoshua Bengio’s blog – first words](https://yoshuabengio.org/2020/02/10/fusce-risus/) 2020-02-26T20:42:38Z 2020-02-26 Nucléaire : pourquoi la centrale de Flamanville ne produit plus d’électricité depuis six mois > are all the dimensions of my input vector interacting with all the others? Usually not. So going sparse maybe useful. > Convolutional layers are a smart and efficient way to implement a sparse transformation on an input tensor... Some other networks contain much larger matrices that may benefit from sparsity: Transformers But > It’s hard to implement general sparse matrice computations on GPUs in an efficient way... Easier if the matrices non-zeros grouped in small fixed-size blocks 2020-02-05T00:33:11Z 2020-02-05 Is the future of Neural Networks Sparse? An Introduction Enquête sur les usines d’antibiotiques indiennes, fabriques d’antibiorésistance (2018) 2020-02-09 2020-02-09T12:28:01Z > Plus de 90 % de nos antibiotiques sortent des usines chinoises ou indiennes, dont une partie des effluents finissent dans l’environnement, créant des foyers d’antibiorésistance [#Coronavirus: Pangolin's Revenge?](https://twitter.com/hyperfp/status/1226070112651948032?s=20) 2020-02-16T11:12:31Z 2020-02-16 « Le pangolin tient-il sa revanche avec le nouveau coronavirus ? » 2020-02-27T18:46:42Z Anna Rogers Anna Rogers 2020-02-28T13:25:30Z (article praised on [twitter](https://twitter.com/dennybritz/status/1233343170596917248?s=20) by D Britz and Y. Goldberg) 2020-02-28 [2002.12327] A Primer in BERTology: What we know about how BERT works 2020-02-27T18:46:42Z 2002.12327 A Primer in BERTology: What we know about how BERT works Anna Rumshisky Transformer-based models are now widely used in NLP, but we still do not understand a lot about their inner workings. This paper describes what is known to date about the famous BERT model (Devlin et al. 2019), synthesizing over 40 analysis studies. We also provide an overview of the proposed modifications to the model and its training regime. We then outline the directions for further research. Olga Kovaleva 2020-02-26T10:48:53Z 2020-02-27T23:36:54Z 2002.11402 Swapnil Ashok Jadhav Swapnil Ashok Jadhav For a news content distribution platform like Dailyhunt, Named Entity Recognition is a pivotal task for building better user recommendation and notification algorithms. Apart from identifying names, locations, organisations from the news for 13+ Indian languages and use them in algorithms, we also need to identify n-grams which do not necessarily fit in the definition of Named-Entity, yet they are important. For example, "me too movement", "beef ban", "alwar mob lynching". In this exercise, given an English language text, we are trying to detect case-less n-grams which convey important information and can be used as topics and/or hashtags for a news. Model is built using Wikipedia titles data, private English news corpus and BERT-Multilingual pre-trained model, Bi-GRU and CRF architecture. It shows promising results when compared with industry best Flair, Spacy and Stanford-caseless-NER in terms of F1 and especially Recall. [2002.11402] Detecting Potential Topics In News Using BERT, CRF and Wikipedia Detecting Potential Topics In News Using BERT, CRF and Wikipedia 2020-02-28T18:44:07Z 2020-02-27 2020-02-18T16:29:34Z 2020-02-18 Machine Learning Crash Course | Google Developers 2020-02-14T00:23:36Z 2020-02-14 Hugging Face sur Twitter : DistilBERT-cased for Question Answering w/ just 3 lines of javascript 2020-02-20 2020-02-20T22:27:40Z > Here’s all the math for backprop written out & color-coded. This and other lessons I wrote in Colab are quickly becoming blog posts thanks to FastPages ( @HamelHusain ) and nbdev ( @GuggerSylvain & @jeremyphoward )! My First NN Part 3. Multi-Layer Networks and Backpropagation | Scott H. Hawley (alt. blog via fastpages) 2020-02-10T09:21:44Z 2020-02-10 Canwen Xu sur Twitter : "WTF? We brutally dismember BERT and replace all his organs?" [paper](/doc/2020/02/_2002_02925_bert_of_theseus_c) In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models, and smooths the training process. Compared to the previous knowledge distillation approaches for BERT compression, our approach leverages only one loss function and one hyper-parameter, liberating human effort from hyper-parameter tuning. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression. Canwen Xu 2020-02-10 [2002.02925] BERT-of-Theseus: Compressing BERT by Progressive Module Replacing Wangchunshu Zhou Furu Wei approach to compress BERT by progressive module replacing. > Compared to the previous knowledge distillation approaches for BERT compression, our approach leverages only one loss function and one hyper-parameter [Github](https://github.com/JetRunner/BERT-of-Theseus) BERT-of-Theseus: Compressing BERT by Progressive Module Replacing 2020-02-07T17:52:16Z Ming Zhou Canwen Xu 2020-03-25T15:20:44Z Tao Ge 2002.02925 2020-02-10T21:50:03Z 2017-08-01T19:52:13Z 1703.07464 Saurabh Singh No Fuss Distance Metric Learning using Proxies [1703.07464] No Fuss Distance Metric Learning using Proxies Alexander Toshev 2017-03-21T23:11:56Z Thomas K. Leung Yair Movshovitz-Attias 2020-02-09 > We address the problem of distance metric learning (DML), defined as learning a distance consistent with a notion of semantic similarity... > Traditionnaly, supervision is expressed in the form of sets of points that follow an ordinal relationship – an anchor point x is similar to a set of positive points Y , and dissimilar to a set of negative points Z, and a loss defined over these distances is minimized. > Triplet-Based methods are challenging to optimize (a main issue is the need for finding informative triplets). > > We propose to **optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points which are learned as well**. These proxies approximate the original data points, so that a triplet loss over the proxies is a tight upper bound of the original loss. Mentioned in this [blog post](/doc/2020/01/training_a_speaker_embedding_fr): > "**Proxy based triplet learning**": instead of generating triplets, we learn an embedding for each class and use the learnt embedding as a proxy for triplets as part of the training. In other words, we can train end to end without the computationally expensive step of resampling triplets after each network update. Near the conclusion: > Our formulation of Proxy-NCA loss produces a loss very similar to the standard cross-entropy loss used in classification. However, we arrive at our formulation from a different direction: we are not interested in the actual classifier and indeed discard the proxies once the model has been trained. Instead, the proxies are auxiliary variables, enabling more effective optimization of the embedding model parameters. **As such, our formulation not only enables us to surpass the state of the art in zero-shot learning, but also offers an explanation to the effectiveness of the standard trick of training a classifier, and using its penultimate layer’s output as the embedding.** 2020-02-09T18:44:26Z Sergey Ioffe We address the problem of distance metric learning (DML), defined as learning a distance consistent with a notion of semantic similarity. Traditionally, for this problem supervision is expressed in the form of sets of points that follow an ordinal relationship -- an anchor point $x$ is similar to a set of positive points $Y$, and dissimilar to a set of negative points $Z$, and a loss defined over these distances is minimized. While the specifics of the optimization differ, in this work we collectively call this type of supervision Triplets and all methods that follow this pattern Triplet-Based methods. These methods are challenging to optimize. A main issue is the need for finding informative triplets, which is usually achieved by a variety of tricks such as increasing the batch size, hard or semi-hard triplet mining, etc. Even with these tricks, the convergence rate of such methods is slow. In this paper we propose to optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points which are learned as well. These proxies approximate the original data points, so that a triplet loss over the proxies is a tight upper bound of the original loss. This proxy-based loss is empirically better behaved. As a result, the proxy-loss improves on state-of-art results for three standard zero-shot learning datasets, by up to 15% points, while converging three times as fast as other triplet-based losses. Yair Movshovitz-Attias > During the siege of Leningrad, a group of Russian botanists starved to death rather than consume the greatest collection of seeds they were guarding. Nikolay Vavilov, the man who had collected the seeds, also died of hunger in Stalin’s gulag. The men who starved to death to save the world's seeds - Russia Beyond 2020-02-12 2020-02-12T00:39:56Z Contrastive Self-Supervised Learning | Ankesh Anand (2020) 2020-02-15T19:51:39Z 2020-02-15 methods that build representations by learning to encode what makes two things similar or different > an overview of how contrastive methods differ from other self-supervised learning techniques, > can we build representation learning algorithms that don’t concentrate on pixel-level details, and only encode high-level features sufficient enough to distinguish different objects? Learn an encoder such as: score(f(x), f(x+)) >> score(f(x), f(x-)) (where x+ similar to x, "positiv" example, and x- "negative"). Can use a softmax classifier to optimize this ("InfoNCE loss" ~ cross-entropy loss) Self-Supervised Representation Learning 2020-02-15 2020-02-15T19:45:29Z The Matrix Calculus You Need For Deep Learning 2020-02-19T21:52:12Z Terence Parr This paper is an attempt to explain all the matrix calculus you need in order to understand the training of deep neural networks. We assume no math knowledge beyond what you learned in calculus 1, and provide links to help you refresh the necessary math where needed. Note that you do not need to understand this material before you start learning to train and use deep learning in practice; rather, this material is for those who are already familiar with the basics of neural networks, and wish to deepen their understanding of the underlying math. Don't worry if you get stuck at some point along the way---just go back and reread the previous section, and try writing down and working through some examples. And if you're still stuck, we're happy to answer your questions in the Theory category at forums.fast.ai. Note: There is a reference section at the end of the paper summarizing all the key matrix calculus rules and terminology discussed here. See related articles at http://explained.ai [1802.01528] The Matrix Calculus You Need For Deep Learning Related blog post [The Math Behind Neural Networks](https://towardsdatascience.com/step-by-step-the-math-behind-neural-networks-490dc1f3cfd9) 1802.01528 2020-02-19 Jeremy Howard 2018-02-05T17:37:59Z 2018-07-02T17:36:34Z Terence Parr 2020-02-16T13:39:46Z 2020-02-16 Hugging Face: How to train a new language model from scratch using Transformers and Tokenizers 2020-02-20T17:14:23Z 2020-02-20 Calling Java from Python - Stack Overflow 2020-02-09T23:35:36Z 2020-02-09 Extractive Text Summarization Using spaCy in Python Oyvind Tafjord Kyle Richardson 2020-02-14T04:23:28Z Transformers as Soft Reasoners over Language 2002.05867 2020-02-17 2020-02-17T09:06:44Z Peter Clark AI has long pursued the goal of having systems reason over *explicitly provided* knowledge, but building suitable representations has proved challenging. Here we explore whether transformers can similarly learn to reason (or emulate reasoning), but using rules expressed in language, thus bypassing a formal representation. We provide the first demonstration that this is possible, and characterize the extent of this capability. To do this, we use a collection of synthetic datasets that test increasing levels of reasoning complexity (number of rules, presence of negation, and depth of chaining). We find transformers appear to learn rule-based reasoning with high (99%) accuracy on these datasets, and in a way that generalizes to test data requiring substantially deeper chaining than in the training data (95%+ scores). We also demonstrate that the models transfer well to two hand-authored rulebases, and to rulebases paraphrased into more natural language. These findings are significant as it suggests a new role for transformers, namely as a limited "soft theorem prover" operating over explicit theories in language. This in turn suggests new possibilities for explainability, correctability, and counterfactual reasoning in question-answering. All datasets and a live demo are available at http://rule-reasoning.apps.allenai.org/ [2002.05867] Transformers as Soft Reasoners over Language Peter Clark 2020-02-14T04:23:28Z > AI has long pursued the goal of having systems reason over *explicitly provided* knowledge, but building suitable representations has proved challenging. Here we explore whether transformers can similarly learn to reason (or emulate reasoning), but **using rules expressed in language, thus bypassing a formal representation**. Combining Locality-Sensitive Hashing and Elasticsearch for Scalable Online K-Nearest Neighbors Search. [Github](https://github.com/alexklibisz/elastik-nearest-neighbors), [Improved version](https://github.com/alexklibisz/elastiknn) 2020-02-13 ElastiK Nearest Neighbors 2020-02-13T23:48:03Z Machine Learning at the VU University Amsterdam 2020-02-18 2020-02-18T13:52:09Z 2020-02-13 Paper describing the fast.ai v2 API fastai: A Layered API for Deep Learning [2002.04688] fastai: A Layered API for Deep Learning Sylvain Gugger fastai is a deep learning library which provides practitioners with high-level components that can quickly and easily provide state-of-the-art results in standard deep learning domains, and provides researchers with low-level components that can be mixed and matched to build new approaches. It aims to do both things without substantial compromises in ease of use, flexibility, or performance. This is possible thanks to a carefully layered architecture, which expresses common underlying patterns of many deep learning and data processing techniques in terms of decoupled abstractions. These abstractions can be expressed concisely and clearly by leveraging the dynamism of the underlying Python language and the flexibility of the PyTorch library. fastai includes: a new type dispatch system for Python along with a semantic type hierarchy for tensors; a GPU-optimized computer vision library which can be extended in pure Python; an optimizer which refactors out the common functionality of modern optimizers into two basic pieces, allowing optimization algorithms to be implemented in 4-5 lines of code; a novel 2-way callback system that can access any part of the data, model, or optimizer and change it at any point during training; a new data block API; and much more. We have used this library to successfully create a complete deep learning course, which we were able to write more quickly than using previous approaches, and the code was more clear. The library is already in wide use in research, industry, and teaching. NB: This paper covers fastai v2, which is currently in pre-release at http://dev.fast.ai/ 2020-02-13T21:07:29Z 2020-02-16T18:17:51Z Jeremy Howard 2002.04688 2020-02-11T21:16:48Z Jeremy Howard 2020-02-13T17:50:53Z 2020-02-13 Jeremy Howard sur Twitter : "The fastai paper (with @GuggerSylvain) covers v2..." > La réanalyse des tests fournis aux autorités réglementaires, indique que l’herbicide controversé est susceptible de déclencher des cancers chez les rongeurs. > Si cette conclusion est notable, c’est que ces mêmes tests – dont la majorité ont été menés par les industriels eux-mêmes – ont servi de base aux avis des autorités réglementaires, notamment européennes et américaines. Or celles-ci ont unanimement estimé, à l’inverse, que le glyphosate n’a pas de potentiel cancérogène. > Confidentiels, les tests réglementaires ne peuvent généralement pas être consultés par la communauté scientifique, 2020-02-24T13:58:07Z 2020-02-24 L’évaluation officielle du glyphosate de nouveau mise en cause fastai/fastpages: An easy to use blogging platform, with enhanced support for Jupyter Notebooks. 2020-02-25T08:56:03Z 2020-02-25 2018-05-10T20:42:52Z Yizhe Zhang Xinyuan Zhang Chunyuan Li [1805.04174] Joint Embedding of Words and Labels for Text Classification (ACL Anthology 2018) > text classification as a label-word joint embedding problem: **each label is embedded in the same space with the word vectors**. We introduce an attention framework that measures the compatibility of embeddings between text sequences and labels. The attention is learned on a training set of labeled samples to ensure that, given a text sequence, the relevant words are weighted higher than the irrelevant ones. (from introduction:) > For the task of text classification, labels play a central role of the final performance. A natural question to ask is how we can directly use label information in constructing the text-sequence representations > The proposed LEAM (Label- Embedding Attentive Mode) is implemented by jointly embedding the word and label in the same latent space, and **the text representations are constructed directly using the text-label compatibility**. Word embeddings are effective intermediate representations for capturing semantic regularities between words, when learning the representations of text sequences. We propose to view text classification as a label-word joint embedding problem: each label is embedded in the same space with the word vectors. We introduce an attention framework that measures the compatibility of embeddings between text sequences and labels. The attention is learned on a training set of labeled samples to ensure that, given a text sequence, the relevant words are weighted higher than the irrelevant ones. Our method maintains the interpretability of word embeddings, and enjoys a built-in ability to leverage alternative sources of information, in addition to input text sequences. Extensive results on the several large text datasets show that the proposed framework outperforms the state-of-the-art methods by a large margin, in terms of both accuracy and speed. Dinghan Shen 1805.04174 2020-02-18 Guoyin Wang Wenlin Wang Ricardo Henao 2020-02-18T15:01:31Z Guoyin Wang Lawrence Carin Joint Embedding of Words and Labels for Text Classification 2018-05-10T20:42:52Z 2020-02-11 Adam Roberts sur Twitter : "New preprint: How Much Knowledge Can You Pack into the Parameters of a Language Model?..." [paper](/doc/2020/02/how_much_knowledge_can_you_pack) 2020-02-11T12:24:21Z > It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. indeed, cf. Facebook's paper [Language Models as Knowledge Bases?](/doc/2019/09/_1909_01066_language_models_as) > In this short paper, we measure the practical utility of this approach by fine-tuning pre-trained models to answer questions without access to any external context or knowledge. > we show that a large language model pre-trained on unstructured text can attain competitive results on open-domain question answering benchmarks without any access to external knowledge BUT: >1. state-of-the-art results only with the largest model which had 11 billion parameters. >1. “open-book” models typically provide some indication of what information they accessed when answering a question that provides a useful form of interpretability. In contrast, our model distributes knowledge in its parameters in an inexplicable way, which precludes this form of interpretability. >1. **the maximum-likelihood objective provides no guarantees as to whether a model will learn a fact or not.** So, what's the point? To be compared with this [IBM's paper](/doc/2019/09/_1909_04120_span_selection_pre): "a new pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself" How Much Knowledge Can You Pack Into the Parameters of a Language Model? 2020-02-11 2020-02-11T22:56:31Z 2020-02-19T01:04:23Z 2020-02-19 Notebook: fine-tune a text classification model with HuggingFace transformers and fastai-v2. FastHugs | ntentional « Le Mariage de Roland », Victor Hugo, La Légende des Siècles, 1859. 2020-02-20T22:35:47Z 2020-02-20 > Ils se battent - combat terrible !... > Il dit, et déracine un chêne. > > Sire Olivier arrache un orme dans la plaine > Nous lutterons ainsi que lions et panthères. > > Ne vaudrait-il pas mieux que nous devinssions frères ? > > Écoute, j’ai ma sœur, la belle Aude au bras blanc, > > Épouse-là. - Pardieu ! je veux bien, dit Roland. > > Et maintenant buvons, car l'affaire était chaude." > > C'est ainsi que Roland épousa la belle Aude. Meetup NLP #6 – July 25, 2018 Ismael Belghiti, CTO @ Hiresweet > comment différentes techniques de NLP peuvent être appliquées pour calculer un score de matching entre un profil et une offre, en comparant leur performance sur une métrique de ranking dédiée. Information Retrieval for HR 2020-02-14 2020-02-14T16:57:51Z Attributes act as intermediate representations that enable parameter sharing between classes, a must when training data is scarce. We propose to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors. We introduce a function that measures the compatibility between an image and a label embedding. The parameters of this function are learned on a training set of labeled samples to ensure that, given an image, the correct classes rank higher than the incorrect ones. Results on the Animals With Attributes and Caltech-UCSD-Birds datasets show that the proposed framework outperforms the standard Direct Attribute Prediction baseline in a zero-shot learning scenario. Label embedding enjoys a built-in ability to leverage alternative sources of information instead of or in addition to attributes, such as e.g. class hierarchies or textual descriptions. Moreover, label embedding encompasses the whole range of learning settings from zero-shot learning to regular learning with a large number of labeled examples. Florent Perronnin 2015-03-30T14:04:34Z Label-Embedding for Image Classification Cordelia Schmid [1503.08677] Label-Embedding for Image Classification 2015-10-01T10:48:38Z Zaid Harchaoui 2020-02-18T15:00:20Z 2020-02-18 Zeynep Akata Zeynep Akata 1503.08677 One solution to make similarity search more practical and computationally feasible involves hashing of documents, in a way that similar documents are more likely to produce the same hash code (locality sensitive hashing, LSH). Depending on what constitutes the similarity between documents, various LSH functions have been proposed. For Jaccard similarity, a popular LSH function is MinHash. 2020-02-14T00:03:14Z 2020-02-14 MinHash token filter | Elasticsearch Reference 2020-02-26T00:16:16Z 2020-02-26 Ivan Maisky Diplomate soviétique - si les Anglais et les Français l'avaient écouté, la 2eme guerre mondiale aurait peut-être été évitée 2020-02-24T09:48:11Z 2020-02-24 NLP Newsletter: The Annotated GPT-2, Understanding self-distillation, Haiku, GANILLA, Sparkwiki, Ethics in NLP, Torchmeta,… 2020-02-15T11:15:11Z 2020-02-15 Distilling BERT models with spaCy - Towards Data Science (2019) 2020-02-10 Siamese adaptation of CNN, using contrastive loss. The document embedding of resumes and job descriptions (dim 200) are generated using [#Doc2Vec](/tag/doc2vec.html) and are given as inputs to the network. 2020-02-10 2020-02-10T13:43:44Z Matching Resumes to Jobs via Deep Siamese Network | Companion Proceedings of the The Web Conference 2018 2020-02-10T14:19:40Z Siamese CNN for job–candidate matching (slides) 2020-02-11T08:40:48Z 2020-02-11 A new model and dataset for long-range memory | DeepMind the use of memory in deep learning, and how modelling language may be an ideal task for developing better memory architectures [paper](/doc/2020/02/_1911_05507_compressive_transf) Anna Potapenko > the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. [Blog post](/doc/2020/02/a_new_model_and_dataset_for_lon) Compressive Transformers for Long-Range Sequence Modelling 2019-11-13T14:36:01Z 2020-02-11 Timothy P. Lillicrap Siddhant M. Jayakumar We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19. 2019-11-13T14:36:01Z 2020-02-11T08:48:20Z Jack W. Rae 1911.05507 Jack W. Rae [1911.05507] Compressive Transformers for Long-Range Sequence Modelling