]> lingo: algorithm for clustering search results, which emphasizes cluster description quality. Implemented in Carrot2. 2017-07-11 2017-07-11T16:58:42Z Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition (2004) (paper) approach to automatically construct tagging rules in the form of a binary tree. Python and java 2017-07-11 2017-07-11T15:46:46Z RDRPOSTagger: A Rule-based Part-of-Speech and Morphological Tagging Toolkit 2017-07-20T15:43:09Z 2017-07-20 An overview of word embeddings and their connection to distributional semantic models - AYLIEN (2016) > While on the surface DSMs and word embedding models use varying algorithms to learn word representations – the former count, the latter predict – both types of model fundamentally act on the same underlying statistics of the data, i.e. the co-occurrence counts between words... > These results are in contrast to the general consensus that word embeddings are superior to traditional methods. Rather, they indicate that it typically makes no difference whatsoever whether word embeddings or distributional methods are used. What really matters is that your hyperparameters are tuned and that you utilize the appropriate pre-processing and post-processing steps. 2017-07-16 2017-07-16T23:09:44Z Good Morning England (The boat that Rocked) Some pre-trained word2vec models for French 2017-07-20T13:00:27Z 2017-07-20 Similarity module | Elasticsearch Reference 2017-07-19T14:38:13Z 2017-07-19 What is a simple but detailed explanation of Textrank? - Quora 2017-07-12T00:58:03Z 2017-07-12 Dive Into NLTK, Part V: Using Stanford Text Analysis Tools in Python – Text Mining Online 2017-07-11 2017-07-11T18:16:16Z [en cas de pb](https://gist.github.com/alvations/e1df0ba227e542955a8a) **including how to use Java NLP Tools in python** ``` export CLASSPATH=/Users/fps/_fps/DeveloperTools/stanford-postagger-full/stanford-postagger.jar # ATTENTION, stanford-postagger.jar, pas stanford-postagger-3.8.0.jar export STANFORD_MODELS=/Users/fps/_fps/DeveloperTools/stanford-postagger-full/models python ``` ``` from nltk.tag import StanfordPOSTagger st = StanfordPOSTagger('english-bidirectional-distsim.tagger') st.tag('What is the airspeed of an unladen swallow ?'.split()) st = StanfordPOSTagger('french.tagger') st.tag('Les plats servis sont toujours les mêmes et la qualité des plats est en nette baisse'.split()) ``` [('Les', 'DET'), ('plats', 'NOUN'), ('servis', 'ADJ'), ('sont', 'VERB'), ('toujours', 'ADV'), ('les', 'DET'), ('mêmes', 'ADJ'), ('et', 'CONJ'), ('la', 'DET'), ('qualité', 'NOUN'), ('des', 'DET'), ('plats', 'NOUN'), ('est', 'VERB'), ('en', 'ADP'), ('nette', 'ADJ'), ('baisse', 'NOUN')] 2017-07-11 2017-07-11T03:39:08Z Search-As-You-Type with Solr 2017-07-17T00:17:43Z 2017-07-17 Résumé Automatique Multi-Document Dynamique : État de l’Art (2015) 2017-07-18T18:04:05Z 2017-07-18 Watson: Alchemy Language v1 API Explorer The AlchemyLanguage API uses natural language processing technology and machine learning algorithms to extract semantic meta-data from content, such as information on people, places, companies, topics, facts, relationships, authors, and languages. 2017-07-11 NLTK: Installing Third Party Software · nltk Wiki 2017-07-11T18:14:58Z 2017-07-05T00:13:56Z 2017-07-05 Développement web : concilier sûreté et flexibilité avec le typage graduel What are all possible pos tags of NLTK? - Stack Overflow 2017-07-11T14:50:14Z 2017-07-11 Penn Treebank P.O.S. Tags 2017-07-11T14:48:26Z 2017-07-11 Alphabetical list of part-of-speech tags used in the Penn Treebank Project 2017-07-05T18:36:07Z 2017-07-05 Dans la Silicon Valley, « l’Oracle » français de la complexité | tech and berries 2017-07-19 2017-07-19T14:23:50Z Document Similarity Analysis Using ElasticSearch and Python - Data Science Central Eugène de Savoie 2017-07-12T00:05:30Z 2017-07-12 2017-07-26 2017-07-26T13:28:53Z How to spot first stories on Twitter using Storm | Michael Vogiatzis gensim : Similarity Queries using Annoy (Tutorial) 2017-07-10T19:15:18Z 2017-07-10 Using the (Annoy Approximate Nearest Neighbors Oh Yeah) library for similarity queries with a Word2Vec model built with gensim. Cortical.io - Fast, precise, intuitive NLP 2017-07-10T14:57:06Z 2017-07-10 "semantic fingerprint" representation of words 2017-07-13 2017-07-13T10:38:21Z IBM SPSS Text Analytics for Surveys 2017-07-30 2017-07-30T01:22:47Z Brussels attacks Liam Fox's 'ignorant' remarks on chlorinated chicken | Politics | The Guardian 2017-07-21 2017-07-21T02:03:54Z Ampoules de Lorenzini — Wikipédia Good survey of VSMs, of their 3 classes (based either on term-document, word-context, or pair-pattern matrices), and of their applications. A detailed look at a specific open source project in each category. From Frequency to Meaning: Vector Space Models of Semantics (2010) 2017-07-10 2017-07-10T15:18:19Z How to get started with GCP  |  Google Cloud Platform 2017-07-12 2017-07-12T16:52:19Z Word Embeddings and Their Challenges - AYLIEN 2017-07-20T15:49:59Z 2017-07-20 Perhaps the biggest problem with word2vec is the inability to handle unknown or out-of-vocabulary (OOV) words. Beyond Cosine Similarity - Algorithms for Big Data 2017-07-21T11:44:52Z 2017-07-21 2017-07-21 2017-07-21T12:47:02Z A brief overview of query/sentence similarity functions | searchivarius.org 2017-07-20T00:12:06Z Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models | Blog | Explosion AI > A four-step strategy for deep learning with text > Word embeddings let you treat individual words as related units of meaning, rather than entirely distinct IDs. However, most NLP problems require understanding of longer spans of text, not just individual words. There's now a simple and flexible solution that is achieving excellent performance on a wide range of problems. After embedding the text into a sequence of vectors, bidirectional RNNs are used to encode the vectors into a sentence matrix. The rows of this matrix can be understood as token vectors — they are sensitive to the sentential context of the token. The final piece of the puzzle is called an attention mechanism. This lets you reduce the sentence matrix down to a sentence vector, ready for prediction. 2016-11 2017-07-20 Teaching a Computer to Read: - Scripted 2017-07-10T18:32:29Z 2017-07-10 2017-07-25 2017-07-25T16:00:35Z Sparse Distributed Memory: Principles and Operation Text Summarizer - Text Summarization Online 2017-07-07T17:11:37Z 2017-07-07 Indexing by Latent Semantic Analysis - Deerwester et al. (1990) 2017-07-18T15:46:17Z 2017-07-18 LSI seminal article. Cité plus de 12000 fois 2017-07-10 2017-07-10T11:34:38Z Baidu’s self-driving tech plans revealed | Robohub VSM : problem with synonymy and polysemy (eg. synonyms are accorded separate dimensions) Could we use the co-occurrences of terms to capture the latent semantic associations of terms and alleviate these problems? Concluding remarks: - computational cost of the SVD is significant - biggest obstacle to the widespread adoption to LSI. - One approach to this obstacle: build the LSI representation on a randomly sampled subset of the documents, following which the remaining documents are ``folded in'' (cf Gensim tutorial "[Random Projection (used as an option to speed up LSI)](https://radimrehurek.com/gensim/models/rpmodel.html)") - As we reduce k, recall tends to increase, as expected. - **Most surprisingly**, a value of k in the low hundreds can actually increase precision. **This appears to suggest that for a suitable value of *k*, LSI addresses some of the challenges of synonymy**. - LSI works best in applications where there is little overlap between queries and documents. (--??) The experiments also documented some modes where LSI failed to match the effectiveness of more traditional indexes and score computations. LSI shares two basic drawbacks of vector space retrieval: - no good way of expressing negations - no way of enforcing Boolean conditions. LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster. 2017-07-19 2017-07-19T09:54:04Z Latent semantic indexing ("Introduction to Information Retrieval" Manning 2008) 2017-07-18 2017-07-18T14:49:17Z Dealing with Human Language | Elasticsearch: The Definitive Guide [master] 2017-07-26 2017-07-26T13:45:31Z Online Generation of Locality Sensitive Hash Signatures 2017-07-21 2017-07-21T12:34:32Z Divergence From Randomness (DFR) Framework 2017-07-05 2017-07-05T23:14:19Z What happened to the Semantic Web? [1405.4053] Distributed Representations of Sentences and Documents 2014-05-16T07:12:16Z 1405.4053 Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks. Paragraph Vector: an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.Represents each document by a dense vector which is trained to predict words in the document. Overcomes the weaknesses of the [Bag Of Words](/tag/bag_of_words) model (order of words, semantic of words) 2014-05-22T23:23:19Z Quoc V. Le Distributed Representations of Sentences and Documents 2017-07-10 2017-07-10T16:20:03Z Quoc V. Le Tomas Mikolov 2017-07-20 2017-07-20T13:42:49Z Can I use word2vec to train a machine learning classifier? - Quora 2017-07-09T10:32:29Z 2017-07-09 La bactérie tueuse d’oliviers progresse en Europe - Le Temps 2017-07-20T13:22:06Z 2015-10-02T20:17:33Z 2017-07-20 A Primer on Neural Network Models for Natural Language Processing [1510.00726] A Primer on Neural Network Models for Natural Language Processing Yoav Goldberg Yoav Goldberg Over the past few years, neural networks have re-emerged as powerful machine-learning models, yielding state-of-the-art results in fields such as image recognition and speech processing. More recently, neural network models started to be applied also to textual natural language signals, again with very promising results. This tutorial surveys neural network models from the perspective of natural language processing research, in an attempt to bring natural-language researchers up to speed with the neural techniques. The tutorial covers input encoding for natural language tasks, feed-forward networks, convolutional networks, recurrent networks and recursive networks, as well as the computation graph abstraction for automatic gradient computation. 2015-10-02T20:17:33Z 1510.00726 2017-07-26 2017-07-26T01:35:31Z How to understand Locality Sensitive Hashing? - Stack Overflow le patriotisme, un truc de losers 2017-07-01 2017-07-01T20:01:31Z Pourquoi les Canadiens se moquent éperdument des 150 ans du Canada 2017-07-10 2017-07-10T19:05:37Z gensim: models.phrases – Phrase (collocation) detection Automatically detect common phrases – aka multi-word expressions, word n-gram collocations – from a stream of sentences. [see also](http://www.markhneedham.com/blog/2015/02/12/pythongensim-creating-bigrams-over-how-i-met-your-mother-transcripts/#disqus_thread) 2017-07-20 Word Meaning and Similarity - Stanford University 2017-07-20T00:09:07Z thesaurus based meaning, Distributional models of meaning Term-Context matrix. Term-document matrix: use tf-idf instead of raw term counts, for the term-context matrix, use Positive Pointwise Mutual Information (PPMI: Do words x and y co-occur more than if they were independent?) 2017-07-03 2017-07-03T00:03:53Z Socrate, ennemi de la démocratie ? 2017-07-20 2017-07-20T13:45:20Z Can I use word2vec representation to train a weka classifier? - Quora 2017-07-22 Silicon Valley's push for universal basic income is — surprise! — totally self-serving - LA Times 2017-07-22T02:36:00Z Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition (slides) 2017-07-11T17:13:55Z 2017-07-11 2 independent phases in the process: - cluster label candidate discovery, (based on phrases discovery — usually good label indicators) - clusters discovery (based on SVD) Lingo: description comes first. Yejin Choi - University of Washington [Slides adapted from Dan Jurafsky] 2017-07-10 2017-07-10T13:22:28Z Distributed Semantics & Embeddings 2017-07-10 2017-07-10T22:20:40Z La sixième extinction de masse des animaux s’accélère Source code for nltk.tag.stanford — NLTK documentation 2017-07-11 2017-07-11T16:13:00Z How does Textrank work? (slides) 2017-07-12T00:48:39Z 2017-07-12 > "The thing to note here is that documents no. 2 would never be returned by a standard boolean fulltext search, because they do not share any common words with query string" 2017-07-19 2017-07-19T14:54:26Z Gensim tutorial: Similarity Queries 2017-07-02 > les Lacédémoniens déclarèrent qu'ils ne réduiraient pas en servitude une ville grecque qui avait rendu un grand service à la Grèce, quand elle était menacée des plus grands dangers Xénophon Guerre du Péloponnèse — Wikipédia 2017-07-02T13:14:40Z TreeTagger - a part-of-speech tagger for many languages 2017-07-11T15:44:58Z 2017-07-11 > We found that the code above gives about the same (or slightly worse) results compared to the Bag of Words More Fun With Word Vectors - Bag of Words Meets Bags of Popcorn | Kaggle 2017-07-20T14:56:22Z 2017-07-20 Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification - ScienceDirect 2017-07-21T01:36:21Z 2017-07-21 2017-07-26 2017-07-26T13:41:20Z Finding Similar Items **Jaccard similarity**: similarity of sets, based on the relative size of their intersection -> **finding textually similar documents in a large corpus, near duplicates**. [Collaborative Filtering](/tag/collaborative_filtering) as a Similar-Sets Problem (cf. online purchases, movie ratings) **Shingling** turns the problem of textual similarity of documents into a pb of similarity of sets k-shingle: substring of length k found within a document. k: 5 for emails. Hashing shingles. Shingles built from words (stop word + 2 following words) Similarity-Preserving Summaries of Sets: shingles sets are large -> compress large sets into small representations (“signatures”) that preserve similarity: **[Minhashing](/tag/minhash)** - related to Jaccard similarity (good explanation in [wikipedia](https://en.wikipedia.org/wiki/MinHash)) It still may be impossible to find the pairs of docs with greatest similarity efficiently -> **[Locality-Sensitive Hashing](/tag/locality_sensitive_hashing)** for Documents Distance measures Theory of Locality-Sensitive Functions LSH famiies for other distance measures Applications of Locality-Sensitive Hashing: - entity resolution - matching fingerprints - matching newpapers articles Methods for High Degrees of Similarity: LSH-based methods most effective when the degree of similarity we accept is relatively low. When we want to find sets that are almost identical, other methods can be faster. 2017-07-21 2017-07-21T12:45:10Z Effective measures for inter-document similarity A module for interfacing with the Stanford taggers. nltk.tag.stanford module — NLTK documentation 2017-07-11T15:43:03Z 2017-07-11 2017-07-11T15:25:58Z 2017-07-11 Stanford Log-linear Part-Of-Speech Tagger Hierarchical clustering in Python and beyond 2017-07-11 2017-07-11T10:07:47Z Intégration de la similarité entre phrases comme critère pour le résumé multi-document (2016) 2017-07-17 2017-07-17T00:21:08Z 2017-07-21 Representation learning for very short texts using weighted word embedding aggregation 2017-07-21T01:49:18Z hmm, déjà [bookmarké sur arxiv](https://arxiv.org/abs/1607.00570)