21 documents during the last month.
- Silicon Valley's push for universal basic income is — surprise! — totally self-serving - LA Times
- A brief overview of query/sentence similarity functions | searchivarius.org
- Effective measures for inter-document similarity
- Divergence From Randomness (DFR) Framework
- Beyond Cosine Similarity - Algorithms for Big Data
- Ampoules de Lorenzini — Wikipédia
- Representation learning for very short texts using weighted word embedding aggregation
hmm, déjà [bookmarké sur arxiv](https://arxiv.org/abs/1607.00570)
- Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification - ScienceDirect
- Word Embeddings and Their Challenges - AYLIEN
Perhaps the biggest problem with word2vec is the inability to handle unknown or out-of-vocabulary (OOV) words.
- An overview of word embeddings and their connection to distributional semantic models - AYLIEN
While on the surface DSMs and word embedding models use varying algorithms to learn word representations – the former count, the latter predict – both types of model fundamentally act on the same underlying statistics of the data, i.e. the co-occurrence counts between words...
These results are in contrast to the general consensus that word embeddings are superior to traditional methods
- More Fun With Word Vectors - Bag of Words Meets Bags of Popcorn | Kaggle
> We found that the code above gives about the same (or slightly worse) results compared to the Bag of Words
- Can I use word2vec representation to train a weka classifier? - Quora
- Can I use word2vec to train a machine learning classifier? - Quora
- A Primer on Neural Network Models for Natural Language Processing
- Some pre-trained word2vec models for French
- Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models | Blog | Explosion AI
> A four-step strategy for deep learning with text
> Word embeddings let you treat individual words as related units of meaning, rather than entirely distinct IDs. However, most NLP problems require understanding of longer spans of text, not just individual words. There's now a simple and flexible solution that is achieving excellent performance on a wide range of problems. After embedding the text into a sequence of vectors, bidirectional RNNs are used to encode the vectors into a sentence matrix. The rows of this matrix can be understood as token vectors — they are sensitive to the sentential context of the token. The final piece of the puzzle is called an attention mechanism. This lets you reduce the sentence matrix down to a sentence vector, ready for prediction.
- Word Meaning and Similarity - Stanford University
thesaurus based meaning, Distribu1onal models of meaning
Term-Context matrix. Term-document matrix: use tf-idf instead of raw term counts, for the term-context matrix, use Positive Pointwise Mutual Information (PPMI: Do words x and y co-occur more than if they were independent?)
- Gensim tutorial: Similarity Queries
> "The thing to note here is that documents no. 2 would never be returned by a standard boolean fulltext search, because they do not share any common words with query string"
- Similarity module | Elasticsearch Reference
- Document Similarity Analysis Using ElasticSearch and Python - Data Science Central
- Latent semantic indexing ("Introduction to Information Retrieval" Manning 2008)
VSM : problem with synonymy and polysemy (eg. synonyms are accorded separate dimensions)
Could we use the co-occurrences of terms to capture the latent semantic associations of terms and alleviate these problems?
- computational cost of the SVD is significant
- biggest obstacle to the widespread adoption to LSI.
- One approach to this obstacle: build the LSI representation on a randomly sampled subset of the documents, following which the remaining documents are ``folded in'' (cf Gensim tutorial "[Random Projection (used as an option to speed up LSI)](https://radimrehurek.com/gensim/models/rpmodel.html)")
- As we reduce k, recall tends to increase, as expected.
- **Most surprisingly**, a value of k in the low hundreds can actually increase precision. **This appears to suggest that for a suitable value of *k*, LSI addresses some of the challenges of synonymy**.
- LSI works best in applications where there is little overlap between queries and documents. (--??)
The experiments also documented some modes where LSI failed to match the effectiveness of more traditional indexes and score computations.
LSI shares two basic drawbacks of vector space retrieval:
- no good way of expressing negations
- no way of enforcing Boolean conditions.
LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.