Word Meaning and Similarity - Stanford University thesaurus based meaning, Distributional models of meaning
Term-Context matrix. Term-document matrix: use tf-idf instead of raw term counts, for the term-context matrix, use Positive Pointwise Mutual Information (PPMI: Do words x and y co-occur more than if they were independent?)
Gensim tutorial: Similarity Queries > "The thing to note here is that documents no. 2 would never be returned by a standard boolean fulltext search, because they do not share any common words with query string"
Short Text Similarity with Word Embeddings We investigate whether determining short text similarity is possible
using only semantic features.
A novel feature of our
approach is that an arbitrary number of word embedding sets can be
Quick review on Text Clustering and Text Similarity Approaches Author: Maali Mnasri (PhD @ CEA)
First transform text units to vectors? not always (eg. sentence similarity task using lexical word alignment). But vectors are efficient to process, and benefit from existing clustering algorithms such as k-means.
Sentence level or document level? Sentence clustering to summarise large documents.
Thematic clustering vs Semantic clustering: depends on the similarity measure.
Text similarity measures:
- Cosine similarity of tf-idf (suitable to produce thematic clusters)
- Knowledge-based Measures (wordNet) (quantify semantic relatedness of words),
- Word embedings
Examples, sample code:
- using wordnet with NLTK, and the formula to compute sentence similarities from word similarities.
- computing similarities between docs using gensim/word2vec
Which clustering algorithm?
- when we have an approximation of the clusters number, and when the similarity measure is not expensive in terms of computation time, clustering algo are suitable and fast. Sample code of k-means clustering using tf-idf vectors with scikit-learn
- Hierarchical clustering algorithms
- don't need to give the number of clusters
- but time consuming (calculate a similarity matrix for the sentences)
- for voluminous data, use an incremental clustering algorithm: sentences are processed one at a time ; each new sentence is compared to each of the already formed clusters.