Finding Similar Items(About) **Jaccard similarity**: similarity of sets, based on the relative size of their intersection -> finding textually similar documents in a large corpus, near duplicates. [Collaborative Filtering](/tag/collaborative_filtering) as a Similar-Sets Problem (cf. online purchases, movie ratings)
**Shingling** turns the problem of textual similarity of documents into a pb of similarity of sets
k-shingle: substring of length k found within a document. k: 5 for emails. Hashing shingles. Shingles built from words (stop word + 2 following words)
Similarity-Preserving Summaries of Sets: shingles sets are large -> compress large sets into small representations (“signatures”) that preserve similarity: **[Minhashing](/tag/minhash)** - related to Jaccard similarity (good explanation in [wikipedia](https://en.wikipedia.org/wiki/MinHash))
It still may be impossible to find the pairs of docs with greatest similarity efficiently -> **[Locality-Sensitive Hashing](/tag/locality_sensitive_hashing)** for Documents
Theory of Locality-Sensitive Functions
LSH famiies for other distance measures
Applications of Locality-Sensitive Hashing:
- entity resolution
- matching fingerprints
- matching newpapers articles
Methods for High Degrees of Similarity: LSH-based methods most effective when the degree of similarity we
accept is relatively low. When we want to find sets that are almost identical, other methods can be faster.
Word Meaning and Similarity - Stanford University(About) thesaurus based meaning, Distributional models of meaning
Term-Context matrix. Term-document matrix: use tf-idf instead of raw term counts, for the term-context matrix, use Positive Pointwise Mutual Information (PPMI: Do words x and y co-occur more than if they were independent?)
Gensim tutorial: Similarity Queries(About) > "The thing to note here is that documents no. 2 would never be returned by a standard boolean fulltext search, because they do not share any common words with query string"
Short Text Similarity with Word Embeddings(About) We investigate whether determining short text similarity is possible
using only semantic features.
A novel feature of our
approach is that an arbitrary number of word embedding sets can be
Quick review on Text Clustering and Text Similarity Approaches(About) Author: Maali Mnasri (PhD @ CEA)
First transform text units to vectors? not always (eg. sentence similarity task using lexical word alignment). But vectors are efficient to process, and benefit from existing clustering algorithms such as k-means.
Sentence level or document level? Sentence clustering to summarise large documents.
Thematic clustering vs Semantic clustering: depends on the similarity measure.
Text similarity measures:
- Cosine similarity of tf-idf (suitable to produce thematic clusters)
- Knowledge-based Measures (wordNet) (quantify semantic relatedness of words),
- Word embedings
Examples, sample code:
- using wordnet with NLTK, and the formula to compute sentence similarities from word similarities.
- computing similarities between docs using gensim/word2vec
Which clustering algorithm?
- when we have an approximation of the clusters number, and when the similarity measure is not expensive in terms of computation time, clustering algo are suitable and fast. Sample code of k-means clustering using tf-idf vectors with scikit-learn
- Hierarchical clustering algorithms
- don't need to give the number of clusters
- but time consuming (calculate a similarity matrix for the sentences)
- for voluminous data, use an incremental clustering algorithm: sentences are processed one at a time ; each new sentence is compared to each of the already formed clusters.