Semanlink - Text Similarity

> Transferring the success of word embeddings to Information Retrieval (IR) task is currently an active research topic. While embedding-based retrieval models could tackle the vocabulary mismatch problem by making use of the embedding’s inherent similarity between distinct words, most of them struggle to compete with the prevalent strong baselines such as TF-IDF and BM25.

Considering a practical ad-hoc IR task composed of two steps, matching and scoring, compares the performance of several techniques that leverage word embeddings in the retrieval models to compute the similarity between the query and the documents (namely word centroid similarity, paragraph vectors, Word Mover’s distance, as well as a novel inverse document frequency (IDF) re-weighted word centroid similarity).

> We confirm that word embeddings can be successfully employed in a practical information retrieval setting. The proposed cosine similarity of IDF re-weighted, aggregated word vectors is competitive to the TF-IDF baseline.

2018-01-28 About

A Comparison of Vector-based Representations for Semantic Composition (Blacoe and Lapata - 2012)

Tags:

2017-11-12 About

Finding Similar Items

Tags:

**Jaccard similarity**: similarity of sets, based on the relative size of their intersection -> **finding textually similar documents in a large corpus, near duplicates**. [Collaborative Filtering](/tag/collaborative_filtering) as a Similar-Sets Problem (cf. online purchases, movie ratings)

**Shingling** turns the problem of textual similarity of documents into a pb of similarity of sets

k-shingle: substring of length k found within a document. k: 5 for emails. Hashing shingles. Shingles built from words (stop word + 2 following words)

Similarity-Preserving Summaries of Sets: shingles sets are large -> compress large sets into small representations (“signatures”) that preserve similarity: **[Minhashing](/tag/minhash)** - related to Jaccard similarity (good explanation in [wikipedia](https://en.wikipedia.org/wiki/MinHash))

It still may be impossible to find the pairs of docs with greatest similarity efficiently -> **[Locality-Sensitive Hashing](/tag/locality_sensitive_hashing)** for Documents

Distance measures

Theory of Locality-Sensitive Functions

LSH famiies for other distance measures

Applications of Locality-Sensitive Hashing:

- entity resolution
- matching fingerprints
- matching newpapers articles

Methods for High Degrees of Similarity: LSH-based methods most effective when the degree of similarity we
accept is relatively low. When we want to find sets that are almost identical, other methods can be faster.

2017-07-26 About

A brief overview of query/sentence similarity functions | searchivarius.org

Tags:

Text Similarity

2017-07-21 About

Effective measures for inter-document similarity

Tags:

2017-07-21 About

Word Meaning and Similarity - Stanford University

Tags:

2017-07-20 About

Gensim tutorial: Similarity Queries

Tags:

2017-07-19 About

Tags:

2017-07-19 About

Tags:

2017-07-19 About

How to find semantic similarity between two documents? (researchgate)

Tags:

Text Similarity

2017-05-18 About

Tags:

2017-05-18 About

Quick review on Text Clustering and Text Similarity Approaches

Tags:

Author: Maali Mnasri (PhD @ CEA)

First transform text units to vectors? not always (eg. sentence similarity task using lexical word alignment). But vectors are efficient to process, and benefit from existing clustering algorithms such as k-means.

Sentence level or document level? Sentence clustering to summarise large documents.

Thematic clustering vs Semantic clustering: depends on the similarity measure.

Text similarity measures:

- Cosine similarity of tf-idf (suitable to produce thematic clusters)
- Knowledge-based Measures (wordNet) (quantify semantic relatedness of words),
- Word embedings

Examples, sample code:

- using wordnet with NLTK, and the formula to compute sentence similarities from word similarities.
- computing similarities between docs using gensim/word2vec

Which clustering algorithm?

- when we have an approximation of the clusters number, and when the similarity measure is not expensive in terms of computation time, clustering algo are suitable and fast. Sample code of k-means clustering using tf-idf vectors with scikit-learn
- Hierarchical clustering algorithms
    - don't need to give the number of clusters
    - but time consuming (calculate a similarity matrix for the sentences) 
- for voluminous data, use an incremental clustering algorithm: sentences are processed one at a time ; each new sentence is compared to each of the already formed clusters.

2017-05-18 About

Fuzzy-Fingerprints for Text-Based Information Retrieval

Tags:

2013-05-31 About