Semanlink - Locality Sensitive Hashing

> We show how to learn a deep graphical model of the word-count vectors obtained from a
large set of documents. The values of the latent variables in the deepest layer are easy to
infer and give a much better representation of each document than Latent Semantic Analysis.
When the deepest layer is forced to use a small number of binary variables (e.g. 32),
the graphical model performs ‘‘semantic hashing”: Documents are mapped to memory
addresses in such a way that semantically similar documents are located at nearby
addresses. Documents similar to a query document can then be found by simply accessing
all the addresses that differ by only a few bits from the address of the query document. This
way of extending the efficiency of hash-coding to approximate matching is much faster
than locality sensitive hashing, which is the fastest current method. By using semantic
hashing to filter the documents given to TF-IDF, we achieve higher accuracy than applying
TF-IDF to the entire document set.

Indexing is implemented in the following manner: a document is mapped to a word-count vector and then this vector is passed through a [#Restricted Boltzmann Machine](/tag/restricted_boltzmann_machine) autoencoder and encoded to 32-bit address.

2017-11-07 About

Online Generation of Locality Sensitive Hash Signatures

Tags:

2017-07-26 About

Finding Similar Items

Tags:

**Jaccard similarity**: similarity of sets, based on the relative size of their intersection -> **finding textually similar documents in a large corpus, near duplicates**. [Collaborative Filtering](/tag/collaborative_filtering) as a Similar-Sets Problem (cf. online purchases, movie ratings)

**Shingling** turns the problem of textual similarity of documents into a pb of similarity of sets

k-shingle: substring of length k found within a document. k: 5 for emails. Hashing shingles. Shingles built from words (stop word + 2 following words)

Similarity-Preserving Summaries of Sets: shingles sets are large -> compress large sets into small representations (“signatures”) that preserve similarity: **[Minhashing](/tag/minhash)** - related to Jaccard similarity (good explanation in [wikipedia](https://en.wikipedia.org/wiki/MinHash))

It still may be impossible to find the pairs of docs with greatest similarity efficiently -> **[Locality-Sensitive Hashing](/tag/locality_sensitive_hashing)** for Documents

Distance measures

Theory of Locality-Sensitive Functions

LSH famiies for other distance measures

Applications of Locality-Sensitive Hashing:

- entity resolution
- matching fingerprints
- matching newpapers articles

Methods for High Degrees of Similarity: LSH-based methods most effective when the degree of similarity we
accept is relatively low. When we want to find sets that are almost identical, other methods can be faster.

2017-07-26 About

How to spot first stories on Twitter using Storm | Michael Vogiatzis

Tags:

2017-07-26 About

How to understand Locality Sensitive Hashing? - Stack Overflow

Tags:

2017-07-26 About

Locality-sensitive hashing - Elasticsearch - Stack Overflow

Tags:

2017-04-27 About

Fuzzy-Fingerprints for Text-Based Information Retrieval

Tags: