Locality-sensitive hashing (LSH) reduces the dimensionality of high-dimensional data. LSH hashes input items so that similar items map to the same “buckets” with high probability (the number of buckets being much smaller than the number of possible input items). LSH has much in common with data clustering and [#nearest neighbor search](nearest_neighbor_search).
LSH employs random linear projections (followed by random thresholding) to map data points close in an Euclidean space to similar codes.
See Also [#sparse distributed memory](/tag/sparse_distributed_memory), associative memory
Semantic hashing (2008) - Ruslan Salakhutdinov, Geoffrey Hinton(About) > We show how to learn a deep graphical model of the word-count vectors obtained from a
large set of documents. The values of the latent variables in the deepest layer are easy to
infer and give a much better representation of each document than Latent Semantic Analysis.
When the deepest layer is forced to use a small number of binary variables (e.g. 32),
the graphical model performs ‘‘semantic hashing”: Documents are mapped to memory
addresses in such a way that semantically similar documents are located at nearby
addresses. Documents similar to a query document can then be found by simply accessing
all the addresses that differ by only a few bits from the address of the query document. This
way of extending the efficiency of hash-coding to approximate matching is much faster
than locality sensitive hashing, which is the fastest current method. By using semantic
hashing to filter the documents given to TF-IDF, we achieve higher accuracy than applying
TF-IDF to the entire document set.
Finding Similar Items(About) **Jaccard similarity**: similarity of sets, based on the relative size of their intersection -> **finding textually similar documents in a large corpus, near duplicates**. [Collaborative Filtering](/tag/collaborative_filtering) as a Similar-Sets Problem (cf. online purchases, movie ratings)
**Shingling** turns the problem of textual similarity of documents into a pb of similarity of sets
k-shingle: substring of length k found within a document. k: 5 for emails. Hashing shingles. Shingles built from words (stop word + 2 following words)
Similarity-Preserving Summaries of Sets: shingles sets are large -> compress large sets into small representations (“signatures”) that preserve similarity: **[Minhashing](/tag/minhash)** - related to Jaccard similarity (good explanation in [wikipedia](https://en.wikipedia.org/wiki/MinHash))
It still may be impossible to find the pairs of docs with greatest similarity efficiently -> **[Locality-Sensitive Hashing](/tag/locality_sensitive_hashing)** for Documents
Theory of Locality-Sensitive Functions
LSH famiies for other distance measures
Applications of Locality-Sensitive Hashing:
- entity resolution
- matching fingerprints
- matching newpapers articles
Methods for High Degrees of Similarity: LSH-based methods most effective when the degree of similarity we
accept is relatively low. When we want to find sets that are almost identical, other methods can be faster.