Semanlink - Algorithmes

> for each unseen instance, its K nearest neighbors in the training set are firstly identified. After that, based on statistical information gained from the label sets of these neighboring instances, i.e. the number of neighboring instances belonging to each possible class, [Maximum a posteriori (MAP)](tag:maximum_a_posteriori_estimation) principle is utilized to determine the label set for the unseen instance.

Implemented in [scikit-multilearn](http://scikit.ml/api/skmultilearn.adapt.mlknn.html), in [java](https://github.com/lefman/mulan-extended/blob/master/mulan/src/mulan/classifier/lazy/MLkNN.java)

> the first lazy approach proposed specifically for multi-label classification. This is also a binary relevance approach which considers each label independently as a binary classification problem. Instead of a standard k-NN method, however, MLkNN uses the maximum a-posteriori (MAP) (Kelleher et al., 2015) approach combined with k-NN. [src](https://pdfs.semanticscholar.org/af9b/33da37d290c063cd826ab5923d96892a9767.pdf)

2018-03-18 About

In Depth: Gaussian Mixture Models | Python Data Science Handbook

Tags:

The non-probabilistic nature of [k-means](/tag/k_means_clustering) and its use of simple distance-from-cluster-center to assign cluster membership leads to poor performance for many real-world situations. We take a look at Gaussian mixture models (GMMs), which can be viewed as an extension of the ideas behind k-means, but can also be a powerful tool for estimation beyond simple clustering.

A Gaussian mixture model (GMM) attempts to find a mixture of multi-dimensional Gaussian probability distributions that best model any input dataset. In the simplest case, GMMs can be used for finding clusters in the same manner as k-means.

But because GMM contains a probabilistic model under the hood, it is also possible to find probabilistic cluster assignments (in Scikit-Learn, using predict_proba)

> Though GMM is often categorized as a clustering algorithm, fundamentally it is an algorithm for density estimation. That is to say, the result of a GMM fit to some data is technically not a clustering model, but a generative probabilistic model describing the distribution of the data.

-> a natural means of determining the optimal number of components for a given dataset

2018-03-16 About

GitHub - spotify/annoy: Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Tags:

2018-03-12 About

Stanford Seminar - "Can the brain do back-propagation?" - Geoffrey Hinton

Tags:

2018-01-22 About

Gradient descent vs. neuroevolution

Tags:

2018-01-06 About

Tags:

2017-11-27 About

Modifications for the Cluster Content Discovery and the Cluster Label Induction Phases of the Lingo Algorithm (2014)

Tags:

Lingo

2017-11-11 About

[1004.5370] Self-Taught Hashing for Fast Similarity Search

Tags:

2017-11-07 About

MinHash Tutorial with Python Code · Chris McCormick

Tags:

2017-09-12 About

How the backpropagation algorithm works

Tags:

Backpropagation

2017-08-21 About

The backpropagation algorithm

Tags:

Backpropagation

2017-08-21 About

Calculus on Computational Graphs: Backpropagation -- colah's blog

Tags:

2017-08-20 About

Finding Similar Items

Tags:

**Jaccard similarity**: similarity of sets, based on the relative size of their intersection -> **finding textually similar documents in a large corpus, near duplicates**. [Collaborative Filtering](/tag/collaborative_filtering) as a Similar-Sets Problem (cf. online purchases, movie ratings)

**Shingling** turns the problem of textual similarity of documents into a pb of similarity of sets

k-shingle: substring of length k found within a document. k: 5 for emails. Hashing shingles. Shingles built from words (stop word + 2 following words)

Similarity-Preserving Summaries of Sets: shingles sets are large -> compress large sets into small representations (“signatures”) that preserve similarity: **[Minhashing](/tag/minhash)** - related to Jaccard similarity (good explanation in [wikipedia](https://en.wikipedia.org/wiki/MinHash))

It still may be impossible to find the pairs of docs with greatest similarity efficiently -> **[Locality-Sensitive Hashing](/tag/locality_sensitive_hashing)** for Documents

Distance measures

Theory of Locality-Sensitive Functions

LSH famiies for other distance measures

Applications of Locality-Sensitive Hashing:

- entity resolution
- matching fingerprints
- matching newpapers articles

Methods for High Degrees of Similarity: LSH-based methods most effective when the degree of similarity we
accept is relatively low. When we want to find sets that are almost identical, other methods can be faster.

2017-07-26 About

Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition (slides)

Tags:

2017-07-11 About

Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition (2004) (paper)

Tags:

2017-07-11 About

A Ranking Approach to Keyphrase Extraction - Microsoft Research (2009)

Tags: