Aho-Corasick (java implementation)(About) Nowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords this approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of them? Like, for example, checking against a dictionary?
This is where the Aho-Corasick algorithm shines.
[1703.03129] Learning to Remember Rare Events(About) > a large-scale life-long memory module for use in deep learning. The module exploits fast nearest-neighbor algorithms for efficiency and thus scales to large memory sizes. Except for the nearest-neighbor query, the module is fully differentiable and trained end-to-end with no extra supervision. It operates in a life-long manner, i.e., without the need to reset it during training.
> Our memory module can be easily added to any part of a supervised neural network
Visualising Top Features in Linear SVM with Scikit Learn and Matplotlib(About) > The weights obtained from svm.coef_ represent the vector coordinates which are orthogonal to the hyperplane and their direction indicates the predicted class. The absolute size of the coefficients in relation to each other can then be used to determine feature importance for the data separation task
Self-Taught Hashing for Fast Similarity Search (2010)(About) Emphasise following issue in Semantic Hashing: obtaining the codes for previously unseen documents. Propose following approach:
first find the optimal l-bit binary codes for all documents in
the given corpus via unsupervised learning, then train
l classifiers via supervised learning to predict the l-bit code
for any query document unseen before.
(méthode résumée [ici](https://www.semanticscholar.org/paper/Semantic-hashing-using-tags-and-topic-modeling-Wang-Zhang/1a0f660f70fd179003edc271694736baaa39dec4))
The backpropagation algorithm(About) a proof of the backpropagation algorithm based on a graphical approach in which the algorithm reduces to a graph labeling problem. This method is not only more general than the usual analytical derivations, which handle only the case of special network topologies, but also much easier to follow. It also shows how the algorithm can be efficiently implemented in computing systems in which only local information can be transported through the network.
Finding Similar Items(About) **Jaccard similarity**: similarity of sets, based on the relative size of their intersection -> **finding textually similar documents in a large corpus, near duplicates**. [Collaborative Filtering](/tag/collaborative_filtering) as a Similar-Sets Problem (cf. online purchases, movie ratings)
**Shingling** turns the problem of textual similarity of documents into a pb of similarity of sets
k-shingle: substring of length k found within a document. k: 5 for emails. Hashing shingles. Shingles built from words (stop word + 2 following words)
Similarity-Preserving Summaries of Sets: shingles sets are large -> compress large sets into small representations (“signatures”) that preserve similarity: **[Minhashing](/tag/minhash)** - related to Jaccard similarity (good explanation in [wikipedia](https://en.wikipedia.org/wiki/MinHash))
It still may be impossible to find the pairs of docs with greatest similarity efficiently -> **[Locality-Sensitive Hashing](/tag/locality_sensitive_hashing)** for Documents
Theory of Locality-Sensitive Functions
LSH famiies for other distance measures
Applications of Locality-Sensitive Hashing:
- entity resolution
- matching fingerprints
- matching newpapers articles
Methods for High Degrees of Similarity: LSH-based methods most effective when the degree of similarity we
accept is relatively low. When we want to find sets that are almost identical, other methods can be faster.
A Ranking Approach to Keyphrase Extraction - Microsoft Research (2009)(About) Previously, automatic keyphrase extraction was formalized as classification and learning methods for classification were utilized. This paper points out that it is more essential to cast the keyphrase extraction problem as ranking and employ a learning to rank method to perform the task. As example, it employs Ranking SVM, a state-of-art method of learning to rank, in keyphrase extraction
Provable Algorithms for Machine Learning Problems by Rong Ge.(About) from the abstract:
Modern machine learning algorithms can extract useful information from text, images and videos. All these applications involve solving NP-hard problems in average case using heuristics. What properties of the input allow it to be solved effciently? Theoretically analyzing the heuristics is very challenging. Few results were known.
This thesis takes a different approach: we identify natural properties of the input, then design new algorithms that provably works assuming the input has these properties. We are able to give new, provable and sometimes practical algorithms for learning tasks related to text corpus, images and social networks.
...In theory, the assumptions in this thesis help us understand why intractable problems in machine learning can often be solved; in practice, the results suggest inherently new approaches for machine learning.