Gensim tutorial: Similarity Queries(About) > "The thing to note here is that documents no. 2 would never be returned by a standard boolean fulltext search, because they do not share any common words with query string"
Intro to Automatic Keyphrase Extraction(About) Candidate identification
- remove stop words and punctuation, filtering for words with certain part of speech / POS patterns, using external knowledge bases like wordnet or wikipedia as references of good/bad keyphrases
- frequency stats (TF-IDT, BM25). Not very good (the best keyphrases aren’t necessarily the most frequent within a document)
- graph based ranking:
- the importance of a candidate is determined by its relatedness to other candidates
- frequency of co-occurence
- semantic relatedness
- a doc is represented as a graph (nodes = candidates)
- topic-based clustering
- previously seen as a classification problem,
- now seen as a ranking problem
- ranking SVM
finally, some sample code in python
An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec(About) Types of word embeddings:
- Frequency based Embedding
- Count Vector
- TF-IDF Vector
- Co-Occurrence Vector
- Co_occurence matrix (with a fixed context window), size V*V or V * N (Vocab size * subset of V size) matrix.
- PCA or SVD: keeping the k most important eigenvalues
- Prediction based Embedding
- CBOW (Continuous Bag Of Words). 1 hidden layer, one output layer. Predict the probability of a word given a context
- Skip-gram. Predict the proba of the context given a word
Sample code using gensim
Text Classification With Word2Vec - DS lore(About) > Overall, we won’t be throwing away our SVMs any time soon in favor of word2vec but it has it’s place in text classification.
> 1. SVM’s are pretty great at text classification tasks
> 2. Models based on simple averaging of word-vectors can be surprisingly good too (given how much information is lost in taking the average)
> 3. but they only seem to have a clear advantage when there is ridiculously little labeled training data
> Update 2017: actually, the best way to utilise the pretrained embeddings would probably be this [using keras](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)
Sample code to benchmark a few text categorization models to test whehter word embeddings like word2vec can improve text classification accuracy.
Sample code (based on scikit-learn) includes an embedding vectorizer that is given embedding dataset and vectorizes texts by taking the mean of all the vectors corresponding to individual words.
Working With Text Data — scikit-learn documentation(About) scikit-learn tutorial about analysing a collection of labelled text documents :
- load the file contents and the categories
- extract feature vectors (count, tf, tf-idf)
- train a linear model to perform categorization
- use a grid search strategy (to find a good configuration of both the feature extraction components and the classifier)