A set of language modeling and feature learning techniques where words from the vocabulary (and possibly phrases thereof) are mapped to vectors of real numbers in a low dimensional space, relative to the vocabulary size.
~ Context-predicting models
~ Latent feature representations of words
Paramaterized function mapping words in some language to vectors (perhaps 200 to 500 dimensions). Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.
"Plongement lexical" in French
Methods to generate the mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, and explicit representation in terms of the context in which words appear.
In the new generation of models, the vector estimation problem is handled as a supervised task, where the weights in a word vector are set to maximize the probability of the contexts in which the word is observed in the corpus
The mapping may be generated training a neural network on a large corpus to predict a word given a context (Continuous Bag Of Words model) or to predict the context given a word (skip gram model). The context is a window of surrounding words.
The most known software to produce word embeddings is Tomas Mikolov's Word2vec. Pre-trained word embeddings are also available in the word2vec code.google page.
- search document ranking
- boost the performance in NLP tasks such as syntactic parsing and sentiment analysis.
Word2Bits - Quantized Word Vectors (2018)(About) We show that high quality quantized word vectors using 1-2 bits per parameter can be learned by introducing a quantization function into Word2Vec. We furthermore show that training with the quantization function acts as a regularizer
Improving the Compositionality of Word Embeddings (2017)(About) (MS thesis, a paper accepted at [TheWebConf 2018](https://www2018.thewebconf.org/program/web-content-analysis/))
> This thesis explores a method to find better encodings of meaning a computer can work with. We specifically want to combine encodings of word meanings in such a way that a good encoding of their joint meaning is created. The act of combining multiple representations of meaning into a new representation of meaning is called semantic composition.
Analysis of four word embeddings (Word2Vec, GloVe, fastText and Paragram) in terms of their semantic compositionality. A method to tune these embeddings towards better compositionality, using a simple neural network architecture with definitions and lemmas from WordNet.
> Since dictionary definitions are semantically similar to their associated lemmas, they are the ideal candidate for our tuning method, as well as evaluating for compositionality. Our architecture allows for the embeddings to be composed using simple arithmetic operations, which makes these embeddings specifically suitable for production applications such as web search and data mining. We also explore more elaborate and involved compositional models, such as recurrent composition and convolutional composition.
Word Representations via Gaussian Embedding (2014)(About) > Current work in lexical distributed representations maps each word to a point vector in low-dimensional space. Mapping instead to a density provides many interesting advantages
> Novel word embedding algorithms that embed words directly as Gaussian distributional potential functions in an infinite dimensional function space. This allows us to map word types not only to vectors but to soft regions in space, modeling uncertainty, inclusion, and entailment, as well as providing a rich geometry of the latent space.
Web Content Analysis, Semantics and Knowledge – TheWebConf 2018 - Research Track(About) [CFP](https://www2018.thewebconf.org/call-for-papers/research-tracks-cfp/web-content-analysis/)
> In previous years, ‘content analysis’ and ‘semantic and knowledge’ were in separate track. This year, we combined these tracks to emphasize the close relationship between these topics; **the use of content to curate knowledge and the use of knowledge to guide content analysis and intelligent usage**.
Some of the accepted papers:
- A paper by [David Blei](/tag/david_blei): (Dynamic Embeddings for Language Evolution)
- Large-Scale [Hierarchical Text Classification](/tag/nlp_hierarchical_text_classification) with Recursively Regularized Deep Graph-CNN
- Improving Word Embedding Compositionality using Lexicographic Definitions ([Github](https://github.com/tscheepers/CompVec), [Thesis 2017](https://esc.fnwi.uva.nl/thesis/centraal/files/f1554608041.pdf))
- Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations
Improving Distributional Similarity with Lessons Learned from Word Embeddings (O Levy - 2015)(About) > We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.
Semantics with Dense Vectors(About) > We will introduce three methods of generating very dense, short vectors:
> 1. using dimensionality reduction methods like SVD,
> 2. using neural nets like the popular skip-gram or CBOW approaches.
> 3. a quite different approach based on neighboring words called Brown clustering.
Dependency-Based Word Embeddings | Omer Levy(About) > While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skip-gram model with negative sampling introduced by Mikolov et al. to include arbitrary contexts.
> Experiments with dependency-based contexts show that they produce markedly different kinds of similarities.
> In particular, the bag-of-words
nature of the contexts in the “original”
SKIPGRAM model yield broad topical similarities,
while the dependency-based contexts yield
more functional similarities of a cohyponym nature.
Word embeddings in 2017: Trends and future directions(About) - Subword-level embeddings: several methods:
> Word embeddings have been augmented with subword-level information for many applications such as named entity recognition, POS, ..., Language Modeling.
> Most of these models employ a CNN or a BiLSTM that takes as input the characters of a word and outputs a character-based word representation.
> For incorporating character information into pre-trained embeddings, however, **character n-grams features** have been shown to be more powerful. [#FastText]
> Subword units based on **byte-pair encoding** have been found to be particularly useful for machine translation where they have replaced words as the standard input units
- Out-of-vocabulary (OOV) words
- Polysemy. Multi-sense embeddings
- [Towards a Seamless Integration of Word Senses into Downstream NLP Applications](http://aclweb.org/anthology/P17-1170)
Learned in translation: contextualized word vectors (Salesforce Research)(About) Models that use pretrained word vectors must learn how to use them. Our work picks up where word vectors left off by looking to improve over randomly initialized methods for contextualizing word vectors through training on an intermediate task -> We teach a neural network how to understand words in context by first teaching it how to translate English to German
A Comparative Study of Word Embeddings for Reading Comprehension(About) abstract:
The focus of past machine learning research for Reading Comprehension tasks has been primarily on the design of novel deep learning architectures. Here we show that seemingly minor choices made on
1. the use of pre-trained word embeddings, and
2. the representation of out- of-vocabulary tokens at test time,
can turn out to have a larger impact than architectural choices on the final performance
An overview of word embeddings and their connection to distributional semantic models - AYLIEN (2016)(About) > While on the surface DSMs and word embedding models use varying algorithms to learn word representations – the former count, the latter predict – both types of model fundamentally act on the same underlying statistics of the data, i.e. the co-occurrence counts between words...
> These results are in contrast to the general consensus that word embeddings are superior to traditional methods. Rather, they indicate that it typically makes no difference whatsoever whether word embeddings or distributional methods are used. What really matters is that your hyperparameters are tuned and that you utilize the appropriate pre-processing and post-processing steps.
Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models | Blog | Explosion AI(About) > A four-step strategy for deep learning with text
> Word embeddings let you treat individual words as related units of meaning, rather than entirely distinct IDs. However, most NLP problems require understanding of longer spans of text, not just individual words. There's now a simple and flexible solution that is achieving excellent performance on a wide range of problems. After embedding the text into a sequence of vectors, bidirectional RNNs are used to encode the vectors into a sentence matrix. The rows of this matrix can be understood as token vectors — they are sensitive to the sentential context of the token. The final piece of the puzzle is called an attention mechanism. This lets you reduce the sentence matrix down to a sentence vector, ready for prediction.
An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec(About) Types of word embeddings:
- Frequency based Embedding
- Count Vector
- TF-IDF Vector
- Co-Occurrence Vector
- Co_occurence matrix (with a fixed context window), size V*V or V * N (Vocab size * subset of V size) matrix.
- PCA or SVD: keeping the k most important eigenvalues
- Prediction based Embedding
- CBOW (Continuous Bag Of Words). 1 hidden layer, one output layer. Predict the probability of a word given a context
- Skip-gram. Predict the proba of the context given a word
Sample code using gensim
Learning Semantic Similarity for Very Short Texts (Arxiv, submitted on 2 Dec 2015)(About) In order to pair short text
fragments—as a concatenation of separate words—an adequate
distributed sentence representation is needed. Main contribution: a first step towards a hybrid method that
combines the strength of dense distributed representations—
as opposed to sparse term matching—with the strength of
tf-idf based methods. The combination of word embeddings and tf-idf
information might lead to a better model for semantic content
within very short text fragments.
Short Text Similarity with Word Embeddings(About) We investigate whether determining short text similarity is possible
using only semantic features.
A novel feature of our
approach is that an arbitrary number of word embedding sets can be
Efficient Estimation of Word Representations in Vector Space(About) We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.