A set of language modeling and feature learning techniques where words from the vocabulary (and possibly phrases thereof) are mapped to vectors of real numbers in a low dimensional space, relative to the vocabulary size.
~ Context-predicting models
~ Latent feature representations of words
Paramaterized function mapping words in some language to vectors (perhaps 200 to 500 dimensions). Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.
"Plongement lexical" in French
Methods to generate the mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, and explicit representation in terms of the context in which words appear.
In the new generation of models, the vector estimation problem is handled as a supervised task, where the weights in a word vector are set to maximize the probability of the contexts in which the word is observed in the corpus
The mapping may be generated training a neural network on a large corpus to predict a word given a context (Continuous Bag Of Words model) or to predict the context given a word (skip gram model). The context is a window of surrounding words.
The most known software to produce word embeddings is Tomas Mikolov's Word2vec. Pre-trained word embeddings are also available in the word2vec code.google page.
- search document ranking
- boost the performance in NLP tasks such as syntactic parsing and sentiment analysis.
Improving Distributional Similarity with Lessons Learned from Word Embeddings (O Levy - 2015) > We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.
Semantics with Dense Vectors > We will introduce three methods of generating very dense, short vectors:
> 1. using dimensionality reduction methods like SVD,
> 2. using neural nets like the popular skip-gram or CBOW approaches.
> 3. a quite different approach based on neighboring words called Brown clustering.
Dependency-Based Word Embeddings | Omer Levy > While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skip-gram model with negative sampling introduced by Mikolov et al. to include arbitrary contexts.
> Experiments with dependency-based contexts show that they produce markedly different kinds of similarities.
> In particular, the bag-of-words
nature of the contexts in the “original”
SKIPGRAM model yield broad topical similarities,
while the dependency-based contexts yield
more functional similarities of a cohyponym nature.
Word embeddings in 2017: Trends and future directions - Subword-level embeddings: several methods:
> Word embeddings have been augmented with subword-level information for many applications such as named entity recognition, POS, ..., Language Modeling.
> Most of these models employ a CNN or a BiLSTM that takes as input the characters of a word and outputs a character-based word representation.
> For incorporating character information into pre-trained embeddings, however, **character n-grams features** have been shown to be more powerful. [#FastText]
> Subword units based on **byte-pair encoding** have been found to be particularly useful for machine translation where they have replaced words as the standard input units
- Out-of-vocabulary (OOV) words
- Polysemy. Multi-sense embeddings
- [Towards a Seamless Integration of Word Senses into Downstream NLP Applications](http://aclweb.org/anthology/P17-1170)
Enriching Word Embeddings Using Knowledge Graph for Semantic Tagging in Conversational Dialog Systems - Microsoft Research > new simple, yet effective approaches to
learn domain specific word embeddings.
> Adapting word embeddings, such as jointly capturing
syntactic and semantic information, can further enrich semantic
word representations for several tasks, e.g., sentiment
analysis (Tang et al. 2014), named entity recognition
(Lebret, Legrand, and Collobert 2013), entity-relation extraction
(Weston et al. 2013), etc. (Yu and Dredze 2014)
has introduced a lightly supervised word embedding learning
extending word2vec. They incorporate prior information to the objective
function as a regularization term considering synonymy relations
between words from Wordnet (Fellbaum 1999).
> In this work, we go one step further and investigate if
enriching the word2vec word embeddings trained on unstructured/
unlabeled text with domain specific semantic relations
obtained from knowledge sources (e.g., knowledge
graphs, search query logs, etc.) can help to discover relation
aware word embeddings. Unlike earlier work, **we encode the
information about the relations between phrases, thereby,
entities and relation mentions are all embedded into a lowdimensional
Vectorland: Brief Notes from Using Text Embeddings for Search > the elegance is in the learning model, but the magic is in the structure of the information we model
> The source-taret training pairs dictate **what notion of "relatedness"** will be modeled in the embedding space
> is eminem more similar to rihanna or rap?
Learned in translation: contextualized word vectors (Salesforce Research) Models that use pretrained word vectors must learn how to use them. Our work picks up where word vectors left off by looking to improve over randomly initialized methods for contextualizing word vectors through training on an intermediate task -> We teach a neural network how to understand words in context by first teaching it how to translate English to German
A Comparative Study of Word Embeddings for Reading Comprehension abstract:
The focus of past machine learning research for Reading Comprehension tasks has been primarily on the design of novel deep learning architectures. Here we show that seemingly minor choices made on
1. the use of pre-trained word embed- dings, and
2. the representation of out- of-vocabulary tokens at test time,
can turn out to have a larger impact than architectural choices on the final performance
An overview of word embeddings and their connection to distributional semantic models - AYLIEN (2016) > While on the surface DSMs and word embedding models use varying algorithms to learn word representations – the former count, the latter predict – both types of model fundamentally act on the same underlying statistics of the data, i.e. the co-occurrence counts between words...
> These results are in contrast to the general consensus that word embeddings are superior to traditional methods. Rather, they indicate that it typically makes no difference whatsoever whether word embeddings or distributional methods are used. What really matters is that your hyperparameters are tuned and that you utilize the appropriate pre-processing and post-processing steps.
Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models | Blog | Explosion AI > A four-step strategy for deep learning with text
> Word embeddings let you treat individual words as related units of meaning, rather than entirely distinct IDs. However, most NLP problems require understanding of longer spans of text, not just individual words. There's now a simple and flexible solution that is achieving excellent performance on a wide range of problems. After embedding the text into a sequence of vectors, bidirectional RNNs are used to encode the vectors into a sentence matrix. The rows of this matrix can be understood as token vectors — they are sensitive to the sentential context of the token. The final piece of the puzzle is called an attention mechanism. This lets you reduce the sentence matrix down to a sentence vector, ready for prediction.
Learning Semantic Similarity for Very Short Texts (Arxiv, submitted on 2 Dec 2015) In order to pair short text
fragments—as a concatenation of separate words—an adequate
distributed sentence representation is needed. Main contribution: a first step towards a hybrid method that
combines the strength of dense distributed representations—
as opposed to sparse term matching—with the strength of
tf-idf based methods. The combination of word embeddings and tf-idf
information might lead to a better model for semantic content
within very short text fragments.
Short Text Similarity with Word Embeddings We investigate whether determining short text similarity is possible
using only semantic features.
A novel feature of our
approach is that an arbitrary number of word embedding sets can be
Efficient Estimation of Word Representations in Vector Space We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.