Contrastive Unsupervised Learning of Semantic Representations: A Theoretical Framework – Off the convex path(About) [paper](/doc/?uri=https%3A%2F%2Farxiv.org%2Fabs%2F1902.09229).
Why do objectives similar the one used by word2vec succeed in such diverse settings? ("Contrastive Unsupervised Representation
> In contrastive learning the objective used at test time is very different from the training objective: generalization error is not the right
way to think about this. -> a framework that formalizes the notion of semantic
similarity that is implicitly used by these algorithms
> **if the unsupervised loss happens to be small at the end of contrastive learning then the resulting
representations perform well on downstream classification**
Word Embeddings: Explaining their properties – Off the convex path(About) second part for [this post](/doc/?uri=http%3A%2F%2Fwww.offconvex.org%2F2015%2F12%2F12%2Fword-embeddings-1%2F)
>- What properties of natural languages cause these low-dimensional embeddings to exist?
>- Why do low-dimensional embeddings work better at analogy solving than high dimensional embeddings?
A Latent Variable Model Approach to PMI-based Word Embeddings (2016)(About) [Related YouTube video](/doc/?uri=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DKR46z_V0BVw)
Based on a generative model (random walk on words involving a latent discourse vector),
a rigorous justification for models such
as word2vec and GloVe, including the hyperparameter
choices for the latter, and a mathematical explanation for why these word embeddings
allow analogies to be solved using linear
[1601.03764] Linear Algebraic Structure of Word Senses, with Applications to Polysemy (2016 - revised 2018)(About) > Here it is shown that multiple word senses reside
in linear superposition within the word
embedding and simple sparse coding can recover
vectors that approximately capture the
> Each extracted word sense is accompanied by one of about 2000 “discourse atoms” that gives a succinct description of which other words co-occur with that word sense.
> The success of the approach is mathematically explained using a variant of
the random walk on discourses model
("random walk": a generative model for language). Under the assumptions of this model, there
exists a linear relationship between the vector of a
word w and the vectors of the words in its contexts (It is not the average of the words in w's context, but in a given corpus the matrix of the linear relationship does not depend on w. It can be estimated, and so we can compute the embedding of a word from the contexts it belongs to)
[Related blog post](/doc/?uri=https%3A%2F%2Fwww.offconvex.org%2F2016%2F07%2F10%2Fembeddingspolysemy%2F)
Sanjeev Arora on "A theoretical approach to semantic representations" - YouTube (2016)(About) Why do low-dimensional word vectors exist?
> a text corpus is imagined as being generated by a random walk in a latent variable space, and the word production is via a loglinear distribution. This model is shown to imply several empirically discovered past methods for word embedding like word2vec, GloVe, PMI etc
A Simple but Tough-to-Beat Baseline for Sentence Embeddings (2017)(About) > Use word embeddings computed using one of the popular methods on unlabeled corpus like Wikipedia, represent the sentence by a weighted average of the word vectors, and then modify them a bit using PCA/SVD
See also [youtube: Sanjeev Arora on "A theoretical approach to semantic representations"](https://www.youtube.com/watch?v=KR46z_V0BVw)