NLP's ImageNet moment has arrived(About) Pretrained word embeddings have a major limitation: they only incorporate previous knowledge in the first layer of the model---the rest of the network still needs to be trained from scratch
> The long reign of word vectors as NLP’s core representation technique has seen an exciting new line of challengers emerge: ELMo, ULMFiT, and the OpenAI transformer. These works made headlines by demonstrating that pretrained language models can be used to achieve state-of-the-art results on a wide range of NLP tasks.
> it only seems to be a question of time until pretrained word embeddings will be dethroned and replaced by pretrained language models in the toolbox of every NLP practitioner. This will likely open many new applications for NLP in settings with limited amounts of labeled data.
Word embeddings in 2017: Trends and future directions(About) - Subword-level embeddings: several methods:
> Word embeddings have been augmented with subword-level information for many applications such as named entity recognition, POS, ..., Language Modeling.
> Most of these models employ a CNN or a BiLSTM that takes as input the characters of a word and outputs a character-based word representation.
> For incorporating character information into pre-trained embeddings, however, **character n-grams features** have been shown to be more powerful. [#FastText]
> Subword units based on **byte-pair encoding** have been found to be particularly useful for machine translation where they have replaced words as the standard input units
- Out-of-vocabulary (OOV) words
- Polysemy. Multi-sense embeddings
- [Towards a Seamless Integration of Word Senses into Downstream NLP Applications](/doc/?uri=https%3A%2F%2Farxiv.org%2Fabs%2F1710.06632)
An overview of word embeddings and their connection to distributional semantic models - AYLIEN (2016)(About) > While on the surface DSMs and word embedding models use varying algorithms to learn word representations – the former count, the latter predict – both types of model fundamentally act on the same underlying statistics of the data, i.e. the co-occurrence counts between words...
> These results are in contrast to the general consensus that word embeddings are superior to traditional methods. Rather, they indicate that it typically makes no difference whatsoever whether word embeddings or distributional methods are used. What really matters is that your hyperparameters are tuned and that you utilize the appropriate pre-processing and post-processing steps.