> Deep contextualized word representations
each word is assigned a representation which is a function of the
entire corpus sentences to which they belong. The embeddings are
computed from the internal states of a two-layers bidirectional Language
Model, hence the name “ELMo”: Embeddings from Language
[1902.11269] Efficient Contextual Representation Learning Without Softmax Layer (2019)(About) **how to accelerate contextual representation learning**.
> Contextual representation models are difficult to train due to the large parameter sizes and high computational complexity
> We find that the softmax layer (the output layer) causes significant inefficiency due to the large vocabulary size.
Therefore, we redesign the learning objectiv.
> Specifically, the proposed approach bypasses the softmax layer by performing language modeling with dimension reduction, and allows the models to leverage pre-trained word embeddings.
Our framework reduces the time spent on the output layer to a negligible level, eliminates almost all the trainable parameters of the softmax layer and performs language modeling without truncating the vocabulary.
When applied to ELMo, our method achieves a 4 times speedup and eliminates 80% trainable parameters while achieving competitive performance on downstream tasks.
**decouples learning contexts and words**
> Instead of using
a softmax layer to predict the distribution of the
missing word, we utilize and extend the SEMFIT
layer (Kumar and Tsvetkov, 2018) to **predict the
embedding of the missing word**.
[1806.06259] Evaluation of sentence embeddings in downstream and linguistic probing tasks(About) a simple approach using bag-of-words with a recently introduced language model for deep context-dependent word embeddings proved to yield better results in many tasks when compared to sentence encoders trained on entailment datasets
> We also show, however, that we are still far away from a universal encoder that can perform consistently across several downstream tasks.
ELMo: Deep contextualized word representations (2018)(About) > models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy).
> These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM)
These representations are :
- Contextual: The representation for each word depends on the entire context in which it is used.
- Deep: combine all layers of a deep pre-trained neural network.
- Character based