replacement of the vectorial representation of words with a matrix representation where each word’s representation includes information about its context
Embedding words through a language model
> The key idea underneath is to train a contextual encoder with a language model objective on a large unannotated text corpus. During the training, part of the text is masked and the goal is to encode the remaining context and predict the missing part. During the training, part of the text is masked and the goal is to encode the remaining
context and predict the missing part. ([source](/doc/?uri=https%3A%2F%2Farxiv.org%2Fabs%2F1902.11269))
[1902.11269] Efficient Contextual Representation Learning Without Softmax Layer (2019)(About) **how to accelerate contextual representation learning**.
> Contextual representation models are difficult to train due to the large parameter sizes and high computational complexity
> We find that the softmax layer (the output layer) causes significant inefficiency due to the large vocabulary size.
Therefore, we redesign the learning objectiv.
> Specifically, the proposed approach bypasses the softmax layer by performing language modeling with dimension reduction, and allows the models to leverage pre-trained word embeddings.
Our framework reduces the time spent on the output layer to a negligible level, eliminates almost all the trainable parameters of the softmax layer and performs language modeling without truncating the vocabulary.
When applied to ELMo, our method achieves a 4 times speedup and eliminates 80% trainable parameters while achieving competitive performance on downstream tasks.
**decouples learning contexts and words**
> Instead of using
a softmax layer to predict the distribution of the
missing word, we utilize and extend the SEMFIT
layer (Kumar and Tsvetkov, 2018) to **predict the
embedding of the missing word**.
Learned in translation: contextualized word vectors (Salesforce Research)(About) Models that use pretrained word vectors must learn how to use them. Our work picks up where word vectors left off by looking to improve over randomly initialized methods for contextualizing word vectors through training on an intermediate task -> We teach a neural network how to understand words in context by first teaching it how to translate English to German