a probability distribution over sequences of words. Statistical language models try to learn the probability of the next word given its previous words.
> Models rely on an auto-regressive factorization of the joint probability of a corpus using different approaches, from n-gram models to RNNs (SOTA as of 2018-01) ([source](https://arxiv.org/abs/1801.06146))
Better Language Models and Their Implications(About) Our model, called GPT-2 (a successor to GPT), was trained simply to predict the next word in 40GB of Internet text. Due to our concerns about malicious applications of the technology, we are not releasing the trained model
NLP's ImageNet moment has arrived(About) Pretrained word embeddings have a major limitation: they only incorporate previous knowledge in the first layer of the model---the rest of the network still needs to be trained from scratch
> The long reign of word vectors as NLP’s core representation technique has seen an exciting new line of challengers emerge: ELMo, ULMFiT, and the OpenAI transformer. These works made headlines by demonstrating that pretrained language models can be used to achieve state-of-the-art results on a wide range of NLP tasks.
> it only seems to be a question of time until pretrained word embeddings will be dethroned and replaced by pretrained language models in the toolbox of every NLP practitioner. This will likely open many new applications for NLP in settings with limited amounts of labeled data.
Deep learning : background and application to natural language processing(About) - Neural Nets : Basics
- Introduction to multi-layered neural network
- Optimization via back-propagation
- Regularization and Dropout
- The vanishing gradient issue
- Advanced Architectures with NLP applications
- n-gram language model
- Neural Machine Translation (Overview)
- Character based model for sequence tagging
Improving Language Understanding with Unsupervised Learning(About) > can we develop one model, train it in an unsupervised way on a large amount of data, and then fine-tune the model to achieve good performance on many different tasks? Our results indicate that this approach works surprisingly well; the same core model can be fine-tuned for very different tasks with minimal adaptation.
a scalable, task-agnostic system based on a combination of two existing ideas: transformers and unsupervised pre-training
unsupervised generative pre-training of language models followed by discriminative fine-tunning
SRILM - The SRI Language Modeling Toolkit(About) SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.