The [blog post by D. Britz](/doc/?uri=http%3A%2F%2Fwww.wildml.com%2F2016%2F01%2Fattention-and-memory-in-deep-learning-and-nlp%2F) (WildML) gives a good and simple explanation. More details in [Attention? Attention!](/doc/?uri=https%3A%2F%2Flilianweng.github.io%2Flil-log%2F2018%2F06%2F24%2Fattention-attention.html)
While simple Seq2Seq builds a single context vector out of the encoder’s last hidden state, attention creates
shortcuts between the context vector and the entire source input: the context vector has access to the entire input sequence.
The decoder can “attend” to different parts of the source sentence at each step of the output generation, and the model learns what to attend to based on the input sentence and what it has produced so far.
Possible to interpret what the model is doing by looking at the Attention weight matrix
Cost: We need to calculate an attention value for each combination of input and output word (-> attention is a bit of a misnomer: we look at everything in details before deciding what to focus on)
Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models | Blog | Explosion AI(About) > A four-step strategy for deep learning with text
> Word embeddings let you treat individual words as related units of meaning, rather than entirely distinct IDs. However, most NLP problems require understanding of longer spans of text, not just individual words. There's now a simple and flexible solution that is achieving excellent performance on a wide range of problems. After embedding the text into a sequence of vectors, bidirectional RNNs are used to encode the vectors into a sentence matrix. The rows of this matrix can be understood as token vectors — they are sensitive to the sentential context of the token. The final piece of the puzzle is called an attention mechanism. This lets you reduce the sentence matrix down to a sentence vector, ready for prediction.