cf. visual attention
In standard [#seq2seg](/tag/sequence_to_sequence_learning) NMT, the decoder is supposed to generate a translation solely based on the last hidden state of the encoder - which therefore must capture everything from the source sentence (it must be a sentence embedding). Not good. Hence the attention mechanism.
> we allow the decoder to “attend” to different parts of the source sentence at each step of the output generation. Importantly, we let the model learn what to attend to based on the input sentence and what it has produced so far
> each decoder output word now depends on a weighted combination of all the input states, not just the last state.
Possible to interpret what the model is doing by looking at the Attention weight matrix
Cost: We need to calculate an attention value for each combination of input and output word (-> attention is a bit of a misnomer: we look at everything in details before deciding what to focus on)
> attention mechanism is simply giving the network access to its internal memory, which is the hidden state of the encoder
> Unlike typical memory, the memory access mechanism here is soft, which means that the network retrieves a weighted combination of all memory locations, not a value from a single discrete location