Approach to machine translation in which a large neural network is trained to maximize translation performance. It is a radical departure from the phrase-based statistical translation approaches, in which a translation system consists of subcomponents that are separately optimized.
A bidirectional recurrent neural network (RNN), known as an encoder, is used by the neural network to encode a source sentence for a second RNN, known as a decoder, that is used to predict words in the target language
Attention and Memory in Deep Learning and NLP – WildML(About) cf. visual attention
In standard [#seq2seq](/tag/sequence_to_sequence_learning) NMT, the decoder is supposed to generate a translation solely based on the last hidden state of the encoder - which therefore must capture everything from the source sentence (it must be a sentence embedding). Not good. Hence the attention mechanism.
> we allow the decoder to “attend” to different parts of the source sentence at each step of the output generation. Importantly, we let the model learn what to attend to based on the input sentence and what it has produced so far
> each decoder output word now depends on a weighted combination of all the input states, not just the last state.
Possible to interpret what the model is doing by looking at the Attention weight matrix
Cost: We need to calculate an attention value for each combination of input and output word (-> attention is a bit of a misnomer: we look at everything in details before deciding what to focus on)
> attention mechanism is simply giving the network access to its internal memory, which is the hidden state of the encoder
> Unlike typical memory, the memory access mechanism here is soft, which means that the network retrieves a weighted combination of all memory locations, not a value from a single discrete location