The best explanation about the transformer. Code included.
> To produce output vector 𝐲i, the self attention operation simply takes a weighted average over all the input vectors
> Where the weights sum to one over all j. The weight wij is not a parameter, as in a normal neural net, but it is derived from a function over 𝐱i and 𝐱j. The simplest option for this function is the dot product.