Transformers from scratch | Peter Bloem
The best explanation about the transformer. Code included. > To produce output vector 𝐲i, the self attention operation simply takes a weighted average over all the input vectors > > 𝐲i=∑jwij𝐱j. > > Where the weights sum to one over all j. The weight wij is not a parameter, as in a normal neural net, but it is derived from a function over 𝐱i and 𝐱j. The simplest option for this function is the dot product.
About This Document
File info