Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
[arXiv] [Google Scholar] [DBLP] [Citeseer]
Read: 27 September 2021

arXiv 1706.03762 cs.CL
2017
Note(s): neural network, auto-encoder model, transformer model, attention function, LSTM, BLEU, NLP, google

Introduced the transformer model and replacing the recurrent layers entirely with attention to improve efficiency.

Uses multi-head attention in encoder-decoder layers, self-attention layers in the encoder and attention layers that avoid leftward information flow. Uses sin/cos functions of varying frequencies to encode position.

Evaluated on English-French and English-German translation tasks and improving the previous best BLEU score by 2.0% and on English constituency parsing.


Attention function, Transformer machine learning model