Introduced the transformer model and replacing the recurrent layers entirely with attention to improve efficiency.
Uses multi-head attention in encoder-decoder layers, self-attention layers in the encoder and attention layers that avoid leftward information flow. Uses sin/cos functions of varying frequencies to encode position.
Evaluated on English-French and English-German translation tasks and improving the previous best BLEU score by 2.0% and on English constituency parsing.