Transformer machine learning model

Notes: neural network, auto-encoder model, attention function, RNN, LSTM, NLP
Papers: vaswani:arxiv:2017

In natural language processing (NLP), transformers replace sequential RNN/LSTM models with more parallel models by using attention functions. This solves the problem seen in LSTMs that the first words of the sentence are often lost. In LSTMs, this is solved by adding an attention mechanism; the idea of transformers is that the attention mechanism is enough and you don’t need the LSTM.

The attention mechanism weights the importance of each token on every other token. This is used to reduce the number of operations required to relate signals from two arbitrary input or output positions from linear in their distance to constant which makes it easier to learn dependencies between distant positions.

There can be multiple “attention heads”: one that weights nearby words, one that weights the impact of the verb, …

Used in pretrained models such as GPT-2, GPT-3, BERT, XLNet and ROBERTa.

Attention function

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity [fedus:arxiv:2021]
GShard: Scaling giant models with conditional computation and automatic sharding [lepikhin:arxiv:2020]
Attention is all you need [vaswani:arxiv:2017]

Transformer machine learning model

Notes related to Transformer machine learning model

Papers related to Transformer machine learning model