Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean
[arXiv] [Google Scholar] [DBLP] [Citeseer]
Read: 27 September 2021

arXiv 1701.06538 cs.LG
Note(s): mixture of experts, machine learning, neural network, gating network, CNN, LSTM, NLP, google

Achieves efficiency (and, hence, scaling) by using a mixture of experts approach with thousands of experts and exploiting sparsity in the gating network where at most k experts are selected.

There are tricks in how the work is distributed and implemented to make learning efficient that exploit the nature of CNNs.

They use two loss functions. One encourages all experts to have equal importance. The other encourages load balancing across experts.

They evaluate on a “billion word language model” consisting of shuffled unique sentences from news articles containing 829 million words and a vocabulary of 793,471 words. They trained models consisting of two stacked LSTM layers with a mixture of experts layer between them and vary the sizes of layers and the number of experts from 4 to 4096 where each expert consists of around 8 million MACs per training example per timestep in the forward pass and 4 experts are active per input. Comparing models by size (number of parameters), this method gives similar perplexity compared with a straight LSTM approach. Comparing models by computational complexity, this method gives 24% lower (better) perplexity for the same computational effort.

The also evaluated on a “100 billion word” news corpus, bilingual machine translation, multilingual machine translation,

Mixture of experts

  • Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity [fedus:arxiv:2021]
  • GShard: Scaling giant models with conditional computation and automatic sharding [lepikhin:arxiv:2020]