Achieves efficiency (and, hence, scaling) by using a mixture of experts approach with thousands of experts and exploiting sparsity in the gating network where at most k experts are selected.
There are tricks in how the work is distributed and implemented to make learning efficient that exploit the nature of CNNs.
They use two loss functions. One encourages all experts to have equal importance. The other encourages load balancing across experts.
They evaluate on a “billion word language model” consisting of shuffled unique sentences from news articles containing 829 million words and a vocabulary of 793,471 words. They trained models consisting of two stacked LSTM layers with a mixture of experts layer between them and vary the sizes of layers and the number of experts from 4 to 4096 where each expert consists of around 8 million MACs per training example per timestep in the forward pass and 4 experts are active per input. Comparing models by size (number of parameters), this method gives similar perplexity compared with a straight LSTM approach. Comparing models by computational complexity, this method gives 24% lower (better) perplexity for the same computational effort.
The also evaluated on a “100 billion word” news corpus, bilingual machine translation, multilingual machine translation,
Notes related to Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
Papers related to Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity [fedus:arxiv:2021]
- GShard: Scaling giant models with conditional computation and automatic sharding [lepikhin:arxiv:2020]