There are tricks in how the work is distributed and implemented to make learning efficient that exploit the nature of CNNs.
They use two loss functions. One encourages all experts to have equal importance. The other encourages load balancing across experts.
They evaluate on a “billion word language model” consisting of shuffled unique sentences from news articles containing 829 million words and a vocabulary of 793,471 words. They trained models consisting of two stacked LSTM layers with a mixture of experts layer between them and vary the sizes of layers and the number of experts from 4 to 4096 where each expert consists of around 8 million MACs per training example per timestep in the forward pass and 4 experts are active per input. Comparing models by size (number of parameters), this method gives similar perplexity compared with a straight LSTM approach. Comparing models by computational complexity, this method gives 24% lower (better) perplexity for the same computational effort.
The also evaluated on a “100 billion word” news corpus, bilingual machine translation, multilingual machine translation,