A technique used in machine learning where a larger model is composed of multiple sub-models plus a gating network (e.g., a neural network) to calculate a relative weight for the outputs of the sub-models.

When the weights are often zero, it is possible to exploit sparsity as in shazeer:arxiv:2017.

## Notes related to Mixture of experts

## Papers related to Mixture of experts

- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity [fedus:arxiv:2021]
- GShard: Scaling giant models with conditional computation and automatic sharding [lepikhin:arxiv:2020]
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer [shazeer:arxiv:2017]