Describes the motivation and implementation of sparse convolution operations in XNNPACK (implemented in Arm NEON). Can be used with the TensorFlow lite model pruning library that learns sparse representations.
At sparsity levels of 85-90%, runtime of models increases by about 30-50% (for MobileNet and EfficientNet).
Although the weights are sparse, the activations are dense. So they can perform vector loads from the activation matrix.
It is possible to keep the working set smaller than the L1$. (Is this just a corollary of the ability to use vector loads?)
If you don’t have too many channels, you can prefetch activations. (todo: I thought “channel” was just red-green-blue in images - now I think I have that wrong).