SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training

Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, Tushar Krishna
[doi] [Google Scholar] [DBLP] [Citeseer]
Read: 28 September 2021

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
Pages 58-70
Feb 2020
Note(s): neural network, sparse model, hardware, ReLU, Benes network
Papers: hedge:micro:2019

An accelerator for sparse matrix-matrix multiplies that achieves 10.8 TFLOPS (at multiple levels of sparsity) at 22.33W and 65.10mm^2 on a 28nm process.

Sources of sparsity include ReLU, pooling and dropout and varies through the training runs and both weights and activations can be sparse with sparsity levels of 70-90% and sparsity tends to increase with model size. Matrices need not be square and their size need not be a multiple of the SIMD size of your hardware.

Flexible dot-product engines (Flex-DPEs) use a rich interconnect fabric to give fast data loading/collection times. Flexible dot-product units (Flex-DPUs) combine multiple Flex-DPEs.

The main source of their flexibility is a Benes network for distributing values to MACs and a “Forwarding Adder Network” (FAN) that adds extra connections (“forwarding links”) (and FP Adders?) to a conventional reduction tree to let it perform multiple irregular reductions at once. [Or maybe it gives one result per cycle? Not sure.]

They use a bitmap for their sparse encoding and a comparison with other representations (for FP32 data) shows that it is never terrible and sometimes the best representation. (They also make the point that this design choice is largely orthogonal to other choices.)

Comparing PPA with a TPU-like systolic engine at same frequency, and number of multipliers, (therefore the same peak TFLOPs), their design uses almost twice the power and 38% more area but (in their evaluation) 6 times higher effective TFLOPS and 3 times effective TFLOPS/W.

The evaluation compares against 7 other accelerators over 10 KxMxN matrix multiplication sizes with K,M,N dimensions ranging from 1 to 500,000 with common values in the range 128-2048. Compared with the TPU (with its 128x128 grid), they win on matrices where one of the dimensions is less than 128. On average, SIGMA is 2x faster than TPUs on dense GEMMs because of higher utilization and faster reduction. And 6x faster than TPUs on sparse GEMMs.