ExTensor: An accelerator for sparse tensor algebra

Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, Christopher W. Fletcher
[doi] [ISBN] [Google Scholar] [DBLP] [Citeseer] [url]
Read: 28 September 2021

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
Columbus, OH, USA
Association for Computing Machinery
New York, NY, USA
Pages 319-333
Note(s): hardware, neural network, sparse model
Papers: qin:hpca:2020

The key idea in this paper is to exploit hierarchy based on the observation that “0 * x == 0” not just for scalar zeros but for tensor zeros. So they can exploit “hyper-sparsity” (sparsity at the level of entire rows, columns or tiles) and avoid fetching large amounts of data. And they can potentially avoid computing data that would only have been multiplied by zero.

To support this, they decouple metadata processing from data processing allowing metadata processing to run ahead.

They view an N-dimensional tensor as an N-level tree where each level represents one of the dimensions. (Tiling is treated is adding additional dimensions/levels.) Various operations (e.g., dot product) correspond to different ways of intersecting the nodes in the two input trees.

A lot of the paper is dedicated to efficiently intersecting coordinate streams from the trees of the input operands. The sequencers and scanners required to do this are placed right next to DRAM to maximize lookahead. Content addressable memories seem to be an important part in some optimizations.

Evaluation compares against a server-class CPU (not a TPU) for different matrix/vector/sparse/dense combinations (GEMM, TTV, TTM, SDDMM). Evaluation uses a partial simulator written in Python that models DRAM burst/rows, NoC, multicast/unicast, …

They get speedups (against a CPU) in the range 1.3x-25x across the different benchmarks and approach the CPU roofline and ExTensor rooflines on many benchmarks. The improvement comes from the higher DRAM utilization achieved.

Further analysis looks at exactly where the benefit comes by adding/removing parts of the design and sweeping over the design space.

A partial hardware implementation of the novel parts yielded PPA estimates and showed that the novel hardware (the intersection and coordination units) are a small fraction (2.5%) of area.

  • SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training [qin:hpca:2020]