The key idea in this paper is to exploit hierarchy based on the observation that “0 * x == 0” not just for scalar zeros but for tensor zeros. So they can exploit “hyper-sparsity” (sparsity at the level of entire rows, columns or tiles) and avoid fetching large amounts of data. And they can potentially avoid computing data that would only have been multiplied by zero.
To support this, they decouple metadata processing from data processing allowing metadata processing to run ahead.
They view an N-dimensional tensor as an N-level tree where each level represents one of the dimensions. (Tiling is treated is adding additional dimensions/levels.) Various operations (e.g., dot product) correspond to different ways of intersecting the nodes in the two input trees.
A lot of the paper is dedicated to efficiently intersecting coordinate streams from the trees of the input operands. The sequencers and scanners required to do this are placed right next to DRAM to maximize lookahead. Content addressable memories seem to be an important part in some optimizations.
Evaluation compares against a server-class CPU (not a TPU) for different matrix/vector/sparse/dense combinations (GEMM, TTV, TTM, SDDMM). Evaluation uses a partial simulator written in Python that models DRAM burst/rows, NoC, multicast/unicast, …
They get speedups (against a CPU) in the range 1.3x-25x across the different benchmarks and approach the CPU roofline and ExTensor rooflines on many benchmarks. The improvement comes from the higher DRAM utilization achieved.
Further analysis looks at exactly where the benefit comes by adding/removing parts of the design and sweeping over the design space.
A partial hardware implementation of the novel parts yielded PPA estimates and showed that the novel hardware (the intersection and coordination units) are a small fraction (2.5%) of area.
Papers related to ExTensor: An accelerator for sparse tensor algebra
- SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training [qin:hpca:2020]