Describes the design and development of Google’s TPU in 15 months.
The TPU consists of a massive (256 x 256) systolic array supporting 8-bit multiplications. (It can also support 8x16 and 16x16 multiplications.) Most databuses are 256 bytes wide (except accumulators that are 1024 bytes wide). Memory is explicitly managed, uses double buffering, etc.
Being a systolic array, each multiplier can store weights used to multiply by (2 weights: one to use now, one double buffer). The array can perform a convolution (CNN) or a matrix multiply.
In the control logic, some ideas from the decoupled access execute philosophy are used with the multiplier stalling if operands are not ready.
The chip is half the size of Haswell, 30% the clock rate, 30% the power, 35x the performance (on 8 bit multiplies), half the memory bandwidth, half the on-chip memory.
Comparison with a Haswell x86 part and an Nvidia K80 part is based on using the roofline performance model which shows that the benchmarks are hitting the roofline on the TPU for all but one benchmark. (Most are memory bandwidth limited.) The ridge points are at 1350 ops/byte (TPU), 13 ops/byte (Haswell) and 9 ops/byte (K80) (where lower numbers are better). An analysis of MAC utilization shows that it is often running at 10-25% utilization in part because tall-thin or short-wide tensors don’t fit the big square multiplier. For Haswell and K80, most benchmarks are far below the roofline. (But note that Haswell and K80 use F32 whereas the TPU uses I8 or I16 data.)
They emphasize that the TPU has few features to improve the average case because they are interested in tail latency.
The TPU has poor energy proportionality (it uses similar power whether idle or busy) because the short design cycle did not give them time to include energy-saving features.
MLPs and [LSTM]s are bandwidth limited while CNNs are compute bound. The large systolic array seems to be a mixed benefit: making it larger gives you more compute but it also takes longer to feed in all of the data.
Notes related to In-datacenter performance analysis of a tensor processing unit
Papers related to In-datacenter performance analysis of a tensor processing unit
- Motivation for and evaluation of the first tensor processing unit [jouppi:micro:2018]