Today we got our first in-detail look at the internals of Google’s Tensor Processing Unit, the hardware inference accelerator introduced last year but whose details have been kept under wraps until now.
The design team behind the Chip, led by Norman Jouppi, have released a paper that will be presented at the International Symposium on Computer Architecture this coming June in Toronto and, while the reveal isn’t complete (the paper at one point refers the reader to a thicket of patent filings for further details, how’s that for a sense of humor?) there is enough to get a good sense of the chip.
The basic design philosophy is simple and direct: Neural nets don’t need fancy scheduling and inference doesn’t need floating point so junk all of that and jam enough multipliers and memory onto a chip to handle the biggest inference model you can think of without needing to go back to the CPU. This diagram (shamelessly lifted from the paper) shows what they did to achieve that end.
The heart of the die is a systolic array of 64K (216) 8-bit multiplier-accumulators connected to 4096, 256-element, 32-bit accumulators and a 24Mbyte memory array. The MAC array is set up to produce 256 results per cycle, with a pipeline delay that is dependent on the instruction (matmul or conv) being executed. A matrix operation takes in variable size B*256 input and multiplies it by a 256x256 constant weight input, taking B pipelined cycles to complete. 16x8 (or 8x16) multiplies run at half speed; 16x16 drops to one quarter. Interconnect widths between blocks are 256 bytes in order to keep the array busy and the results flowing back to memory.
The chip performs two basic operations: Matrix multiply/convolution and the various flavors of activate; the rest is moving data in and out of memory and keeping the multipliers busy. It acts as a coprocessor, being fed by the CPU (rather than seeking its own work like a GPU) and is programmed using Tensorflow.
Google doesn’t give a die size, although it does reveal that the chip uses a 28nm technology and is less than half the size of a Haswell E5 2699 V3 die, although there is no saying how much less. The claimed performance for the TPU running at 700MHz is 92 TOPS for 8-bit operands while dissipating 75W.
Naturally, the benchmarks quoted show the TPU blowing away both the CPU and GPU for both outright performance and performance per watt.
The paper finishes up with a discussion of where TPU design might go in the future, concluding that higher performance is all about memory bandwidth. Interesting.