Hans Graf, Srihari Cadambi, Venkata Jakkula, Murugan Sankaradass, Eric Cosatto, Srimat Chakradhar, Igor Dourdanovic
We present a new, massively parallel architecture for accelerating machine learning algorithms, based on arrays of variable-resolution arithmetic vector processing elements (VPE). Groups of VPEs operate in SIMD (single instruction multiple data) mode, and each group is connected to an independent memory bank. In this way memory bandwidth scales with the number of VPE, and the main data flows are local, keeping power dissipation low. With 256 VPEs, implemented on two FPGA (field programmable gate array) chips, we obtain a sustained speed of 19 GMACS (billion multiply-accumulate per sec.) for SVM training, and 86 GMACS for SVM classification. This performance is more than an order of magnitude higher than that of any FPGA implementation reported so far. The speed on one FPGA is similar to the fastest speeds published on a Graphics Processor for the MNIST problem, despite a clock rate of the FPGA that is six times lower. High performance at low clock rates makes this massively parallel architecture particularly attractive for embedded applications, where low power dissipation is critical. Tests with Convolutional Neural Networks and other learning algorithms are under way now.