Showing results for 
Search instead for 
Did you mean: 

A deep look at accelerating Convolutional Neural Network performance, from Auviz Systems

Xilinx Employee
Xilinx Employee
0 0 50.8K

Convolutional Neural Networks (CNNs) and deep learning are revolutionizing all sorts of recognition applications from image and speech recognition to big data mining. Baidu’s Dr. Ren Wu, a GPU application pioneer, gave a keynote at last week’s Embedded Vision Summit 2015 announcing worldwide accuracy leadership in analyzing the ImageNet Large Scale Visual Recognition Challenge data set using Baidu’s GPU-based deep-learning CNN. (See “Baidu Leads in Artificial Intelligence Benchmark” and Baidu’s paper.) GPUs are currently the implementation technology of choice for CNN researchers—because of their familiar programming model—but GPUs have prohibitive power consumption. Meanwhile and also at the Embedded Vision Summit, Auviz Systems founder and CEO Nagesh Gupta presented results of related work on image-processing CNNs. Auviz Systems has been developing FPGA-based middleware IP for data centers that cuts application power consumption.


One of the previous holders of the world title for ImageNet processing accuracy, before Baidu took the crown last week, was AlexNet from the University of Toronto. AlexNet consists of five convolution layers followed by three dense layers (that’s CNN-speak). Each convolution layer convolves the set of input feature maps with a set of weight filters resulting in a set of output feature maps. A convolution layer in AlexNet does the following:


  1. 3D Convolutions
  2. Activation function using ReLU (Rectified Linear Units)
  3. Sub-sampling


There’s a lot more math in this complex algorithm than I can begin to handle in a blog, but let’s look at the effort needed for computing just one 3D convolution. An 11x11 Weight Matrix convolved with an 11x11 input feature map generates one output value, as shown below:



11x11 3D Convolution.jpg



This computation involves 121 parallel MAC (multiply/accumulate) operations, which will take quite a while with an instruction-serializing CPU. Depending on its size, an FPGA can easily compute 512 or more such MAC results in parallel and in just one clock cycle.


AuvizDNN, a library of functions from Auviz Systems, provides all the required objects, classes, and functions needed to implement CNNs on FPGAs. AuvizDNN provides configurable functions using which any type and configuration of CNN can be created. From a programmer’s perspective, a complete CNN implementation on an FPGA using AuvizDNN looks like a sequence of C/C++ function calls. AlexNet used in this example is just an illustration; the AuvizDNN library can be used to implement all sorts of CNNs.


So how fast can you go? Here’s a graph taken from real systems implemented by Auviz Systems using Xilinx Kintex-7 and Kintex UltraScale FPGAs:




Auviz AlexNet Performance.jpg



According to Auviz, FPGAs like the Xilinx Kintex Ultrascale can provide better than 14 images/sec/Watt while a high end GPU can process only 4 images/sec/Watt, based on data published in this recent Microsoft paper. These results strongly suggest that FPGAs make a great choice for implementing fast, power-efficient, data-center applications.


More information about CNN implementation using FPGAs is available from the Auviz Web site in a downloadable White Paper titled “Accelerating Machine Learning in the Cloud: Deep Neural Networks on FPGAs.” (Requires registration.)