Showing results for 
Search instead for 
Did you mean: 

Machine Learning in the Cloud: Deep Neural Networks on FPGAs

Xilinx Employee
Xilinx Employee
0 0 49.9K


By Nagesh Gupta, Auviz Systems



Deep-learning techniques use a large amount of known data to find a set of weights and bias values to match the expected results. The process is called training, and it can result in large models. This fact has motivated engineers to move toward specialized hardware such as GPUs for training and classification purposes.


As the amount of data increases even further, machine learning will move to the cloud, where large machine-learning models would be implemented on CPUs. While GPUs are a better alternative in terms of performance for deep-learning algorithms, the prohibitive power requirements have limited their use to high-performance computing clusters. Therefore, there is a dire need for a processing platform that can accelerate algorithms without a substantial increase in power consumption. In this context, FPGAs seem to be an ideal choice, with their inherent capability to facilitate the launching of a large number of concurrent processes at a low power profile.


Let’s take a closer look at how to implement a convolutional neural network (CNN) on a Xilinx FPGA. CNN is a class of deep neural networks that has been very successful for large scale image-recognition tasks and other, similar machine-learning problems.


WHAT IS A CONVOLUTIONAL NEURAL NETWORK? Convolutional neural networks are a form of deep neural networks (DNNs) that engineers have recently begun using for various recognition tasks. Image recognition, speech recognition and natural-language processing are a few popular applications of the CNNs.


In 2012, Alex Krishevsky and others from the University of Toronto proposed a deep architecture based on CNNs that won that year’s Imagenet Large Scale Visual Recognition Challenge. Their model achieved a substantial improvement in recognition compared with its competitors or with models from previous years. Since then, AlexNet has become the benchmark for comparison in all image-recognition tasks.


AlexNet consists of five convolution layers followed by three dense convolution operation from pixel location (x,y) at the input feature map n. The activation function used is a rectified linear unit, which performs the function Max(x,0). The activation function introduces nonlinearity in the transfer function of the network. Max pooling is the subsampling technique used in AlexNet. Using this technique, only the maximum values in the local neighborhood of a pixel are selected to propagate to the next layer.


IMPLEMENTING CNN ON AN FPGA. With the advent of newer advanced design environments, it has become easier for software developers to port their designs to Xilinx FPGAs. The software developer can exploit the inherent architectural advantages of an FPGA by calling functions from C/C++ code. Libraries from Auviz Systems, such as AuvizDNN, provide optimized functions for the user to create custom CNNs for a variety of applications. These functions can be called from within design environments such as Xilinx’s SDAccel to launch kernels on an FPGA.


The simplest approach is to implement the convolutions and the vector-matrix operation in a sequential manner. Given the number of computations involved, sequential computations will create significant latency.


The main reason for the very high latency of a sequential implementation is the sheer number of computations involved in a CNN. The figure below shows the number of computations and the data transfers for each layer in AlexNet to illustrate the complexity.



AlexNet Computations and Data Transfers.jpg



Therefore, it is essential to compute in parallel. There are many ways to parallelize the implementation. One such example is illustrated in the figure below. Here, an 11 x 11 weight matrix is convolved in parallel with an 11 x 11 input feature map to create one output value. This process involves 121 parallel multiply-accumulate operations. Depending on the FPGA resources available, we could convolve 512 or even 768 values in parallel.



FPGA Performance for AlexNet CNN.jpg



To further increase the throughput, we can pipeline the implementation. Pipelining enables higher throughput for operations that take more than one cycle to complete, such as floating-point multiply and add. With pipelining, the latency increases for the first output very slightly, but we can obtain an output every cycle.


A complete implementation of CNNs on the FPGA using AuvizDNN just looks like a sequence of function calls from a C/C++ program. After setting up the objects and data containers, function calls are made to create each of the convolution layers, followed by the dense layers and finally the softmax layer, as shown below.



CNN Function Calls.jpg 


Note: This blog post is excerpted from a much larger and far more detailed article in the special Megatrends issue of Xcell Journal (Issue 92) that has just been published. To read the full article, click here or download a PDF of the entire issue by clicking here.