Showing results for 
Show  only  | Search instead for 
Did you mean: 

Removing the Barrier for FPGA-Based OpenCL Data Center Servers

Xilinx Employee
Xilinx Employee
0 0 56.4K


By Devadas Varma and Tom Feist, Xilinx


(Excerpted from the latest issue of Xcell Journal)



Mission-critical enterprise servers often use specialized hardware for application acceleration, including graphics processing units (GPUs) and digital signal processors (DSPs). Xilinx’s new SDAccel development environment removes programming as a gating issue to FPGA utilization in this application by providing developers with a familiar CPU/GPU-like environment.


Public clouds such as Amazon Web services, Google Compute, Microsoft Azure, Facebook and China’s Baidu have huge repositories of pictures and require very fast image recognition. In one implementation, Google scientists created one of the largest neural networks for machine learning by connecting 16,000 computer processors into an entity that they turned loose on the Internet to learn on its own. The research is representative of a new generation of computer science that is exploiting the availability of huge clusters of computers in giant data centers.


Baidu, China’s largest search-engine specialist, turned to deep-neural-network processing to solve problems in speech recognition, image search and natural-language processing. The company quickly determined that when neural back-propagation algorithms are used in online prediction, FPGA solutions scale far more easily than CPUs and GPUs while also reducing power. (Note: Baidu’s Dr. Ren Wu, a GPU application pioneer, gave a keynote at last week’s Embedded Vision Summit 2015 announcing worldwide accuracy leadership in analyzing the ImageNet Large Scale Visual Recognition Challenge data set using Baidu’s GPU-based deep-learning CNN. See “Baidu Leads in Artificial Intelligence Benchmark” and Baidu’s paper.)


The new generation of 28nm and 20nm high-integration FPGA families, such as Xilinx’s 7 series and UltraScale devices, are changing the dynamics for integration of FPGAs into host cards and line cards in data center servers. Performance per watt can easily exceed 20x that of an equivalent CPU or GPU, while offering up to 50x to 75x latency improvements in some applications over traditional CPUs. Teams with limited or no FPGA hardware resources, however, have found the transition to FPGAs challenging due to the RTL (VHDL or Verilog) development expertise needed to take full advantage of these devices.


OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, DSPs, FPGAs and other processors. OpenCL includes a language (based on C99) for programming and an application programming interface (API) to control the platform and execute programs on the target device.


The new Xilinx SDAccel environment provides data center application developers with the complete FPGA-based hardware and OpenCL software. SDAccel includes a fast, architecturally optimizing compiler that makes efficient use of on-chip FPGA resources along with a familiar software-development flow based on an Eclipse integrated design environment (IDE) for code development, profiling and debugging. This IDE provides a CPU/GPU-like work environment. Moreover, SDAccel leverages Xilinx’s dynamically reconfigurable technology to enable accelerator kernels optimized for different applications to be swapped in and out on the fly.



SDAccel Development Environment Diagram.jpg



The SDAccel compiler delivers as much as a 10x performance improvement over high-end CPUs and one-tenth the power consumption of a GPU, while maintaining code compatibility and a traditional software-programming model for easy application migration and cost savings.


On real-world computation workloads such as video processing with complex nested datapaths, it is clear that the inherent flexibility of the FPGA fabric has performance and power advantages when compared with the fixed architectures of CPUs and GPUs. As shown by the benchmarks in the figure below, the FPGA solution compiled by SDAccel outperforms the CPU implementation of the same code and offers performance competitive with GPU implementations.


 SDAccel Performance Numbers.jpg


(Benchmarks performed by Auviz using the AuvizCV library.)



Note: This blog entry is an excerpt from an article in the latest issue of Xcell Journal.