Hardware accelerators have become commonplace in the data center and a host of new workloads are ripe for the advantages of FPGA acceleration and their greater computing efficiency. The rapidly growing interest in Machine Learning (ML) is driving the increasing adoption of FPGA accelerators in private, public, and hybrid cloud data center environments to accelerate this compute-intensive workload. As part of facilitating this IT infrastructure transformation to heterogeneous computing, we recently collaborated with VMware to test FPGA acceleration on vSphere, VMware's cloud computing virtualization platform. Given the growing adoption of Xilinx FPGAs for ML inference acceleration, we’re showing how to use Xilinx FPGAs with VMware vSphere to achieve high-throughput and low-latency ML inference performance that is nearly identical between virtual and bare-metal deployments.
Adaptive Computing Advantages
FPGAs are adaptive computing devices that provide the flexibility to be re-programmed to meet different processing and functionality requirements of desired applications. This feature distinguishes FPGAs from fixed architectures like GPUs and ASICs – not to mention the skyrocketing costs of custom ASICs. In addition, FPGAs also have advantages in achieving high energy efficiency and low latency compared to other hardware accelerators, which makes FPGAs especially suitable for ML inference tasks. Unlike GPUs, which fundamentally rely on a large number of parallel processing cores to achieve high throughput, FPGAs can simultaneously achieve high throughput and low latency for ML inference through customized hardware kernels, data flow pipelining and interconnects.
Using Xilinx FPGAs on vSphere for ML Inference
VMware used the Xilinx Alveo U250 datacenter card in their lab for the testing. ML models were quickly provisioned using Docker containers provided in Vitis AI, the Xilinx unified development stack for ML inference on Xilinx hardware platforms from Edge to Cloud. It consists of optimized tools, libraries, models, and examples. Vitis AI supports mainstream frameworks, including Caffe and TensorFlow, as well as the latest models capable of diverse deep learning tasks. In addition, Vitis AI is an open source and can be accessed on GitHub.
Vitis AI software stack
Currently, Xilinx FPGAs can be enabled on vSphere via DirectPath I/O mode (passthrough). In this way, our FPGAs can be directly accessed by applications running inside a VM, bypassing the hypervisor layer and thereby maximizing performance and minimizing latency. Configuring the FPGA in DirectPath I/O mode is a straightforward two-step process: First, enable the device on ESXi at the host level, and then add the device to the target VM. Detailed instructions can be found in this VMware KB article. Note that if you are running vSphere 7, host rebooting is no longer required.
High-Throughput, Low-Latency ML Inference Performance
Together with Xilinx, VMware evaluated the throughput and latency performance of our Alveo U250 accelerator card in DirectPath I/O mode by running inference with four CNN models: Inception_v1; Inception_v2; Resnet50; and VGG16. These models vary in the number of model parameters and thus have different processing complexity.
The testing used a Dell PowerEdge R740 server with two 10-core Intel Xeon Silver 4114 CPUs and 192 GB of DDR4 memory. We used an ESXi 7.0 hypervisor and end-to-end performance results for each model are compared to bare metal as the baseline. Ubuntu 16.04 (kernel 4.4.0-116) is used as both the guest and native OS. In addition, Vitis AI v1.1 along with Docker CE 19.03.4 are used throughout the tests. A 50k-image data set derived from ImageNet2012 was used, and to further avoid disk bottleneck in reading images, a RAM disk was created and used to store the 50k images.
With these settings, the performance comparison between virtual and bare metal tests can be viewed in the following two figures, one for throughput and the other for latency. The y-axis is the ratio between virtual and bare metal, with y=1.0 meaning the performance in virtual and bare metal is identical.
Throughput performance comparison between bare metal and virtual for ML inference on Xilinx Alveo U250 FPGA
Latency performance comparison between bare metal and virtual for ML inference on Xilinx Alveo U250 FPGA
The testing validates that the performance gap between virtual and bare metal is capped at 2%, for both throughput and latency. This indicates that the performance of Alveo U250 on vSphere for ML inference in virtual environments is nearly identical to the bare-metal baseline.
FPGA Performance in the Cloud
The adoption of FPGA accelerators in the data center is becoming pervasive and will continue to increase to meet the growing demand for heterogeneous computing and a performance boost. We’re excited to have partnered with VMware to ensure customers are able to take full advantage of Xilinx FPGA acceleration on the vSphere platform. The testing of our Alveo U250 accelerator on vSphere for ML inference successfully demonstrates for customers the close-to-native performance achieved with DirectPath I/O mode.