Last week at the NIPS 2017 conference in Long Beach, California, a Xilinx team demonstrated a live object-detection implementation of a YOLO—“you only look once”—network called Tincy YOLO (pronounced “teensy YOLO”) running on a Xilinx Zynq UltraScale+ MPSoC. Tincy YOLO combines reduced precision, pruning, and FPGA-based hardware acceleration to speed network performance by 160x, resulting in a YOLO network capable of operating on video frames at 16fps while dissipating a mere 6W.
Live demo of Tincy YOLO at NIPS 2017. Photo credit: Dan Isaacs
Here’s a description of that demo:
TincyYOLO: a real-time, low-latency, low-power object detection system running on a Zynq UltraScale+ MPSoC
By Michaela Blott, Principal Engineer, Xilinx
The Tincy YOLO demonstration shows real-time, low-latency, low-power object detection running on a Zynq UltraScale+ MPSoC device. In object detection, the challenge is to identify objects of interest within a scene and to draw bounding boxes around them, as shown in Figure 1. Object detection is useful in many areas, particularly in advanced driver assistance systems (ADAS) and autonomous vehicles where systems need to automatically detect hazards and to take the right course of action. Tincy YOLO leverages the “you only look once” (YOLO) algorithm, which delivers state-of-the-art object detection. Tincy YOLO is based on the Tiny YOLO convolutional network, which is based on the Darknet reference network. Tincy YOLO has been optimized through heavy quantization and modification to fit into the Zynq UltraScale+ MPSoC’s PL (programmable logic) and Arm Cortex-A53 processor cores to produce the final, real-time demo.
Figure 1: YOLO-recognized people with bounding boxes
To appreciate the computational challenge posed by Tiny YOLO, note that it takes 7 billion floating-point operations to process a single frame. Before you can conquer this computational challenge on an embedded platform, you need to pull many levers. Luckily, the all-programmable Zynq UltraScale+ MPSoC platform provides many levers to pull. Figure 2 summarizes the versatile and heterogeneous architectural options of the Zynq platform.
Figure 2: Tincy YOLO Platform Overview
The vanilla Darknet open-source neural network framework is optimized for CUDA acceleration but its generic, single-threaded processing option can target any C-programmable CPU. Compiling Darknet for the embedded Arm processors in the Zynq UltraScale+ MPSoC left us with a sobering performance of one recognized frame every 10 seconds. That’s about two orders of magnitude of performance away from a useful ADAS implementation. It also produces a very limited live-video experience.
To create Tincy YOLO, we leveraged several of the Zynq UltraScale+ MPSoC’s architectural features in steps, as shown in Figure 3. Our first major move was to quantize the computation of the network’s twelve inner (aka. hidden) layers by giving them binary weights and 3-bit activations. We then pruned this network to reduce the total operations to 4.5 GOPs/frame.
Figure 3: Steps used to achieve a 160x speedup of the Tiny YOLO network
We created a reduced-precision accelerator using a variant of the FINN BNN library (https://github.com/Xilinx/BNN-PYNQ) to offload the quantized layers into the Zynq UltraScale+ MPSoC’s PL. These layers account for more than 97% of all the computation within the network. Moving the computations for these layers into hardware bought us a 30x speedup of their specific execution, which translated into an 11x speedup within the overall application context, bringing the network’s performance up to 1.1fps.
We tackled the remaining outer layers by exploiting the NEON SIMD vector capabilities built into the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor cores, which gained another 2.2x speedup. Then we cracked down on the complexity of the initial convolution using maxpool elimination for another 2.2x speedup. This work raised the frame rate to 5.5fps. A final re-write of the network inference to parallelize the CPU computations across all four of the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor delivered video performance at 16fps.
The result of these changes appears in Figure 4, which demonstrates better recognition accuracy than Tiny YOLO.
Figure 4: Tincy YOLO results
Good machine learning heavily depends on large training-data sets, which are not always available. There’s a solution to this problem called transfer learning, which allows the new neural network to leverage an already trained neural network as a starting point. Kaan Kara at ETH Zurich has published an example of transfer learning as a Jupyter Notebook for the Zynq-and-Python based PYNQ development environment on Github. This demo uses the ZipML-PYNQ overlay and analyzes astronomical images of galaxies and puts the images into one of two classes: one showing images of merging galaxies and one that doesn’t.
The work is discussed further in a paper presented at the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017. The paper is titled “FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off.”
Like the genie in Aladdin, KORTIQ’s FPGA-based AIScale CNN Accelerator takes pre-trained CNNs (convolutional neural networks)—including industry standards such as ResNet, AlexNet, Tiny Yolo, and VGG-16—compresses them, and fits them into Xilinx’s full range of programmable logic fabrics. Devices such as the Zynq SoC and Zynq UltraScale+ MPSoC have multiple on-chip processors that can provide data to the AIScale CNN Accelerator instantiated in the FPGA fabric and accept its classification output, enabling designs such as single-chip, intelligent industrial or surveillance video cameras.
KORTIQ’s AIScale DeepCompressor compresses the trained CNN and outputs a resulting description file that represents the trained CNN. KORTIQ’s TensorFlow2AIScale translator then prepares the compressed CNN for use with KORTIQ’s AIScale RCC (reconfigurable compute core) IP that performs real-time recognition based on the trained CNN. Because the compressed CNN takes the form of a relatively small description, many such description files can be stored in on- or off-chip memory, making fast switching among trained CNNs quite feasible. Currently, KORTIQ is focusing on embedded vision and computer vision applications such as image classification, object recognition, object tracking, and face recognition.
Here’s a conceptual block diagram of the KORTIQ offering:
The hardware portion of this product, the AIScale RCC, is a coarse-grained, scalable, accelerator that can be instantiated in programmable logic—for example in the FPGA fabric of a Zynq Z-7020 SoC for small-footprint instances of the AIScale RCC. Larger All Programmable devices such as larger Zynq SoCs and Zynq UltraScale+ MPSoCs can implement more processing blocks within the accelerator core, which in turn makes the accelerator go even faster. You can use this feature to scale system performance up by picking devices with larger FPGA arrays or reducing power consumption by picking smaller devices.
For more information about the AIScale product family, contact KORTIQ directly.
In the short video below, Xilinx Product Marketing Manager Kamran Khan demonstrates GoogleNet running at 10K images/sec on Amazon’s AWS EC2 F1 using eight Virtex UltraScale+ FPGAs in a 16xlarge configuration. The same video also shows open-source, deep-learning app DeepDetect running in real time, classifying images from a Webcam’s real-time video stream.
For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:
According to an announcement released today:
“Xilinx, Inc. (XLNX) and Huawei Technologies Co., Ltd. today jointly announced the North American debut of the Huawei FPGA Accelerated Cloud Server (FACS) platform at SC17. Powered by Xilinx high performance Virtex UltraScale+ FPGAs, the FACS platform is differentiated in the marketplace today.
“Launched at the Huawei Connect 2017 event, the Huawei Cloud provides FACS FP1 instances as part of its Elastic Compute Service. These instances enable users to develop, deploy, and publish new FPGA-based services and applications through easy-to-use development kits and cloud-based EDA verification services. Both expert hardware developers and high-level language users benefit from FP1 tailored instances suited to each development flow.
"...The FP1 demonstrations feature Xilinx technology which provides a 10-100x speed-up for compute intensive cloud applications such as data analytics, genomics, video processing, and machine learning. Huawei FP1 instances are equipped with up to eight Virtex UltraScale+ VU9P FPGAs and can be configured in a 300G mesh topology optimized for performance at scale."
Huawei’s FP1 FPGA accelerated cloud service is available on the Huawei Public Cloud today. To register for the public beta, click here.
Ryft is one of several companies now offering FPGA-accelerated applications based on Amazon’s AWS EC2 F1 instance. Ryft was at SC17 in Denver this week with a sophisticated, cloud-based data analytics demo based on machine learning and deep learning that classified 50,000 images from one data file using a neural network, merged the classified image files with log data from another file to create a super metadata file, and then provided fast image retrieval using many criteria including image classification, a watch-list match (“look for a gun” or “look for a truck”), or geographic location using the Google Earth database. The entire demo made use of geographically separated servers containing the files used in conjunction with Amazon’s AWS Cloud. The point of this demo was to show Ryft’s ability to provide “FPGAs as a Service” (FaaS) in an easy to use manner using any neural network of your choice, any framework (Caffe, TensorFlow, MXNet), and the popular RESTful API.
This was a complex, live demo and it took Ryft’s VP of Products Bill Dentinger six minutes to walk me through the entire thing, even moving as quickly as possible. Here’s the 6-minute video of Bill giving a very clear explanation of the demo details:
Note: Ryft does a lot of work with US government agencies and as of November 15 (yesterday), Amazon’s AWS EC2 F1 instance based on Xilinx Virtex UltraScale+ FPGAs is available on GovCloud. (See “Amazon’s FPGA-accelerated AWS EC2 F1 instance now available on Amazon’s GovCloud—as of today.”)
This week, if you were in the Xilinx booth at SC17, you would have seen demos of the new Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit (available in actively and passively cooled versions). Both versions are based on Xilinx Virtex UltraScale+ VU9P FPGAs with 64Gbytes of on-board DDR4 SDRAM.
Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, actively cooled version
Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, passively cooled version
Xilinx had several VCU1525 Acceleration Development Kits at SC17 running various applications at SC17. Here’s a short 90-second video from SC17 showing two running applications—edge-to-cloud video analytics and machine learning— narrated by Xilinx Senior Engineering Manager Khang Dao:
Note: For more information about the Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, contact your friendly neighborhood Xilinx or Avnet sales representative.
The new Mellanox Innova-2 Adapter Card teams the company’s ConnectX-5 Ethernet controller with a Xilinx Kintex UltraScale+ KU15P FPGA to accelerate computing, storage, and networking in data centers. According to the announcement, “Innova-2 is based on an efficient combination of the state-of-the-art ConnectX-5 25/40/50/100Gb/s Ethernet and InfiniBand network adapter with Xilinx UltraScale FPGA accelerator.” The adapter card has a PCIe Gen4 host interface.
Mellanox’s Innova-2 PCIe Adapter Card
Key features of the card include:
Innova-2 is available in multiple, pre-programmed configurations for security applications with encryption acceleration such as IPsec or TLS/SSL. Innova-2 boosts performance by 6x for security applications while reducing total cost of ownership by 10X when compared to alternatives.
Innova-2 enables SDN and virtualized acceleration and offloads for Cloud infrastructure. The on-board programmable resources allow deep-learning training and inferencing applications to achieve faster performance and better system utilization by offloading algorithms into the card’s Kintex UltraScale+ FPGA and the ConnectX acceleration engines.
The adapter card is also available as an unprogrammed card, open for customers’ specific applications. Mellanox provides configuration and management tools to support the Innova-2 Adapter Card across Windows, Linux, and VMware distributions.
Please contact Mellanox directly for more information about the Innova-2 Adapter Card.
Programmable logic is proving to be an excellent, flexible implementation medium for neural networks that gets faster and faster as you go from floating-point to fixed-point representation—making it ideal for embedded AI and machine-learning applications—and the latest proof point is a recently published paper written by Yufeng Hao and Steven Quigley in the Department of Electronic, Electrical and Systems Engineering at the University of Birmingham, UK. The paper is titled “The implementation of a Deep Recurrent Neural Network Language Model on a Xilinx FPGA” and it describes a successful implementation and training of a fixed-point Deep Recurrent Neural Network (DRNN) using the Python programming language; the Theano math library and framework for multi-dimensional arrays; the open-source, Python-based PYNQ development environment; the Digilent PYNQ-Z1 dev board; and the Xilinx Zynq Z-7020 SoC on the PYNQ-Z1 board. Using a Python DRNN hardware-acceleration overlay, the two-person team achieved 20GOPS of processing throughput for an NLP (natural language processing) application with this design and outperformed earlier FPGA-based implementation by factors ranging from 2.75x to 70.5x.
Most of the paper discusses NLP and the LM (language model), “which is involved in machine translation, voice search, speech tagging, and speech recognition.” The paper then discusses the implementation of a DRNN LM hardware accelerator using Vivado HLS and Verilog to synthesize a custom overlay for the PYNQ development environment. The resulting accelerator contains five Process Elements (PEs) capable of delivering 20 GOPS in this application. Here’s a block diagram of the design:
DRNN Accelerator Block Diagram
There are plenty of deep technical details embedded in this paper but this one sentence sums up the reason for this blog post about the paper: “More importantly, we showed that a software and hardware joint design and simulation process can be useful in the neural network field.” This statement is doubly true considering that the PYNQ-Z1 dev board sells for $229.
Today, Xilinx announced plans to invest $40M to expand research and development engineering work in Ireland on artificial intelligence and machine learning for strategic markets including cloud computing, embedded vision, IIoT (industrial IoT), and 5G wireless communications. The company already has active development programs in these categories and today’s announcement signals an acceleration of development in these fields. The development was formally announced in Dublin today by The Tánaiste (Deputy Prime Minister of Ireland) and Minister for Business, Enterprise and Innovation, Frances Fitzgerald T.D., and by Kevin Cooney, Senior Vice President, Chief Information Officer and Managing Director EMEA, Xilinx Inc. The new investment is supported by the Irish government through IDA Ireland.
Xilinx first established operations in Dublin in 1995. Today, the company employs 350 people at its EMEA headquarters in Citywest, Dublin, where it operates a research, product development, engineering, and an IT center along with centralized supply, finance, legal, and HR functions. Xilinx also has R&D operations in Cork, which the company established in 2001.
Xilinx’s Ireland Campus
I’ve written several times about Amazon’s AWS EC2 F1 instance, a cloud-based acceleration services based on multiple Xilinx Virtex UltraScale+ VU9P FPGAs. (See “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available.”) The VINEYARD Project, a pan-European effort to significantly increase the performance and energy efficiency of data centers by leveraging the advantages of hardware accelerators, is using Amazon’s EC2 F1 instance to develop Apache Spark accelerators. VINEYARD project coordinator Christoforos Kachris from ICCS/NTU, of gave a presentation on "Hardware Acceleration of Apache Spark on Energy Efficient FPGAs" at SPARK Summit 2017 and a video of his presentation appears below.
Kachris’ presentation details experiments on accelerating machine-learning (ML) applications running on the Apache Spark cluster-computing framework by developing hardware-accelerated IP. The central idea is to create ML libraries that can be seamlessly invoked by programs simply by calling the appropriate library. No other program changes are needed to get the benefit of hardware acceleration. Raw data passes from a Spark Worker through a pipe, a Python API, and a C API to the FPGA acceleration IP and returns to the Spark Worker over a similar, reverse path.
The VINEYARD development team first prototyped their idea by creating a small model of the AWS EC2 F1 could-based system using four Digilent PYNQ-Z1 dev boards networked together via Ethernet and the Python-based, open-source PYNQ software development environment. Digilent’s PYNQ-Z1 dev boards are based on Xilinx Zynq Z-7020 SoCs. Even this small prototype dramatically doubled the performance relative to a Xeon server.
Having proved the concept, the VINEYARD development team scaled up to the AWS EC2 F1 and achieved a 3x to 10x performance improvement (cost normalized against an AWS instance with non-accelerated servers).
Here’s the 26-minute video presentation:
According to Yin Qi, Megvii’s chief exec, his company is developing a “brain” for visual computing. Beijing-based Megvii develops some of the most advanced image-recognition and AI technology in the world. The company’s Face++ facial-recognition algorithms run on the cloud and in edge devices such as the MegEye-C3S security camera, which runs Face++ algorithms locally and can capture more than 100 facial images in each 1080P video frame at 30fps.
MegEye-C3S Facial-Recognition Camera based on Megvii’s Face++ technology
In its early days, Megvii ran its algorithms on GPUs, but quickly discovered the high cost and power disadvantages of GPU acceleration. The company switched to the Xilinx Zynq SoC and is able to run deep convolution on the Zynq SoC’s programmable logic while quantitative analysis runs simultaneously on the Zynq SoC’s Arm Cortex-A9 processors. The heterogeneous processing resources of the Zynq SoC allow Megvii to optimize the performance of its recognition algorithms for lowest cost and minimum power consumption in edge equipment such as the MegEye-C3S camera.
MegEye-C3S Facial-Recognition Camera exploded diagram showing Zynq SoC (on right)
Here’s a 5-minute video where Megvii’s Sam Xie, GM of Branding and Marketing, and Jesson Liu, Megvii’s hardware leader, explain how their company has been able to attract more than 300,000 developers to the Face++ platform and how the Xilinx Zynq SoC has aided the company in developing the most advanced recognition products in the cloud and on the edge:
One of the several products announced at DeePhi’s event held yesterday in Beijing was the DP-S64 ASR (Automatic speech Recognition) Acceleration Solution, a Neural Network (NN) application that runs on Amazon’s FPGA-Accelerated AWS EC2 F1 instance. The AWS EC2 F1 instances’ FPGA acceleration is based on multiple Xilinx Virtex UltraScale+ VU9P FPGAs. (See “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available.”)
For details and to apply for a free trial of DeePhi’s DP-S64, send an email to firstname.lastname@example.org.
For information about the other NN products announced yesterday by DeePhi, see “DeePhi launches vision-processing dev boards based on a Zynq SoC and Zynq UltraScale+ MPSoC, companion Neural Network (NN) dev kit.”
Yesterday, DeePhi Tech announced several new deep-learning products at an event held in Beijing. All of the products are based on DeePhi’s hardware/software co-design technologies for neural network (NN) and AI development and use deep compression and Xilinx All Programmable technology as a foundation. Central to all of these products is DeePhi’s Deep Neural Network Development Kit (DNNDK), an integrated framework that permits NN development using popular tools and libraries such as Caffe, TensorFlow, and MXNet to develop and compile code for DeePhi’s DPUs (Deep Learning Processor Units). DeePhi has developed two FPGA-based DPUs: the Aristotle Architecture for convolutional neural networks (CNNs) and the Descartes Architecture for Recurrent Neural Networks (RNNs).
DeePhi’s DNNDK Design Flow
DeePhi’s Aristotle Architecture
DeePhi’s Descartes Architecture
DeePhi’s approach to NN development using Xilinx All Programmable technology uniquely targets the company’s carefully optimized, hand-coded DPUs instantiated in programmable logic. In the new book “FPGA Frontiers” published the Next Platform Press, DeePhi’s co-founder and CEO Song Yao describes using his company’s DPUs: “The algorithm designer doesn’t need to know anything about the underlying hardware. This generates instruction instead of RTL code, which leads to compilation in 60 seconds.” The benefits are rapid development and the ability to concentrate on NN code development rather than the mechanics of FPGA compilation, synthesis, and placement and routing.
Part of yesterday’s announcement included two PCIe boards oriented towards vision processing that implement DeePhi’s Aristotle Architecture DPU. One board, based on the Xilinx Zynq Z-7020 SoC, handles real-time CNN-based video analysis including facial detection for more than 30 faces simultaneously for 1080p, 18fps video using only 2 to 4 watts. The second board, based on a Xilinx Zynq UltraScale+ ZU9 MPSoC, supports simultaneous, real-time video analysis for 16 channels of 1080p, 18fps video and draws only 30 to 60 watts.
DeePhi PCIe NN board based on a Xilinx Zynq Z-7020 SoC
DeePhi PCIe NN board based on a Xilinx Zynq UltraScale+ ZU9 MPSoC
For more information about these products, please contact DeePhi Tech directly.
RedZone Robotics’ Solo—a camera-equipped, autonomous sewer-inspection robot—gives operators a detailed, illuminated view of the inside of a sewer pipe by crawling the length of the pipe and recording video of the conditions it finds inside. A crew can deploy a Solo robot in less than 15 minutes and then move to another site to launch yet another Solo robot, thus conducting several inspections simultaneously and cutting the cost per inspection. The treaded robot traverses the pipeline autonomously and then returns to the launch point for retrieval. If the robot encounters an obstruction or blockage, it attempts to negotiate the problem three times before aborting the inspection and returning to its entry point. The robot fits into pipes as small as eight inches in diameter and even operates in pipes that contain some residual waste water.
RedZone Robotics Autonomous Sewer-Pipe Inspection Robot
Justin Starr, RedZone’s VP of Technology, says that the Solo inspection robot uses its on-board Spartan FPGA for image processing and for AI. Image-processing algorithms compensate for lens aberrations and also perform a level of sensor fusion for the robot’s multiple sensors. “Crucial” AI routines in the Spartan FPGA help the robot keep track of where it is in the pipeline and tell the robot what to do when it encounters an obstruction.
Starr also says that RedZone is already evaluating Xilinx Zynq devices to extend the robot’s capabilities. “It’s not enough for the Solo to just grab information about what it sees, but let’s actually look at those images. Let’s have the solo go through that inspection data in real time and generate a preliminary report of what it saw. It used to be the stuff of science fiction but now it’s becoming reality.”
Want to see the Solo in action? Here’s a 3-minute video:
BrainChip Holdings has just announced the BrainChip Accelerator, a PCIe server-accelerator card that simultaneously processes 16 channels of video in a variety of video formats using spiking neural networks rather than convolutional neural networks (CNNs). The BrainChip Accelerator card is based on a 6-core implementation BrainChip’s Spiking Neural Network (SNN) processor instantiated in an on-board Xilinx Kintex UltraScale FPGA.
Here’s a photo of the BrainChip Accelerator card:
BrainChip Accelerator card with six SNNs instantiated in a Kintex UltraScale FPGA
Each BrainChip core performs fast, user-defined image scaling, spike generation, and SNN comparison to recognize objects. The SNNs can be trained using low-resolution images as small as 20x20 pixels. According to BrainChip, SNNs as implemented in the BrainChip Accelerator cores excel at recognizing objects in low-light, low-resolution, and noisy environments.
The BrainChip Accelerator card can process 16 channels of video simultaneously with an effective throughput of more than 600 frames per second while dissipating a mere 15W for the entire card. According to BrainChip, that’s a 7x improvement in frames/sec/watt when compared to a GPU-accelerated CNN-based, deep-learning implementation for neural networks like GoogleNet and AlexNet. Here’s a graph from BrainChip illustrating this claim:
SNNs mimic human brain function (synaptic connections, neuron thresholds) more closely than do CNNs and rely on models based on spike timing and intensity. Here’s a graphic from BrainChip comparing a CNN model with the Spiking Neural Network model:
For more information about the BrainChip Accelerator card, please contact BrainChip directly.
A new open-source tool named GUINNESS makes it easy for you to develop binarized (2-valued) neural networks (BNNs) for Zynq SoCs and Zynq UltraScale+ MPSoCs using the SDSoC Development Environment. GUINNESS is a GUI-based tool that uses the Chainer deep-learning framework to train a binarized CNN. In a paper titled “On-Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA,” presented at the recent 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, authors Haruyoshi Yonekawa and Hiroki Nakahara describe a system they developed to implement a binarized CNN for the VGG-16 benchmark on the Xilinx ZCU102 Eval Kit, which is based on a Zynq UltraScale+ ZU9EG MPSoC. Nakahara presented the GUINNESS tool again this week at FPL2017 in Ghent, Belgium.
According to the IEEE paper, the Zynq-based BNN is 136.8x faster and 44.7x more power efficient than the same CNN running on an ARM Cortex-A57 processor. Compared to the same CNN running on an Nvidia Maxwell GPU, the Zynq-based BNN is 4.9x faster and 3.8x more power efficient.
Xilinx ZCU102 Zynq UltraScale+ MPSoC Eval Kit
Xilinx has announced at HUAWEI CONNECT 2017 that Huawei’s new, accelerated cloud service and its FPGA Accelerated Cloud Server (FACS) is based on Xilinx Virtex UltraScale+ VU9P FPGAs. The Huawei FACS platform allows users to develop, deploy, and publish new FPGA-based services and applications on the Huawei Public Cloud with a 10-50x speed-up for compute-intensive cloud applications such as machine learning, data analytics, and video processing. Huawei has more than 15 years of experience in the development of FPGA systems for telecom and data center markets. "The Huawei FACS is a fully integrated hardware and software platform offering developer-to-deployment support with best-in-class industry tool chains and access to Huawei's significant FPGA engineering expertise," said Steve Langridge, Director, Central Hardware Institute, Huawei Canada Research Center.
The FPGA Accelerated Cloud Server is available on the Huawei Public Cloud today. To register for the public beta, please visit http://www.hwclouds.com/product/fcs.html. For more information on the Huawei Cloud, please visit www.huaweicloud.com.
For more information, see this page.
Xcell Daily covered an announcement by Baidu about its use of Xilinx Kintex UltraScale+ FPGAs for the acceleration of cloud-based applications last October. (See “Baidu Adopts Xilinx Kintex UltraScale FPGAs to Accelerate Machine Learning Applications in the Data Center.”) Today, Baidu discussed more architectural particulars of its FPGA-acceleration efforts at the Hot Chips conference in Cupertino, California—according to Nicole Hemsoth’s article appearing on the NextPlatform.com site (“An Early Look at Baidu’s Custom AI and Analytics Processor”).
“…Baidu has a new processor up its sleeve called the XPU… The architecture they designed is aimed at this diversity with an emphasis on compute-intensive, rule-based workloads while maximizing efficiency, performance and flexibility, says Baidu researcher, Jian Ouyang. He unveiled the XPU today at the Hot Chips conference along with co-presenters from FPGA maker, Xilinx…
“’The FPGA is efficient and can be aimed at specific workloads but lacks programmability,’ Ouyang explains. ‘Traditional CPUs are good for general workloads, especially those that are rule-based and they are very flexible. GPUs aim at massive parallelism and have high performance. The XPU is aimed at diverse workloads that are compute-intensive and rule-based with high efficiency and performance with the flexibility of a CPU,’ Ouyang says. The part that is still lagging, as is always the case when FPGAs are involved, is the programmability aspect. As of now there is no compiler, but he says the team is working to develop one…
“’To support matrix, convolutional, and other big and small kernels we need a massive math array with high bandwidth, low latency memory and with high bandwidth I/O,” Ouyang explains. “The XPU’s DSP units in the FPGA provide parallelism, the off-chip DDR4 and HBM interface push on the data movement side and the on-chip SRAM provide the memory characteristics required.’”
According to Hemsoth’s article, “The XPU has 256 cores clustered with one shared memory for data synchronization… Somehow the all 256 cores are running at 600MHz.”
For more details, see Hemsoth’s article on the NextPlatform.com Web site.
Two new papers, one about hardware and one about software, describe the Snowflake CNN accelerator and accompanying Torch7 compiler developed by several researchers at Purdue U. The papers are titled “Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks” (the hardware paper) and “Compiling Deep Learning Models for Custom Hardware Accelerators” (the software paper). The authors of both papers are Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, and Eugenio Culurciello from Purdue’s School of Electrical and Computer Engineering and the Weldon School of Biomedical Engineering.
In the abstract, the hardware paper states:
“Snowflake, implemented on a Xilinx Zynq XC7Z045 SoC is capable of achieving a peak throughput of 128 G-ops/s and a measured throughput of 100 frames per second and 120 G-ops/s on the AlexNet CNN model, 36 frames per second and 116 Gops/s on the GoogLeNet CNN model and 17 frames per second and 122 G-ops/s on the ResNet-50 CNN model. To the best of our knowledge, Snowflake is the only implemented system capable of achieving over 91% efficiency on modern CNNs and the only implemented system with GoogLeNet and ResNet as part of the benchmark suite.”
The primary goal of the Snowflake accelerator design was computational efficiency. Efficiency and bandwidth are the two primary factors influencing accelerator throughput. The hardware paper says that the Snowflake accelerator achieves 95% computational efficiency and that it can process networks in real time. Because it is implemented on a Xilinx Zynq Z-7045, power consumption is a miserly 5W according to the software paper, well within the power budget of many embedded systems.
The hardware paper also states:
“Snowflake with 256 processing units was synthesized on Xilinx's Zynq XC7Z045 FPGA. At 250MHz, AlexNet achieved in 93:6 frames/s and 1:2GB/s of off-chip memory bandwidth, and 21:4 frames/s and 2:2GB/s for ResNet18.”
Here’s a block diagram of the Snowflake machine architecture from the software paper, from the micro level on the left to the macro level on the right:
There’s room for future performance improvement notes the hardware paper:
“The Zynq XC7Z045 device has 900 MAC units. Scaling Snowflake up by using three compute clusters, we will be able to utilize 768 MAC units. Assuming an accelerator frequency of 250 MHz, Snowflake will be able to achieve a peak performance of 384 G-ops/s. Snowflake can be scaled further on larger FPGAs by increasing the number of clusters.”
This is where I point out that a Zynq Z-7100 SoC has 2020 “MAC units” (actually, DSP48E1 slices)—which is a lot more than you find on the Zynq Z-7045 SoC—and the Zynq UltraScale+ ZU15EG MPSoC has 3528 DSP48E2 slices—which is much, much larger still. If speed and throughput are what you desire in a CNN accelerator, then either of these parts would be worthy of consideration for further development.
Brian Bailey has just posted an excellent tutorial article titled “CCIX Enables Machine Learning” on the Semiconductor Engineering Web site. The article discusses use of the CCIX high-speed, coherent chip-to-chip I/O standard and its use for machine-learning applications. As it states on the CCIX Consortium Web site:
“CCIX was founded to enable a new class of interconnect focused on emerging acceleration applications such as machine learning, network processing, storage off-load, in-memory data base and 4G/5G wireless technology.
“The standard allows processors based on different instruction set architectures to extend the benefits of cache coherent, peer processing to a number of acceleration devices including FPGAs, GPUs, network/storage adapters, intelligent networks and custom ASICs.”
“Today, machine learning is based on tasks that have a very deep pipeline. ‘Everyone talks about the amount of compute required, and that is why GPUs are doing well,’ says [Vice President of architecture and verification at Xilinx and chair of the CCIX consortium Gaurav] Singh. ‘They have a lot of compute engines, but the bigger problem is actually the data movement. You may want to enable a model where the GPU is doing the training and the inference is being done by the FPGA. Now you have a lot of data sharing for all of the weights being generated by the GPU, and those are being transferred over to the FPGA for inference. You also may have backward propagation and forward propagation. Forward propagation could be done by the FPGAs, backward by the GPU, but the key thing is still that data movement. They can all work efficiently together if they can share the same data.’”
For more information about CCIX, see:
Korea-based ATUS (Across The Universe) has developed a working automotive vision sensor that recognizes objects such as cars and pedestrians using a 17.53frames/sec video stream. A CNN (convolutional neural network) performs the object recognition on 20 different object classes and runs in the programmable logic fabric on a Xilinx Zynq Z7045 SoC. The programmable logic clocks at 200MHz and the entire design draws 10.432W. That’s about 10% of the power required by CPUs or GPUs to implement this CNN.
Here’s a block diagram of the recognition engine in the Zynq SoC’s programmable logic fabric:
ATUS’ Object-Recognition CNN runs in the programmable logic fabric of a Zynq Z7045 SoC
Here’s a short video of ATUS’ Automotive Vision Sensor in action, running on a Xilinx ZC106 eval kit:
Please contact ATUS for more information about their Automotive Vision Sensor.
Voice control is hot. Witness Amazon Echo and Google Home. These products work because they’re designed to recognize the spoken word from a distance—far-field speech recognition. It’s a useful capability in a wide range of consumer, medical, and industrial applications and SoundAI now has a kit you can use far-field speech recognition to differentiate your next system design whether it’s a smart speaker; an in-vehicle, speech-based control system; a voice-controlled IoT or IIoT device; or some other never-seen-before device. The SoundAI 60C MicA Development Kit employs FPGA-accelerated machine learning and FPGA-based signal processing to implement advanced audio noise suppression, de-reverberation, echo cancellation, direction-of-arrival detection, and beamforming. The FPGA acceleration is performed by a Xilinx Spartan-6 SLX4 FPGA. (There’s also an available version built into a smart speaker.)
SoundAI 60C MicA Development Kit for Far-Field Speech Recognition
The SoundAI MicA Development Kit’s circular circuit board measures 3.15 inches (80mm) in diameter and incorporates 7 MEMS microphones and 32 LEDs in addition to the Spartan-6 FPGA. According to SoundAI, the kit can capture voice from as far as 5m away, detect commands embedded in the 360-degree ambient sound, localize the voice to within ±10°, and deliver clean audio to the speech-recognition engine (Alexa for English and SoundAI for Chinese).