UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

 

BrainChip Holdings has just announced the BrainChip Accelerator, a PCIe server-accelerator card that simultaneously processes 16 channels of video in a variety of video formats using spiking neural networks rather than convolutional neural networks (CNNs). The BrainChip Accelerator card is based on a 6-core implementation BrainChip’s Spiking Neural Network (SNN) processor instantiated in an on-board Xilinx Kintex UltraScale FPGA.

 

Here’s a photo of the BrainChip Accelerator card:

 

 

BrainChip FPGA Board.jpg 

 

BrainChip Accelerator card with six SNNs instantiated in a Kintex UltraScale FPGA

 

 

 

Each BrainChip core performs fast, user-defined image scaling, spike generation, and SNN comparison to recognize objects. The SNNs can be trained using low-resolution images as small as 20x20 pixels. According to BrainChip, SNNs as implemented in the BrainChip Accelerator cores excel at recognizing objects in low-light, low-resolution, and noisy environments.

 

The BrainChip Accelerator card can process 16 channels of video simultaneously with an effective throughput of more than 600 frames per second while dissipating a mere 15W for the entire card. According to BrainChip, that’s a 7x improvement in frames/sec/watt when compared to a GPU-accelerated CNN-based, deep-learning implementation for neural networks like GoogleNet and AlexNet. Here’s a graph from BrainChip illustrating this claim:

 

 

 

BrainChip Efficiency Chart.jpg 

 

 

 

 

SNNs mimic human brain function (synaptic connections, neuron thresholds) more closely than do CNNs and rely on models based on spike timing and intensity. Here’s a graphic from BrainChip comparing a CNN model with the Spiking Neural Network model:

 

 

 

 

BrainChip Spiking Neural Network comparison.jpg 

 

 

For more information about the BrainChip Accelerator card, please contact BrainChip directly.

 

 

 

A new open-source tool named GUINNESS makes it easy for you to develop binarized (2-valued) neural networks (BNNs) for Zynq SoCs and Zynq UltraScale+ MPSoCs using the SDSoC Development Environment. GUINNESS is a GUI-based tool that uses the Chainer deep-learning framework to train a binarized CNN. In a paper titled “On-Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA,” presented at the recent 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, authors Haruyoshi Yonekawa and Hiroki Nakahara describe a system they developed to implement a binarized CNN for the VGG-16 benchmark on the Xilinx ZCU102 Eval Kit, which is based on a Zynq UltraScale+ ZU9EG MPSoC. Nakahara presented the GUINNESS tool again this week at FPL2017 in Ghent, Belgium.

 

According to the IEEE paper, the Zynq-based BNN is 136.8x faster and 44.7x more power efficient than the same CNN running on an ARM Cortex-A57 processor. Compared to the same CNN running on an Nvidia Maxwell GPU, the Zynq-based BNN is 4.9x faster and 3.8x more power efficient.

 

GUINNESS is now available on GitHub.

 

 

 

ZCU102 Board Photo.jpg 

 

 

Xilinx ZCU102 Zynq UltraScale+ MPSoC Eval Kit

 

 

 

 

 

 

 

Xilinx has announced at HUAWEI CONNECT 2017 that Huawei’s new, accelerated cloud service and its FPGA Accelerated Cloud Server (FACS) is based on Xilinx Virtex UltraScale+ VU9P FPGAs. The Huawei FACS platform allows users to develop, deploy, and publish new FPGA-based services and applications on the Huawei Public Cloud with a 10-50x speed-up for compute-intensive cloud applications such as machine learning, data analytics, and video processing. Huawei has more than 15 years of experience in the development of FPGA systems for telecom and data center markets. "The Huawei FACS is a fully integrated hardware and software platform offering developer-to-deployment support with best-in-class industry tool chains and access to Huawei's significant FPGA engineering expertise," said Steve Langridge, Director, Central Hardware Institute, Huawei Canada Research Center.

 

The FPGA Accelerated Cloud Server is available on the Huawei Public Cloud today. To register for the public beta, please visit http://www.hwclouds.com/product/fcs.html. For more information on the Huawei Cloud, please visit www.huaweicloud.com.

 

 

For more information, see this page.

 

 

Baidu details FPGA-based Cloud acceleration with 256-core XPU today at Hot Chips in Cupertino, CA

by Xilinx Employee ‎08-22-2017 11:38 AM - edited ‎08-22-2017 11:40 AM (5,819 Views)

 

Xcell Daily covered an announcement by Baidu about its use of Xilinx Kintex UltraScale+ FPGAs for the acceleration of cloud-based applications last October. (See “Baidu Adopts Xilinx Kintex UltraScale FPGAs to Accelerate Machine Learning Applications in the Data Center.”) Today, Baidu discussed more architectural particulars of its FPGA-acceleration efforts at the Hot Chips conference in Cupertino, California—according to Nicole Hemsoth’s article appearing on the NextPlatform.com site (“An Early Look at Baidu’s Custom AI and Analytics Processor”).

 

Hemsoth writes:

 

“…Baidu has a new processor up its sleeve called the XPU… The architecture they designed is aimed at this diversity with an emphasis on compute-intensive, rule-based workloads while maximizing efficiency, performance and flexibility, says Baidu researcher, Jian Ouyang. He unveiled the XPU today at the Hot Chips conference along with co-presenters from FPGA maker, Xilinx…

 

“’The FPGA is efficient and can be aimed at specific workloads but lacks programmability,’ Ouyang explains. ‘Traditional CPUs are good for general workloads, especially those that are rule-based and they are very flexible. GPUs aim at massive parallelism and have high performance. The XPU is aimed at diverse workloads that are compute-intensive and rule-based with high efficiency and performance with the flexibility of a CPU,’ Ouyang says. The part that is still lagging, as is always the case when FPGAs are involved, is the programmability aspect. As of now there is no compiler, but he says the team is working to develop one…

 

“’To support matrix, convolutional, and other big and small kernels we need a massive math array with high bandwidth, low latency memory and with high bandwidth I/O,” Ouyang explains. “The XPU’s DSP units in the FPGA provide parallelism, the off-chip DDR4 and HBM interface push on the data movement side and the on-chip SRAM provide the memory characteristics required.’”

 

According to Hemsoth’s article, “The XPU has 256 cores clustered with one shared memory for data synchronization… Somehow the all 256 cores are running at 600MHz.”

 

For more details, see Hemsoth’s article on the NextPlatform.com Web site.

 

 

Two new papers, one about hardware and one about software, describe the Snowflake CNN accelerator and accompanying Torch7 compiler developed by several researchers at Purdue U. The papers are titled “Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks” (the hardware paper) and “Compiling Deep Learning Models for Custom Hardware Accelerators” (the software paper). The authors of both papers are Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, and Eugenio Culurciello from Purdue’s School of Electrical and Computer Engineering and the Weldon School of Biomedical Engineering.

 

In the abstract, the hardware paper states:

 

 

“Snowflake, implemented on a Xilinx Zynq XC7Z045 SoC is capable of achieving a peak throughput of 128 G-ops/s and a measured throughput of 100 frames per second and 120 G-ops/s on the AlexNet CNN model, 36 frames per second and 116 Gops/s on the GoogLeNet CNN model and 17 frames per second and 122 G-ops/s on the ResNet-50 CNN model. To the best of our knowledge, Snowflake is the only implemented system capable of achieving over 91% efficiency on modern CNNs and the only implemented system with GoogLeNet and ResNet as part of the benchmark suite.”

 

 

The primary goal of the Snowflake accelerator design was computational efficiency. Efficiency and bandwidth are the two primary factors influencing accelerator throughput. The hardware paper says that the Snowflake accelerator achieves 95% computational efficiency and that it can process networks in real time. Because it is implemented on a Xilinx Zynq Z-7045, power consumption is a miserly 5W according to the software paper, well within the power budget of many embedded systems.

 

The hardware paper also states:

 

 

“Snowflake with 256 processing units was synthesized on Xilinx's Zynq XC7Z045 FPGA. At 250MHz, AlexNet achieved in 93:6 frames/s and 1:2GB/s of off-chip memory bandwidth, and 21:4 frames/s and 2:2GB/s for ResNet18.”

 

 

Here’s a block diagram of the Snowflake machine architecture from the software paper, from the micro level on the left to the macro level on the right:

 

 

Snowflake CNN Accelerator Block Diagram.jpg 

 

 

 There’s room for future performance improvement notes the hardware paper:

 

 

“The Zynq XC7Z045 device has 900 MAC units. Scaling Snowflake up by using three compute clusters, we will be able to utilize 768 MAC units. Assuming an accelerator frequency of 250 MHz, Snowflake will be able to achieve a peak performance of 384 G-ops/s. Snowflake can be scaled further on larger FPGAs by increasing the number of clusters.”

 

 

This is where I point out that a Zynq Z-7100 SoC has 2020 “MAC units” (actually, DSP48E1 slices)—which is a lot more than you find on the Zynq Z-7045 SoC—and the Zynq UltraScale+ ZU15EG MPSoC has 3528 DSP48E2 slices—which is much, much larger still. If speed and throughput are what you desire in a CNN accelerator, then either of these parts would be worthy of consideration for further development.

 

Brian Bailey has just posted an excellent tutorial article titled “CCIX Enables Machine Learning” on the Semiconductor Engineering Web site. The article discusses use of the CCIX high-speed, coherent chip-to-chip I/O standard and its use for machine-learning applications. As it states on the CCIX Consortium Web site:

 

“CCIX was founded to enable a new class of interconnect focused on emerging acceleration applications such as machine learning, network processing, storage off-load, in-memory data base and 4G/5G wireless technology. 

 

“The standard allows processors based on different instruction set architectures to extend the benefits of cache coherent, peer processing to a number of acceleration devices including FPGAs, GPUs, network/storage adapters, intelligent networks and custom ASICs.”

 

Bailey writes:

 

 

“Today, machine learning is based on tasks that have a very deep pipeline. ‘Everyone talks about the amount of compute required, and that is why GPUs are doing well,’ says [Vice President of architecture and verification at Xilinx and chair of the CCIX consortium Gaurav] Singh. ‘They have a lot of compute engines, but the bigger problem is actually the data movement. You may want to enable a model where the GPU is doing the training and the inference is being done by the FPGA. Now you have a lot of data sharing for all of the weights being generated by the GPU, and those are being transferred over to the FPGA for inference. You also may have backward propagation and forward propagation. Forward propagation could be done by the FPGAs, backward by the GPU, but the key thing is still that data movement. They can all work efficiently together if they can share the same data.’”

 

 

 

For more information about CCIX, see:

 

 

 

 

 

 

 

 

Korea-based ATUS (Across The Universe) has developed a working automotive vision sensor that recognizes objects such as cars and pedestrians using a 17.53frames/sec video stream. A CNN (convolutional neural network) performs the object recognition on 20 different object classes and runs in the programmable logic fabric on a Xilinx Zynq Z7045 SoC. The programmable logic clocks at 200MHz and the entire design draws 10.432W. That’s about 10% of the power required by CPUs or GPUs to implement this CNN.

 

Here’s a block diagram of the recognition engine in the Zynq SoC’s programmable logic fabric:

 

 

 

ATUS CNN.jpg

 

ATUS’ Object-Recognition CNN runs in the programmable logic fabric of a Zynq Z7045 SoC

 

 

 

Here’s a short video of ATUS’ Automotive Vision Sensor in action, running on a Xilinx ZC106 eval kit:

 

 

 

 

 

Please contact ATUS for more information about their Automotive Vision Sensor.

 

 

 

SoundAI MicA Development Kit for Far-field Speech-Recognition Systems: Powered by Xilinx Spartan-6 FPGA

by Xilinx Employee ‎07-11-2017 09:18 AM - edited ‎07-12-2017 10:49 AM (9,801 Views)

 

Voice control is hot. Witness Amazon Echo and Google Home. These products work because they’re designed to recognize the spoken word from a distance—far-field speech recognition. It’s a useful capability in a wide range of consumer, medical, and industrial applications and SoundAI now has a kit you can use far-field speech recognition to differentiate your next system design whether it’s a smart speaker; an in-vehicle, speech-based control system; a voice-controlled IoT or IIoT device; or some other never-seen-before device. The SoundAI 60C MicA Development Kit employs FPGA-accelerated machine learning and FPGA-based signal processing to implement advanced audio noise suppression, de-reverberation, echo cancellation, direction-of-arrival detection, and beamforming. The FPGA acceleration is performed by a Xilinx Spartan-6 SLX4 FPGA. (There’s also an available version built into a smart speaker.)

 

 

 

SoundAI MicA Development Kit for Far-Field Speech Recognition.jpg

 

SoundAI 60C MicA Development Kit for Far-Field Speech Recognition

 

 

The SoundAI MicA Development Kit’s circular circuit board measures 3.15 inches (80mm) in diameter and incorporates 7 MEMS microphones and 32 LEDs in addition to the Spartan-6 FPGA. According to SoundAI, the kit can capture voice from as far as 5m away, detect commands embedded in the 360-degree ambient sound, localize the voice to within ±10°, and deliver clean audio to the speech-recognition engine (Alexa for English and SoundAI for Chinese).

 

 

Labels
About the Author
  • Be sure to join the Xilinx LinkedIn group to get an update for every new Xcell Daily post! ******************** Steve Leibson is the Director of Strategic Marketing and Business Planning at Xilinx. He started as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He's served as Editor in Chief of EDN Magazine, Embedded Developers Journal, and Microprocessor Report. He has extensive experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.