We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!


Two new papers, one about hardware and one about software, describe the Snowflake CNN accelerator and accompanying Torch7 compiler developed by several researchers at Purdue U. The papers are titled “Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks” (the hardware paper) and “Compiling Deep Learning Models for Custom Hardware Accelerators” (the software paper). The authors of both papers are Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, and Eugenio Culurciello from Purdue’s School of Electrical and Computer Engineering and the Weldon School of Biomedical Engineering.


In the abstract, the hardware paper states:



“Snowflake, implemented on a Xilinx Zynq XC7Z045 SoC is capable of achieving a peak throughput of 128 G-ops/s and a measured throughput of 100 frames per second and 120 G-ops/s on the AlexNet CNN model, 36 frames per second and 116 Gops/s on the GoogLeNet CNN model and 17 frames per second and 122 G-ops/s on the ResNet-50 CNN model. To the best of our knowledge, Snowflake is the only implemented system capable of achieving over 91% efficiency on modern CNNs and the only implemented system with GoogLeNet and ResNet as part of the benchmark suite.”



The primary goal of the Snowflake accelerator design was computational efficiency. Efficiency and bandwidth are the two primary factors influencing accelerator throughput. The hardware paper says that the Snowflake accelerator achieves 95% computational efficiency and that it can process networks in real time. Because it is implemented on a Xilinx Zynq Z-7045, power consumption is a miserly 5W according to the software paper, well within the power budget of many embedded systems.


The hardware paper also states:



“Snowflake with 256 processing units was synthesized on Xilinx's Zynq XC7Z045 FPGA. At 250MHz, AlexNet achieved in 93:6 frames/s and 1:2GB/s of off-chip memory bandwidth, and 21:4 frames/s and 2:2GB/s for ResNet18.”



Here’s a block diagram of the Snowflake machine architecture from the software paper, from the micro level on the left to the macro level on the right:



Snowflake CNN Accelerator Block Diagram.jpg 



 There’s room for future performance improvement notes the hardware paper:



“The Zynq XC7Z045 device has 900 MAC units. Scaling Snowflake up by using three compute clusters, we will be able to utilize 768 MAC units. Assuming an accelerator frequency of 250 MHz, Snowflake will be able to achieve a peak performance of 384 G-ops/s. Snowflake can be scaled further on larger FPGAs by increasing the number of clusters.”



This is where I point out that a Zynq Z-7100 SoC has 2020 “MAC units” (actually, DSP48E1 slices)—which is a lot more than you find on the Zynq Z-7045 SoC—and the Zynq UltraScale+ ZU15EG MPSoC has 3528 DSP48E2 slices—which is much, much larger still. If speed and throughput are what you desire in a CNN accelerator, then either of these parts would be worthy of consideration for further development.


Brian Bailey has just posted an excellent tutorial article titled “CCIX Enables Machine Learning” on the Semiconductor Engineering Web site. The article discusses use of the CCIX high-speed, coherent chip-to-chip I/O standard and its use for machine-learning applications. As it states on the CCIX Consortium Web site:


“CCIX was founded to enable a new class of interconnect focused on emerging acceleration applications such as machine learning, network processing, storage off-load, in-memory data base and 4G/5G wireless technology. 


“The standard allows processors based on different instruction set architectures to extend the benefits of cache coherent, peer processing to a number of acceleration devices including FPGAs, GPUs, network/storage adapters, intelligent networks and custom ASICs.”


Bailey writes:



“Today, machine learning is based on tasks that have a very deep pipeline. ‘Everyone talks about the amount of compute required, and that is why GPUs are doing well,’ says [Vice President of architecture and verification at Xilinx and chair of the CCIX consortium Gaurav] Singh. ‘They have a lot of compute engines, but the bigger problem is actually the data movement. You may want to enable a model where the GPU is doing the training and the inference is being done by the FPGA. Now you have a lot of data sharing for all of the weights being generated by the GPU, and those are being transferred over to the FPGA for inference. You also may have backward propagation and forward propagation. Forward propagation could be done by the FPGAs, backward by the GPU, but the key thing is still that data movement. They can all work efficiently together if they can share the same data.’”




For more information about CCIX, see:









Korea-based ATUS (Across The Universe) has developed a working automotive vision sensor that recognizes objects such as cars and pedestrians using a 17.53frames/sec video stream. A CNN (convolutional neural network) performs the object recognition on 20 different object classes and runs in the programmable logic fabric on a Xilinx Zynq Z7045 SoC. The programmable logic clocks at 200MHz and the entire design draws 10.432W. That’s about 10% of the power required by CPUs or GPUs to implement this CNN.


Here’s a block diagram of the recognition engine in the Zynq SoC’s programmable logic fabric:






ATUS’ Object-Recognition CNN runs in the programmable logic fabric of a Zynq Z7045 SoC




Here’s a short video of ATUS’ Automotive Vision Sensor in action, running on a Xilinx ZC106 eval kit:






Please contact ATUS for more information about their Automotive Vision Sensor.




SoundAI MicA Development Kit for Far-field Speech-Recognition Systems: Powered by Xilinx Spartan-6 FPGA

by Xilinx Employee ‎07-11-2017 09:18 AM - edited ‎07-12-2017 10:49 AM (5,888 Views)


Voice control is hot. Witness Amazon Echo and Google Home. These products work because they’re designed to recognize the spoken word from a distance—far-field speech recognition. It’s a useful capability in a wide range of consumer, medical, and industrial applications and SoundAI now has a kit you can use far-field speech recognition to differentiate your next system design whether it’s a smart speaker; an in-vehicle, speech-based control system; a voice-controlled IoT or IIoT device; or some other never-seen-before device. The SoundAI 60C MicA Development Kit employs FPGA-accelerated machine learning and FPGA-based signal processing to implement advanced audio noise suppression, de-reverberation, echo cancellation, direction-of-arrival detection, and beamforming. The FPGA acceleration is performed by a Xilinx Spartan-6 SLX4 FPGA. (There’s also an available version built into a smart speaker.)




SoundAI MicA Development Kit for Far-Field Speech Recognition.jpg


SoundAI 60C MicA Development Kit for Far-Field Speech Recognition



The SoundAI MicA Development Kit’s circular circuit board measures 3.15 inches (80mm) in diameter and incorporates 7 MEMS microphones and 32 LEDs in addition to the Spartan-6 FPGA. According to SoundAI, the kit can capture voice from as far as 5m away, detect commands embedded in the 360-degree ambient sound, localize the voice to within ±10°, and deliver clean audio to the speech-recognition engine (Alexa for English and SoundAI for Chinese).



About the Author
  • Be sure to join the Xilinx LinkedIn group to get an update for every new Xcell Daily post! ******************** Steve Leibson is the Director of Strategic Marketing and Business Planning at Xilinx. He started as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He's served as Editor in Chief of EDN Magazine, Embedded Developers Journal, and Microprocessor Report. He has extensive experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.