RedZone Robotics’ Solo—a camera-equipped, autonomous sewer-inspection robot—gives operators a detailed, illuminated view of the inside of a sewer pipe by crawling the length of the pipe and recording video of the conditions it finds inside. A crew can deploy a Solo robot in less than 15 minutes and then move to another site to launch yet another Solo robot, thus conducting several inspections simultaneously and cutting the cost per inspection. The treaded robot traverses the pipeline autonomously and then returns to the launch point for retrieval. If the robot encounters an obstruction or blockage, it attempts to negotiate the problem three times before aborting the inspection and returning to its entry point. The robot fits into pipes as small as eight inches in diameter and even operates in pipes that contain some residual waste water.
RedZone Robotics Autonomous Sewer-Pipe Inspection Robot
Justin Starr, RedZone’s VP of Technology, says that the Solo inspection robot uses its on-board Spartan FPGA for image processing and for AI. Image-processing algorithms compensate for lens aberrations and also perform a level of sensor fusion for the robot’s multiple sensors. “Crucial” AI routines in the Spartan FPGA help the robot keep track of where it is in the pipeline and tell the robot what to do when it encounters an obstruction.
Starr also says that RedZone is already evaluating Xilinx Zynq devices to extend the robot’s capabilities. “It’s not enough for the Solo to just grab information about what it sees, but let’s actually look at those images. Let’s have the solo go through that inspection data in real time and generate a preliminary report of what it saw. It used to be the stuff of science fiction but now it’s becoming reality.”
Want to see the Solo in action? Here’s a 3-minute video:
BrainChip Holdings has just announced the BrainChip Accelerator, a PCIe server-accelerator card that simultaneously processes 16 channels of video in a variety of video formats using spiking neural networks rather than convolutional neural networks (CNNs). The BrainChip Accelerator card is based on a 6-core implementation BrainChip’s Spiking Neural Network (SNN) processor instantiated in an on-board Xilinx Kintex UltraScale FPGA.
Here’s a photo of the BrainChip Accelerator card:
BrainChip Accelerator card with six SNNs instantiated in a Kintex UltraScale FPGA
Each BrainChip core performs fast, user-defined image scaling, spike generation, and SNN comparison to recognize objects. The SNNs can be trained using low-resolution images as small as 20x20 pixels. According to BrainChip, SNNs as implemented in the BrainChip Accelerator cores excel at recognizing objects in low-light, low-resolution, and noisy environments.
The BrainChip Accelerator card can process 16 channels of video simultaneously with an effective throughput of more than 600 frames per second while dissipating a mere 15W for the entire card. According to BrainChip, that’s a 7x improvement in frames/sec/watt when compared to a GPU-accelerated CNN-based, deep-learning implementation for neural networks like GoogleNet and AlexNet. Here’s a graph from BrainChip illustrating this claim:
SNNs mimic human brain function (synaptic connections, neuron thresholds) more closely than do CNNs and rely on models based on spike timing and intensity. Here’s a graphic from BrainChip comparing a CNN model with the Spiking Neural Network model:
For more information about the BrainChip Accelerator card, please contact BrainChip directly.
A new open-source tool named GUINNESS makes it easy for you to develop binarized (2-valued) neural networks (BNNs) for Zynq SoCs and Zynq UltraScale+ MPSoCs using the SDSoC Development Environment. GUINNESS is a GUI-based tool that uses the Chainer deep-learning framework to train a binarized CNN. In a paper titled “On-Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA,” presented at the recent 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, authors Haruyoshi Yonekawa and Hiroki Nakahara describe a system they developed to implement a binarized CNN for the VGG-16 benchmark on the Xilinx ZCU102 Eval Kit, which is based on a Zynq UltraScale+ ZU9EG MPSoC. Nakahara presented the GUINNESS tool again this week at FPL2017 in Ghent, Belgium.
According to the IEEE paper, the Zynq-based BNN is 136.8x faster and 44.7x more power efficient than the same CNN running on an ARM Cortex-A57 processor. Compared to the same CNN running on an Nvidia Maxwell GPU, the Zynq-based BNN is 4.9x faster and 3.8x more power efficient.
Xilinx ZCU102 Zynq UltraScale+ MPSoC Eval Kit
Xilinx has announced at HUAWEI CONNECT 2017 that Huawei’s new, accelerated cloud service and its FPGA Accelerated Cloud Server (FACS) is based on Xilinx Virtex UltraScale+ VU9P FPGAs. The Huawei FACS platform allows users to develop, deploy, and publish new FPGA-based services and applications on the Huawei Public Cloud with a 10-50x speed-up for compute-intensive cloud applications such as machine learning, data analytics, and video processing. Huawei has more than 15 years of experience in the development of FPGA systems for telecom and data center markets. "The Huawei FACS is a fully integrated hardware and software platform offering developer-to-deployment support with best-in-class industry tool chains and access to Huawei's significant FPGA engineering expertise," said Steve Langridge, Director, Central Hardware Institute, Huawei Canada Research Center.
The FPGA Accelerated Cloud Server is available on the Huawei Public Cloud today. To register for the public beta, please visit http://www.hwclouds.com/product/fcs.html. For more information on the Huawei Cloud, please visit www.huaweicloud.com.
For more information, see this page.
Xcell Daily covered an announcement by Baidu about its use of Xilinx Kintex UltraScale+ FPGAs for the acceleration of cloud-based applications last October. (See “Baidu Adopts Xilinx Kintex UltraScale FPGAs to Accelerate Machine Learning Applications in the Data Center.”) Today, Baidu discussed more architectural particulars of its FPGA-acceleration efforts at the Hot Chips conference in Cupertino, California—according to Nicole Hemsoth’s article appearing on the NextPlatform.com site (“An Early Look at Baidu’s Custom AI and Analytics Processor”).
“…Baidu has a new processor up its sleeve called the XPU… The architecture they designed is aimed at this diversity with an emphasis on compute-intensive, rule-based workloads while maximizing efficiency, performance and flexibility, says Baidu researcher, Jian Ouyang. He unveiled the XPU today at the Hot Chips conference along with co-presenters from FPGA maker, Xilinx…
“’The FPGA is efficient and can be aimed at specific workloads but lacks programmability,’ Ouyang explains. ‘Traditional CPUs are good for general workloads, especially those that are rule-based and they are very flexible. GPUs aim at massive parallelism and have high performance. The XPU is aimed at diverse workloads that are compute-intensive and rule-based with high efficiency and performance with the flexibility of a CPU,’ Ouyang says. The part that is still lagging, as is always the case when FPGAs are involved, is the programmability aspect. As of now there is no compiler, but he says the team is working to develop one…
“’To support matrix, convolutional, and other big and small kernels we need a massive math array with high bandwidth, low latency memory and with high bandwidth I/O,” Ouyang explains. “The XPU’s DSP units in the FPGA provide parallelism, the off-chip DDR4 and HBM interface push on the data movement side and the on-chip SRAM provide the memory characteristics required.’”
According to Hemsoth’s article, “The XPU has 256 cores clustered with one shared memory for data synchronization… Somehow the all 256 cores are running at 600MHz.”
For more details, see Hemsoth’s article on the NextPlatform.com Web site.
Two new papers, one about hardware and one about software, describe the Snowflake CNN accelerator and accompanying Torch7 compiler developed by several researchers at Purdue U. The papers are titled “Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks” (the hardware paper) and “Compiling Deep Learning Models for Custom Hardware Accelerators” (the software paper). The authors of both papers are Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, and Eugenio Culurciello from Purdue’s School of Electrical and Computer Engineering and the Weldon School of Biomedical Engineering.
In the abstract, the hardware paper states:
“Snowflake, implemented on a Xilinx Zynq XC7Z045 SoC is capable of achieving a peak throughput of 128 G-ops/s and a measured throughput of 100 frames per second and 120 G-ops/s on the AlexNet CNN model, 36 frames per second and 116 Gops/s on the GoogLeNet CNN model and 17 frames per second and 122 G-ops/s on the ResNet-50 CNN model. To the best of our knowledge, Snowflake is the only implemented system capable of achieving over 91% efficiency on modern CNNs and the only implemented system with GoogLeNet and ResNet as part of the benchmark suite.”
The primary goal of the Snowflake accelerator design was computational efficiency. Efficiency and bandwidth are the two primary factors influencing accelerator throughput. The hardware paper says that the Snowflake accelerator achieves 95% computational efficiency and that it can process networks in real time. Because it is implemented on a Xilinx Zynq Z-7045, power consumption is a miserly 5W according to the software paper, well within the power budget of many embedded systems.
The hardware paper also states:
“Snowflake with 256 processing units was synthesized on Xilinx's Zynq XC7Z045 FPGA. At 250MHz, AlexNet achieved in 93:6 frames/s and 1:2GB/s of off-chip memory bandwidth, and 21:4 frames/s and 2:2GB/s for ResNet18.”
Here’s a block diagram of the Snowflake machine architecture from the software paper, from the micro level on the left to the macro level on the right:
There’s room for future performance improvement notes the hardware paper:
“The Zynq XC7Z045 device has 900 MAC units. Scaling Snowflake up by using three compute clusters, we will be able to utilize 768 MAC units. Assuming an accelerator frequency of 250 MHz, Snowflake will be able to achieve a peak performance of 384 G-ops/s. Snowflake can be scaled further on larger FPGAs by increasing the number of clusters.”
This is where I point out that a Zynq Z-7100 SoC has 2020 “MAC units” (actually, DSP48E1 slices)—which is a lot more than you find on the Zynq Z-7045 SoC—and the Zynq UltraScale+ ZU15EG MPSoC has 3528 DSP48E2 slices—which is much, much larger still. If speed and throughput are what you desire in a CNN accelerator, then either of these parts would be worthy of consideration for further development.
Brian Bailey has just posted an excellent tutorial article titled “CCIX Enables Machine Learning” on the Semiconductor Engineering Web site. The article discusses use of the CCIX high-speed, coherent chip-to-chip I/O standard and its use for machine-learning applications. As it states on the CCIX Consortium Web site:
“CCIX was founded to enable a new class of interconnect focused on emerging acceleration applications such as machine learning, network processing, storage off-load, in-memory data base and 4G/5G wireless technology.
“The standard allows processors based on different instruction set architectures to extend the benefits of cache coherent, peer processing to a number of acceleration devices including FPGAs, GPUs, network/storage adapters, intelligent networks and custom ASICs.”
“Today, machine learning is based on tasks that have a very deep pipeline. ‘Everyone talks about the amount of compute required, and that is why GPUs are doing well,’ says [Vice President of architecture and verification at Xilinx and chair of the CCIX consortium Gaurav] Singh. ‘They have a lot of compute engines, but the bigger problem is actually the data movement. You may want to enable a model where the GPU is doing the training and the inference is being done by the FPGA. Now you have a lot of data sharing for all of the weights being generated by the GPU, and those are being transferred over to the FPGA for inference. You also may have backward propagation and forward propagation. Forward propagation could be done by the FPGAs, backward by the GPU, but the key thing is still that data movement. They can all work efficiently together if they can share the same data.’”
For more information about CCIX, see:
Korea-based ATUS (Across The Universe) has developed a working automotive vision sensor that recognizes objects such as cars and pedestrians using a 17.53frames/sec video stream. A CNN (convolutional neural network) performs the object recognition on 20 different object classes and runs in the programmable logic fabric on a Xilinx Zynq Z7045 SoC. The programmable logic clocks at 200MHz and the entire design draws 10.432W. That’s about 10% of the power required by CPUs or GPUs to implement this CNN.
Here’s a block diagram of the recognition engine in the Zynq SoC’s programmable logic fabric:
ATUS’ Object-Recognition CNN runs in the programmable logic fabric of a Zynq Z7045 SoC
Here’s a short video of ATUS’ Automotive Vision Sensor in action, running on a Xilinx ZC106 eval kit:
Please contact ATUS for more information about their Automotive Vision Sensor.
Voice control is hot. Witness Amazon Echo and Google Home. These products work because they’re designed to recognize the spoken word from a distance—far-field speech recognition. It’s a useful capability in a wide range of consumer, medical, and industrial applications and SoundAI now has a kit you can use far-field speech recognition to differentiate your next system design whether it’s a smart speaker; an in-vehicle, speech-based control system; a voice-controlled IoT or IIoT device; or some other never-seen-before device. The SoundAI 60C MicA Development Kit employs FPGA-accelerated machine learning and FPGA-based signal processing to implement advanced audio noise suppression, de-reverberation, echo cancellation, direction-of-arrival detection, and beamforming. The FPGA acceleration is performed by a Xilinx Spartan-6 SLX4 FPGA. (There’s also an available version built into a smart speaker.)
SoundAI 60C MicA Development Kit for Far-Field Speech Recognition
The SoundAI MicA Development Kit’s circular circuit board measures 3.15 inches (80mm) in diameter and incorporates 7 MEMS microphones and 32 LEDs in addition to the Spartan-6 FPGA. According to SoundAI, the kit can capture voice from as far as 5m away, detect commands embedded in the 360-degree ambient sound, localize the voice to within ±10°, and deliver clean audio to the speech-recognition engine (Alexa for English and SoundAI for Chinese).