Last week at the NIPS 2017 conference in Long Beach, California, a Xilinx team demonstrated a live object-detection implementation of a YOLO—“you only look once”—network called Tincy YOLO (pronounced “teensy YOLO”) running on a Xilinx Zynq UltraScale+ MPSoC. Tincy YOLO combines reduced precision, pruning, and FPGA-based hardware acceleration to speed network performance by 160x, resulting in a YOLO network capable of operating on video frames at 16fps while dissipating a mere 6W.
Live demo of Tincy YOLO at NIPS 2017. Photo credit: Dan Isaacs
Here’s a description of that demo:
TincyYOLO: a real-time, low-latency, low-power object detection system running on a
Zynq UltraScale+ MPSoC
By Michaela Blott, Principal Engineer, Xilinx
The Tincy YOLO demonstration shows real-time, low-latency, low-power object detection running on a Zynq UltraScale+ MPSoC device. In object detection, the challenge is to identify objects of interest within a scene and to draw bounding boxes around them, as shown in Figure 1. Object detection is useful in many areas, particularly in advanced driver assistance systems (ADAS) and autonomous vehicles where systems need to automatically detect hazards and to take the right course of action. Tincy YOLO leverages the “you only look once” (YOLO) algorithm, which delivers state-of-the-art object detection. Tincy YOLO is based on the Tiny YOLO convolutional network, which is based on the Darknet reference network. Tincy YOLO has been optimized through heavy quantization and modification to fit into the Zynq UltraScale+ MPSoC’s PL (programmable logic) and Arm Cortex-A53 processor cores to produce the final, real-time demo.
Figure 1: YOLO-recognized people with bounding boxes
To appreciate the computational challenge posed by Tiny YOLO, note that it takes 7 billion floating-point operations to process a single frame. Before you can conquer this computational challenge on an embedded platform, you need to pull many levers. Luckily, the all-programmable Zynq UltraScale+ MPSoC platform provides many levers to pull. Figure 2 summarizes the versatile and heterogeneous architectural options of the Zynq platform.
Figure 2: Tincy YOLO Platform Overview
The vanilla Darknet open-source neural network framework is optimized for CUDA acceleration but its generic, single-threaded processing option can target any C-programmable CPU. Compiling Darknet for the embedded Arm processors in the Zynq UltraScale+ MPSoC left us with a sobering performance of one recognized frame every 10 seconds. That’s about two orders of magnitude of performance away from a useful ADAS implementation. It also produces a very limited live-video experience.
To create Tincy YOLO, we leveraged several of the Zynq UltraScale+ MPSoC’s architectural features in steps, as shown in Figure 3. Our first major move was to quantize the computation of the network’s twelve inner (aka. hidden) layers by giving them binary weights and 3-bit activations. We then pruned this network to reduce the total operations to 4.5 GOPs/frame.
Figure 3: Steps used to achieve a 160x speedup of the Tiny YOLO network
We created a reduced-precision accelerator using a variant of the FINN BNN library (https://github.com/Xilinx/BNN-PYNQ) to offload the quantized layers into the Zynq UltraScale+ MPSoC’s PL. These layers account for more than 97% of all the computation within the network. Moving the computations for these layers into hardware bought us a 30x speedup of their specific execution, which translated into an 11x speedup within the overall application context, bringing the network’s performance up to 1.1fps.
We tackled the remaining outer layers by exploiting the NEON SIMD vector capabilities built into the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor cores, which gained another 2.2x speedup. Then we cracked down on the complexity of the initial convolution using maxpool elimination for another 2.2x speedup. This work raised the frame rate to 5.5fps. A final re-write of the network inference to parallelize the CPU computations across all four of the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor delivered video performance at 16fps.
The result of these changes appears in Figure 4, which demonstrates better recognition accuracy than Tiny YOLO.
Figure 4: Tincy YOLO results
An article titled “Living on the Edge” by Farhad Fallah, one of Aldec’s Application Engineers, on the New Electronics Web site recently caught my eye because it succinctly describes why FPGAs are so darn useful for many high-performance, edge-computing applications. Here’s an example from the article:
“The benefits of Cloud Computing are many-fold… However, there are a few disadvantages to the cloud too, the biggest of which is that no provider can guarantee 100% availability.”
There’s always going to be some delay when you ship data to the cloud for processing. You will need to wait for the answer. The article continues:
“Edge processing needs to be high-performance and in this respect an FPGA can perform several different tasks in parallel.”
The article then continues to describe a 4-camera ADAS demo based on Aldec’s TySOM-2-7Z100 prototyping board that was shown at this year’s Embedded Vision Summit held in Santa Clara, California. (The TySOM-2-7Z100 proto board is based on the Xilinx Zynq Z-7100 SoC—the largest member of the Zynq SoC family.)
Aldec’s TySOM-2-7Z100 prototyping board
Then the article describes the significant performance boost that the Zynq SoC’s FPGA fabric provides:
“The processing was shared between a dual-core ARM Cortex-A9 processor and FPGA logic (both of which reside within the Zynq device) and began with frame grabbing images from the cameras and applying an edge detection algorithm (‘edge’ here in the sense of physical edges, such as objects, lane markings etc.). This is a computational-intensive task because of the pixel-level computations being applied (i.e. more than 2 million pixels). To perform this task on the ARM CPU a frame rate of only 3 per second could have been realized, whereas in the FPGA 27.5 fps was achieved.”
That’s nearly a 10x performance boost thanks to the on-chip FPGA fabric. Could your application benefit similarly?
Back in August, I wrote about a series of GigE 3D imaging sensors based on Spartan-6 FPGAs from Carnegie Robotics. (See “Carnegie Robotics’ FPGA-based GigE 3D cameras help robots sweep mines from a battlefield, tend corn, and scrub floors.”) That blog post mentioned that Carnegie Robotics had teamed with GPS maker Swift Navigation to work on autonomous robots that would employ the 3D and positioning-system sensors from the two companies. That post also mentioned that the photo of Swift Navigation’s centimeter-accurate Piksi Multi multi-band, multi-constellation GNSS (global navigation satellite system) receiver clearly showed that the receiver is based on a Zynq Z-7020 SoC.
Now, Swift Navigation has just appeared in the latest “Powered by Xilinx” video. In this video, Swift Navigation’s CEO and Founder Timothy Harris describes his company’s use of the Zynq SoC in the Piksi Multi. The Zynq SoC’s programmable logic processes the incoming signals from multiple global-positioning satellite constellations on multiple frequencies and performs measurements on those signals that is normally performed by dedicated hardware. Then the Zynq SoC’s dual-core Arm Cortex-A9 MPCore processor calculates a physical position from those measurements.
The advantages that hardware and software programmability confer on Swift Navigation’s Piksi Multi includes the ability to quickly adapt the GNSS module for specific customer requirements and the ability to update, upgrade, and add features to the module via over-the-air transmissions. These capabilities give Swift Navigation a competitive advantage over competitive designs that employ dedicated hardware.
Here’s the video:
Last month, I described BrainChip’s neuromorphic BrainChip Accelerator PCIe card, a spiking neural network (SNN) based on a Xilinx Kintex UltraScale FPGA. (See “FPGA-based Neuromorphic Accelerator board recognizes objects 7x more efficiently than GPUs on GoogleNet, AlexNet.”) Now, BrainChip Holdings has announced that it has shipped a BrainChip Accelerator to “a major European automobile manufacturer” for evaluation in ADAS (Advanced Driver Assist Systems) and autonomous driving applications. According to BrainChip’s announcement, the company’s BrainChip Accelerator “provides a 7x improvement in images/second/watt, compared to traditional convolutional neural networks accelerated by Graphics Processing Units (GPUs).”
BrainChip Accelerator card with six SNNs instantiated in a Kintex UltraScale FPGA
MathWorks has just published a 4-part mini course that teaches you how to develop vision-processing applications using MATLAB, HDL Coder, and Simulink, then walks you through a practical example targeting a Xilinx Zynq SoC using a lane-detection algorithm in Part 4.
Click here for each of the classes:
Xylon has a new hardware/software development kit for quickly implementing embedded, multi-camera vision systems for ADAS and AD (autonomous driving), machine-vision, AR/VR, guided robotics, drones, and other applications. The new logiVID-ZU Vision Development Kit is based on the Xilinx Zynq UltraScale+ MPSoC and includes four Xylon 1Mpixel video cameras based on the TI FPD (flat-panel display) Link-III interface. The kit supports HDMI video input and display output and comes complete with extensive software deliverables including pre-verified camera-to-display SoC designs built with licensed Xylon logicBRICKS IP cores, reference designs and design examples prepared for the Xilinx SDSoC Development Environment, and complete demo Linux applications.
Xylon’s new logiVID-ZU Vision Development Kit
Please contact Xylon for more information about the new logiVID-ZU Vision Development Kit.
Last week, Aerotenna announced its ready-to-fly Smart Drone Development Platform, based on its OcPoC with Xilinx Zynq Mini Flight Controller. Other components in the $3499 Drone Dev Platform include Aerotenna’s μLanding radar altimeter, three Aerotenna μSharp-Patch collision avoidance radar sensors, one Aerotenna CAN hub, a remote controller, and a pre-assembled quadcopter airframe:
The OcPoC flight controller uses the Zynq SoC’s additional processing power and I/O flexibility—embodied in the on-chip programmable FPGA fabric and programmable I/O—to handle the immense sensor load presented to a drone in flight through sensor fusion and on-board, real-time processing. The Aerotenna OcPoC flight controller handles more than 100 sense inputs.
How well do all of these Aerotenna drone components work together? Well, one indication of how well integrated they are is another announcement last week—made jointly by Aerotenna and UAVenture of Switzerland—to roll out the Precision Spray Autopilot—a “simple plug-and-play solution, allowing quick and easy integration into your existing multirotor spraying drones.” This piece of advanced Ag Tech is designed to create smart drones for agricultural spraying applications.
The Precision Spray Autopilot in Action
The Precision Spray Autopilot’s features include:
What’s even cooler is this 1-minute demo video of the Precision Spray Autopilot in action:
The Xilinx Technology Showcase 2017 will highlight FPGA-acceleration as used in Amazon’s cloud-based AWS EC2 F1 Instance and for high-performance, embedded-vision designs—including vision/video, autonomous driving, Industrial IoT, medical, surveillance, and aerospace/defense applications. The event takes place on Friday, October 6 at the Xilinx Summit Retreat Center in Longmont, Colorado.
You’ll also have a chance to see the latest ways you can use the increasingly popular Python programming language to create Zynq-based designs. The Showcase is a prelude to the 30-hour Xilinx Hackathon starting immediately after the Showcase. (See “Registration is now open for the Colorado PYNQ Hackathon—strictly limited to 35 participants. Apply now!”)
The Xilinx Technology Showcase runs from 3:00 to 5:00 PM.
Click here for more details and for registration info.
Xilinx Colorado, Longmont Facility
For more information about the FPGA-accelerated Amazon AWS EC2 F1 Instance, see:
Pinnacle Imaging Systems’ configurable Denali-MC HDR video and HDR still ISP (Image Signal Processor) IP can support 29 different HDR-capable CMOS image sensors including nine Aptina/ON Semi, six Omnivision, and eleven Sony sensors and twelve different pixel-level gain and frame-set HDR methods using 16-bit processing. The IP can be useful in a wide variety of applications including but certainly not limited to:
Pinnacle’s Denali-MC Image Signal Processor Core Block Diagram
Pinnacle has implemented its Denali-MC IP on a Xilinx Zynq Z-7045 SoC (from the photo on the Denali-MC product page, it appears that Pinnacle used a Xilinx Zynq ZC706 Eval Kit as the implementation vehicle) and it has produced this impressive 3-minute video of the IP in real-time action:
Please contact Pinnacle directly for more information about the Denali-MC ISP IP. The data sheet for the Denali-MC ISP core is here.
Two new papers, one about hardware and one about software, describe the Snowflake CNN accelerator and accompanying Torch7 compiler developed by several researchers at Purdue U. The papers are titled “Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks” (the hardware paper) and “Compiling Deep Learning Models for Custom Hardware Accelerators” (the software paper). The authors of both papers are Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, and Eugenio Culurciello from Purdue’s School of Electrical and Computer Engineering and the Weldon School of Biomedical Engineering.
In the abstract, the hardware paper states:
“Snowflake, implemented on a Xilinx Zynq XC7Z045 SoC is capable of achieving a peak throughput of 128 G-ops/s and a measured throughput of 100 frames per second and 120 G-ops/s on the AlexNet CNN model, 36 frames per second and 116 Gops/s on the GoogLeNet CNN model and 17 frames per second and 122 G-ops/s on the ResNet-50 CNN model. To the best of our knowledge, Snowflake is the only implemented system capable of achieving over 91% efficiency on modern CNNs and the only implemented system with GoogLeNet and ResNet as part of the benchmark suite.”
The primary goal of the Snowflake accelerator design was computational efficiency. Efficiency and bandwidth are the two primary factors influencing accelerator throughput. The hardware paper says that the Snowflake accelerator achieves 95% computational efficiency and that it can process networks in real time. Because it is implemented on a Xilinx Zynq Z-7045, power consumption is a miserly 5W according to the software paper, well within the power budget of many embedded systems.
The hardware paper also states:
“Snowflake with 256 processing units was synthesized on Xilinx's Zynq XC7Z045 FPGA. At 250MHz, AlexNet achieved in 93:6 frames/s and 1:2GB/s of off-chip memory bandwidth, and 21:4 frames/s and 2:2GB/s for ResNet18.”
Here’s a block diagram of the Snowflake machine architecture from the software paper, from the micro level on the left to the macro level on the right:
There’s room for future performance improvement notes the hardware paper:
“The Zynq XC7Z045 device has 900 MAC units. Scaling Snowflake up by using three compute clusters, we will be able to utilize 768 MAC units. Assuming an accelerator frequency of 250 MHz, Snowflake will be able to achieve a peak performance of 384 G-ops/s. Snowflake can be scaled further on larger FPGAs by increasing the number of clusters.”
This is where I point out that a Zynq Z-7100 SoC has 2020 “MAC units” (actually, DSP48E1 slices)—which is a lot more than you find on the Zynq Z-7045 SoC—and the Zynq UltraScale+ ZU15EG MPSoC has 3528 DSP48E2 slices—which is much, much larger still. If speed and throughput are what you desire in a CNN accelerator, then either of these parts would be worthy of consideration for further development.
A recent White Paper published by National Instruments (NI) and titled “ADAS HIL With Sensor Fusion” discusses the challenges presented by systems that have fused many diverse sensors—and automated driver assistance systems (ADAS) present some of the biggest challenges yet. ADAS systems fuse radar, visible and IR cameras, LIDAR, and ultrasound sensors into an environment sensing system of nearly unprecedented complexity, yet they’re expected to work reliably en masse while installed in consumer-grade automobiles. That’s a stiff challenge indeed.
Meeting these challenges requires good design, yes, and it also requies large amounts of testing. Preferably automated testing because there are too many tests to run by hand.
This NI White Paper discusses the way that Altran Italia met these challenges by creating an ADAS HIL Test Environment Suite based on NI instruments and the LabVIEW Development Environment that could test a variety of automotive electronics control units (ECUs) including a Sensor Fusion ECU, a Radar ECU, and a camera ECU. The ADAS HIL test system generates 3D video scenes for the cameras and simulates radar-detected objects using an RF simulator that’s based, in part, on one or more NI 2nd-generation PXIe-5840 VSTs (Vector Signal Transceivers), which in turn are based on a Xilinx Virtex-7 690T FPGA. (See “NI launches 2nd-Gen 6.5GHz Vector Signal Transceiver with 5x the instantaneous bandwidth, FPGA programmability.”)
ADAS HIL Test Environment
Korea-based ATUS (Across The Universe) has developed a working automotive vision sensor that recognizes objects such as cars and pedestrians using a 17.53frames/sec video stream. A CNN (convolutional neural network) performs the object recognition on 20 different object classes and runs in the programmable logic fabric on a Xilinx Zynq Z7045 SoC. The programmable logic clocks at 200MHz and the entire design draws 10.432W. That’s about 10% of the power required by CPUs or GPUs to implement this CNN.
Here’s a block diagram of the recognition engine in the Zynq SoC’s programmable logic fabric:
ATUS’ Object-Recognition CNN runs in the programmable logic fabric of a Zynq Z7045 SoC
Here’s a short video of ATUS’ Automotive Vision Sensor in action, running on a Xilinx ZC106 eval kit:
Please contact ATUS for more information about their Automotive Vision Sensor.
The latest “Powered by Xilinx” video, published today, provides more detail about the Perrone Robotics MAX development platform for developing all types of autonomous robots—including self-driving cars. MAX is a set of software building blocks for handling many types of sensors and controls needed to develop such robotic platforms.
Perrone Robotics has MAX running on the Xilinx Zynq UltraScale+ MPSoC and relies on that heterogeneous All Programmable device to handle the multiple, high-bit-rate data streams from complex sensor arrays that include lidar systems and multiple video cameras.
Perrone is also starting to develop with the new Xilinx reVISION stack and plans to both enhance the performance of existing algorithms and develop new ones for its MAX development platform.
Here’s the 4-minute video:
Last month, I wrote about Perrone Robotic’s Autonomous Driving Platform based on the Zynq UltraScale+ MPSoC. (See “Linc the autonomous Lincoln MKZ running Perrone Robotics' MAX AI takes a drive in Detroit without puny humans’ help” and “Perrone Robotics builds [Self-Driving] Hot Rod Lincoln with its MAX platform, on a Zynq UltraScale+ MPSoC.”) That platform runs on a controller box supplied by iVeia. In the 2-minute video below, iVeia’s CTO Mike Fawcett describes the attributes of the Zynq UltraScale+ MPSoC that make it a superior implementation technology for autonomous driving platforms. The Zynq UltraScale+ MPSoC’s immense, heterogeneous computing power supplied by six ARM processors plus programmable logic and a few more programmable resources flexibly delivers the monumental amount of processing required for vehicular sensor fusion and real-time perception processing while consuming far less power and generating far less heat than competing solutions involving CPUs or GPUs.
Here’s the video:
Cloud computing and application acceleration for a variety of workloads including big-data analytics, machine learning, video and image processing, and genomics are big data-center topics and if you’re one of those people looking for acceleration guidance, read on. If you’re looking to accelerate compute-intensive applications such as automated driving and ADAS or local video processing and sensor fusion, this blog post’s for you to. The basic problem here is that CPUs are too slow and they burn too much power. You may have one or both of these challenges. If so, you may be considering a GPU or an FPGA as an accelerator in your design.
How to choose?
Although GPUs started as graphics accelerators, primarily for gamers, a few architectural tweaks and a ton of software have made them suitable as general-purpose compute accelerators. With the right software tools, it’s not too difficult to recode and recompile a program to run on a GPU instead of a CPU. With some experience, you’ll find that GPUs are not great for every application workload. Certain computations such as sparse matrix math don’t map onto GPUs well. One big issue with GPUs is power consumption. GPUs aimed at server acceleration in a data-center environment may burn hundreds of watts.
With FPGAs, you can build any sort of compute engine you want with excellent performance/power numbers. You can optimize an FPGA-based accelerator for one task, run that task, and then reconfigure the FPGA if needed for an entirely different application. The amount of computing power you can bring to bear on a problem is scary big. A Virtex UltraScale+ VU13P FPGA can deliver 38.3 INT8 TOPS (that’s tera operations per second) and if you can binarize the application, which is possible with some neural networks, you can hit 500TOPS. That’s why you now see big data-center operators like Baidu and Amazon putting Xilinx-based FPGA accelerator cards into their server farms. That’s also why you see Xilinx offering high-level acceleration programming tools like SDAccel to help you develop compute accelerators using Xilinx All Programmable devices.
For more information about the use of Xilinx devices in such applications including a detailed look at operational efficiency, there’s a new 17-page White Paper titled “Xilinx All Programmable Devices: A Superior Platform for Compute-Intensive Systems.”
Linc, Perrone Robotics’ autonomous Lincoln MKZ automobile, took a drive around the Perrone paddock at the TU Automotive autonomous vehicle show in Detroit last week and Dan Isaacs, Xilinx’s Director Connected Systems in Corporate Marketing, was there to shoot photos and video. Perrone’s Linc test vehicle operates autonomously using the company’s MAX (Mobile Autonomous X), a “comprehensive full-stack, modular, real-time capable, customizable, robotics software platform for autonomous (self-driving) vehicles and general purpose robotics.” MAX runs on multiple computing platforms including one based on an Iveia controller, which is based on an Iveia Atlas SOM, which in turn is based on a Xilinx Zynq UltraScale+ MPSoC. The Zynq UltraScale+ MPSoC handles the avalanche of data streaming from the vehicle’s many sensors to ensure that the car travels the appropriate path and avoids hitting things like people, walls and fences, and other vehicles. That’s all pretty important when the car is driving itself in public. (For more information about Perrone Robotics’ MAX, see “Perrone Robotics builds [Self-Driving] Hot Rod Lincoln with its MAX platform, on a Zynq UltraScale+ MPSoC.”)
Here’s a photo of Perrone’s sensored-up Linc autonomous automobile in the Perrone Robotics paddock at TU Automotive in Detroit:
And here’s a photo of the Iveia control box with the Zynq UltraScale+ MPSoC inside, running Perrone’s MAX autonomous-driving software platform. (Note the controller’s small size and lack of a cooling fan):
Opinions about the feasibility of autonomous vehicles are one thing. Seeing the Lincoln MKZ’s 3800 pounds of glass, steel, rubber, and plastic being controlled entirely by a little silver box in the trunk, that’s something entirely different. So here’s the video that shows Perrone Robotics’ Linc in action, driving around the relative safety of the paddock while avoiding the fences, pedestrians, and other vehicles:
When someone asks where Xilinx All Programmable devices are used, I find it a hard question to answer because there’s such a very wide range of applications—as demonstrated by the thousands of Xcell Daily blog posts I’ve written over the past several years.
Now, there’s a 5-minute “Powered by Xilinx” video with clips from several companies using Xilinx devices for applications including:
That’s a huge range covered in just five minutes.
Here’s the video:
With LED automotive lighting now becoming commonplace, newer automobiles have the ability to communicate with each other (V2V communications) and with roadside infrastructure by quickly flashing their lights (LiFi) instead of using radio protocols. Researchers at OKATEM—the Centre of Excellence in Optical Wireless Communication Technologies at Ozyegin University in Turkey—have developed an OFDM-based LiFi demonstrator for V2V (vehicle-to-vehicle) and V2I (vehicle-to-infrastructure) applications that has achieved 50Mbps communications between vehicles as far apart as 70m in a lab atmospheric emulator.
Inside the OKATEM LiFi Atmospheric Emulator
The demo system is based on PXIe equipment from National Instruments (NI) including FlexRIO FPGA modules. (NI’s PXIe FlexRIO modules are based on Xilinx Virtex-5 and Virtex-7 FPGAs.) The FlexRIO modules implement the LiFi OFDM protocols including channel coding, 4-QAM modulation, and an N-IFFT. Here’s a diagram of the setup:
Researchers developed the LiFi system using NI’s LabVIEW and LabVIEW system engineering software. Initial LiFi system performance demonstrated a data rate of 50 Mbps with as much as 70m between two cars, depending on the photodetectors’ location in the car (particularly its height above ground level). Further work will try to improve the total system performance by integrating advanced capabilities such as multiple-input, multiple-output (MIMO) communication and link adaptation on the top of OFDM architecture.
This project was a 2017 NI Engineering Impact Award Winner in the RF and Mobile Communications category last month at NI Week. It is documented in this NI case study.
My Pappy said
Son, you’re gonna
Drive me to drinkin’
If you don’t stop drivin’
That Hot Rod Lincoln” — Commander Cody & His Lost Planet Airmen
In other words, you need an autonomous vehicle.
For the last 14 years, Perrone Robotics has focused on creating platforms that allow vehicle manufacturers to quickly integrate a variety of sensors and control algorithms into a self-driving vehicle. The company’s MAX (Mobile Autonomous X) is “comprehensive full-stack, modular, real-time capable, customizable, robotics software platform for autonomous (self-driving) vehicles and general purpose robotics.”
Sensors for autonomous vehicles include cameras, lidar, radar, ultrasound, and GPS. All of these sensors generate a lot of data—about 1Mbyte/sec for the Perrone test platform. Designers need to break up all of the processing required for these sensors into tasks that can be distributed to multiple processors and then fuse the processed sensor data (sensor fusion) to achieve real-time, deterministic performance. For the most demanding tasks, software-based processing won’t deliver sufficiently quick response.
Self-driving systems must make as many as 100 decisions/sec based on real-time sensor data. You never know what will come at you.
According to Perrone’s Chief Revenue Officer Dave Hofert, the Xilinx Zynq UltraScale+ MPSoC with its multiple ARM Cortex-A53 and -R5 processors and programmable logic can handle all of these critical tasks and provides a “solution that scales,” with enough processing power to bring in machine learning as well.
Here’s a brand new, 3-minute video with more detail and a lot of views showing a Perrone-equipped Lincoln driving very carefully all by itself:
For more detailed information about Perrone Robotics, see this new feature story from an NBC TV affiliate.
Mentor has just announced the DRS360 platform for developing autonomous driving systems based on the Xilinx Zynq UltraScale+ MPSoC. The automotive-grade DRS360 platform is already designed and tested for deployment in ISO 26262 ASIL D-compliant systems.
This platform offers comprehensive sensor-fusion capabilities for multiple cameras, radar, LIDAR, and other sensors while offering “dramatic improvements in latency reduction, sensing accuracy and overall system efficiency required for SAE Level 5 autonomous vehicles.” In particular, the DRS360 platform’s use of the Zynq UltraScale+ MPSoC permits the use of “raw data sensors,” thus avoiding the power, cost, and size penalties of microcontrollers and the added latency of local processing at the sensor nodes.
Eliminating pre-processing microcontrollers from all system sensor nodes brings many advantages to the autonomous-driving system design including improved real-time performance, significant reductions in system cost and complexity, and access to all of the captured sensor data for a maximum-resolution, unfiltered model of the vehicle’s environment and driving conditions.
Rather than try to scale lower levels of ADAS up, Mentor’s DRS360 platform is optimized for Level 5 autonomous driving, and it’s engineered to easily scale down to Levels 4, 3 and even 2. This approach makes it far easier to develop systems at the appropriate level for the system you’re developing because the DRS360 platform is already designed to handle the most complex tasks from the beginning.
I did not go to Embedded World in Nuremberg this week but apparently SemiWiki’s Bernard Murphy was there and he’s published his observations about three Zynq-based reference designs that he saw running in Aldec’s booth on the company’s Zynq-based TySOM embedded dev and prototyping boards.
Aldec TySOM-2 Embedded Prototyping Board
Murphy published this article titled “Aldec Swings for the Fences” on SemiWiki and wrote:
“At the show, Aldec provided insight into using the solution to model the ARM core running in QEMU, together with a MIPI CSI-2 solution running in the FPGA. But Aldec didn’t stop there. They also showed off three reference designs designed using this flow and built on their TySOM boards.
“The first reference design targets multi-camera surround view for ADAS (automotive – advanced driver assistance systems). Camera inputs come from four First Sensor Blue Eagle systems, which must be processed simultaneously in real-time. A lot of this is handled in software running on the Zynq ARM cores but the computationally-intensive work, including edge detection, colorspace conversion and frame-merging, is handled in the FPGA. ADAS is one of the hottest areas in the market and likely to get hotter since Intel just acquired Mobileye.
“The next reference design targets IoT gateways – also hot. Cloud interface, through protocols like MQTT, is handled by the processors. The gateway supports connection to edge devices using wireless and wired protocols including Bluetooth, ZigBee, Wi-Fi and USB.
“Face detection for building security, device access and identifying evil-doers is also growing fast. The third reference design is targeted at this application, using similar capabilities to those on the ADAS board, but here managing real-time streaming video as 1280x720 at 30 frames per second, from an HDR-CMOS image sensor.”
The article contains a photo of the Aldec TySOM-2 Embedded Prototyping Board, which is based on a Xilinx Zynq Z-7045 SoC. According to Murphy, Aldec developed the reference designs using its own and other design tools including the Aldec Riviera-PRO simulator and QEMU. (For more information about the Zynq-specific QEMU processor emulator, see “The Xilinx version of QEMU handles ARM Cortex-A53, Cortex-R5, Cortex-A9, and MicroBlaze.”)
Then Murphy wrote this:
“So yes, Aldec put together a solution combining their simulator with QEMU emulation and perhaps that wouldn’t justify a technical paper in DVCon. But business-wise they look like they are starting on a much bigger path. They’re enabling FPGA-based system prototype and build in some of the hottest areas in systems today and they make these solutions affordable for design teams with much more constrained budgets than are available to the leaders in these fields.”
This week, EETimes’ Junko Yoshida published an article titled “Xilinx AI Engine Steers New Course” that gathers some comments from industry experts and from Xilinx with respect to Monday’s reVISION stack announcement. To recap, the Xilinx reVISION stack is a comprehensive suite of industry-standard resources for developing advanced embedded-vision systems based on machine learning and machine inference.
As Xilinx Senior Vice President of Corporate Strategy Steve Glaser tells Yoshida, “Xilinx designed the stack to ‘enable a much broader set of software and systems engineers, with little or no hardware design expertise to develop, intelligent vision guided systems easier and faster.’”
“While talking to customers who have already begun developing machine-learning technologies, Xilinx identified ‘8 bit and below fixed point precision’ as the key to significantly improve efficiency in machine-learning inference systems.”
Yoshida also interviewed Karl Freund, Senior Analyst for HPC and Deep Learning at Moor Insights & Strategy, who said:
“Artificial Intelligence remains in its infancy, and rapid change is the only constant.” In this circumstance, Xilinx seeks “to ease the programming burden to enable designers to accelerate their applications as they experiment and deploy the best solutions as rapidly as possible in a highly competitive industry.”
She also quotes Loring Wirbel, a Senior Analyst at The Linley group, who said:
“What’s interesting in Xilinx's software offering, [is that] this builds upon the original stack for cloud-based unsupervised inference, Reconfigurable Acceleration Stack, and expands inference capabilities to the network edge and embedded applications. One might say they took a backward approach versus the rest of the industry. But I see machine-learning product developers going a variety of directions in trained and inference subsystems. At this point, there's no right way or wrong way.”
There’s a lot more information in the EETimes article, so you might want to take a look for yourself.
Today, EEJournal’s Kevin Morris has published a review article of the announcement titled “Teaching Machines to See: Xilinx Launches reVISION” following Monday’s announcement of the Xilinx reVISION stack for developing vision-guided applications. (See “Xilinx reVISION stack pushes machine learning for vision-guided applications all the way to the edge.”
“But vision is one of the most challenging computational problems of our era. High-resolution cameras generate massive amounts of data, and processing that information in real time requires enormous computing power. Even the fastest conventional processors are not up to the task, and some kind of hardware acceleration is mandatory at the edge. Hardware acceleration options are limited, however. GPUs require too much power for most edge applications, and custom ASICs or dedicated ASSPs are horrifically expensive to create and don’t have the flexibility to keep up with changing requirements and algorithms.
“That makes hardware acceleration via FPGA fabric just about the only viable option. And it makes SoC devices with embedded FPGA fabric - such as Xilinx Zynq and Altera SoC FPGAs - absolutely the solutions of choice. These devices bring the benefits of single-chip integration, ultra-low latency and high bandwidth between the conventional processors and the FPGA fabric, and low power consumption to the embedded vision space.”
Later on, Morris gets to the fly in the ointment:
“Oh, yeah, There’s still that “almost impossible to program” issue.”
And then he gets to the solution:
“reVISION, announced this week, is a stack - a set of tools, interfaces, and IP - designed to let embedded vision application developers start in their own familiar sandbox (OpenVX for vision acceleration and Caffe for machine learning), smoothly navigate down through algorithm development (OpenCV and NN frameworks such as AlexNet, GoogLeNet, SqueezeNet, SSD, and FCN), targeting Zynq devices without the need to bring in a team of FPGA experts. reVISION takes advantage of Xilinx’s previously-announced SDSoC stack to facilitate the algorithm development part. Xilinx claims enormous gains in productivity for embedded vision development - with customers predicting cuts of as much as 12 months from current schedules for new product and update development.
In many systems employing embedded vision, it’s not just the vision that counts. Increasingly, information from the vision system must be processed in concert with information from other types of sensors such as LiDAR, SONAR, RADAR, and others. FPGA-based SoCs are uniquely agile at handling this sensor fusion problem, with the flexibility to adapt to the particular configuration of sensor systems required by each application. This diversity in application requirements is a significant barrier for typical “cost optimization” strategies such as the creation of specialized ASIC and ASSP solutions.
The performance rewards for system developers who successfully harness the power of these devices are substantial. Xilinx is touting benchmarks showing their devices delivering an advantage of 6x images/sec/watt in machine learning inference with GoogLeNet @batch = 1, 42x frames/sec/watt in computer vision with OpenCV, and ⅕ the latency on real-time applications with GoogLeNet @batch = 1 versus “NVidia Tegra and typical SoCs.” These kinds of advantages in latency, performance, and particularly in energy-efficiency can easily be make-or-break for many embedded vision applications.”
But don’t take my word for it, read Morris’ article yourself.
As part of today’s reVISION announcement of a new, comprehensive development stack for embedded-vision applications, Xilinx has produced a 3-minute video showing you just some of the things made possible by this announcement.
Here it is:
By Adam Taylor
Several times in this series, we have looked at image processing using the Avnet EVK and the ZedBoard. Along with the basics, we have examined object tracking using OpenCV running on the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PS (processing system) and using HLS with its video library to generate image-processing algorithms for the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL (programmable logic, see blogs 140 to 148 here).
Xilinx’s reVision is an embedded-vision development stack that provides support for a wide range of frameworks and libraries often used for embedded-vision applications. Most exciting, from my point of view, is that the stack includes acceleration-ready OpenCV functions.
The stack itself is split into three layers. Once we select or define our platform, we will be mostly working at the application and algorithm layers. Let’s take a quick look at the layers of the stack:
As I mentioned above one of the most exciting aspects of the reVISION stack is the ability to accelerate a wide range of OpenCV functions using the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL. We can group the OpenCV functions that can be hardware-accelerated using the PL into four categories:
What is very interesting with these function calls is that we can optimize them for resource usage or performance within the PL. The main optimization method is specifying the number of pixels to be processed during each clock cycle. For most accelerated functions, we can choose to process either one or eight pixels. Processing more pixels per clock cycle reduces latency but increases resource utilization. Processing one pixel per clock minimizes the resource requirements at the cost of increased latency. We control the number of pixels processed per clock in via the function call.
Over the next few blogs, we will look more at the reVision stack and how we can use it. However in the best Blue Peter tradition, the image below shows the result of running a reVision Harris OpenCV acceleration function within the PL when accelerated.
Accelerated Harris Corner Detection in the PL
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
Today, Xilinx announced a comprehensive suite of industry-standard resources for developing advanced embedded-vision systems based on machine learning and machine inference. It’s called the reVISION stack and it allows design teams without deep hardware expertise to use a software-defined development flow to combine efficient machine-learning and computer-vision algorithms with Xilinx All Programmable devices to create highly responsive systems. (Details here.)
The Xilinx reVISION stack includes a broad range of development resources for platform, algorithm, and application development including support for the most popular neural networks: AlexNet, GoogLeNet, SqueezeNet, SSD, and FCN. Additionally, the stack provides library elements such as pre-defined and optimized implementations for CNN network layers, which are required to build custom neural networks (DNNs and CNNs). The machine-learning elements are complemented by a broad set of acceleration-ready OpenCV functions for computer-vision processing.
For application-level development, Xilinx supports industry-standard frameworks including Caffe for machine learning and OpenVX for computer vision. The reVISION stack also includes development platforms from Xilinx and third parties, which support various sensor types.
The reVISION development flow starts with a familiar, Eclipse-based development environment; the C, C++, and/or OpenCL programming languages; and associated compilers all incorporated into the Xilinx SDSoC development environment. You can now target reVISION hardware platforms within the SDSoC environment, drawing from a pool of acceleration-ready, computer-vision libraries to quickly build your application. Soon, you’ll also be able to use the Khronos Group’s OpenVX framework as well.
For machine learning, you can use popular frameworks including Caffe to train neural networks. Within one Xilinx Zynq SoC or Zynq UltraScale+ MPSoC, you can use Caffe-generated .prototxt files to configure a software scheduler running on one of the device’s ARM processors to drive CNN inference accelerators—pre-optimized for and instantiated in programmable logic. For computer vision and other algorithms, you can profile your code, identify bottlenecks, and then designate specific functions that need to be hardware-accelerated. The Xilinx system-optimizing compiler then creates an accelerated implementation of your code, automatically including the required processor/accelerator interfaces (data movers) and software drivers.
The Xilinx reVISION stack is the latest in an evolutionary line of development tools for creating embedded-vision systems. Xilinx All Programmable devices have long been used to develop such vision-based systems because these devices can interface to any image sensor and connect to any network—which Xilinx calls any-to-any connectivity—and they provide the large amounts of high-performance processing horsepower that vision systems require.
Initially, embedded-vision developers used the existing Xilinx Verilog and VHDL tools to develop these systems. Xilinx introduced the SDSoC development environment for HLL-based design two years ago and, since then, SDSoC has dramatically and successfully shorted development cycles for thousands of design teams. Xilinx’s new reVISION stack now enables an even broader set of software and systems engineers to develop intelligent, highly responsive embedded-vision systems faster and more easily using Xilinx All Programmable devices.
And what about the performance of the resulting embedded-vision systems? How do their performance metrics compare against against systems based on embedded GPUs or the typical SoCs used in these applications? Xilinx-based systems significantly outperform the best of this group, which employ Nvidia devices. Benchmarks of the reVISION flow using Zynq SoC targets against Nvidia Tegra X1 have shown as much as:
There is huge value to having a very rapid and deterministic system-response time and, for many systems, the faster response time of a design that's been accelerated using programmable logic can mean the difference between success and catastrophic failure. For example, the figure below shows the difference in response time between a car’s vision-guided braking system created with the Xilinx reVISION stack running on a Zynq UltraScale+ MPSoC relative to a similar system based on an Nvidia Tegra device. At 65mph, the Xilinx embedded-vision system’s response time stops the vehicle 5 to 33 feet faster depending on how the Nvidia-based system is implemented. Five to 33 feet could easily mean the difference between a safe stop and a collision.
(Note: This example appears in the new Xilinx reVISION backgrounder.)
The last two years have generated more machine-learning technology than all of the advancements over the previous 45 years and that pace isn't slowing down. Many new types of neural networks for vision-guided systems have emerged along with new techniques that make deployment of these neural networks much more efficient. No matter what you develop today or implement tomorrow, the hardware and I/O reconfigurability and software programmability of Xilinx All Programmable devices can “future-proof” your designs whether it’s to permit the implementation of new algorithms in existing hardware; to interface to new, improved sensing technology; or to add an all-new sensor type (like LIDAR or Time-of-Flight sensors, for example) to improve a vision-based system’s safety and reliability through advanced sensor fusion.
Xilinx is pushing even further into vision-guided, machine-learning applications with the new Xilinx reVISION Stack and this announcement complements the recently announced Reconfigurable Acceleration Stack for cloud-based systems. (See “Xilinx Reconfigurable Acceleration Stack speeds programming of machine learning, data analytics, video-streaming apps.”) Together, these new development resources significantly broaden your ability to deploy machine-learning applications using Xilinx technology—from inside the cloud to the very edge.
You might also want to read “Xilinx AI Engines Steers New Course” by Junko Yoshida on the EETimes.com site.
It’s amazing what you can do with a few low-cost video cameras and FPGA-based, high-speed video processing. One example: the Virtual Flying Camera that Xylon has implemented with just four video cameras and a Xilinx Zynq Z-7000 SoC. This setup gives the driver a flying, 360-degree view of a car and its surroundings. It’s also known as a bird’s-eye view, but in this case the bird can fly around the car.
Many such implementations of this sort of video technology use GPUs for the video processing, but Xylon uses the programmable logic in the Zynq SoC using custom hardware designed with Xylon logicBRICKS IP cores. The custom hardware implemented in the Zynq SoC’s programmable logic enables very fast execution of complex video operations including camera lens-distortion corrections, video frame grabbing, video rotation, perspective changes, as well as the seamless stitching of four processed video streams into a single display output—and all this occurs in real time. This design approach assures the lowest possible video processing delay at significantly lower power consumption when compared to GPU-based implementations.
A Xylon logi3D Scalable 3D Graphics Controller soft-IP core—also implemented in the Zynq SoC’s programmable logic—renders a 3D vehicle and the surrounding view on the driver’s information display. The Xylon Surround View system permits real-time 3D image generation even in programmable SoCs without an on-chip GPU, as long as there’s programmable logic available to implement the graphics controller. The current version of the Xylon ADAS Surround View Virtual Flying Camera system runs on the Xylon logiADAK Automotive Driver Assistance Kit that is based on the Xilinx Zynq-7000 All Programmable SoC.
Here’s a 2-minute video of the Xylon Surround View system in action:
If you’re attending the CAR-ELE JAPAN show in Tokyo next week, you can see the Xylon Surround View system operating live in the Xilinx booth.
Next week, the Xilinx booth at the CAR-ELE JAPAN show at Tokyo Big Sight will hold a variety of ADAS (Advanced Driver Assistance Systems) demos based on Xilinx Zynq SoC and Zynq UltraScale+ MPSoC devices from several companies including:
The Zynq UltraScale+ MPSoC and original Zynq SoC offer a unique mix of ARM 32- and 64-bit processors with the heavy-duty processing you get from programmable logic, needed to process and manipulate video and to fuse data from a variety of sensors such as video and still cameras, radar, lidar, and sonar to create maps of the local environment.
If you are developing any sort of sensor-based electronic systems for future automotive products, you might want to come by the Xilinx booth (E35-38) to see what’s already been explored. We’re ready to help you get a jump on your design.
Aldec has posted a new 4-minute video with a demonstration of its TySOM-2 Embedded Development Kit generating a 360° view from four Blue Eagle DC3K-1-LVD video cameras plugged into an FMC-ADAS card that is in turn plugged into the TySOM-2 board. The TySOM-2 board is based on a Xilinx Zynq Z-7045 SoC. The demo uses Aldec’s Multi-Camera Surround View technology.
The four video-camera feeds appear choppy in the demo until the FPGA-based acceleration is turned on. At that point, the four video feeds appear on screen in real time with corner-detection annotation added at the full frame rate, thanks to the FPGA-based video processing.
Here’s Aldec’s new video:
This week, National Instruments (NI) announced a technology demonstration of a test system for 76-81GHz automotive radar, targeting ADAS (Advanced Driver Assistance Systems) applications. The system is based on the company’s mmWave front-end technology and its PXIe-5840 2nd-generation vector signal transceiver (VST), introduced earlier this year, which combines a 6.5GHz RF vector signal generator and a 6.5GHz vector signal analyzer in a 2-slot PXIe module. (See “NI launches 2nd-Gen 6.5GHz Vector Signal Transceiver with 5x the instantaneous bandwidth, FPGA programmability.”) The ADAS Test Solution combines NI’s banded, frequency-specific upconverters and downconverters for the 76–81GHz radar band with the 2nd-generation VST’s 1GHz of real-time bandwidth.
The PXIe-5840 VST gets its real-time signal-analysis capabilities from a Xilinx Virtex-7 690T FPGA.
National Instruments PXIe-5840 2nd-generation vector signal transceiver (VST)