Displaying articles for: 01-08-2017 - 01-14-2017
Magicians are very good at creating the illusion of levitating objects but the Institute for Integrated Systems at Ruhr University Bochum (RUB) has developed a system that does the real thing—quite precisely. The system levitates a steel ball using an electromagnet controlled by an Avnet PicoZed SOM, which in turn is based on a Xilinx Zynq Z-7000 SoC. An FMCW (frequency-modulated, continuous wave) radar module jointly developed by RUB and the Fraunhofer Institute senses the ball’s position and that data feeds a PID control loop that controls the pulse-width-modulated current supplied to an electromagnet that levitates the steel ball.
FMCW radar sensor module jointly developed by RUB and the Fraunhofer Institute
The entire system was developed using the Xilinx SDSoC development environment with hardware acceleration used for the critical paths in the control loop resulting in fast, repeatable, real-time system response. The un-accelerated code runs on the Zynq SoC’s dual-core ARM Cortex-A9 processor and the code translated into hardware by SDSoC resides in the Zynq SoC’s programmable logic. SDSoC seamlessly manages the interaction between the system’s software and the hardware accelerators and the Zynq SoC provides a single-chip solution to the sensor-driven-control design problem.
Here’s a 3-minute video that captures the entire demo:
It’s amazing what you can do with a few low-cost video cameras and FPGA-based, high-speed video processing. One example: the Virtual Flying Camera that Xylon has implemented with just four video cameras and a Xilinx Zynq Z-7000 SoC. This setup gives the driver a flying, 360-degree view of a car and its surroundings. It’s also known as a bird’s-eye view, but in this case the bird can fly around the car.
Many such implementations of this sort of video technology use GPUs for the video processing, but Xylon uses the programmable logic in the Zynq SoC using custom hardware designed with Xylon logicBRICKS IP cores. The custom hardware implemented in the Zynq SoC’s programmable logic enables very fast execution of complex video operations including camera lens-distortion corrections, video frame grabbing, video rotation, perspective changes, as well as the seamless stitching of four processed video streams into a single display output—and all this occurs in real time. This design approach assures the lowest possible video processing delay at significantly lower power consumption when compared to GPU-based implementations.
A Xylon logi3D Scalable 3D Graphics Controller soft-IP core—also implemented in the Zynq SoC’s programmable logic—renders a 3D vehicle and the surrounding view on the driver’s information display. The Xylon Surround View system permits real-time 3D image generation even in programmable SoCs without an on-chip GPU, as long as there’s programmable logic available to implement the graphics controller. The current version of the Xylon ADAS Surround View Virtual Flying Camera system runs on the Xylon logiADAK Automotive Driver Assistance Kit that is based on the Xilinx Zynq-7000 All Programmable SoC.
Here’s a 2-minute video of the Xylon Surround View system in action:
If you’re attending the CAR-ELE JAPAN show in Tokyo next week, you can see the Xylon Surround View system operating live in the Xilinx booth.
Jan Gray’s FPGA.org site has just published a blog post detailing the successful test of the GRVI Phalanx massively parallel accelerator framework, with 1680 open-source RISC-V processor cores running simultaneously on one Xilinx Virtex UltraScale+ VU9P. (That’s a mid-sized Virtex UltraScale+ FPGA.) According to the post, this is the first example of a kilocore RISC-V implementation and represents “the most 32-bit RISC cores on a chip in any technology.”
That’s certainly worth a picture (is a picture worth 1000 cores?):
1680 RISC-V processor cores run simultaneously on a Xilinx VCU118 eval kit with a Virtex UltraScale+ VU9P FPGA
The GRVI Phalanx’s design consists of 210 processing clusters with each cluster comprised of eight RISC-V processor cores, 128Kbytes of multiported RAM, and a 300-bit Hoplite NOC router. Here’s a block diagram of one such Phalanx cluster:
GRVI Phalanx Cluster Block Diagram
Note: Jan Gray contacted Xcell Daily after this post first appeared and wanted to clarify that the RISC-V ISA may be open-source and there may be open-source implementations of the RISC-V processor, but the multicore GRVI Phalanx is a commercial design and is not open-source.
Yesterday, National Instruments (NI) along with 15 partners announced the grand opening of the new NI Industrial IoT Lab located at NI’s headquarters in Austin, TX. The lab is a working showcase for Industrial IoT technologies, solutions and systems architectures and will address challenges including interoperability and security in the IIoT space. The partner companies working with NI on the lab include:
NI’s Jamie Smith (on the left), NI’s Business and Technology Director, opens the new NI Industrial IoT Lab in Austin, TX
Next week, the Xilinx booth at the CAR-ELE JAPAN show at Tokyo Big Sight will hold a variety of ADAS (Advanced Driver Assistance Systems) demos based on Xilinx Zynq SoC and Zynq UltraScale+ MPSoC devices from several companies including:
The Zynq UltraScale+ MPSoC and original Zynq SoC offer a unique mix of ARM 32- and 64-bit processors with the heavy-duty processing you get from programmable logic, needed to process and manipulate video and to fuse data from a variety of sensors such as video and still cameras, radar, lidar, and sonar to create maps of the local environment.
If you are developing any sort of sensor-based electronic systems for future automotive products, you might want to come by the Xilinx booth (E35-38) to see what’s already been explored. We’re ready to help you get a jump on your design.
Avnet has just announced the 1x1 version of its PicoZed SDR 2x2 SOM that you can use for rapid development of software-defined radio applications. The 62x100mm form factor for the PicoZed SDR 1x1 SOM is the same as that used for the 2x2 version but the PicoZed SDR 1x1 SOM uses the Analog Devices AD9364 RF Agile Transceiver instead of the AD9361 used in the PicoZed SDR 2x2 SOM. Another difference is that the 2x2 version of the PicoZed SDR SOM employs a Xilinx Zynq Z-7035 SoC and the 1x1 SOM uses a Zynq Z-7020 SoC.
Avnet’s Zynq-based PicoZed SDR 1x1 SOM
One final difference: The Avnet PicoZed SDR 1x1 sells for $549 and the PicoZed SDR 2x2 sells for $1095. So if you liked the idea of the original PicoZed SDR SOM but wished for a lower-cost entry point, your wish is granted, with immediate availability.
Work started on CCIX, the cache-coherent interconnect for accelerators, a little over a year ago. The CCIX specification describes an interconnect that makes workload handoff from server CPUs to hardware accelerators as simple as passing a pointer. This capability enables a whole new class of accelerated data center applications.
Xilinx VP of Silicon architecture Gaurav Singh discussed CCIX at the recent Xilinx Technology Briefing held at SC16 in Salt Lake City. His talk covers many CCIX details and you can watch him discuss these topics in this 9-minute video from the briefing:
The video below shows Ravi Sunkavalli, the Xilinx Sr. Director of Data Center Solutions, discussing how advanced FPGAs like devices based on the Xilinx UltraScale architecture can aid you in developing high-speed networking and storage equipment as data centers migrate to faster internal networking speeds. Sunkavalli posits that CPUs, which are largely used for networking and storage applications connected with today’s 10G networks, quickly run out of gas at 40G and 100G networking speeds. FPGAs can provide “bump-in-the-wire” acceleration for high-speed networking ports thanks to the large number of fast compute elements and the high-speed transceivers incorporated into devices like the Xilinx UltraScale and UltraScale+ FPGAs.
Examples of networking applications already handled by FPGAs include VNF (Virtual Network Functions) such as VPNs, firewalls, and security. FPGAs are already being used to implement high-speed data center storage functions such as error correction, compression, and security.
The following 8-minute video was recorded during a Xilinx technology briefing at the recent SC16 conference in Salt Lake City:
All Internet-connected video devices produce data streams that are processed somewhere in the cloud, said Xilinx Chief Video Architect Johan Janssen during a talk at November’s SC16 conference in Salt Lake City. FPGAs are well suited to video acceleration and deliver better compute density than cloud servers based on microprocessors. One example Janssen gave during his talk shows a Xilinx Virtex UltraScale VU190 FPGA improving the video-stream encoding rate from 3 to 60fps while cutting power consumption by half when compared to the performance of a popular Intel Xeon microprocessor executing the same encoding task. In power-constrained data centers, that’s a 40x efficiency improvement with no increase in electrical or heat load. In other words, it costs a lot less operationally to use FPGA for video encoding in data centers.
Here’s the 7-minute video of Janssen’s talk at SC16:
Last November at SC16 in Salt Lake City, Xilinx Distinguished Engineer Ashish Sirasao gave a 10-minute talk on deploying deep-learning applications using FPGAs with significant performance/watt benefits. Sirasao started by noting that we’re already knee-deep in machine-learning applications: spam filters; cloud-based and embedded voice-to-text converters; and Amazon’s immensely successful, voice-operated Alexa are all examples of extremely successful machine-learning apps in broad use today. More—many more—will follow. These applications all have steep computing requirements.
There are two phases in any machine-learning application. The first is training and the second is deployment. Training is generally done using floating-point implementations so that application developers need not worry about numeric precision. Training is a 1-time event so energy efficiency isn’t all that critical.
Deployment is another matter however.
Putting a trained deep-learning application in a small appliance like Amazon’s Alexa calls for attention to factors such as energy efficiency. Fortunately, said Sirasao, the arithmetic precision of the application can change from training to mass deployment and there are significant energy-consumption gains to be had by deploying fixed-point machine-learning applications. According to Sirasao, you can get accurate machine inference using 8- or 16-bit fixed-point implementations while realizing a 10x gain in energy efficiency for the computing hardware and a 4x gain in memory energy efficiency.
The Xilinx DSP48E2 block implemented in the company’s UltraScale and UltraScale+ devices is especially useful for these machine-learning deployments because its DSP architecture can perform two independent 8-bit operations per clock per DSP block. That translates into nearly double the compute performance, which in turn results in much better energy efficiency. There’s a Xilinx White Paper on this topic titled “Deep Learning with INT8 Optimization on Xilinx Devices.”
Further, Xilinx recently announced its Acceleration Stack for machine-learning (and other cloud-based applications), which allows you to focus on developing your application rather than FPGA programming. You can learn about the Xilinx Acceleration Stack here
Finally, here’s the 10-minute video with Sirasao’s SC16 talk:
Nextera Video is helping the broadcast video industry migrate to video-over-IP as quickly as possible with an FPGA IP core developed for Xilinx UltraScale and other Xilinx FPGAs that compresses 4K video using Sony’s low-latency, noise-free NMI (Network Media Interface) packet protocols to achieve compression ratios of 3:1 to 14:1. The company’s products can transport compressed 4Kp60 video between all sorts of broadcast equipment over standard 10G IP switches, which significantly lowers equipment and operating costs for broadcasters.
Here’s a quick video that describes Nextera’s approach:
By Adam Taylor
As I discussed last week, one method we can use to reduce the power is to put the Zynq SoC in low-power mode when we detect that the system is idle. The steps required to enter and leave the low-power mode appear in the technical reference manual (section 24.4 of UG585). However, it’s always good to see an actual example to understand the power savings we get by entering this mode.
We’ll start with a running system. The current draw (344.9mA) appears on the DMM’s display in the upper left part of this image:
MicroZed with the DMM measuring current
We follow these steps from the TRM to place the Zynq SoC’s ARM Cortex-A9 processor into sleep mode:
Implementing most of these steps requires that we use the standard XilOut32() approach to modify the desired register contents as we have done for many examples throughout this blog series. There are however some registers we need to interact with using inline assembly language. We will now look at this in more detail because it is a little different to using the XilOutXX() fucntions.
We need to use assembler for two reasons. The first is to interact with the CP15 co-processor registers and the second is to execute the wait for interrupt, wfi() instruction. You will notice that the CP15 registers are not defined within the TRM. As such no address-space details are provided. However we can still access them from within our SDK application.
We’ll use a bare-metal approach to demonstrate how we enter sleep mode. The generated BSP will provide the functions and macros to interact with the Zynq SoC’s CP15 registers. There are three files that we need to use:
We can use two macros contained within xpseudo_asm_gcc.h to perform the writes we need to make to the CP15 power-control register. These are the macros MFCP, which allows us to read a register, and MFCP, which allows us to write to a register. We can find the register address within the file xreg_cortexa9.h to target the register we want to interact with, as shown in the image below:
Actual code within the power-down application
The last element we need is the final WFI instruction to wait for the wake-up source interrupt. Again, we use inline assembler, just as we did previously when we issued the SEV instruction to wake up the second processor as part of the AMP example.
Defining the WFI instruction
When I put all of this together and ran the code on the MicroZed dev board, I noted a 100 mA drop in the overall current draw, which equates to a 29% drop in power—from 1.72W to 1.22W. You can see the overall effect in this image. Note the new reading on then DMM.
Resultant current consumption after entering sleep mode
This is a considerable reduction in power. However, you may be surprised it is not more. Remember that we still have elements of the Zynq SoC powered up. We can power these elements down as well, to achieve an even lower power dissipation. For example, we can power-down the Zynq SoC’s PL. While powering down the PL results in a longer wake-up time as the PL would need to be reconfigured after waking up, the resultant power saving would be greater. This does require that we correctly architect the power architecture to provide the ability to power down specific voltage rails.
Next week we will look at how we can develop our PL application for lower power dissipation in operation.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
All of Adam Taylor’s MicroZed Chronicles are cataloged here.