Xylon has a new hardware/software development kit for quickly implementing embedded, multi-camera vision systems for ADAS and AD (autonomous driving), machine-vision, AR/VR, guided robotics, drones, and other applications. The new logiVID-ZU Vision Development Kit is based on the Xilinx Zynq UltraScale+ MPSoC and includes four Xylon 1Mpixel video cameras based on the TI FPD (flat-panel display) Link-III interface. The kit supports HDMI video input and display output and comes complete with extensive software deliverables including pre-verified camera-to-display SoC designs built with licensed Xylon logicBRICKS IP cores, reference designs and design examples prepared for the Xilinx SDSoC Development Environment, and complete demo Linux applications.
Xylon’s new logiVID-ZU Vision Development Kit
Please contact Xylon for more information about the new logiVID-ZU Vision Development Kit.
Earlier this week at San Jose State University (SJSU), Jim Hogan, one of Silicon Valley’s most successful venture capitalists, gave a talk on the disruptive effects that cognitive science and AI are already having on society. In a short portion of that talk, Hogan discussed how he and one of his teams developed the world’s most experienced lung-cancer radiologist—an AI app—for $75:
US Center for Disease Control lung imaging database--$25
Cloud storage for the 4Tbyte data base--$25
Time for training the AI using the CDC database using Amazon’s AWS--$25
Hogan’s trained AI radiologist can look at lung images and find possibly cancerous tumors based on thousands of cases in the CDC database. However, said Hogan, the US Veterans Administration has a database with millions of cases. Yes, his team used that database for training too.
Hogan predicted that something like 25 million AI apps like his lung-cancer-specific radiologist will be developed over the next few years. His $75 example is meant to prove the cost feasibility of developing that many useful apps.
SDAccel—Xilinx’s development environment for accelerating cloud-based applications using C, C++, or OpenCL—is now available on Amazon’s AWS EC2 F1 instance. (Formal announcement here.) The Amazon EC2 F1 compute instance allows you to create custom hardware accelerators for your application using cloud-based server hardware that incorporates multiple Xilinx Virtex UltraScale+ VU9P FPGAs. SDAccel automates the acceleration of software applications by building application-specific FPGA kernels for the AWS EC2 F1. You can also use HDLs including Verilog and VHDL to define hardware accelerators in SDAccel. With this release, you can access SDAccel through the AWS FPGA developer AMI.
For more information about Amazon’s AWS EC2 F1 instance, see:
Brandon Treece from National Instruments (NI) has just published an article titled “CPU or FPGA for image processing: Which is best?” on Vision-Systems.com. NI offers a Vision Development Module for LabVIEW, the company’s graphical systems design environment, and can run vision algorithms on CPUs and FPGAs, so the perspective is a knowledgeable one. Abstracting the article, what you get from an FPGA-accelerated imaging pipeline is speed. If you’re performing four 6msec operations on each video frame, a CPU will need 24msec (four times 6msec) to complete the operations while an FPGA offers you parallelism that shortens processing time for each operation and permits overlap among the operations, as illustrated from this figure taken from the article:
In this example, the FPGA needs a total of 6msec to perform the four operations and another 2msec to transfer a video frame back and forth between processor and FPGA. The CPU needs a total of 24msec for all four operations. The FPGA needs 8msec, for a 3x speedup.
Treece then demonstrates that the acceleration is actually much greater in the real world. He uses the example of a video processing sequence needed for particle counting that includes these three major steps:
Convolution filtering to sharpen the image
Thresholding to produce a binary image
Morphology to remove holes in the binary particles
Here’s an image series that shows you what’s happening at each step:
Using the NI Vision Development Module for LabVIEW, he then runs the algorithm run on an NI cRIO-9068 CompactRIO controller, which is based on a Xilinx Zynq Z-7020 SoC. Running the algorithm on the Zynq SoC’s ARM Cortex-A9 processor takes 166.7msec per frame. Running the same algorithm but accelerating the video processing using the Zynq SoC’s integral FPGA hardware takes 8msec. Add in another 0.5msec for DMA transfer of the pre- and post-processed video frame back and forth between the Zynq SoC’s CPU and FPGA and you get about a 20x speedup.
A key point here is that because the cRIO-9068 controller is based on the Zynq SoC, and because NI’s Vision Development Module for LabVIEW supports FPGA-based algorithm acceleration, this is an easy choice to make. The resources are there for your use. You merely need to click the “Go-Fast” button.
For more information about NI’s Vision Development Module for LabVIEW and cRIO-9068 controller, please contact NI directly.
According to the IEEE paper, the Zynq-based BNN is 136.8x faster and 44.7x more power efficient than the same CNN running on an ARM Cortex-A57 processor. Compared to the same CNN running on an Nvidia Maxwell GPU, the Zynq-based BNN is 4.9x faster and 3.8x more power efficient.
The Xilinx Technology Showcase 2017 will highlight FPGA-acceleration as used in Amazon’s cloud-based AWS EC2 F1 Instance and for high-performance, embedded-vision designs—including vision/video, autonomous driving, Industrial IoT, medical, surveillance, and aerospace/defense applications. The event takes place on Friday, October 6 at the Xilinx Summit Retreat Center in Longmont, Colorado.
The Xilinx Hackathon is a 30-hour marathon event being held at the Xilinx “Retreat” (also known as the Xilinx Colorado facility in Longmont, but see the image below), starting on October 7. The organizers are looking for no more than 35 heroic coders who will receive a Python-programmable, Zynq-based Digilent/Xilinx PYNQ-Z1 board and an assortment of Arduino-compatible shields and sensors. The intent, as Zaphod Beeblebrox might say, is to create something not just amazing but “amazingly amazing.”
Xilinx Colorado, Longmont Facility
Want to compete as a team? No problem. The Xilinx Hackathon rules allow teams as large as four people, but the body count is capped at 35 so if you have a large team you’d better get your name on the invite list early. Better yet, get your name on the list early even if you’re competing solo.
How much does it cost to enter? Zero. Zip, Nada, Nothing.
What are the prizes? We’re offering more than $2000 in cash prizes plus all competitors keep their PYNQ-Z1 boards. Also, winners and other amazingly amazing projects will get incredible, amazingly amazing recognition in the Xcell Daily blog, which will be covering this event. Your fame is assured.
What should you bring: “A laptop, laptop charger, phone, charger(s), headphones, a pillow, toiletries, an extra set of clothes, and a water bottle.” (It’s a 30-hour hackathon, but it’s at the Xilinx retreat (see photo, again)). Xilinx will provide you with three meals per day (good hackers will figure out how many meals they’ll get in 30 hours) as well as snacks, drinks, and caffeine stimulation.
Where do I sign up? Here. A crack Xilinx team will hand-select and invite the lucky 35 participants from this list. (That’s the Final Five times seven.)
In case you’ve not read about it, the PYNQ project is an open-source project from Xilinx that makes it easy to design high-performance embedded systems using Xilinx Zynq Z-7000 SoCs. Here’s what’s on the PYNQ-Z1 board:
Xilinx Zynq Z-7020 SoC with a dual-core ARM Cortex-A9 MPCore processor running at 650MHz
Video Direct Memory Access (VDMA) is one of the key IP blocks used within many image-processing applications. It allows frames to be moved between the Zynq SoC’s and Zynq UltraScale+ MPSoC’s PS and PL with ease. Once the frame is within the PS domain, we have several processing options available. We can implement high-level image processing algorithms using open-source libraries such as OpenCV and acceleration stacks such as the Xilinx reVISION stack if we wish to process images at the edge. Alternatively, we can transmit frames over Gigabit Ethernet, USB3, PCIe, etc. for offline storage or later analysis.
It can be infuriating when our VDMA-based image-processing chain does not work as intended. Therefore, we are going to look at a simple VDMA example and the steps we can take to ensure that it works as desired.
The simple VDMA example shown below contains the basic elements needed to provide VDMA output to a display. The processing chain starts with a VDMA read that obtains the current frame from DDR memory. To correctly size the data stream width, we use an AXIS subset convertor to convert 32-bit data read from DDR memory into a 24-bit format that represents each RGB pixel with 8 bits. Finally, we output the image with an AXIS-to-video output block that converts the AXIS stream to parallel video with video data and sync signals, using timing provided by the Video Timing Controller (VTC). We can use this parallel video output to drive a VGA, HDMI, or other video display output with an appropriate PHY.
This example outlines a read case from the PS to the PL and corresponding output. This is a more complicated case than performing a frame capture and VDMA write because we need to synchronize video timing to generate an output.
Simple VDMA-Based Image-Processing Pipeline
So what steps can we take if the VDMA-based image pipeline does not function as intended? To correct the issue:
Check Reset and Clocks as we would when debugging any application. Ensure that the reset polarity is correct for each module as there will be mixed polarities. Ensure that the pixel clock is correct for the required video timing and that it is supplied to both the VTC and the AXIS-to-Video Out blocks. While the clock required for the AXIS network must be able to support the image throughput.
Check the Clock Enables on both the VTC and AXIS to Video Out blocks are tied to the correct level to enable the clocks.
Check that the VTC is correctly configured, especially if you are using the AXI interface to define the configuration through the application software. When configuring the VTC using AXI, it is important to make sure we have set the source registers to the VTC generator, enabled register updates, and defined the timing parameters required.
Check the connections between the VTC and AXIS-to-Video-Out Blocks. Ensure that the horizontal and vertical blanking signals are also connected along with the horizontal and vertical syncs.
Check the AXIS-to-Video-Out If we are using VDMA, the timing mode of the AXIS-to-Video-Out block should be set to master. This enables the AXIS-to-Video-Out block to assert back pressure on the AXIS data stream to halt the frame buffer output. This mechanism permits the AXIS-to-Video-Out block to manage the flow of pixels by enabling synchronization and lock. You may also want to increase the size of the internal buffer from the default.
Check that the AXIS-to-Video-Out VTC_ce signal is not connected to the VTC gen clock enable as is the case when configured for slave operation. This will prevent the AXIS-to-Video-Out block from being able to lock to the AXIS video stream.
Insert ILA’s. Inserting these within the design allow us to observe the detailed workings of the AXI buses. When commissioning a new image processing pipeline, I insert ILA blocks on the VTC output and the VDMA MM-to-AXIS port so that I can observe the generated timing signals and VDMA output stream. When observing the AXI Stream the tuser signal identifies the start of frame and the tlast signal represents the end of line. You may also want to observe the AXIS-to-Video-Out 32-bit status output, which provides indication of the locked status along with additional debug information.
Ensure that HSize and Stride are set correctly. These are defined by the application software and configure the VMDA with frame-store information. HSize represents the horizontal size of the image and Stride represents the distance in memory between the image lines. Both HSize and Stride are defined in bytes. As such, when working with U32 or U16 types, take care to correctly set these values to reflect the number of bytes used.
Hopefully by the time you have checked these points, the issue with your VDMA based image processing pipeline will have been identified and you can start developing the higher-level image processing algorithms needed for the application.
Why aren’t you getting all of the performance that you expect after moving a task or tasks from the Zynq PS (processing system) to its PL (programmable logic)? If you used SDSoC to develop your embedded design, there’s help available. Here’s some advice from DornerWorks, a Premier Xilinx Alliance Program member. This blog is adapted from a recent post on the DornerWorks Web site titled “Fine Tune Your Heterogeneous Embedded System with Emulation Tools.”
Thanks to Xilinx’s SDSoC Development Environment, offloading portions of your software algorithm to a Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL (programmable logic) to meet system performance requirements is straightforward. Once you have familiarized yourself with SDSoC’s data-transfer options for moving data back and forth between the PS and PL, you can select the appropriate data mover that represents the best choice for your design. SDSoC’s software estimation tool then shows you the expected performance results.
Yet when performing the ultimate test of execution—on real silicon—the performance of your system sometimes fails to match expectations and you need to discover the cause… and the cure. Because you’ve offloaded software tasks to the PL, your existing software debugging/analysis methods do not fully apply because not all of the processing occurs in the PS.
You need to pinpoint the cause of the unexpected performance gap. Perhaps you made a sub-optimal choice of data mover. Perhaps the offloaded code was not a good candidate for offloading to the PL. You cannot cure the performance problem without knowing its cause.
Just how do you investigate and debug system performance on a Zynq-based heterogeneous embedded system with part of the code running in the PS and part in the PL?
If you are new to the world of debugging PL data processing, you may not be familiar with the options you have for viewing PL data flow. Fortunately, if you used SDSoC to accelerate software tasks by offloading them to the PL, there is an easy solution. SDSoC has an emulation capability for viewing the simulated operation of your PL hardware that uses the context of your overall system.
This emulation capability allows you to identify any timing issues with the data flow into or out of the auto-generated IP blocks that accelerate your offloaded software. The same capability can also show you if there is an unexpected slowdown in the offloaded software acceleration itself.
Using this tool can help you find performance bottlenecks. You can investigate these potential bottlenecks by watching your data flow through the hardware via the displayed emulation signal waveforms. Similarly, you can investigate the interface points by watching the data signals transfer data between the PS and the PL. This information provides key insights that help you find and fix your performance issues.
We’ll focus on the multiplier IP block from the Xilinx MMADD example to demonstrate how you can debug/emulate a hardware-accelerated function. For simplicity, we will focus on one IP block, the matrix multiplier IP block from the Multiply and Add example, shown in Figure 1.
Figure 1: Multiplier IP block with Port A expanded to show its signals
We will look at the waveforms for the signals to and from this Mmult IP block in the emulation. Specifically we will view the A_PORTA signals as shown in the figure above. These signals represent the data input for matrix A, which corresponds to the software input param A to the matrix multiplier function.
To get started with the emulation, enable generation of the “emulation model” configuration for the build in SDSoC’s project’s settings, as shown in Figure 2.
Figure 2: The mmult Project Settings needed to enable emulation
Next, rebuild your project as normal. After building your project with emulation model support enabled in the configuration, run the emulator by selecting “Start/Stop Emulation” under the “Xilinx Tools” menu option. When a window opens, select “Start” to start the emulator. SDSoC will then automatically launch an instance of Xilinx Vivado, which triggers the auto-generated PL project that SDSoC created for you as a subproject within your SDSoC project.
We specifically want to view the A_PORTA signals of the Mmult IP block. These signals must be added to the Wave Window to be viewed during a simulation. The available Mmult signals can be viewed in the Objects pane by selecting the mmult_1 block in the Scopes pane. To add the A_PORTA signals to the Wave Window, select all of the “A_*” signals in the Objects pane, right click, and select “Add to Wave Window” as shown in Figure 3.
Now you can run the emulation and view the signal states in the waveform viewer. Start the emulator by clicking “Run All” from the “Run” drop-down menu as shown in Figure 4.
Figure 4: Start emulation of the PL
Back SDSoC’s toolchain environment, you can now run a debugging session that connects to this emulation session as it would to your software running on the target. From the “Run” menu option, select “Debug As -> 1 Launch on Emulator (SDSoC Debugger)” to start the debug session as shown in Figure 5.
Figure 5: Connect Debug Session to run the PL emulation
Now you can step or run through your application test code and view the signals of interest in the emulator. Shown below in Figure 6 are the A_PORTA signals we highlighted earlier and their signal values at the end of the PL logic operation using the Mmult and Add example test code.
Figure 6: Emulated mmult_1 signal waveforms
These signals tell us a lot about the performance of the offloaded code now running in the PL and we used familiar emulation tools to obtain this troubleshooting information. This powerful debugging method can help illuminate unexpected behavior in your hardware-accelerated C algorithm by allowing you to peer into the black box of PL processing, thus revealing data-flow behavior that could use some fine-tuning.
How does an engineer already experienced and comfortable with working in the Zynq SoC’s software-based PS (processing system) domain take advantage of the additional flexibility and processing power of the Zynq SoC’s PL (programmable logic)? The traditional method is through education and training to learn to program the PL using an HDL such as Verilog or VHDL. Another way is to learn and use a tool that allows you to take a software-based design written exclusively for the ARM 32-bit processors in the PS and transfer some or most of the tasks to the PL, without writing HDL descriptions.
One such tool is Xilinx’s Vivado High Level Synthesis (HLS). By leveraging the capabilities of HLS, you can prototype a design using the Zynq PS and then move functionality to the PL to boost performance. The advantage of this tool is that it generates IP blocks that can be used in the programmable logic of Xilinx FPGAs as well as Xilinx Zynq SoCs and Zynq UltraScale+ MPSoCs.
Logic optimization occurs when Vivado HLS synthesizes your algorithm’s C model and creates RTL. There are code directives (essentially guidelines for the tools’ optimization process) available that allow you to guide the HLS tool’s synthesis from the C model source to the RTL bitstream programmed into the FPGA. If you are working with an existing algorithm modeled in C, C++, or SystemC and need to implement this algorithm in custom logic for added performance, then HLS is a great tool choice.
However, be aware that the data movers that transfer data between the Zynq PS and the PL must be manually configured for performance when using Vivado HLS. This can become a complicated process when there’s significant data transfer between the domains.
A recent innovation that simplifies data-mover configuration is the development of the Xilinx SDSoC (Software-Defined System on Chip) Development Environment for use with Zynq SoCs and Zynq UltraScale+ MPSoCs. SDSoC builds on Vivado HLS’ capabilities by using HLS to perform the C-to-RTL conversion but with the convenient addition of automatically generated data movers, which greatly simplifies configuring the connection between the software running on the Zynq PS and the accelerated algorithm executing in the Zynq PL. SDSoC also allows you to guide data-mover generation by providing a set of pragmas to make specific data-mover choices. The SDSoC directive pragmas give you control over the automatically generated data movers but still require some minimal manual configuration. Code-directive pragmas for RTL optimization available in Vivado HLS are also available in SDSoC and can be used in tandem with SDSoC pragmas to optimize both the PL algorithm and the automatically generated data movers.
It is possible to disable the SDSoC auto generated data movers and only use the HLS optimizations. Demonstrated below are an IP block diagram generated with the auto configured SDSoC data movers and one without them.
The following screen shots are taken from a Xilinx-provided template project demonstrating the acceleration of a software matrix multiplication and addition algorithm, provided with the SDx installation. We used the SDx 2016.4 toolchain and targeted an Avnet Zedboard with a standalone OS configuration for this example.
Here is a screen shot of the same block, but without the SDSoC data movers. (We have disabled the automatic generation of data movers within SDSoC by manually declaring the AXI HLS interface directives for both mmult and madd accelerated IP block.)
To achieve the best algorithm performance, be prepared to familiarize yourself and use both the SDSoC and Vivado HLS user guides and datasheets. SDSoC provides a superset of Vivado HLS’s capabilities.
If you are developing and accelerating your model from first principles but want to take advantage of the flexibility of testing and proving out a design in software first, and you don’t intend to use a Zynq SoC, then using the Vivado HLS toolset straightaway is the place to start. A design started in HLS is transferable to an SDSoC if requirements change. Alternatively, if using a Zynq-based system is possible, it would be worthwhile to start right away with using SDSoC.
Amazon Web Services (AWS) is now offering the Xilinx SDAccel Development Environment as a private preview. SDAccel empowers hardware designers to easily deploy their RTL designs in the AWS F1 FPGA instance. It also automates the acceleration of code written in C, C++ or OpenCL by building application-specific accelerators on the F1. This limited time preview is hosted in a private GitHub repo and supported through an AWS SDAccel forum. To request early access, click here.
Last September at the GNU Radio Conference in Boulder, Colorado, Ettus Research announced the RFNoC & Vivado Challenge for SDR (software-defined radio). Ettus’ RFNoC (RF Network on Chip) is designed to allow you to efficiently harness the latest-generation FPGAs for SDR applications without being an expert firmware or FPGA developer. Today, Ettus Research and Xilinx announced the three challenge winners.
Ettus’ GUI-based RFNoC design tool allows you to create FPGA applications as easily as you can create GNU Radio flowgraphs. This includes the ability to seamlessly transfer data between your host PC and an FPGA. It dramatically eases the task of FPGA off-loading in SDR applications. Ettus’ RFNoC is built upon Xilinx’s Vivado HLS.
Here are the three winning teams and their projects:
It will take you five or ten minutes to read the new SDSoC article written by Nick Ni and Adam Taylor titled “Developing all programmable logic using the SDSoC environment” and after you’re done, you’ll have a very good idea of why you might want to try Xilinx’s new SDSoC Development Environment for C, C++, and OpenCL application development. The article quickly guides you through a typical software/hardware development cycle and then gives you the performance results for an AES cryptography application.
The resulting performance chart showing as much as a 75% reduction in clock cycles for the application alone should be enough to pique your interest:
Got a problem getting enough performance out of your processor-based embedded system? You might want to watch a 14-minute video that does a nice job of explaining how you can develop hardware accelerators directly from your C/C++ code using the Xilinx SDK.
How much acceleration do you need? If you don’t know for sure, the video gives an example of an autonomous drone with vision and control tasks that need real-time acceleration.
What are your alternatives? If you need to accelerate your code, you can:
Increase your processor’s clock speed, likely requiring a faster speed grade
Add more processor cores to share the load
Switch to a higher-end, code-compatible processor
Unfortunately, each of these three alternatives increases power consumption. There’s another alternative however that can actually cut power consumption. That alternative’s based on the use of Xilinx All Programmable Zynq SoCs and Zynq UltraScale+ MPSoCs. By moving critical code into custom hardware accelerators implements in the programmable logic incorporated into all Zynq family members, you can relieve the processor of the associated processing burden and actually slow the processor’s clock speed, thus reducing power. It’s quite possible to cut overall power consumption using this approach.
Ah, but implementing these accelerators. Aye, there’s the rub!
It turns out that implementation of these hardware accelerators might not be as difficult as you imagine. The Xilinx SDK is already a C/C++ development environment based on familiar IDE and compiler technology. Under the hood, the SDK serves as a single cockpit for all Zynq-based development work—software and hardware. It also includes SDSoC, the piece of the puzzle you need to convert C/C++ code into acceleration hardware using a 3-step process:
Code profiling to identify time-consuming tasks that are critical to real-time operation
Software/hardware partitioning based on the profiling data
Software/hardware compilation based on the system partitioning
One development platform, SDK, serves all members of the Zynq SoC and Zynq UltraScale+ MPSoC device families, giving you a huge price/performance range.
The latest “Powered by Xilinx” video, published today, provides more detail about the Perrone Robotics MAX development platform for developing all types of autonomous robots—including self-driving cars. MAX is a set of software building blocks for handling many types of sensors and controls needed to develop such robotic platforms.
Perrone Robotics has MAX running on the Xilinx Zynq UltraScale+ MPSoC and relies on that heterogeneous All Programmable device to handle the multiple, high-bit-rate data streams from complex sensor arrays that include lidar systems and multiple video cameras.
Perrone is also starting to develop with the new Xilinx reVISION stack and plans to both enhance the performance of existing algorithms and develop new ones for its MAX development platform.
In a free Webinar taking place on July 12, Xilinx experts will present a new design approach that unleashes the immense processing power of FPGAs using the Xilinx reVISION stack including hardware-tuned OpenCV libraries, a familiar C/C++ development environment, and readily available hardware-development platforms to develop advanced vision applications based on complex, accelerated vision-processing algorithms such as dense optical flow. Even though the algorithms are advanced, power consumption is held to just a few watts thanks to Xilinx’s All Programmable silicon.
Xilinx announced the addition of the P416 network programming language for SDN applications to its SDNet Development Environment for high-speed (1Gbps to 100Gbps) packet processing back in May. (See “The P4 has landed: SDNet 2017.1 gets P4-to-FPGA compilation capability for 100Gbps data-plane packet processing.”) An OFC 2017 panel session in March—presented by Xilinx, Barefoot Networks, Netcope Technologies, and MoSys—discussed the adoption of P4, the emergent high-level language for packet processing, and early implementations of P4 for FPGA and ASIC targets. Here’s a half-hour video of that panel discussion.
Cloud computing and application acceleration for a variety of workloads including big-data analytics, machine learning, video and image processing, and genomics are big data-center topics and if you’re one of those people looking for acceleration guidance, read on. If you’re looking to accelerate compute-intensive applications such as automated driving and ADAS or local video processing and sensor fusion, this blog post’s for you to. The basic problem here is that CPUs are too slow and they burn too much power. You may have one or both of these challenges. If so, you may be considering a GPU or an FPGA as an accelerator in your design.
How to choose?
Although GPUs started as graphics accelerators, primarily for gamers, a few architectural tweaks and a ton of software have made them suitable as general-purpose compute accelerators. With the right software tools, it’s not too difficult to recode and recompile a program to run on a GPU instead of a CPU. With some experience, you’ll find that GPUs are not great for every application workload. Certain computations such as sparse matrix math don’t map onto GPUs well. One big issue with GPUs is power consumption. GPUs aimed at server acceleration in a data-center environment may burn hundreds of watts.
With FPGAs, you can build any sort of compute engine you want with excellent performance/power numbers. You can optimize an FPGA-based accelerator for one task, run that task, and then reconfigure the FPGA if needed for an entirely different application. The amount of computing power you can bring to bear on a problem is scary big. A Virtex UltraScale+ VU13P FPGA can deliver 38.3 INT8 TOPS (that’s tera operations per second) and if you can binarize the application, which is possible with some neural networks, you can hit 500TOPS. That’s why you now see big data-center operators like Baidu and Amazon putting Xilinx-based FPGA accelerator cards into their server farms. That’s also why you see Xilinx offering high-level acceleration programming tools like SDAccel to help you develop compute accelerators using Xilinx All Programmable devices.
There’s considerable 5G experimentation taking place as the radio standards have not yet gelled and researchers are looking to optimize every aspect. SDRs (software-defined radios) are excellent experimental tools for such research—NI’s (National Instruments’) SDR products especially so because, as the Wireless Communication Research Laboratory at Istanbul Technical University discovered:
“NI SDR products helped us achieve our project goals faster and with fewer complexities due to reusability, existing examples, and the mature community. We had access to documentation around the examples, ready-to-run conceptual examples, and courseware and lab materials around the grounding wireless communication topics through the NI ecosystem. We took advantage of the graphical nature of LabVIEW to combine existing blocks of algorithms more easily compared to text-based options.”
Researchers at the Wireless Communication Research Laboratory were experimenting with UFMC (universal filtered multicarrier) modulation, a leading modulation candidate technique for 5G communications. Although current communication standards frequently use OFDM (orthogonal frequency-division multiplexing), it is not considered to be a suitable modulation technique for 5G systems due to its tight synchronization requirements, inefficient spectral properties (such as high spectral side-lobe levels), and cyclic prefix (CP) overhead. UFMC has relatively relaxed synchronization requirements.
Although humans once served as the final inspectors for pcbs, today’s component dimensions and manufacturing volumes mandate the use of camera-based automated optical inspection (AOI) systems. Amfax has developed a 3D AOI system—the a3Di—that uses two lasers to make millions of 3D measurements with better than 3μm accuracy. One of the company’s customers uses an a3Di system to inspect 18,000 assembled pcbs per day.
Up and downstream SMEMA (Surface Mount Equipment Manufacturers Association) conveyor control
Operator manual controls for width PCB control
System emergency stop
The system provides height-graded images like this:
3D Image of a3Di’s Measurement Data: Colors represent height, with Z resolution down to less than a micron. The blue section at the top indicates signs of board warp. Laser etched component information appears on some of the ICs.
The a3Di system then compares this image against a stored golden reference image to detect manufacturing defects.
Amfax says that it has found the CompactRIO system to be “CompactRIO system has proven to be a dependable, reliable, and cost-effective.” In addition, the company found it could get far better timing resolution with the CompactRIO system than the 1msec resolution usually provided by PLC controllers.
This project was a 2017 NI Engineering Impact Award Finalist in the Electronics and Semiconductor category last month at NI Week. It is documented in this NI case study.
Hyundai Heavy Industries (HHI) is the world’s foremost shipbuilding company and the company’s Engine and Machinery Division (HHI-EMD) is the world’s largest marine diesel engine builder. HHI’s HiMSEN medium-sized engines are four-stroke diesels with output power ranging from 960kW to 25MW. These engines power electric generators on large ships and serve as the propulsion engine on medium and small ships. HHI-EMD is always developing newer, more fuel-efficient engines because the fuel costs for these large diesels runs about $2000/hour. Better fuel efficiency will significantly reduce operating costs and emissions.
For that research, HHI-EMD developed monitoring and diagnostic equipment to better understand engine combustion performance and an HIL system to test new engine controller designs. The test and HIL systems are based on equipment from National Instruments (NI).
Engine instrumentation must be able to monitor 10-cylinder engines running at thousands of RPM while measuring crankshaft angle to 0.1 degree of resolution. From that information, the engine test and monitoring system calculates in-cylinder peak pressure, mean effective pressure, and cycle-to-cycle pressure variation. All this must happen every 10 μ sec for each cylinder.
HHI-EMD elected to use an NI cRIO-9035 Controller, which incorporates a Xilinx Kintex-7 70T FPGA, to serve as the platform for developing its HiCAS test and data-acquisition system. The HiCAS system monitors all aspects of the engine under test including engine speed, in-cylinder pressure, and pressures in the intake and exhaust systems. This data helped HHI-EMD engineers analyze the engine’s overall performance and the performance of key parts using thermodynamic analysis. HiCAS provides real-time analysis of dynamic data including:
In-cylinder peak pressure
Indicated mean effective pressure and cycle-to-cycle variation
Cyclic moving parts fault diagnosis
Using the collected data, the engineering team then developed a model of the diesel engine, resulting in the development of an HMI system used to exercise the engine controllers. This engine model runs in real time on an NI PXI system synchronized with the high-speed signal-sensor simulation software running on the PXI system’s multifunction FPGA-based FlexRIO module. The HMI system transmits signals to the engine controllers, simulating an operating engine and eliminating the operating costs of a large diesel engine during these tests. HHI-EMD credits the FPGAs in these systems for making the calculations run fast enough for real-time simulation. The simulated engine also permits fault testing without the risk of damaging an actual engine. Of course, all of this is programmed using NI’s LabVIEW systems engineering software and LabVIEW FPGA.
HHI-EMD HIL Simulator for Marine Diesel Engines
According to HHI-EMD, development of the HiCAS engine-monitoring system and virtual verification based on the HIL system shortened development time from more than three years to one, significantly accelerating the time-to-market for HHI-EMD’s more eco-friendly marine diesel engines.
This project was a 2017 NI Engineering Impact Award Finalist in the Transportation and Heavy Equipment category last month at NI Week and won the 2017 HPE Edgeline Big Analog Data Award. It is documented in this NI case study.
Chang Guang Satellite Technology, China’s first commercial remote sensing satellite company, develops and operates the JILIN-1 high-resolution remote-sensing satellite series, which has pioneered the application of commercial satellites in China. The company contemplates putting 60 satellites in orbit by 2020 and 138 satellites in orbit by 2030. Achieving that goal is going to take a lot of testing and testing consumes about 70% of the development cycle for space-based systems. So Chang Guang Satellite Technology knew it would need to automate its test systems and turned to National Instruments (NI) for assistance. The resulting automated test system has three core test systems using products from NI:
An S-band ground monitoring system, based mainly on NI’s 1st-generation PXI RF Vector Signal Transceiver (VST) and an NI FlexRIO module
An on-orbit satellite dynamic model hardware-in-the-loop (HIL) system with a sub-1msec closed-loop period, 100x better than a conventional test system design
A GPS simulator using an FPGA-based FlexRIO module to simulate high-dynamic, in-orbit satellite navigation signals
A Chang Guang Satellite Technology test system based on NI’s 1st-generation VST and FlexRIO PXIe modules
Here’s a sample image from the company’s growing satellite imaging portfolio:
Shanghai Disneyland as viewed from space
NI’s VSTs and FlexRIO modules are all based on multiple generations of Xilinx FPGAs. The company’s 2nd-generation VSTs are based on Virtex-7 FPGAs and its latest FlexRIO modules are based on Kintex-7 FPGAs.
This project was a 2017 NI Engineering Impact Award Finalist in the Aerospace and Defense category last month at NI Week. It is documented in this NI case study.
Avnet has formally introduced its MiniZed dev board based on the Xilinx Zynq Z-7000S SoC with the low, low price of just $89. For this, you get a Zynq Z-7007S SoC with one ARM Cortex-A9 processor core, 512Mbytes of DDR3L SDRAM, 128Mbits of QSPI Flash, 8Gbytes of eMMC Flash memory, WiFi 802.11 b/g/n, and Bluetooth 4.1. The MiniZed board incorporates an Arduino-compatible shield interface, two Pmod connectors, and a USB 2.0 host interface for fast peripheral expansion. You’ll also find an ST Microelectronics LIS2DS12 Motion and temperature sensor and an MP34DT05 Digital Microphone on the board. This is a low-cost dev board that packs the punch of a fast ARM Cortex-A9 processor, programmable logic, a dual-wireless communications system, and easy system expandability.
I find the software that accompanies the board equally interesting. According to the MiniZed Product Brief, the $89 price includes a voucher for an SDSoC license so you can program the programmable logic on the Zynq SoC using C or C++ in addition to Verilog or VHDL using Vivado. This is a terrific deal on a Zynq dev board, whether you’re a novice or an experienced Xilinx user.
Avnet’s announcement says that the board will start shipping in early July.
Stefan Rousseau, senior technical marketing engineer for Avnet, said, “Whether customers are developing a Linux-based system or have a simple bare metal implementation, with MiniZed, Zynq-7000 development has never been easier. Designers need only connect to their laptops with a single micro-USB cable and they are up and running. And with Bluetooth or Wi-Fi, users can also connect wirelessly, transforming a mobile phone or tablet into an on-the-go GUI.”
Here’s a photo of the MiniZed Dev board:
Avnet’s $89 MiniZed Dev Board based on a Xilinx Zynq Z-7007S SoC
Many engineers in Canada wear the Iron Ring on their finger, presented to engineering graduates as a symbolic, daily reminder that they have an obligation not to design structures or other artifacts that fail catastrophically. (Legend has it that the iron in the ring comes from the first Quebec Bridge—which collapsed during its construction in 1907—but the legend appears to be untrue.) All engineers, whether wearing the Canadian Iron Ring or not, feel an obligation to develop products that do not fail dangerously. For buildings and other civil engineering works, that usually means designing structures with healthy design margins even for worst-case projected loading. However, many structures encounter worst-case loads infrequently or never. For example, a sports stadium experiences maximum loading for perhaps 20 or 30 days per year, for only a few hours at a time when it fills with sports fans. The rest of the time, the building is empty and the materials used to ensure that the structure can handle those loads are not needed to maintain structural integrity.
The total energy consumed by a structure over its lifetime is a combination of the energy needed to mine and fabricate the building materials and to build the structure (embodied energy) and the energy needed to operate the building (operational energy). The resulting energy curve looks something like this:
For completely passive structures, which describes most structures built over the past several thousand years, embodied energy dominates the total consumed energy because structural members must be designed to bear the full design load at all times. Alternatively, a smart structure with actuators that stiffen the structure only when needed will require more operational energy but the total required embodied energy will be smaller. Looking at the above conceptual graph, a well-designed active-passive system minimizes the total required energy for the structure.
Active control has already been used in structure design, most widely for vibration control. During his doctorate work, Gennaro Senatore formulated a new methodology to design adaptive structures. His research project was a collaboration between the University College London and Expedition Engineering. As part of that project, Senatore built a large scale prototype of an active-passive structure at the University College London structures laboratory. The resulting prototype is a 6m cantilever spatial truss with a 37.5:1 span-to-depth ratio. Here’s a photo of the large-scale prototype truss:
You can see the actuators just beneath the top surface of the truss. When the actuators are not energized, the cantilever truss flexes quite a lot with a load placed at the extreme end. However, this active system detects the load-induced flexion and compensates by energizing the actuators and stiffening the cantilever.
Here’s a photo showing the amount of flex induced by a 100kg load at the end of the cantilever without and with energized actuators:
The top half of the image shows that the truss flexes 170mm under load when the actuators are not energized, but only 2mm when the system senses the load and energizes the linear actuators.
The truss incorporates ten linear electric actuators that stiffen the truss when sensors detect a load-induced deflection. The control system for this active-passive truss consists of a National Instruments (NI) CompactRIO cRIO-9024 controller, 45 strain-gage sensors, 10 actuators, and five driver boards (one for each actuator pair.) The NI cRIO-9024 controller pairs with a card cage that accepts I/O modules and incorporates a Virtex-5 FPGA for reconfigurable I/O. (That’s what the “RIO” in cRIO stands for.) In this application, the integral Virtex-5 FPGA also provides in-line processing for acquired and generated signals.
A large structure would require many such subsystems, all communicating through a network. This is clearly one very useful way to employ the IIoT in structures.
This project was a 2017 NI Engineering Impact Award Finalist in the Industrial Machinery and Control category last month at NI Week. It is documented in this NI case study, which includes many more technical details and a short video showing the truss in action as a load is applied.
National Instruments (NI) has just announced a baseband version of its 2nd-Generation PXIe VST (Vector Signal Transceiver), the PXIe-5820, with 1GHz of complex I/Q bandwidth. It’s designed to address the most challenging RF front-end module and transceiver test applications. Of course, you program it with NI’s LabVIEW system engineering software like all NI instruments and, like its RF sibling the PXIe-5840, the PXIe-5820 baseband VST is based on a Xilinx Virtex-7 690T FPGA and a chunk of the FPGA’s programmable logic is available to users for creating real-time, application-specific signal processing using LabVIEW FPGA. According to Ruan Lourens, NI’s Chief Architect of RF R&D, “The baseband VST can be tightly synchronized with the PXIe-5840 RF VST to sub-nanosecond accuracy, to offer a complete solution for RF and baseband differential I/Q testing of wireless chipsets.”
NI’s new PXIe-5820 Baseband VST
How might you use this feature? Here’s a very recent, 2-minute video demonstration of a DPD (digital predistortion) measurement application that provides a pretty good example:
MathWorks has just published a 30-minute video titled “FPGA for DSP applications: Fixed Point Made Easy.” The video targets users of the company’s MATLAB and Simulink software tools and covers fixed-point number systems, how these numbers are represented in MATLAB and in FPGAs, quantization and quantization challenges, sources of error and minimizing these errors, how to use MathWorks’ design tools to understand these concepts, implementation of fixed-point DSP algorithms on FPGAs using MathWorks’ tools, and the advantages of the Xilinx DSP48 block—which you’ll find in all Xilinx 28nm series 7, 20nm UltraScale, and 16nm UltraScale+ devices including Zynq SoCs and Zynq UltraScale+ MPSoCs.
The video also shows the development of an FIR filter using MathWorks’ fixed-point tools as an example with some useful utilization feedback that helps you optimize your design. The video also briefly shows how you can use MathWorks’ HDL Coder tool to develop efficient, single-precision, floating-point DSP hardware for Xilinx FPGAs.
When someone asks where Xilinx All Programmable devices are used, I find it a hard question to answer because there’s such a very wide range of applications—as demonstrated by the thousands of Xcell Daily blog posts I’ve written over the past several years.
Now, there’s a 5-minute “Powered by Xilinx” video with clips from several companies using Xilinx devices for applications including:
Machine learning for manufacturing
Autonomous cars, drones, and robots
Real-time 4K, UHD, and 8K video and image processing
VR and AR
High-speed networking by RF, LED-based free-air optics, and fiber
With LED automotive lighting now becoming commonplace, newer automobiles have the ability to communicate with each other (V2V communications) and with roadside infrastructure by quickly flashing their lights (LiFi) instead of using radio protocols. Researchers at OKATEM—the Centre of Excellence in Optical Wireless Communication Technologies at Ozyegin University in Turkey—have developed an OFDM-based LiFi demonstrator for V2V (vehicle-to-vehicle) and V2I (vehicle-to-infrastructure) applications that has achieved 50Mbps communications between vehicles as far apart as 70m in a lab atmospheric emulator.
Inside the OKATEM LiFi Atmospheric Emulator
The demo system is based on PXIe equipment from National Instruments (NI) including FlexRIO FPGA modules. (NI’s PXIe FlexRIO modules are based on Xilinx Virtex-5 and Virtex-7 FPGAs.) The FlexRIO modules implement the LiFi OFDM protocols including channel coding, 4-QAM modulation, and an N-IFFT. Here’s a diagram of the setup:
Researchers developed the LiFi system using NI’s LabVIEW and LabVIEW system engineering software. Initial LiFi system performance demonstrated a data rate of 50 Mbps with as much as 70m between two cars, depending on the photodetectors’ location in the car (particularly its height above ground level). Further work will try to improve the total system performance by integrating advanced capabilities such as multiple-input, multiple-output (MIMO) communication and link adaptation on the top of OFDM architecture.
This project was a 2017 NI Engineering Impact Award Winner in the RF and Mobile Communications category last month at NI Week. It is documented in this NI case study.