Exactly a week ago, Xilinx introduced the Zynq UltraScale+ RFSoC family, which is a new series of Zynq UltraScale+ MPSoCs with RF ADCs and DACs and SD-FECs added. (See “Zynq UltraScale+ RFSoC: All the processing power of 64- and 32-bit ARM cores, programmable logic plus RF ADCs, DACs.”) This past Friday at the Xilinx Showcase held in Longmont, Colorado, Senior Marketing Engineer Lee Hansen demonstrated a Zynq UltraScale+ ZU28DR RFSoC with eight 12-bit, 4Gsamples/sec RF ADCs, eight 14-bit, 6.4Gsamples/sec RF DACs, and eight SD-FECs connected through an appropriate interface to National Instruments’ LabVIEW Systems Engineering Development Environment.
The demo system was generating signals using the RF DACs, receiving the signals using the RF ADCs, and then displaying the resulting signal spectrum using LabVIEW.
Here’s a 3-minute video of the demo:
Over the past weekend, Xilinx held a Showcase and PYNQ Hackathon in its Summit Retreat Center in Longmont, Colorado. About 100 people from tech companies all over Colorado attended the Showcase and twelve teams—about 40 people including students from local universities and engineers from industry—competed in the Hackathon.
Here are a few images from the Xilinx Showcase and PYNQ Hackathon:
Xilinx Summit Retreat Center in Longmont, Colorado
Xilinx CTO Ivo Bolsens welcomes everyone to the Xilinx Showcase
Xilinx Showcase Attendees
More Xilinx Showcase Attendees
Xilinx VP of Interactive Design Tools and Xilinx Longmont Site Manager Dan Gibbons Welcomes the Hackers to the PYNQ Hackathon
Abo’s Pizza (Boulder’s Finest) for Friday Night Dinner
Handing out Digilent PYNQ-Z1 boards
The PYNQ Hackathon Work Begins
Giving a little help to the participants
Focus, Focus, Focus
A Mini Lecture about the PYNQ Logictools Overlay for Hackathon Attendees
Getting a little shuteye
A view of Longs Peak from the Xilinx Longmont Summit Retreat Center
The event was organized by an internal Xilinx team including:
In addition, a team of helpers was on hand over the 30-hour duration of the event to answer questions:
For more information about the PYNQ Hackathon, see “12 PYNQ Hackathon teams competed for 30 hours, inventing remote-controlled robots, image recognizers, and an air keyboard.”
For more information about the Python-based, open-source PYNQ development environment and the Zynq-based Digilent PYNQ-Z1 dev board, see “Python + Zynq = PYNQ, which runs on Digilent’s new $229 pink PYNQ-Z1 Python Productivity Package.”
Twelve student and industry teams competed for 30 straight hours in the Xilinx Hackathon 2017 competition over the weekend at the Summit Retreat Center in the Xilinx corporate facility located in Longmont, Colorado. Each team member received a Digilent PYNQ-Z1 dev board, which is based on a Xilinx Zynq Z-7020 SoC, and then used their fertile imaginations to conceive of and develop working code for an application using the open-source, Python-based PYNQ development environment, which is based on self-documenting Jupyter Notebooks. The online electronics and maker retailer Sparkfun, located just down the street from the Xilinx facility in Longmont, supplied boxes of compatible peripheral boards with sensors and motor controllers to spur the team members’ imaginations. Several of the teams came from local universities including the University of Colorado at Boulder and the Colorado School of Mines in Golden, Colorado. At the end of the competition, eleven of the teams presented their results using their Jupyter Notebooks. Then came the prizes.
For the most part, team members had never used the PYNQ-Z1 boards and were not familiar with using programmable logic. In part, that was the intent of the Hackathon—to connect teams of inexperienced developers with appropriate programming tools and see what develops. That’s also the reason that Xilinx developed PYNQ: so that software developers and students could take advantage of the improved embedded performance made possible by the Zynq SoC’s programmable hardware without having to use ASIC-style (HDL) design tools to design hardware (unless they want to do so, of course).
Here are the projects developed by the teams, in the order presented during the final hour of the Hackathon (links go straight to the teams’ Github repositories with their Jupyter notebooks that document the projects with explanations and “working” code):
Team John Cena’s Voice-Controlled Mobile Robot
Team “Joy of Pink” developed an emoji generator based on facial interpretation on Microsoft’s cloud-based Azure Emotion API
Team Caffeine’s Audio Fiend Tone-Based Robotic Controller
After the presentations, the judges deliberated for a few minutes using multiple predefined criteria and then awarded the following prizes:
Congratulations to the winners and to all of the teams who spent 30 hours with each other in a large room in Colorado to experience the joy of hacking code to tackle some tough problems. (A follow-up blog will include a photographic record of the event so that you can see what it was like.)
For more information about the PYNQ development environment and the Digilent PYNQ-Z1 board, see “Python + Zynq = PYNQ, which runs on Digilent’s new $229 pink PYNQ-Z1 Python Productivity Package.”
Late last month, I wrote about an announcement by DNAnexus and Edico Genome that described a huge reduction in the cost and time to analyze genomic information, enabled by Amazon’s FPGA-accelerated AWS EC2 F1 instance. (See “Edico Genome and DNAnexus announce $20, 90-minute genome analysis on Amazon’s FPGA-accelerated AWS EC2 F1 instance.”) The AWS Partner Network blog has just published more details in an article written by Amazon’s Aaron Friedman, titled “How DNAnexus and Edico Genome are Powering Precision Medicine on Amazon Web Services (AWS).”
The details are exciting to say the least. The article begins with this statement:
“Diagnosing the medical mysteries behind acutely ill babies can be a race against time, filled with a barrage of tests and misdiagnoses. During the first few days of life, a few hours can save or seal the fate of patients admitted to the neonatal intensive care units (NICUs) and pediatric intensive care units (PICUs). Accelerating the analysis of the medical assays conducted in these hospitals can improve patient outcomes, and, in some cases, save lives.”
Then, if you read far enough into the post, you find this statement:
“Rady Children’s Institute for Genomic Medicine is one of the global leaders in advancing precision medicine. To date, the institute has sequenced the genomes of more than 3,000 children and their family members to diagnose genetic diseases. 40% of these patients are diagnosed with a genetic disease, and 80% of these receive a change in medical management. This is a remarkable rate of change in care, considering that these are rare diseases and often involve genomic variants that have not been previously observed in other individuals.”
This example is merely a road sign, pointing the way to even more exciting developments in FPGA-accelerated, cloud-based computing to come. Well-known Silicon Valley venture capitalist Jim Hogan directly addressed these developments in a speech at San Jose State University just a couple of weeks ago. (See “Four free training videos (two hour's worth) on using Xilinx SDAccel to create apps for Amazon AWS EC2 F1 instances.”)
The Amazon AWS EC2 F1 instance is a cloud service that’s based on multiple Xilinx Virtex UltraScale+ VU9P FPGAs installed in Amazon’s Web servers. For more information on the AWS EC2 F1 Instance in Xcell Daily, see:
By Adam Taylor
The Xilinx Zynq UltraScale+ MPSoC is good for many applications including embedded vision. It’s APU with two or four 64-bit ARM Cortex-A53 processors, Mali GPU, DisplayPort interface, and on-chip programmable logic (PL) give the Zynq UltraScale+ MPSoC plenty of processing power to address exciting applications such as ADAS and vision-guided robotics with relative ease. Further, we can use the device’s PL and its programmable I/O to interface with a range of vision and video standards including MIPI, LVDS, parallel, VoSPI, etc. When it comes to interfacing image sensors, the Zynq UltraScale+ MPSoC can handle just about anything you throw at it.
Once we’ve brought the image into the Zynq UltraScale+ MPSoC’s PL, we can implement an image-processing pipeline using existing IP cores from the Xilinx library or we can develop our own custom IP cores using Vivado HLS (high-level synthesis). However, for many applications we’ll need to move the images into the device’s PS (processing system) domain before we can apply exciting application-level algorithms such as decision making or use the Xilinx reVISION acceleration stack.
I thought I would kick off the fourth year of this blog with a look at how we can use VDMA instantiated in the Zynq MPSoC’s PL to transfer images from the PL to the PS-attached DDR Memory without processor intervention. You often need to make such high-speed background transfers in a variety of applications.
To do this we will use the following IP blocks:
Once configured over its AXI Lite interface, the Test Pattern Generator outputs test patterns which are then transferred into the PS-attached DDR memory. We can demonstrate that this has been successful by examining the memory locations using SDK.
Enabling the FPD Master and Slave Interfaces
For this simple example, we’ll clock both the AXI networks at the same frequency, driven by PL_CLK_0 at 100MHz.
For a deployed system, an image sensor would replace the TPG as the image source and we would need to ensure that the VDMA input-channel clocks (Slave-to-Memory-Map and Memory-Map-to-Slave) were fast enough to support the required pixel and frame rate. For example, a sensor with a resolution of 1280 pixels by 1024 lines running at 60 frames per second would require a clock rate of at least 108MHz. We would need to adjust the clock frequency accordingly.
Block Diagram of the completed design
To aid visibility within this example, I have included three ILA modules, which are connected to the outputs of the Test Pattern Generator, AXI VDMA, and the Slave Memory Interconnect. Adding these modules enables the use of Vivado’s hardware manager to verify that the software has correctly configured the TPG and the VDMA to transfer the images.
With the Vivado design complete and built, creating the application software to configure the TPG and VDMA to generate images and move them from the PL to the PS is very straightforward. We use the AXIVDMA, V_TPG, Video Common APIs available under the BSP lib source directory to aid in creating the application. The software itself performs the following:
The application will then start generating test frames, transferred from the TPG into the PS DDR memory. I disabled the caches for this example to ensure that the DDR memory is updated.
Examining the ILAs, you will see the TPG generating frames and the VDMA transferring the stream into memory mapped format:
TPG output, TUSER indicates start of frame while TLAST indicates end of line
VDMA Memory Mapped Output to the PS
Examining the frame store memory location within the PS DDR memory using SDK demonstrates that the pixel values are present.
Test Pattern Pixel Values within the PS DDR Memory
You can use the same approach in Vivado when creating software for a Zynq Z-7000 SoC iinstead of a Zynq UltraScale+ MPSoC by enabling the AXI GP master for the AXI Lite bus and AXI HP slave for the VDMA channel.
Should you be experiencing trouble with your VDMA based image processing chain, you might want to read this blog.
The project, as always, is on GitHub.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
First Year E Book here
First Year Hardback here.
Second Year E Book here
Second Year Hardback here
MathWorks has just published a 4-part mini course that teaches you how to develop vision-processing applications using MATLAB, HDL Coder, and Simulink, then walks you through a practical example targeting a Xilinx Zynq SoC using a lane-detection algorithm in Part 4.
Click here for each of the classes:
PDF Solutions provides yield-improvement technologies and services to the IC-manufacturing industry to lower manufacturing costs, improve profitability, and shorten time to market. One of the company’s newest solutions is the eProbe series of e-beam tools used for inline electrical characterization and process control. These tools combine an SEM (scanning electron microscope) and an optical microscope and have the unique ability to provide real-time image analysis of nanometer-scale features. The eProbe development team selected National Instrument’s (NI’s) LabVIEW to control the eProbe system and brought in JKI—a LabVIEW consulting company, Xilinx Alliance Program member, and NI Silver Alliance Partner—to help develop the system.
PDF Solutions eProbe e-beam tool combines an SEM with an optical microscope
In less than four months, JKI helped PDF Solutions attain a 250MHz pixel-acquisition rate from the prototype eProbe using a combination of NI’s FlexRIO module, based on a Xilinx Kintex-7 FPGA, and NI’s LabVIEW FPGA module. According to the PDF Solutions case study published on the JKI Web site, using NI’s LabVIEW allowed the PDF/JKI team to implement the required, real-time FPGA logic and easily integrate third-party FPGA IP in a fraction of the time required by alternative design platforms while still achieving the project’s image-throughput goals.
LabVIEW controls most of the functions within the eProbe that perform the wafer inspection including:
JKI contributed both to the eProbe’s software architecture design and the development of various high-level software components that coordinate and control the low-level hardware functions including data acquisition and image manipulation.
Although the eProbe’s control system runs within NI’s LabVIEW environment, the system’s user interface is based on a C# application from The PEER Group called the Peer Tool Orchestrator (PTO). JKI developed the interface between the eProbe’s front-end user interface and its LabVIEW-based control system using its internally developed tools. (Note: JKI offers several LabVIEW development tools and templates directly on this Web page.)
eProbe user interface screen
Once PDF Solutions started fielding eProbe systems, JKI sent people to work with PDF Solutions’ customers on site in a collaboration that helped generate ideas for future algorithm and tool improvements.
For more information about real-time LabVIEW development using the NI LabVIEW FPGA module and Xilinx-based NI hardware, contact JKI directly.
Xylon has a new hardware/software development kit for quickly implementing embedded, multi-camera vision systems for ADAS and AD (autonomous driving), machine-vision, AR/VR, guided robotics, drones, and other applications. The new logiVID-ZU Vision Development Kit is based on the Xilinx Zynq UltraScale+ MPSoC and includes four Xylon 1Mpixel video cameras based on the TI FPD (flat-panel display) Link-III interface. The kit supports HDMI video input and display output and comes complete with extensive software deliverables including pre-verified camera-to-display SoC designs built with licensed Xylon logicBRICKS IP cores, reference designs and design examples prepared for the Xilinx SDSoC Development Environment, and complete demo Linux applications.
Xylon’s new logiVID-ZU Vision Development Kit
Please contact Xylon for more information about the new logiVID-ZU Vision Development Kit.
Earlier this week at San Jose State University (SJSU), Jim Hogan, one of Silicon Valley’s most successful venture capitalists, gave a talk on the disruptive effects that cognitive science and AI are already having on society. In a short portion of that talk, Hogan discussed how he and one of his teams developed the world’s most experienced lung-cancer radiologist—an AI app—for $75:
Hogan’s trained AI radiologist can look at lung images and find possibly cancerous tumors based on thousands of cases in the CDC database. However, said Hogan, the US Veterans Administration has a database with millions of cases. Yes, his team used that database for training too.
Hogan predicted that something like 25 million AI apps like his lung-cancer-specific radiologist will be developed over the next few years. His $75 example is meant to prove the cost feasibility of developing that many useful apps.
Hogan made me a believer.
In a related connection to AWS app development, Xilinx has just posted four training videos showing you how to develop FPGA-accelerated apps using Xilinx’s SDAccel on Amazon’s AWS EC2 F1 instance. That’s nearly two hours of free training available at your desk. (For more information on the AWS EC2 F1 instance, see “SDAccel for cloud-based application acceleration now available on Amazon’s AWS EC2 F1 instance” and “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available”.)
Here are the videos:
Note: You can watch Jim Hogan’s 90-minute presentation at SJSU by clicking here.
SDAccel—Xilinx’s development environment for accelerating cloud-based applications using C, C++, or OpenCL—is now available on Amazon’s AWS EC2 F1 instance. (Formal announcement here.) The Amazon EC2 F1 compute instance allows you to create custom hardware accelerators for your application using cloud-based server hardware that incorporates multiple Xilinx Virtex UltraScale+ VU9P FPGAs. SDAccel automates the acceleration of software applications by building application-specific FPGA kernels for the AWS EC2 F1. You can also use HDLs including Verilog and VHDL to define hardware accelerators in SDAccel. With this release, you can access SDAccel through the AWS FPGA developer AMI.
For more information about Amazon’s AWS EC2 F1 instance, see:
For more information about SDAccel, see:
Brandon Treece from National Instruments (NI) has just published an article titled “CPU or FPGA for image processing: Which is best?” on Vision-Systems.com. NI offers a Vision Development Module for LabVIEW, the company’s graphical systems design environment, and can run vision algorithms on CPUs and FPGAs, so the perspective is a knowledgeable one. Abstracting the article, what you get from an FPGA-accelerated imaging pipeline is speed. If you’re performing four 6msec operations on each video frame, a CPU will need 24msec (four times 6msec) to complete the operations while an FPGA offers you parallelism that shortens processing time for each operation and permits overlap among the operations, as illustrated from this figure taken from the article:
In this example, the FPGA needs a total of 6msec to perform the four operations and another 2msec to transfer a video frame back and forth between processor and FPGA. The CPU needs a total of 24msec for all four operations. The FPGA needs 8msec, for a 3x speedup.
Treece then demonstrates that the acceleration is actually much greater in the real world. He uses the example of a video processing sequence needed for particle counting that includes these three major steps:
Here’s an image series that shows you what’s happening at each step:
Using the NI Vision Development Module for LabVIEW, he then runs the algorithm run on an NI cRIO-9068 CompactRIO controller, which is based on a Xilinx Zynq Z-7020 SoC. Running the algorithm on the Zynq SoC’s ARM Cortex-A9 processor takes 166.7msec per frame. Running the same algorithm but accelerating the video processing using the Zynq SoC’s integral FPGA hardware takes 8msec. Add in another 0.5msec for DMA transfer of the pre- and post-processed video frame back and forth between the Zynq SoC’s CPU and FPGA and you get about a 20x speedup.
A key point here is that because the cRIO-9068 controller is based on the Zynq SoC, and because NI’s Vision Development Module for LabVIEW supports FPGA-based algorithm acceleration, this is an easy choice to make. The resources are there for your use. You merely need to click the “Go-Fast” button.
For more information about NI’s Vision Development Module for LabVIEW and cRIO-9068 controller, please contact NI directly.
A new open-source tool named GUINNESS makes it easy for you to develop binarized (2-valued) neural networks (BNNs) for Zynq SoCs and Zynq UltraScale+ MPSoCs using the SDSoC Development Environment. GUINNESS is a GUI-based tool that uses the Chainer deep-learning framework to train a binarized CNN. In a paper titled “On-Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA,” presented at the recent 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, authors Haruyoshi Yonekawa and Hiroki Nakahara describe a system they developed to implement a binarized CNN for the VGG-16 benchmark on the Xilinx ZCU102 Eval Kit, which is based on a Zynq UltraScale+ ZU9EG MPSoC. Nakahara presented the GUINNESS tool again this week at FPL2017 in Ghent, Belgium.
According to the IEEE paper, the Zynq-based BNN is 136.8x faster and 44.7x more power efficient than the same CNN running on an ARM Cortex-A57 processor. Compared to the same CNN running on an Nvidia Maxwell GPU, the Zynq-based BNN is 4.9x faster and 3.8x more power efficient.
Xilinx ZCU102 Zynq UltraScale+ MPSoC Eval Kit
The Xilinx Technology Showcase 2017 will highlight FPGA-acceleration as used in Amazon’s cloud-based AWS EC2 F1 Instance and for high-performance, embedded-vision designs—including vision/video, autonomous driving, Industrial IoT, medical, surveillance, and aerospace/defense applications. The event takes place on Friday, October 6 at the Xilinx Summit Retreat Center in Longmont, Colorado.
You’ll also have a chance to see the latest ways you can use the increasingly popular Python programming language to create Zynq-based designs. The Showcase is a prelude to the 30-hour Xilinx Hackathon starting immediately after the Showcase. (See “Registration is now open for the Colorado PYNQ Hackathon—strictly limited to 35 participants. Apply now!”)
The Xilinx Technology Showcase runs from 3:00 to 5:00 PM.
Click here for more details and for registration info.
Xilinx Colorado, Longmont Facility
For more information about the FPGA-accelerated Amazon AWS EC2 F1 Instance, see:
The Xilinx Hackathon is a 30-hour marathon event being held at the Xilinx “Retreat” (also known as the Xilinx Colorado facility in Longmont, but see the image below), starting on October 7. The organizers are looking for no more than 35 heroic coders who will receive a Python-programmable, Zynq-based Digilent/Xilinx PYNQ-Z1 board and an assortment of Arduino-compatible shields and sensors. The intent, as Zaphod Beeblebrox might say, is to create something not just amazing but “amazingly amazing.”
Xilinx Colorado, Longmont Facility
In case you’ve not read about it, the PYNQ project is an open-source project from Xilinx that makes it easy to design high-performance embedded systems using Xilinx Zynq Z-7000 SoCs. Here’s what’s on the PYNQ-Z1 board:
The PYNQ-Z1 Board
For more information about the PYNQ project, see:
By Adam Taylor
Video Direct Memory Access (VDMA) is one of the key IP blocks used within many image-processing applications. It allows frames to be moved between the Zynq SoC’s and Zynq UltraScale+ MPSoC’s PS and PL with ease. Once the frame is within the PS domain, we have several processing options available. We can implement high-level image processing algorithms using open-source libraries such as OpenCV and acceleration stacks such as the Xilinx reVISION stack if we wish to process images at the edge. Alternatively, we can transmit frames over Gigabit Ethernet, USB3, PCIe, etc. for offline storage or later analysis.
It can be infuriating when our VDMA-based image-processing chain does not work as intended. Therefore, we are going to look at a simple VDMA example and the steps we can take to ensure that it works as desired.
The simple VDMA example shown below contains the basic elements needed to provide VDMA output to a display. The processing chain starts with a VDMA read that obtains the current frame from DDR memory. To correctly size the data stream width, we use an AXIS subset convertor to convert 32-bit data read from DDR memory into a 24-bit format that represents each RGB pixel with 8 bits. Finally, we output the image with an AXIS-to-video output block that converts the AXIS stream to parallel video with video data and sync signals, using timing provided by the Video Timing Controller (VTC). We can use this parallel video output to drive a VGA, HDMI, or other video display output with an appropriate PHY.
This example outlines a read case from the PS to the PL and corresponding output. This is a more complicated case than performing a frame capture and VDMA write because we need to synchronize video timing to generate an output.
Simple VDMA-Based Image-Processing Pipeline
So what steps can we take if the VDMA-based image pipeline does not function as intended? To correct the issue:
Hopefully by the time you have checked these points, the issue with your VDMA based image processing pipeline will have been identified and you can start developing the higher-level image processing algorithms needed for the application.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
by Anthony Boorsma, DornerWorks
Why aren’t you getting all of the performance that you expect after moving a task or tasks from the Zynq PS (processing system) to its PL (programmable logic)? If you used SDSoC to develop your embedded design, there’s help available. Here’s some advice from DornerWorks, a Premier Xilinx Alliance Program member. This blog is adapted from a recent post on the DornerWorks Web site titled “Fine Tune Your Heterogeneous Embedded System with Emulation Tools.”
Thanks to Xilinx’s SDSoC Development Environment, offloading portions of your software algorithm to a Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL (programmable logic) to meet system performance requirements is straightforward. Once you have familiarized yourself with SDSoC’s data-transfer options for moving data back and forth between the PS and PL, you can select the appropriate data mover that represents the best choice for your design. SDSoC’s software estimation tool then shows you the expected performance results.
Yet when performing the ultimate test of execution—on real silicon—the performance of your system sometimes fails to match expectations and you need to discover the cause… and the cure. Because you’ve offloaded software tasks to the PL, your existing software debugging/analysis methods do not fully apply because not all of the processing occurs in the PS.
You need to pinpoint the cause of the unexpected performance gap. Perhaps you made a sub-optimal choice of data mover. Perhaps the offloaded code was not a good candidate for offloading to the PL. You cannot cure the performance problem without knowing its cause.
Just how do you investigate and debug system performance on a Zynq-based heterogeneous embedded system with part of the code running in the PS and part in the PL?
If you are new to the world of debugging PL data processing, you may not be familiar with the options you have for viewing PL data flow. Fortunately, if you used SDSoC to accelerate software tasks by offloading them to the PL, there is an easy solution. SDSoC has an emulation capability for viewing the simulated operation of your PL hardware that uses the context of your overall system.
This emulation capability allows you to identify any timing issues with the data flow into or out of the auto-generated IP blocks that accelerate your offloaded software. The same capability can also show you if there is an unexpected slowdown in the offloaded software acceleration itself.
Using this tool can help you find performance bottlenecks. You can investigate these potential bottlenecks by watching your data flow through the hardware via the displayed emulation signal waveforms. Similarly, you can investigate the interface points by watching the data signals transfer data between the PS and the PL. This information provides key insights that help you find and fix your performance issues.
We’ll focus on the multiplier IP block from the Xilinx MMADD example to demonstrate how you can debug/emulate a hardware-accelerated function. For simplicity, we will focus on one IP block, the matrix multiplier IP block from the Multiply and Add example, shown in Figure 1.
Figure 1: Multiplier IP block with Port A expanded to show its signals
We will look at the waveforms for the signals to and from this Mmult IP block in the emulation. Specifically we will view the A_PORTA signals as shown in the figure above. These signals represent the data input for matrix A, which corresponds to the software input param A to the matrix multiplier function.
To get started with the emulation, enable generation of the “emulation model” configuration for the build in SDSoC’s project’s settings, as shown in Figure 2.
Figure 2: The mmult Project Settings needed to enable emulation
Next, rebuild your project as normal. After building your project with emulation model support enabled in the configuration, run the emulator by selecting “Start/Stop Emulation” under the “Xilinx Tools” menu option. When a window opens, select “Start” to start the emulator. SDSoC will then automatically launch an instance of Xilinx Vivado, which triggers the auto-generated PL project that SDSoC created for you as a subproject within your SDSoC project.
We specifically want to view the A_PORTA signals of the Mmult IP block. These signals must be added to the Wave Window to be viewed during a simulation. The available Mmult signals can be viewed in the Objects pane by selecting the mmult_1 block in the Scopes pane. To add the A_PORTA signals to the Wave Window, select all of the “A_*” signals in the Objects pane, right click, and select “Add to Wave Window” as shown in Figure 3.
Figure 3: Behavioral Simulation – mmult_1 signals highlighted
Now you can run the emulation and view the signal states in the waveform viewer. Start the emulator by clicking “Run All” from the “Run” drop-down menu as shown in Figure 4.
Figure 4: Start emulation of the PL
Back SDSoC’s toolchain environment, you can now run a debugging session that connects to this emulation session as it would to your software running on the target. From the “Run” menu option, select “Debug As -> 1 Launch on Emulator (SDSoC Debugger)” to start the debug session as shown in Figure 5.
Figure 5: Connect Debug Session to run the PL emulation
Now you can step or run through your application test code and view the signals of interest in the emulator. Shown below in Figure 6 are the A_PORTA signals we highlighted earlier and their signal values at the end of the PL logic operation using the Mmult and Add example test code.
Figure 6: Emulated mmult_1 signal waveforms
These signals tell us a lot about the performance of the offloaded code now running in the PL and we used familiar emulation tools to obtain this troubleshooting information. This powerful debugging method can help illuminate unexpected behavior in your hardware-accelerated C algorithm by allowing you to peer into the black box of PL processing, thus revealing data-flow behavior that could use some fine-tuning.
A recent Sensorsmag,com article written by Nick Ni and Adam Taylor titled “Accelerating Sensor Fusion Embedded Vision Applications” discusses some of the sensor-fusion principles behind, among other things, 3D stereo vision as used in the Carnegie Robotics Multisense stereo cameras discussed in today’s earlier blog titled “Carnegie Robotics’ FPGA-based GigE 3D cameras help robots sweep mines from a battlefield, tend corn, and scrub floors.” We’re starting to put a large amount of sensors into systems and turning the deluge of raw sensor data into usable information is a tough computational job.
Describing some of that job’s particulars consumes the first half of Ni’s and Taylor’s article. The second half of the article then discusses some implementation strategies based on the new Xilinx reVISION stack, which is built on top of Xilinx Zynq SoCs and Zynq UltraScale+ MPSoCs.
If there are a lot of sensors in your next design, particularly image sensors, be sure to take a look at this article.
By Anthony Boorsma, DornerWorks
Having some trouble choosing between Vivado HLS and SDSoC? Here’s some advice from DornerWorks, a Premier Xilinx Alliance Program member. This blog is adapted from a recent post on the DornerWorks Web site titled “Algorithm Implementation and Acceleration on Embedded Systems”
How does an engineer already experienced and comfortable with working in the Zynq SoC’s software-based PS (processing system) domain take advantage of the additional flexibility and processing power of the Zynq SoC’s PL (programmable logic)? The traditional method is through education and training to learn to program the PL using an HDL such as Verilog or VHDL. Another way is to learn and use a tool that allows you to take a software-based design written exclusively for the ARM 32-bit processors in the PS and transfer some or most of the tasks to the PL, without writing HDL descriptions.
One such tool is Xilinx’s Vivado High Level Synthesis (HLS). By leveraging the capabilities of HLS, you can prototype a design using the Zynq PS and then move functionality to the PL to boost performance. The advantage of this tool is that it generates IP blocks that can be used in the programmable logic of Xilinx FPGAs as well as Xilinx Zynq SoCs and Zynq UltraScale+ MPSoCs.
Logic optimization occurs when Vivado HLS synthesizes your algorithm’s C model and creates RTL. There are code directives (essentially guidelines for the tools’ optimization process) available that allow you to guide the HLS tool’s synthesis from the C model source to the RTL bitstream programmed into the FPGA. If you are working with an existing algorithm modeled in C, C++, or SystemC and need to implement this algorithm in custom logic for added performance, then HLS is a great tool choice.
However, be aware that the data movers that transfer data between the Zynq PS and the PL must be manually configured for performance when using Vivado HLS. This can become a complicated process when there’s significant data transfer between the domains.
A recent innovation that simplifies data-mover configuration is the development of the Xilinx SDSoC (Software-Defined System on Chip) Development Environment for use with Zynq SoCs and Zynq UltraScale+ MPSoCs. SDSoC builds on Vivado HLS’ capabilities by using HLS to perform the C-to-RTL conversion but with the convenient addition of automatically generated data movers, which greatly simplifies configuring the connection between the software running on the Zynq PS and the accelerated algorithm executing in the Zynq PL. SDSoC also allows you to guide data-mover generation by providing a set of pragmas to make specific data-mover choices. The SDSoC directive pragmas give you control over the automatically generated data movers but still require some minimal manual configuration. Code-directive pragmas for RTL optimization available in Vivado HLS are also available in SDSoC and can be used in tandem with SDSoC pragmas to optimize both the PL algorithm and the automatically generated data movers.
It is possible to disable the SDSoC auto generated data movers and only use the HLS optimizations. Demonstrated below are an IP block diagram generated with the auto configured SDSoC data movers and one without them.
The following screen shots are taken from a Xilinx-provided template project demonstrating the acceleration of a software matrix multiplication and addition algorithm, provided with the SDx installation. We used the SDx 2016.4 toolchain and targeted an Avnet Zedboard with a standalone OS configuration for this example.
Here is a screen shot of the same block, but without the SDSoC data movers. (We have disabled the automatic generation of data movers within SDSoC by manually declaring the AXI HLS interface directives for both mmult and madd accelerated IP block.)
To achieve the best algorithm performance, be prepared to familiarize yourself and use both the SDSoC and Vivado HLS user guides and datasheets. SDSoC provides a superset of Vivado HLS’s capabilities.
If you are developing and accelerating your model from first principles but want to take advantage of the flexibility of testing and proving out a design in software first, and you don’t intend to use a Zynq SoC, then using the Vivado HLS toolset straightaway is the place to start. A design started in HLS is transferable to an SDSoC if requirements change. Alternatively, if using a Zynq-based system is possible, it would be worthwhile to start right away with using SDSoC.
Amazon Web Services (AWS) is now offering the Xilinx SDAccel Development Environment as a private preview. SDAccel empowers hardware designers to easily deploy their RTL designs in the AWS F1 FPGA instance. It also automates the acceleration of code written in C, C++ or OpenCL by building application-specific accelerators on the F1. This limited time preview is hosted in a private GitHub repo and supported through an AWS SDAccel forum. To request early access, click here.
Last September at the GNU Radio Conference in Boulder, Colorado, Ettus Research announced the RFNoC & Vivado Challenge for SDR (software-defined radio). Ettus’ RFNoC (RF Network on Chip) is designed to allow you to efficiently harness the latest-generation FPGAs for SDR applications without being an expert firmware or FPGA developer. Today, Ettus Research and Xilinx announced the three challenge winners.
Ettus’ GUI-based RFNoC design tool allows you to create FPGA applications as easily as you can create GNU Radio flowgraphs. This includes the ability to seamlessly transfer data between your host PC and an FPGA. It dramatically eases the task of FPGA off-loading in SDR applications. Ettus’ RFNoC is built upon Xilinx’s Vivado HLS.
Here are the three winning teams and their projects:
Finally, here’s a 5-minute video announcing the winners along with the prizes they have won:
It will take you five or ten minutes to read the new SDSoC article written by Nick Ni and Adam Taylor titled “Developing all programmable logic using the SDSoC environment” and after you’re done, you’ll have a very good idea of why you might want to try Xilinx’s new SDSoC Development Environment for C, C++, and OpenCL application development. The article quickly guides you through a typical software/hardware development cycle and then gives you the performance results for an AES cryptography application.
The resulting performance chart showing as much as a 75% reduction in clock cycles for the application alone should be enough to pique your interest:
Got a problem getting enough performance out of your processor-based embedded system? You might want to watch a 14-minute video that does a nice job of explaining how you can develop hardware accelerators directly from your C/C++ code using the Xilinx SDK.
How much acceleration do you need? If you don’t know for sure, the video gives an example of an autonomous drone with vision and control tasks that need real-time acceleration.
What are your alternatives? If you need to accelerate your code, you can:
Unfortunately, each of these three alternatives increases power consumption. There’s another alternative however that can actually cut power consumption. That alternative’s based on the use of Xilinx All Programmable Zynq SoCs and Zynq UltraScale+ MPSoCs. By moving critical code into custom hardware accelerators implements in the programmable logic incorporated into all Zynq family members, you can relieve the processor of the associated processing burden and actually slow the processor’s clock speed, thus reducing power. It’s quite possible to cut overall power consumption using this approach.
Ah, but implementing these accelerators. Aye, there’s the rub!
It turns out that implementation of these hardware accelerators might not be as difficult as you imagine. The Xilinx SDK is already a C/C++ development environment based on familiar IDE and compiler technology. Under the hood, the SDK serves as a single cockpit for all Zynq-based development work—software and hardware. It also includes SDSoC, the piece of the puzzle you need to convert C/C++ code into acceleration hardware using a 3-step process:
One development platform, SDK, serves all members of the Zynq SoC and Zynq UltraScale+ MPSoC device families, giving you a huge price/performance range.
Here’s that 14-minute video:
The latest “Powered by Xilinx” video, published today, provides more detail about the Perrone Robotics MAX development platform for developing all types of autonomous robots—including self-driving cars. MAX is a set of software building blocks for handling many types of sensors and controls needed to develop such robotic platforms.
Perrone Robotics has MAX running on the Xilinx Zynq UltraScale+ MPSoC and relies on that heterogeneous All Programmable device to handle the multiple, high-bit-rate data streams from complex sensor arrays that include lidar systems and multiple video cameras.
Perrone is also starting to develop with the new Xilinx reVISION stack and plans to both enhance the performance of existing algorithms and develop new ones for its MAX development platform.
Here’s the 4-minute video:
In a free Webinar taking place on July 12, Xilinx experts will present a new design approach that unleashes the immense processing power of FPGAs using the Xilinx reVISION stack including hardware-tuned OpenCV libraries, a familiar C/C++ development environment, and readily available hardware-development platforms to develop advanced vision applications based on complex, accelerated vision-processing algorithms such as dense optical flow. Even though the algorithms are advanced, power consumption is held to just a few watts thanks to Xilinx’s All Programmable silicon.
Xilinx announced the addition of the P416 network programming language for SDN applications to its SDNet Development Environment for high-speed (1Gbps to 100Gbps) packet processing back in May. (See “The P4 has landed: SDNet 2017.1 gets P4-to-FPGA compilation capability for 100Gbps data-plane packet processing.”) An OFC 2017 panel session in March—presented by Xilinx, Barefoot Networks, Netcope Technologies, and MoSys—discussed the adoption of P4, the emergent high-level language for packet processing, and early implementations of P4 for FPGA and ASIC targets. Here’s a half-hour video of that panel discussion.
Cloud computing and application acceleration for a variety of workloads including big-data analytics, machine learning, video and image processing, and genomics are big data-center topics and if you’re one of those people looking for acceleration guidance, read on. If you’re looking to accelerate compute-intensive applications such as automated driving and ADAS or local video processing and sensor fusion, this blog post’s for you to. The basic problem here is that CPUs are too slow and they burn too much power. You may have one or both of these challenges. If so, you may be considering a GPU or an FPGA as an accelerator in your design.
How to choose?
Although GPUs started as graphics accelerators, primarily for gamers, a few architectural tweaks and a ton of software have made them suitable as general-purpose compute accelerators. With the right software tools, it’s not too difficult to recode and recompile a program to run on a GPU instead of a CPU. With some experience, you’ll find that GPUs are not great for every application workload. Certain computations such as sparse matrix math don’t map onto GPUs well. One big issue with GPUs is power consumption. GPUs aimed at server acceleration in a data-center environment may burn hundreds of watts.
With FPGAs, you can build any sort of compute engine you want with excellent performance/power numbers. You can optimize an FPGA-based accelerator for one task, run that task, and then reconfigure the FPGA if needed for an entirely different application. The amount of computing power you can bring to bear on a problem is scary big. A Virtex UltraScale+ VU13P FPGA can deliver 38.3 INT8 TOPS (that’s tera operations per second) and if you can binarize the application, which is possible with some neural networks, you can hit 500TOPS. That’s why you now see big data-center operators like Baidu and Amazon putting Xilinx-based FPGA accelerator cards into their server farms. That’s also why you see Xilinx offering high-level acceleration programming tools like SDAccel to help you develop compute accelerators using Xilinx All Programmable devices.
For more information about the use of Xilinx devices in such applications including a detailed look at operational efficiency, there’s a new 17-page White Paper titled “Xilinx All Programmable Devices: A Superior Platform for Compute-Intensive Systems.”
There’s considerable 5G experimentation taking place as the radio standards have not yet gelled and researchers are looking to optimize every aspect. SDRs (software-defined radios) are excellent experimental tools for such research—NI’s (National Instruments’) SDR products especially so because, as the Wireless Communication Research Laboratory at Istanbul Technical University discovered:
“NI SDR products helped us achieve our project goals faster and with fewer complexities due to reusability, existing examples, and the mature community. We had access to documentation around the examples, ready-to-run conceptual examples, and courseware and lab materials around the grounding wireless communication topics through the NI ecosystem. We took advantage of the graphical nature of LabVIEW to combine existing blocks of algorithms more easily compared to text-based options.”
Researchers at the Wireless Communication Research Laboratory were experimenting with UFMC (universal filtered multicarrier) modulation, a leading modulation candidate technique for 5G communications. Although current communication standards frequently use OFDM (orthogonal frequency-division multiplexing), it is not considered to be a suitable modulation technique for 5G systems due to its tight synchronization requirements, inefficient spectral properties (such as high spectral side-lobe levels), and cyclic prefix (CP) overhead. UFMC has relatively relaxed synchronization requirements.
The research team at the Wireless Communication Research Laboratory implemented UFMC modulation using two USRP-2921 SDRs, a PXI-6683H timing module, and a PXIe-5644R VST (Vector signal Transceiver) module from National Instruments (NI)–and all programmed with NI’s LabVIEW systems engineering software. Using this equipment, they achieved better spectral results over the OFDM usage and, by exploiting UFMC’s sub-band filtering approach, they’ve proposed enhanced versions of UFMC. Details are available in the NI case study titled “Using NI Software Defined Radio Solutions as a Testbed of 5G Waveform Research.” This project was a finalist in the 2017 NI Engineering Impact Awards, RF and Mobile Communications category, held last month in Austin as part of NI Week.
5G UFMC Modulation Testbed based on Equipment from National Instruments
Note: NI’s USRP-2921 SDR is based on a Xilinx Spartan-6 FPGA; the NI PXI-6683 timing module is based on a Xilinx Virtex-5 FPGA; and the PXIe-5644R VST is based on a Xilinx Virtex-6 FPGA.
Although humans once served as the final inspectors for pcbs, today’s component dimensions and manufacturing volumes mandate the use of camera-based automated optical inspection (AOI) systems. Amfax has developed a 3D AOI system—the a3Di—that uses two lasers to make millions of 3D measurements with better than 3μm accuracy. One of the company’s customers uses an a3Di system to inspect 18,000 assembled pcbs per day.
The a3Di control system is based on a National Instruments (NI) cRIO-9075 CompactRIO controller—with an integrated Xilinx Virtex-5 LX25 FPGA—programmed with NI’s LabVIEW systems engineering software. The controller manages all aspects of the a3Di AOI system including monitoring and control of:
The system provides height-graded images like this:
3D Image of a3Di’s Measurement Data: Colors represent height, with Z resolution down to less than a micron. The blue section at the top indicates signs of board warp. Laser etched component information appears on some of the ICs.
The a3Di system then compares this image against a stored golden reference image to detect manufacturing defects.
Amfax says that it has found the CompactRIO system to be “CompactRIO system has proven to be a dependable, reliable, and cost-effective.” In addition, the company found it could get far better timing resolution with the CompactRIO system than the 1msec resolution usually provided by PLC controllers.
This project was a 2017 NI Engineering Impact Award Finalist in the Electronics and Semiconductor category last month at NI Week. It is documented in this NI case study.
Hyundai Heavy Industries (HHI) is the world’s foremost shipbuilding company and the company’s Engine and Machinery Division (HHI-EMD) is the world’s largest marine diesel engine builder. HHI’s HiMSEN medium-sized engines are four-stroke diesels with output power ranging from 960kW to 25MW. These engines power electric generators on large ships and serve as the propulsion engine on medium and small ships. HHI-EMD is always developing newer, more fuel-efficient engines because the fuel costs for these large diesels runs about $2000/hour. Better fuel efficiency will significantly reduce operating costs and emissions.
For that research, HHI-EMD developed monitoring and diagnostic equipment to better understand engine combustion performance and an HIL system to test new engine controller designs. The test and HIL systems are based on equipment from National Instruments (NI).
Engine instrumentation must be able to monitor 10-cylinder engines running at thousands of RPM while measuring crankshaft angle to 0.1 degree of resolution. From that information, the engine test and monitoring system calculates in-cylinder peak pressure, mean effective pressure, and cycle-to-cycle pressure variation. All this must happen every 10 μ sec for each cylinder.
HHI-EMD elected to use an NI cRIO-9035 Controller, which incorporates a Xilinx Kintex-7 70T FPGA, to serve as the platform for developing its HiCAS test and data-acquisition system. The HiCAS system monitors all aspects of the engine under test including engine speed, in-cylinder pressure, and pressures in the intake and exhaust systems. This data helped HHI-EMD engineers analyze the engine’s overall performance and the performance of key parts using thermodynamic analysis. HiCAS provides real-time analysis of dynamic data including:
Using the collected data, the engineering team then developed a model of the diesel engine, resulting in the development of an HMI system used to exercise the engine controllers. This engine model runs in real time on an NI PXI system synchronized with the high-speed signal-sensor simulation software running on the PXI system’s multifunction FPGA-based FlexRIO module. The HMI system transmits signals to the engine controllers, simulating an operating engine and eliminating the operating costs of a large diesel engine during these tests. HHI-EMD credits the FPGAs in these systems for making the calculations run fast enough for real-time simulation. The simulated engine also permits fault testing without the risk of damaging an actual engine. Of course, all of this is programmed using NI’s LabVIEW systems engineering software and LabVIEW FPGA.
HHI-EMD HIL Simulator for Marine Diesel Engines
According to HHI-EMD, development of the HiCAS engine-monitoring system and virtual verification based on the HIL system shortened development time from more than three years to one, significantly accelerating the time-to-market for HHI-EMD’s more eco-friendly marine diesel engines.
This project was a 2017 NI Engineering Impact Award Finalist in the Transportation and Heavy Equipment category last month at NI Week and won the 2017 HPE Edgeline Big Analog Data Award. It is documented in this NI case study.