Network World reports that the IEEE has ratified IEEE P802.3bz, the standard that defines 2.5GBASE-T and 5GBASE-T Ethernet. That’s a very big deal because these new standards can increase existing 1000BASE-T network line speeds by 5x without the need to upgrade in-place cabling. The new standard’s development is being supported through the collaborative efforts of the Ethernet Alliance and NBASE-T Alliance. (See “Ethernet and NBASE-T Alliances to Host Joint 2.5/5Gbps Plugfest in October.”) Xilinx is a founding member of the NBASE-T Alliance.
For more information about NBASE-T and the NBASE-T alliance, see “NBASE-T aims to boost data center bandwidth and throughput by 5x with existing Cat 5e/6 cable infrastructure” and “12 more companies join NBASE-T alliance for 2.5 and 5Gbps Ethernet standards.”
For additional information about the PHY technology behind NBASE-T, see “Boost data center bandwidth by 5x over Cat 5e and 6 cabling. Ask your doctor if Aquantia’s AQrate is right for you” and “Teeny, tiny, 2nd-generation 1- and 4-port PHYs do 5 and 2.5GBASE-T Ethernet over 100m (and why that’s important).”
“Programming FPGAs: Getting Started with Verilog” by Simon Monk is hot off the press and must have come from Amazon Prime’s Future Division because the publication date says 2017 but my colleague Aaron Behman just received his copy. This short, $20 book introduces you to Verilog HDL (hardware description language), used by many engineers to develop FPGA-based designs. (ASIC designs too!)
The entire book is based on the Xilinx ISE Design Suite, which just got a new lease on life with yesterday’s announcement that Xilinx will be supporting ISE on Windows 10 and CentOS Linux by the end of this year. (See “Good news for Spartan-6 and ISE users: Windows 10 and CentOS Linux support for ISE 14.7.”)
Like the book itself, the three hardware targets for the projects in the book are low-cost:
The boards’ use of these three older FPGA devices explains the book’s use of the Xilinx ISE Design Suite.
If you’re looking for a gentle, low-cost entry into the FPGA world, this might be the door you’re seeking.
Aaron also recommends that you take a look at O’Reilly’s “Make: FPGAs” book by David Romano.
PYNQ is an open-source project that makes it easy for you to design embedded systems using the Xilinx Zynq-7000 SoC using the Python language, associated libraries, and the Jupyter Notebook, which is a pretty nice, collaborative learning and development environment for many programming languages including Python. PYNQ allows you to exploit the benefits of programmable logic used together with microprocessors to build more capable embedded systems with superior performance when performing embedded tasks such as:
Nearly every embedded system needs to run one or more such tasks. The programmable hardware on the Zynq SoC just makes this job a lot easier.
The PYNQ-Z1 based on the Xilinx Zynq Z-7020 SoC is the first dev board to support PYNQ and it just showed up on the Digilent Web site, listing for $229. (Digilent’s academic price for the PYNQ-Z1 is only $65!)
Digilent PYNQ-Z1 Dev Board
Here’s what’s on the PYNQ-Z1 board:
That’s a lot of board for $229—and it’s pink!
Here’s what PYNQ is, really: It’s for software developers and students who want to take advantage of the improved embedded performance made possible by the Zynq SoC’s programmable hardware without having to use ASIC-style (HDL) design tools to design hardware.
For even better performance, you can also program the ZYNQ-Z1 using C or C++ with PYNQ using the Xilinx SDK software development environment, available in the no-cost Xilinx Vivado HL Design Suite WebPACK.
The PYNQ-Z1 and the ZYNQ project make it possible to create a lot of really interesting systems, so just what are you waiting for?
Please contact Digilent directly for more information about the PYNQ-Z1 dev board.
With all the hoopla about the three new 28nm one-ARM Zynq Z-7000S SoCs and the six new 28nm Spartan-7 FPGA family members announced today, you might be feeling a wee bit left out if you’re still designing with those trusty Xilinx Spartan-6 devices. Cheer up. I’ve just been told that Xilinx will support ISE 14.7 running on Windows 10 and CentOS Linux by the end of this year.
For today’s other two Xilinx announcements, see:
Xilinx announced six members in the new Spartan-7 FPGA family today. These devices are the lowest-cost devices in the 28nm Xilinx 7 series and they’re optimized for low, low cost per I/O while delivering terrific performance/watt. Compared to Xilinx Spartan-6 FPGAs, Spartan-7 FPGAs run at half the power consumption (for comparable designs) and with 30% more operating frequency.
Let’s jump right to the good stuff. Here’s the Spartan-7 FPGA device table so you can see what’s inside each of the six devices:
Spartan-7 FPGA Family Table
These numbers tell part of the Spartan-7 family story. If you’re comparing the 28nm Spartan-7 family with the older-but-still-popular 45nm Spartan-6 FPGA family, you’ll see that you can get a lot more FPGA in a Spartan-7 device. The smallest Spartan-7 FPGA, the XC7S6, has 6000 logic cells while the smallest Spartan-6 device, the XC6SLX4, has 3840 logic cells. That’s 50% more logic.
However, logical size is not the entire story. There’s also news on the physical front. Spartan-7 device packaging has been engineered specifically for low cost and small size. You can get the new Spartan-7 XC7S6 and XC7S15 in miniscule 8x8mm packages! That’s a tiny 64mm2 of pcb real estate but you still get 86 I/O pins. (Ball pitch is 0.5mm.) Think of what you can do with that sort of I/O capability and where you might tuck one of these puppies into your design. Who knows? You might even want to put a second one in somewhere, just in case.
And, as they say on TV, that’s not all. The Spartan-7 story is also a Vivado Design Suite story—and that’s a pretty big part of this story—because you can use the Vivado Design Suite tools including Vivado HLS to develop FPGA designs based for Spartan-7 devices starting with the 2016.3 release. That includes the no-cost Vivado Design Suite WebPACK Edition. (Vivado HLS allows you to create FPGA designs using C, C++, or SystemC.)
You’ll also want to compare the new Spartan-7 FPGA family members with the existing 28nm Artix-7 devices. Here the biggest differences are that you can get more programmable resources in the larger Artix-7 FPGAs and the Artix-7 devices incorporate two to sixteen 6.6Gbps GTP high-speed serial transceivers. The Spartan-7 devices do not incorporate any serial transceivers. (You can sort of compare the Spartan-7/Artix-7 device relationship to the Spartan-6LX/Spartan-6LXT relationship, if that helps.)
This handy table from the new 7 Series FPGAs Overview product specification further differentiates between the two 28nm families:
Spartan-7 and Artix-7 FPGA comparison
Spartan-7 FPGAs will begin sampling in the first quarter of 2017.
Today, Xilinx welcomed a new family and three new devices into the growing line of Zynq-7000 SoCs and Zynq UltraScale+ MPSoCs. The new family is called the Zynq Z-7000S family and the three new devices are the Zynq Z-7007S, Z7012S, and Z7014S. The three devices in the Zynq Z7000S family target smaller embedded designs and are therefore smaller and slower than other members of the Zynq Z-7000 SoC family—but only in a relative sense. These devices still offer the performance-boosting goodness of on-chip programmable logic just like their larger siblings.
The new Zynq Z-7000S family has two key features that differentiate it from the rest of the Zynq SoC device families. First, the members of the family have one ARM Cortex-A9 processor core (as opposed to dual-core ARM Cortex-A9 MPCore processors in the other Zynq Z-7000 SoC family members). The microprocessor in the Zynq Z-7000S family members also has a maximum clock rate of 766MHz instead of the 866MHz or 1GHz upper bound for the other Zynq Z-7000 SoC family members. Another significant difference is that the three new Zynq Z-7000S family members have fewer on-chip programmable-logic resources than the other Zynq Z-7000 SoC family members.
Here’s a table from the Xilinx Zynq Z-7000 SoC selection guide that compares all ten Zynq Z-7000 SoC devices:
Zynq Z-7000 and Z-7000S SoC Device Table
(Not shown in this table are any of the high-speed serial transceivers incorporated into several of the Zynq Z-7000 devices including the four 6.25Gbps GTP transceivers in the Zynq Z-7012S device.)
According to the 2015 UBM Electronics Embedded Markets Study, more than 50% of embedded system designs use just one microprocessor and that’s been true since 2011. (The percentage was likely higher back when embedded multi-core processors were far less common.) The three new Xilinx Zynq Z-7000S devices now offer designers of these smaller embedded systems alternatives that might well suit their needs when a microprocessor alone just can’t provide the requisite processing oomph needed for a project.
The three Xilinx Zynq Z-7000S family members offer a lower-cost entry point into the Zynq SoC family and the devices’ on-chip programmable logic resources act as universal, programmable I/O ports for any-to-any connectivity and as processor enhancers—application superchargers if you will—that you can harness to provide precisely focused hardware acceleration for specialized embedded tasks such as sensor fusion or video processing. You just cannot do this sort of processing on a microprocessor without blowing your power and cost budgets. Programmable logic gives you far more performance/watt. Further, you’ll be able to create these accelerators using C, C++, or System C using Vivado HLS. The Xilinx Vivado Design Suite will support all Zynq Z-7000S family members starting with the Vivado 2016.3 release.
The dual-core Zynq-7000 devices are very handy when you want to pair an operating system running a GUI with an RTOS in an AMP (asymmetric multiprocessing) configuration and if you need even more performance, I’ll just briefly remind you that Xilinx already has that need covered as well with the 21 devices in the three Zynq UltraScale+ MPSoC familes including:
Finally, because you’ll certainly want to know, Zynq-7000S production devices will begin shipping in the first quarter of 2017, according to today’s announcement.
The Xilinx Zynq UltraScale+ MPSoC contains a lot of goodies for system designers including four or six ARM processor cores (two or four 64-bit ARM Cortex-A53 application processor cores and two 32-bit Arm Cortex-R5 real-time processor cores) and plenty of advanced UltraScale+ programmable logic. One of the many unsung goodies under the Zynq UltraScale+ MPSoC’s hood is extensive power control—and lots of it.
There are three major power domains inside of the device’s processing system and one for the programmable logic, as shown below:
Zynq UltraScale+ MPSoC Power Domains
Each of the Zynq UltraScale+ MPSoC power domains draws power from separate power pins that you can connect to different external power regulators for independent power control. (Note: If your design does not require individual power-domain control, these power rails can share power supplies.)
The Zynq UltraScale+ MPSoC processing system’s three power domains include:
The battery-power domain contains battery-backed RAM, used for storing an encryption key, and a real-time clock with external crystal oscillator to sustain timekeeping even when the rest of the device is powered off. This domain is designed to be powered by an external battery. Power consumption for the battery-power mode ranges from 180nW when just powering the battery-backed RAM to 3μW when the real-time clock is enabled.
The low-power domain consists of a real-time processor unit with the two ARM Cortex-R5 processors, static on-chip memory, the platform management unit, the configuration and security unit, and low-speed peripherals. Power consumption for the low-power mode associated with the low-power domain ranges from about 20mW to about 400mW.
The full-power domain consists of the application processor unit, based on four ARM Cortex-A53 processors, the GPU, the DDR memory controller, and high-performance peripherals including PCIe, USB 3.0, DisplayPort, and SATA. Power consumption for the full-power mode depends on processor activity and how many processor cores are enabled. Power consumption in this mode can range up to a couple of watts.
The programmable-logic power domain includes logic cells, block RAMs, DSP blocks, the XADC, I/O ports, and high-speed serial interfaces. Other devices in the programmable-logic power domain include the video codec, PCIe Gen4 controller, UltraRAM, the 100G Ethernet MAC, and Interlaken I/O. Power consumption for the programmable logic depends entirely on what and how much you put into the programmable logic and how fast you clock it.
Each power domain in the Zynq UltraScale+ MPSoC’s processing system contains multiple independent power islands that you can gate individually for fine-grained power management. The power islands within the low-power domain include:
The power islands within the full-power domain include:
You’ll probably want a lot more information about the Zynq UltraScale+ MPSoC’s power-management features, so I recommend downloading a copy of the new Whitepaper titled “Managing Power and Performance with the Zynq UltraScale+ MPSoC” by Glenn Steiner and Brian Philofsky.
By Adam Taylor
Having examined how we can quickly and easily develop image-processing cores using Vivado HLS, I thought it would be a good idea to examine how we can interface actual image sensors to the Zynq SoC so that we can obtain an image that we can process.
At a high level, we can break down the interface into one of two different categories:
This is where the flexibility of the Zynq SoC really comes in handy. The ability to use the embedded peripheral cores in the Zynq SoC’s PS and the programmable I/O and logic in the Zynq SoC’s PL allow you to interface your design to any camera or any sensor and to create a tightly integrated system. The programmable nature of these interfaces means that you can use the Zynq SoC to create a vision platform for many varied camera and image-processing designs, and several commercial camera vendors have done exactly that.
If we are interfacing to a camera with a USB or GigE Vision video interface, we can use the I/O peripherals in the Zynq SoC’s PS. Images captured over these interfaces can then be routed via the central interconnect of the PS directly into attached DDR memory. Once the image is stored in memory, we can transfer the image from the SDRAM to the Zynq SoC’s PL for processing using VDMA over a high-performance AXI port.
Should the interface to the camera or the device use a lower-level I/O protocol, we can implement the required interface in the Zynq SoC’s PL. These lower-level interfaces typically provide frame and line valid signals along with pixel data. The way these signals are encoded varies, which adds some complexity to the design.
The simplest of these interfaces is a parallel CMOS interface, which provides frame-valid and line-valid signals along with the pixel values in a parallel form, as shown below:
Simple Parallel Video Interface
However, as we increase the frame rate of the image sensor, the use of a parallel CMOS output becomes challenging due to increased signal rates and we usually must use a serialized approach to I/O like Camera Link or LVDS.
Using either Camera Link or serialized LVDS requires that we de-serialize the channels to extract the required information, which involved replicating a parallel structure of pixel value and the frame- and line-valid signals internal to the FPGA as we got directly from the sensor using a parallel interface.
Camera Link comes in three different standards—Base, Medium and Full—providing 2.04, 4.08, and 5.44 Gbps respectively. The base configuration employs four serialized LVDS channels and an LVDS clock running at 85MHz. This interface transfers 24 pixels and 4 framing bits. The Medium and Full versions of the Camera Link interface each introduce another four LVDS Links each so that the Full version has 12 LVDS links.
Camera Link achieves high data rates by serializing data at a rate of 7:1 and transmitting it over 4 LVDS links. The final LVDS link provides the clock, as shown below:
LVDS Camera Link Serialization
When we receive data over a Camera Link interface, we need to de-serialize the five LVDS lines and extract the pixel data in the correct order.
We know the serialisation is 7:1 so we can use one of the MMCMs (mixed-mode clock managers) provided by the Zynq SoC to generate a clock running at 7x the Camera Link clock frequency. However, we still need a framing reference to properly align the received data. Luckily, in the case of Camera Link, we can use the Camera Link clock as the framing reference.
To convert the four LVDS data channels from serial to parallel, we can use the ISERDES2 provided in the Zynq SoC’s I/O structure. Using the ISERDES2, we can provide the parallel clock and the higher speed serial clock, and generate a parallel output of as many as 8 bits. (If necessary, we can chain ISERDES2 blocks together for larger parallel outputs.) We need seven outputs for the Camera Link interface, as the serialization is 7:1, so we can de-serialize the interface using only one ISERDES2 for each of the LVDS data channels and the one for the clock.
We use the ISERDES2 from the clock to provide the framing signal. When we have the correct relationship between the received Camera Link clock signal with the generated clock frequency running at 7x the input frequency, the output from this ISERDES2 block will be the pattern “1100011”.
We can use a simple state machine to look for this pattern while incrementing or decrementing the phase of the high-speed clock until the correct pattern is detected. Once this pattern is detected, we can than extract the line, frame, and pixel value data from the remaining four ISERDES2 blocks.
Here’s a block diagram of the Base Camera Link receiver design based on the above discussion:
Example Base Camera Link Receiver
We can also take a similar approach to transmitting Camera Link data using an MMCM and OSERDES2 to perform the parallel-to-serial conversion.
While this example uses a Camera Link interface, the same general approach can be used for many serialized I/O applications that provide a signal we can use as a framing reference. Next week we will look at applications that do not provide a framing reference but instead provide a training pattern.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
All of Adam Taylor’s MicroZed Chronicles are cataloged here.
Xcell Daily has covered several Photonfocus industrial video cameras and all of them have been based on a Xilinx Spartan-6 FPGA vision platform developed by Photonfocus that serves as a base for numerous industrial video cameras. One of the advantages that the Spartan-6 FPGA provides is the ability to adapt to nearly any sort of imaging sensor—monochrome, color, or hyperspectral—through reprogramming of the interface I/O and the on-board video processing. Another advantage the FPGA provides is the ability to go fast in video-processing applications.
Photonfocus used this latter capability to develop the DR1 family of double-rate GigE video cameras a while ago (the company recently announced quad-rate QR1 GigE industrial cameras based on the same FPGA platform) and a new article written by Andrew Wilson in Vision Systems Design Magazine details the use of Photonfocus DR1 double-rate cameras for a high-speed, visual-inspection system developed by M3C Industrial Automation and Vision. This system can inspect and sort 25,000 cork stoppers per hour using the Photonfocus DR1 camera’s extreme 1800 frames/sec capability.
The camera employed in this application is the Photonfocus DR1-D2048x1088C-192-G2-8 high-speed color camera. Used in line mode, the camera captures an 896x100-pixel region of interest (ROI) at an astounding 1800 frames/sec. The corks pass through two imaging stations that employ structured light generated by an Effilux LED 3D projector. The LED projector’s bright white line appears across the cork stopper within the camera’s ROI and the bright illumination ensures color fidelity. Color is an important sorting criterion.
The Photonfocus DR1 double-rate GigE camera family is based on the Xilinx Spartan-6 FPGA
Corks pass through two such inspection stations. The first imaging station performs a 3D analysis of the cork stoppers using an attached PC running software written in C++ that assembles 3D images from the captured frames as the corks rapidly pass by on a conveyer. Then the corks are flipped before passing through a second imaging station that inspects the stoppers’ reverse side. The system looks for defects in the stoppers including unacceptably large holes, superficial deformations, incorrect size, and color imperfections. The system uses this information to grade and sort the good stoppers and to discard rejects.
Numerous cork manufacturers across Europe have already adopted this high-speed inspection system.
Want to see the M3C Industrial Automation and Vision cork-inspection system in action? Thought so. Here’s Andrew Wilson’s video:
The only way this system could be better would be for it to be inspecting chocolate-chip cookies, and I get the rejects.
Other Xcell Daily blog posts about Photonfocus video cameras based on the Spartan-6 FPGA include:
Although the Xilinx Spartan-6 FPGA family is now more than half a decade old, it continues to demonstrate real value as a cost-effective foundation for many new video and vision platforms.
The video in this post on the Lightreading.com Web site shows Napatech’s Dan Joe Barry discussing the acceleration provided by his company’s NFV NIC. Briefly, the Napatech NFV NIC reduces CPU loading for NFV applications by a factor of more than 7x relative to conventional Ethernet NICs. That allows one CPU can do the work of nearly eight CPUs, resulting in far lower power consumption for the NFV functions. In addition, the Napatech NFV NIC bumps NIC throughput in NFV applications from the 8Mpackets/sec attainable with conventional Ethernet NICs to the full theoretical throughput of 60Mpackets/sec. The dual-port NFV NIC is designed to support multiple data rates including 8x1Gbps, 4x10Gbps, 8x25 Gbps, 2x40 Gbps, 2x50 Gbps and 2x100 Gbps. All that’s required to upgrade the data rate is downloading a new FPGA image with the correct data rate to the NFV NIC. This allows the same NIC to be used in multiple locations in the network, reducing the variety of products and easing maintenance and operations.
These are substantial benefits in an application where performance/watt is really critical. Further, the Napatech NFV NIC can “extend the lifetime of the NFV NIC and server hardware by allowing capacity, features and capabilities to be extended in line with data growth and new industry solution standards and demands.” The NFV functions implemented by the Napatech NFV NIC can be altered on the fly. Bottom line: the Napatech NFV NIC improves data-center performance and can actually help data-center operators postpone forklift upgrades, which saves even more money and reduces TCO (total cost of ownership).
Napatech NFV NIC
A quick look at the data sheet for the Napatech NFV NIC on the Napatech Web site confirmed my suspicions about where a lot of this goodness comes from: the card is based on a Xilinx UltraSCALE FPGA and “can be programmed and re-configured on-the-fly to support specific acceleration functionality. Specific acceleration solutions are delivered as FPGA images that can be downloaded to the NFV NIC to support the given application.”
Oh, by the way, the Napatech Web site says that 40x performance improvements are possible.
Xilinx has joined the non-profit Open Networking Lab (ON.Lab) as a collaborating member of the CORD Project—Central Office Re-architected as a Datacenter, “which combines NFV, SDN, and the elasticity of commodity clouds to bring datacenter economics and cloud agility to the Telco Central Office”—along with a rather long list of major telecom CORD partners including:
CORD aims to produce reference implementations for the industry built using commodity servers, white-box switches, disaggregated access technologies (including vOLT, vBBU, vDOCSIS), and open-source software (including OpenStack, ONOS, XOS) for the residential, enterprise, and mobile markets (R-CORD, E-CORD, and M-CORD).
Xilinx joined with the intent of becoming actively engaged in the CORD Project and has contributed a proposal for FPGA-based Acceleration-as-a-Service for cloud servers and virtualized RAN servers in the M-CORD activity focused on the mobile market. The CORD Technical Steering Team has already reviewed and approved this proposal.
The Xilinx proposal for FPGA-based Acceleration-as-a-Service is based on the company’s UltraScale and UltraScale+ All Programmable devices used, for example, to implement flexible SmartNICs (network interface cards) and employs the partial-reconfiguration capabilities of these devices to allow SDN and NFV operating systems to discover and dynamically allocate FPGA resources to accelerate various functions and services on demand. This proposal will allow SDN and NFV equipment to exploit the superior performance/watt capabilities of hardware-programmable devices in myriad application-processing scenarios.
Yesterday, PLDA announced several products for the PCIe 4.0 standard including a controller IP core, PHY IP, a platform development kit, and training. As PLDA points out in the video below, you have several options when developing with PCIe 4.0, but only one is attractive at the moment. You can design and manufacture a PCIe 4.0 test chip, but that’s risky and expensive. You can wait for a motherboard based on PCIe 4.0, but you will lose valuable development time. Or, you can develop designs now using an FPGA, just as PLDA has done.
PLDA PCIe 4.0 Platform Development Kit based on a Xilinx Virtex UltraScale VU065 FPGA
In PLDA’s case, the company has implemented its PCIe 4.0 IP in a pair of boards based on Xilinx Virtex UltraScale VU065 FPGAs. The boards communicate with a motherboard using PCIe 3.0 but they communicate with each other at 16Gtransfers/sec using the PCIe 4.0 protocol using Samtec Firefly Micro Flyover optical connections, as shown in the video below:
For more information about PLDA’s FPGA-based implementation of the PCIe 4.0 standard, see “PLDA shows working PCIe 4.0 Platform Development Kit operating @ 16G transfers/sec at today’s PCI-SIG Developer’s Conference.”
Dave Embedded Systems has published a detailed technical White Paper titled “Real-timeness, system integrity and TrustZone technology on AMP configuration” about the company’s use of ARM’s TrustZone technology for real-time embedded systems, implemented in Xilinx Zynq-7000 SoCs on the company’s BORA and BORA Xpress cards. The White Paper discusses the implementation of an AMP (asymmetric multi-processing) design that runs Linux on one of the processor cores and FreeRTOS (within a secure, trusted environment) on the other processor core within the Zynq SoC’s dual-core ARM Cortex-A9 MPCore processor system.
Dave Embedded Systems BORA Xpress Card
The White Paper discusses system memory partitioning, the AMP boot process, communications between the trusted and the non-trusted worlds, cache management, and the use of on-chip memory to improve system performance.
One thing of particular note that you might otherwise miss: there are 15 references at the end of the White Paper to extend your knowledge with respect to the use of trusted worlds in real-time embedded systems.
SS-OCT (swept-source optical coherence tomography) employs a short-cavity, swept laser to produce high-resolution scans of the inside of a human eye using multiple light wavelengths. This relatively new medical diagnostic tool is useful for identifying a variety of problems related to eye diseases such as glaucoma. SP Devices, an e2v company, has launched the ADQ14OCT 14-bit, 8Gsamples/sec digitizer specifically for SS-OCT systems that offers “unrivalled noise and distortion performance.” Part of the digitizer’s performance derives from the use of a Xilinx Kintex-7 K325T FPGA to perform the required high-speed signal processing on the digitized input waveforms from the SS-OCT systems’ CMOS cameras, including FFTs with as many as 32k points. The FPGA also enables relatively easy customization of the ADQ14OCT digitizer’s signal-processing functions according to application needs.
Two versions of the digitizer are available: as a USB 3.0 module and as a PCIe card.
ADQ14OCT 14-bit, 8Gsamples/sec digitizer, USB 3.0 version
ADQ14OCT 14-bit, 8Gsamples/sec digitizer, PCIe version
Twenty-five years ago, Sundance Multiprocessor Technology’s Managing Director Flemming Christensen saw the benefit of combining software-driven processors and programmable logic for embedded system design. He’s been developing and marketing products that combine these two processing technologies ever since, getting closer and closer to his goal of allowing software developers to take better advantage of the significant performance boost available from FPGA technology. With the combination of Sundance’s new EMC²-Z7015 PCIe/104 OneBank SBC based on the Xilinx Zynq Z-7015 SoC and Xilinx’s SDSoC Development Environment, Christensen feels his vision is now closer to reality than ever.
Sundance Multiprocessor Technology’s EMC²-Z7015 PCIe/104 OneBank SBC
In a recent article on the New Electronics Web site about the European-funded EMC² project (Embedded Multi-Core systems for Mixed Criticality applications in dynamic and changeable real-time environments) and Sundance’s EMC²-Z7015 PCIe/104 OneBank SBC, author Graham Pitcher writes:
“Because the [EMC²-Z7015] platform which Sundance has developed is based on a Xilinx Zynq device, the concept of C to VHDL can be broadened to encompass C to FPGA, bringing the potential of reconfigurable hardware. And a recent Xilinx development – SDSoC – has made this task easier.”
The Sundance EMC²-Z7015 SBC represents the current pinnacle of Christensen’s vision of an SBC that easily combines software-driven processing with programmable logic for embedded applications. His first design, the HARP-2 from the early 1990s, combined a 32-bit Inmos Transputer with a Xilinx XC3195A, a member of the early-generation Xilinx XC3100A FPGA family. You could program the Transputer using Handel-C but you still needed to understand FPGAs and HDLs to make full use of the on-board FPGA.
Flemming Christensen's original HARP-2 Transputer-FPGA Board with a Xilinx XC3195A FPGA
SDSoC vastly simplifies the processor/FPGA programming task by allowing you to program the Zynq SoC’s dual-core ARM Cortex-A9 MPCore processor and its on-chip programmable logic using C, C++, or SystemC. SDSoC can target any board with an appropriate BSP (Board Support Package) and there is indeed an SDSoC BSP available for the Sundance EMC²-Z7015 PCIe/104 OneBank SBC. (More information about the software support for the Sundance EMC²-Z7015 is available here.)
Here’s a very short video of a Sundance EMC²-Z7015 SBC running a Sobel filter algorithm in the Zynq SoC’s programmable logic, developed using SDSoC:
By K R Ranjith and Deepak Shankar, Mirabilis Design
The Zynq SoC’s unique combination of a dual-core ARM Cortex-A9 MPCore processor, many embedded peripherals and I/O controllers, and programmable logic makes application task partitioning between the software-driven processor cores and the programmable logic a major challenge. System-architecture decisions that are difficult to make without dynamic exploration include the selection of tasks requiring hardware acceleration, use of the Zynq SoC’s local memory as cache or exclusively for use by the programmable logic, memory bandwidth allocation, and the design of communications between the ARM processor and the programmable logic. We at Mirabilis Design have developed a virtual platform for the Zynq-7000 Programmable SoC to help you answer these complex questions.
This platform, called VisualSim, is a heterogeneous modeling, simulation and analysis environment that uses a combination of cycle- and timing-accurate models for simulating the performance and power consumption of the Zynq SoC’s internal components. You can use the VisualSim platform for early architecture exploration. VisualSim uses state-based dynamic power measurement during the simulation, which allows you to architect applications with power in mind.
The following examples of an HD video-processing application use a mix of software and programmable logic. We used the following methodology to develop the HD video application:
The proposed HD video platform must process at least 10K macro blocks in 20 msec. System power should not exceed 3W. The system architecture must be defined to meet these requirements. We simulated two different system designs:
We used the pre-configured and customizable VisualSim Xilinx Zynq-7000 All Programmable SoC platform, shown in Figure 1 below, for these experiments. This platform consists of hardware architectural elements such as the dual-core ARM Cortex-A9 processor; SDRAM and Flash memory controllers; DMA controllers; peripherals and I/O controllers including CAN, Ethernet, and USB; hardware timers; and a generous chunk of Xilinx 7-series FPGA. The VisualSim Zynq-7000 Simulation Environment generates performance analysis reports including application latency, device throughput, processor performance, and system-level power consumption (average power, instant power and battery consumption).
Figure 1: VisualSim Zynq 7000 Programmable SoC Template
In the above figure, we defined the HD video application as a task-flow diagram. Figure 2 shows the “Behavior Flows” hierarchical block in detail. In addition to the HD Video application itself, the system design includes additional housekeeping tasks defined as background processes. Figure 2 also shows a parameters block with the attributes of the Video Post-Processing task.
Figure 2: VisualSim Zynq 7000 User Application Behavior Flow
Each blue behavior block in Figure 2 represents a specific application task. Each behavior block includes a parameters list (data size, priority, destination processing resource, and task name). VisualSim allows you to easily modify task mapping during architecture exploration. You can map tasks to either the Zynq SoC’s processor or to programmable logic to achieve the desired performance and power consumption. VisualSim executes a task’s actual instruction sequence or a synthetic generated trace to accurately emulate task execution on the ARM Cortex-A9 MPCore processor.
You use the VisualSim Power Modeling Toolkit to model system power consumption. The Power Modeling toolkit enables designers and architects to capture dynamic power of the entire system in a model. This feature allows you to trade off performance and power using a single architectural model. Each standard device or component model can have as many as four power states—standby (leakage), idle, active, and wait—and a transition cycle time. There can be as many as twelve power states defined for the Zynq SoC’s programmable logic. When the operation state of a device changes (idle, standby, wait, and busy), the power level goes to the new state. There is a delay to go to the new state called a transition cycle that improves simulation accuracy.
We constructed the HD video application’s behavior task model and hardware mapping in 10 man-hours and then ran the simulation on a 2.6GHz Microsoft Windows 10 platform with 4GBytes of RAM. VisualSim simulated 800μsec of system time using 46 seconds of wall-clock time.
System analysis focused on three aspects: system attribute settings, mapping of tasks to processor and programmable logic, and power management. We considered two use cases to explore system performance and power consumption:
For the initial design exploration, we mapped all of the image-processing tasks onto the Zynq SoC’s dual-core ARM cortex A9 MPCore processor as shown in Figure 3.
Figure 3: Mapping of the HD Video behavior onto the dual-core ARM Cortex-A9 MPCore processor
We set the target destination of the behavior task blocks in Figure 2 to “any_core,” instructing VisualSim to map tasks to either Core 1 or Core 2. The system dispatcher, which is part of the operating system, then makes the final allocation decision at run time.
Simulation results generated include average power consumption of the complete system, the number of processed macro blocks, individual resource power consumption, and task latency. A plot of the average power consumption appears in Figure 4 and a text display with task latency information appears in Figure 5.
Figure 4: System Average Power Consumption and Number of Macro Block Processing
Figure 5: Task Latency Reports
Simulation results show that the power consumption is less than 3W when all applications are mapped onto the Zynq SoC’s dual-core ARM Cortex-A9 MPCore processor, meeting our power-consumption goal, however the performance is inadequate. The design goal is to process 10K macro blocks but only 1643 macro blocks were processed in the allotted 20msec. From the hardware platform statistics, we can see that the “Rotate_Frame” task consumes the most CPU cycles so accelerating this task using programmable logic is a likely step towards achieving the performance goal.
In this case, we moved the “Rotate_Frame” task to a hardware accelerator constructed from the Zynq SoC’s programmable logic. We use basic VisualSim modeling libraries to model the accelerator, as shown in Figure 6.
Figure 6: Hardware accelerator using the Zynq SoC’s programmable logic
First, we modeled the hardware accelerator using the VisualSim Zynq 7000 Platform as shown in Figure 6. Then we simply changed the mapping parameters of the “Rotate_Frame” task to use the hardware accelerator by modifying the “Select_Partitioning” parameter from “SW” to “HW” as shown in Figure 7.
Figure 7: Rotate_Frame task mapping set to programmable logic
The reports of average system power consumption and number of macro blocks processed for Case 2 appear in Figure 8.
Figure 8: Average Power Consumption and Number of Macro Block Processing
The average system power consumption plot shows that the power consumption now slightly exceeds the 3W goal but performance has improved significantly, to 11700 macro blocks, which exceeds our 10K goal. Looking at the instantaneous power consumption plot generated for the Rotate_Frame hardware accelerator, shown in Figure 9, we see that the hardware accelerator logic is always active. We can lower the accelerator’s power consumption by gating the power using a finite state machine.
Figure 9: Instantaneous Power Consumption of Image Rotate Function
The results after introducing gated power appears in Figure 10:
Figure 10: Average system power consumption instantaneous power of Rotate_Frame Function
Introducing power gating reduces performance by 6% but still meets our performance requirement of processing 10K macro blocks in the allotted time while the total system power consumption is now 2.6W, well below our 3W power goal.
In our studies, we found that stall-time, cache-hit-ratio, and task-latency reports helped us determine the behavior tasks best targeted for hardware acceleration. Also, the block-level power information provided visibility into management algorithms that can be deployed to reduce total power consumption. None of this analysis would have been possible with a prototyping board but it is easily handled by the VisualSim platform.
By Adam Taylor
With the core written, we want to be able to test it, initially using a C simulation to prove that it does what we desire and then again using co-simulation against the synthesised HDL to verify that the HDL functions as required. We can use the same test bench for both the C and HDL testing using the HLS environment. Within the test bench, we need to perform the following steps:
This approach allows us to take in an image file, apply a Gaussian filter to it, and then save it to a file for later examination. The reason we do the conversion from RGB to YUV is because it is common to process images in this color space and the example we will eventually build from this preparatory work using the EVK employs this color space, so the input image must also be in this color space.
The color space we will be using is YUV 4:2:2, which means that the pixel value can be represented in 2 bytes. If we were to use the OpenCV cvt_color function, it would return a YUV 4:4:4 image. Therefore we need a specific routine to subsample the image, converting it from 4:4:4 to 4:2:2.
Once we have generated both the function and the test bench file, the next step is to use C simulation to ensure that the function performs as desired. If it does not, we can quickly and easily modify the test bench or the source code and re simulate until we obtain the desired results.
Once assured that we’ve properly defined the desired function at the C level, we then synthesise the function using Vivado HLS to produce HDL. As part of the synthesis process, Vivado HLS estimates the resource utilization. For this example, where we are targeting the Zynq Z-7020 SoC, the estimate showed that the following resources would be required:
We now wish to perform co-simulation with the synthesized output. This step allows us to use the C test bench to stimulate the generated HDL using an HDL simulation tool. The images below show the original input image and the resultant output image.
Co-Simulation Gaussian Blur 3x3
Once the co-simulation is complete, we can examine the resultant output image and, if we wish, the simulation waveforms. There is also a status report on the co-simulation, which not only reports the pass/fail status but also the function latency. We can compare this latency from co-simulation against the synthesis report, which also contains the expected latency. However, the expected latency shown in the synthesis report is based on handling the maximum row and column sizes while the latency estimation in the simulation results are based on the actual image size passed to it.
Once we are happy with the co-simulation and know that the function works as intended, the final step is to export the module into our Vivado IP library and insert the new IP module into our image-processing chain. This we can achieve very easily by using the Export RTL option and completing the configuration options as you desire as shown below:
This will package the IP. You will find the packaged IP core and a Zip file of the IP core within your project’s solutions directory.
We can then open our Vivado design and import the IP Core we just created from the IP Catalog. However, to do this we first need to create an IP Repository within our project using the projects settings dialog on the IP tab:
Creation of the IP repository with this example highlighted
After we create the IP Repository, it will not contain any IP Cores. We need to add cores to it using the IP Catalog. Within the IP Catalog, you should see the Repository that we just created. Right-click on the Repository and then select Add IP to Repository option.
This will open a dialog box so that we can select the IP Core we wish to add to the repository. We can select either the component.xml or the zipped archive. When this is complete, you will see the IP core located within the Repository, ready for use in a block diagram.
Having now shown how we can quickly and easily get image processing functions up and running, I will now start to look at how we can get image data into and out of our system in the next blog post.
Incidentally I am attending the Embedded System Conference in Minneapolis next week and giving several talks including one on High Level Synthesis. If you are attending, please come by and say hello.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
All of Adam Taylor’s MicroZed Chronicles are cataloged here.
Xilinx had a table in Maker’s Alley at the 8th Annual Sparkfun Autonomous Vehicle Competition (AVC), held today in Niwot, Colorado near Boulder. AEs and software engineers from the nearby Xilinx Longmont facility staffed the table along with Aaron Behman and myself. We answered many questions and demonstrated an optical-flow algorithm running on a Zynq-based ZC706 Eval Kit. The demo accepted HDMI video from a camcorder, converted the live HD video stream to greyscale, extracted motion information on a frame-by-frame basis, and displayed the motion on a video monitor using color-coding to express the direction and magnitude of the motion, all in real time. We also gave out 50 Xilinx SDSoC licenses and awarded five Zynq-based ZYBO kits to lucky winners. Digilent supplied the kits. (See “About those Zynq-based Zybo boards we're giving away at Sparkfun’s Autonomous Vehicle Competition: They’re kits now!”)
The Xilinx table in Maker’s Alley at Sparkfun AVC 2016
In case you are not familiar with the Sparkfun AVC, it’s an autonomous vehicle competition and this year, there were two classes of autonomous vehicle: Classic and Power Racing. The Classic class vehicle was about the size of an R/C car and raced on an appropriately sized track with hazards including the Discombobulator (a gasoline-powered turntable), a ball pit, hairpin turns, and an optional dirt-track shortcut. The Power Racing class is based on kid’s Power Wheels vehicles, which are sized to be driven by young kids but in this race were required to be carrying adults. There were races for both autonomous and human-driven Power Racers.
Here’s a video of one of the Sparkfun AVC Classic races getting off to a particularly rocky start:
Here’s a short video of an Autonomous Power Racing race, getting off to an equally disastrous start:
And here’s a long video of an entire, 30-lap, human-driven Power Racing race:
Analog Devices (ADI) introduced the AD9371 Integrated, Dual Wideband RF Transceiver back in May as part of its “RadioVerse.” You use the AD9371 for building extremely flexible, digital radios with operating frequencies of 300MHz to 6GHz, which covers most of the licensed and unlicensed cellular bands. The IC supports receiver bandwidths to 100MHz. It also supports observation receiver and transmit synthesis bandwidths to 250MHz, which you can use to implement digital correction algorithms.
Last week, the company started shipping FMC eval cards based on the AD9371: the ADRV9371-N/PCBZ and ADRV9371-W/PCBZ.
ADRV9371-N Eval Board for the Analog Devices AD9371 Integrated Wideband RF Transceiver
ADI was showing one of these new AD9371 Eval Boards in operation this week at the GNU Radio Conference held in Boulder, Colorado. The board was plugged into the FMC connector on a Xilinx ZC706 Eval Kit, which is based on a Xilinx Zynq Z7045 SoC. The Xilinx Zynq SoC and the AD9371 make an extremely powerful design combination for developing all sorts of SDRs (software-defined radios).
The VITA49 Radio Transport standard defines digitized data formats and metadata formats to create an interoperability framework for SDRs (software-defined radios) from different manufacturers. Epiq Solutions’ 4-channel Quadratiq RF receiver supports a unidirectional VITA49 UDP data stream with its four receiver paths and dual 10GbE interface ports.
Epiq Solutions 4-channel Quadratiq VITA49 RF receiver
The Quadratiq receiver is based on a Xilinx Zynq Z-7030 SoC (or an optional Zynq Z-7045 SoC). Here’s a block diagram:
As you can see from the block diagram, the digital part of the Quadratiq’s design fits entirely into the Zynq SoC, with companion RAM and an SD card to store processor code and FPGA configuration. The Zynq SoC provides the processors, implements the proprietary digital IP, and implements the system’s digital I/O. This sort of system design is increasingly common when using the Zynq SoC in embedded applications like the Quadratiq RF receiver. Add an RF card, a precise clock, and a power supply and you’re good to go. The entire system consumes a mere 18W.
There are all sorts of really remote applications needing direct satellite communications including maritime comms, SCADA systems, UAVs, M2M, and IoT. The AHA Products Group in Moscow, Idaho previewed its tiny CM1 compact SatCom modem yesterday at the GNU Radio Conference in Boulder, Colorado. How tiny? It measures 55x100mm and here’s a photo of the board with a US 25-cent piece for size comparison:
AHA’s CM1 DVB-S2X SatCom Modem based on a Xilinx Zynq SoC
In case you’ve not heard of them (I hadn’t), AHA Products Group develops and IP Cores, boards, and ASICs specifically for communications systems applications. AHA’s specialties are FEC (forward error correction) and lossless data compression. The company had developed this DVB-S2X modem IP for a specific customer and hosted its IP on an Ettus Research USRP X310 SDR (software-defined radio), which is based on a Xilinx Kintex-7 410T FPGA. The next obvious step was to reduce the cost of the modem and its size, weight, and power consumption for volume-production applications by designing a purpose-built board. AHA was able to take the developed IP and drop it into an appropriate Xilinx Zynq-7000 SoC, which soaked up the IP and provides the Gigabit Ethernet and USB ports as well. The unmarked device in the middle of the board in the above photo is a Zynq SoC.
The AHA CM1 board clearly illustrates how well the Zynq SoC family suits high-performance embedded-processing applications. Add some DRAM and EPROM and you’ve got a compact embedded system with high-performance ARM Cortex-A9 MPCore processors and programmable logic that delivers processing speeds not attainable with software-driven processors alone. In this case, AHA needs that programmable logic to implement the 200Mbits/sec modem IP.
The CM1 SatCom modem board is in development and AHA plans to introduce it early in 2017.
Not to be outdone by DARPA (see “DARPA wants you to win $2M in its Grand Spectrum Collaboration Challenge. Yes, lots of FPGAs are involved”), Matt Ettus—founder of Ettus Research and now a Distinguished Engineer at National Instruments, which purchased Ettus Research and now runs it as a separate division—announced a contest of his own at yesterday’s GNU Radio Conference in Boulder, CO. Ettus Research has developed a product called RFNoC (RF Network on Chip), which “is designed to allow you to efficiently harness the full power of the latest generations of FPGAs [for software-defined radio (SDR) applications] without being an expert firmware developer.” Already popular in the SDR community, the GUI-based RFNoC design tool allows you to “create FPGA applications as easily as you can create GNU Radio flowgraphs.” This includes the ability to seamlessly transfer data between your host PC and an FPGA. It dramatically eases the task of FPGA off-loading in SDR applications.
Here is an example of an RFNoC flowgraph built using the GNU Radio Companion. With four blocks, data is being generated on the host, off-loaded to the FPGA for filtering, and then brought back to the host for plotting:
Ettus’ challenge is called the “RFNoC & Vivado HLS Challenge.” How did Xilinx Vivado HLS get into the contest title? You can use Vivado HLS to develop function blocks for Ettus’ RFNoC in C, C++, or SystemC because Ettus Research bases its USRP SDR products on Xilinx All Programmable devices. At Tuesday’s morning session of the GNU Radio conference, Matt Ettus said that he’s using Vivado HLS himself and considers it a very powerful tool for developing SDR function blocks. He sees this new competition as a great way to rapidly add functional blocks to the RFNoC function library. I’d say he’s right, on both counts.
Here’s how the RFNoC & Vivado HLS Challenge works:
“The competition will take place during the proceedings of the 2017 Virginia Tech Symposium. On the day of the competition, accepted teams will give a presentation and show a demo to a panel of judges made up of representatives from Ettus Research and Xilinx. All teams will be required to send at least one representative to the competition for the presentation. The winners will be announced during the symposium on the conclusion of judging.”
Here are the prizes:
Although the prizes for this competition are somewhat more modest than the $2M first prize in the DARPA competition, the bar’s a whole lot lower.
Contest proposals are due December 31, 2016. More details here.
On the first technical day of the GNU Radio Conference in the Glenn Miller Ballroom on the CU Boulder Campus, DARPA Program Manager Paul Tilghman laid out the latest of the DARPA Grand Challenges: the Spectrum Collaboration Challenge (SC2). DARPA’s SC2 is “an open competition to develop radio networks that can thrive in the existing spectrum without allocations and learn how to adapt across multiple degrees of freedom, collaboratively optimizing the total spectrum capacity moment-to-moment.” DARPA is dangling a $2M top prize ($1M for the runner-up, $750K for third place) to the team that does of best job of meeting this challenge over the next three years. You have about six weeks to sign up for Phase 1 of this DARPA competition.
DARPA created SC2 to dig us (that’s the collective, worldwide “us”) out of the deep, deep, deep radio spectrum hole we’ve been digging for more than 100 years. For the past century, the demand for radio spectrum has grown monotonically and in the past few years, it’s grown at 50% per year.
When Marconi invented the spark-gap transmitter in 1899, one transmitter consumed the entire radio spectrum. That proved to be a problem as soon as there was more than one radio transmitter in the world, so we started using frequency selection to share spectrum the following year. Today, said Tilghman, we’re in the “era of isolation.” We’ve allocated frequencies by use and geography, whether or not that frequency is being used at the moment in that location and we currently use simple rules or blind sharing to share some of the available spectrum. Here’s what the allocated RF spectrum map looks like today:
This century-old solution to RF spectrum management is no longer adequate. By the year 2030, said Tilghman, the radio spectrum will need to carry a zetabyte of data every month. Things cannot continue as they have. We cannot manufacture more spectrum, that’s an inherent property of the space/time fabric, so we must get smarter at using the spectrum we have.
That’s the Grand Challenge.
DARPA seeks to open a new era of RF collaboration through SC2, which seeks to develop autonomous, intelligent, collaborative radio networks built on the following five elements:
Only the first element is a hardware component.
The RF networking design that best carries data and shares the spectrum in this 3-year challenge will win a $2M prize from DARPA. Second place wins $1M and third place wins $750,000.
The rules of the SC2 competition are quite interesting. DARPA want to further development of autonomous, collaborative RF networking systems and has standardized the hardware for this challenge using existing SDR (software-defined radio) designs and equipment. DARPA is building a physical “arena” for the competition using FPGA-based hardware from National Instruments (NI) and Ettus Research (an NI subsidiary). The arena resides in a virtual “colosseum” [sic], the world’s largest RF networking testbed. Here’s a diagram:
Sixteen interconnected NI ATCA-3671 FPGA modules create the core colosseum network. Each ATCA-3671 card incorporates four Xilinx Virtex-7 690T FPGAs in a cross-linked configuration resulting in an aggregate data bandwidth of 160Gbytes/sec in and out of each card. These cards create an FPGA-based mesh that permits efficient data movement and aggregation.
Attached to each ATCA-3671 card are eight Ettus Research USRP X310 high-performance SDR modules, which are based on Xilinx Kintex-7 410T FPGAs. The aggregate is 128 2x2 MIMO radio nodes attached to the colosseum network. That’s a 256x256 MIMO network if you’re playing at home.
The DARPA colosseum provides access control, a scheduling infrastructure, a scoring infrastructure, and automated match initialization. This network can be subdivided and during competition, five teams will compete simultaneously on this shared system. In addition, DARPA has slated the colosseum with incumbent RF networks that must be protected from any additional interference and jammer nodes that intentionally disrupt spectrum. The winning team will be the one that creates a network that best carries traffic while cooperating with the other competing networks, protects incumbent systems, and overcomes the jamming.
The deadline for initial registration in Phase 1 (the first year) of the 3-year SC2 Grand Challenge is November 1, 2016. That gives you a matter of weeks to assemble your team. Competition and registration details are here.
By Adam Taylor
Following on our examination of how we can use the HLS video libraries, our next step is to understand how we store an image and the subtle difference between OpenCV and the HLS Video Libraries.
Different types of edge detection (Original, Laplacian of Gaussian, Canny and Sobel)
The most basic of OpenCV elements is the cv::mat class, which defines the image size in X and Y and pixel information (e.g. the number of bits within pixel), if the pixel data is signed or unsigned, and how many channels make up a pixel. This class creates the basis for how we store and manipulate images when we use OpenCV.
Within the HLS library there is a similar construct: the hls::mat. The library also providesa number of functions that enable conversion of the hls::mat class to and from HLS streaming. This is the standard interface we use when creating image-processing pipelines. One major difference between the cv::mat and the hls::mat classes is that the hls::mat class is defined as a stream of pixels as opposed to the cv::mat definition, which is a block of memory. This difference means that we do not have random access to pixels using hls::mat.
A simple example that demonstrates how we can use these libraries is to perform a simple Gaussian Blur of an image. The filter will use AXI Streaming interfaces to input and output the image data stream.
Gaussian blurring is typically applied to an image prior to many edge-detection embedded-vision algorithms that reduce noise within the image like Sobel or Canny.
The first step is to create the HLS structures we need within a header file so that both the module to be synthesised and the test bench can use them. These type definitions are:
typedef hls::stream<ap_axiu<16,1,1,1> > AXI_STREAM;
typedef hls::Mat<MAX_HEIGHT, MAX_WIDTH, HLS_8UC2> YUV_IMAGE;
typedef hls::Mat<MAX_HEIGHT, MAX_WIDTH, HLS_8UC3> RGB_IMAGE;
With the basics defined we are then in a position to generate the module we wish to synthesise and the test bench to check it is functioning.
Starting with the module we wish to synthesise, the video input and output from the module will use the previously defined AXI_STREAM type definition. While the size of the image in rows and columns will be supplied over an AXI-Lite interface, we can also use this interface if we want to provide the ability to enable or disable the filter.
Implementing the function we want is very simple. We need to convert the input video from an AXI Stream into an hls::mat, apply our filter, and then convert the output hls::mat back to an AXI Stream.
HLS Function to perform the Gaussian Blur
Having written the code we wish to synthesise and implement in the Zynq SoC, the next thing we need to do is create a test bench so that we can check the functionality using both C and Co-Simulation before we include the core within our Vivado design.
We will look at this next week, and we’ll also see how we can combine OpenCV and the HLS Libraries in our test bench.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
All of Adam Taylor’s MicroZed Chronicles are cataloged here.
RISC-V (pronounced “risk five”) is an open, 32/64-bit RISC microprocessor architecture first developed at the Computer Science Division of the EECS Department at the U. of California, Berkeley. Now it’s managed by the RISC-V Foundation. If you are an aficionado of processor architectures and you’re looking to get your feet wet with the RISC-V architecture, SiFive has released three Freedom FPGA Platforms based on Xilinx All Programmable devices that allow you to start working with the RISC-V ISA immediately. The three Xilinx-based SiFive Freedom Platform kits are:
Xilinx VC707 Eval Kit
$99 Digilent ARTY Dev Board
SiFive is a new fabless semiconductor company and one of the intents of the entire RISC-V exercise is to create an open-source chip platform so that you can develop your own SoCs. As usually happens, the first step is to become familiar with the processor architecture. FPGAs—as usual—serve as excellent implementation vehicles for the RISC-V HDL code until there are ASICs available.
Not planning on developing your own SoC anytime soon? Take a look at the Xilinx Zynq-7000 SoC and the Zynq UltraScale+ MPSoC. With their monolithic combinations of two to six ARM 32- and 64-bit processors and Xilinx programmable logic, they are both excellent ways to go beyond prototyping and to start shipping systems as early as tomorrow. (Today if you’re really aggressive.) And, if you really, really want a RISC-V processor, you can instantiate the processor’s HDL code in the Zynq SoC’s or Zynq UltraScale+ MPSoC’s on-chip programmable logic.
Someone at Digilent didn’t get the message. Or maybe they did. Late last month, I explained that Xilinx and is giving away fifty SDSoC vouchers worth $995 at the September 17 Sparkfun Autonomous Vehicle Competition (AVC) in beautiful Niwot, Colorado. First come, first served. (See “$100K Xilinx and Digilent giveaway at Sparkfun Autonomous Vehicle Competition, September 17, Niwot Colorado.”) In addition, Digilent has agreed to come into the tent (that’s the Xilinx Maker’s Alley tent at AVC) and will be giving away five Zybo Trainer boards based on the Xilinx Zynq-7000 All Programmable SoC.
Except Digilent didn’t send us five boards for the event. They sent five kits. Five really, really nice kits:
Digilent Zybo Trainer Board Kit
Each kit includes everything you need to get up and running with the Zynq SoC:
These kits are definitely not going out “first come, first served.” How can you get one? You’ll just have to come to the Xilinx Maker’s Alley table at Sparkfun AVC to see how you can win one of these five kits. We’ll be somewhere in the Sparkfun parking lot.
See you then.
A few months ago, Xcell Daily discussed the new Photonfocus QuadRate QR1 Video Camera based on a Xilinx Spartan-6 FPGA. (See “Photonfocus FPGA-powered QuadRate technology pumps video cameras’ high frame rate (400Mbytes/sec) over GigE.”) Now, ClearView Imaging Ltd has posted a YouTube video showing the camera in action and that video includes additional information about the proprietary, patent-pending wavelet-based compression codec required to cram that much video into a GigE Vision connection including this block diagram:
The Photonfocus QR1 camera delivers real-time 4:1 video compression without dropped frames while maintaining 100% compatibility with the GigEVision and GenICam standards. The frame rates achieved by the QR1 cameras are:
Resolution Frame rate [fps]
2040 x 1088 169
1024 x 1024 358
800 x 600 606
640 x 480 754
Photonfocus has leveraged the flexibility of FPGA-based hardware platforms to develop multiple product lines including its three QR1 quad-rate camera models (for monochrome, NIR (near IR), and color imaging) and its previous double-rate video cameras.
The Xilinx Spartan-6 FPGA family is now more than half a decade old yet it continues to demonstrate real value as a foundation—a cost-effective foundation—for many, many new video and vision platforms.
Here’s the ClearView Imaging video:
If you are in the surveillance industry, you know that the number of cameras installed in the field is already skyrocketing. We cannot put that many dumb cameras in the field without swamping centralized image-processing and -analysis equipment. The silver lining in this particular cloud is our increasing ability to move video processing to the edge and processing algorithms instantiated in programmable hardware, either in an FPGA or a Zynq All Programmable SoC, go a long way to enabling that edge-based processing.
If you parsed the previous paragraph with ease, there’s a free 1-hour Webinar you’ll likely want to attend. It’s part of the Xilinx “Video with Precision” Webinar series done in conjunction with the IEEE Spectrum Tech Insiders program and it’s on Wednesday, September 21.
After the Xilinx 28nm 7 series of FPGAs and Zynq SoCs was introduced, Xilinx developed major silicon, architectural, and software enhancements to create the 20nm UltraScale and 16nm UltraScale+ device families (Virtex UltraScale, Virtex UltraScale+, Kintex UltraScale, Kintex UltraScale+, and Zynq UltraScale+ MPSoC). The result is anywhere from a 2x to 5x improvement in performance/W for comparable system designs. Although I’ve written several blogs about this topic and Xilinx has published several backgrounders and White Papers with explanations, perhaps you’re looking for some additional clarity in an easy-to-digest form.
Gotcha covered. Here’s a 5-minute video with Product Marketing Manager Darren Zacher giving a concise, technical explanation of the UltraScale architectural and software changes that underlie this significant improvement in performance/W you can achieve using Xilinx UltraScale and UltraScale+ devices.
Oki IDS has just posted a complete, free eval demo of hardware-accelerated Harris Corner Detection running on the Zynq-based Avnet Smart Vision Development Kit. The demo detects feature points (edges and corners) in every video frame, and adds graphic marker overlays identifying these features to each frame, in real time. It then transfers this video using GigE Vision protocol over Gbit Ethernet to workstations for display. The demo was created using the Xilinx SDSoC Development Environment. Here’s a diagram of the demo:
The demo analyzes 720p (1280x720) monochrome video and it runs on an Avnet PicoZed board based on a Xilinx Zynq Z-7015 SoC. The only reason this demo works in real time is because the algorithm has been implemented in programmable logic on the Zynq SoC. A software implementation would be much too slow. You’ll find more information about this demo on this page from the Hackaday.io Web site, posted by Aaron Behman, which includes instructions and this block diagram of what’s happening inside of the Zynq SoC: