Earlier this week at the Embedded Vision Summit in Santa Clara, QuickPlay and Auviz Systems demonstrated an FPGA-accelerated visual color-detection application developed using a rapid development environment based on the previously announced QuickPlay development environment (See “A Novel Approach to Software-Defined FPGA Computing”) and Auviz middleware IP. This demo ran on an XpressKUS PCIe board from ReFLEX CES, which was based on a Xilinx Kintex UltraScale KU060 FPGA.
The two companies have announced that they are teaming up to provide a next-generation, software-defined development environment for FPGA-accelerated vision applications.
The Auviz Video Content Analysis Platform, AuvizVCA, runs on an FPGA and employs semantic segmentation to deliver accurate image detection and classification at better than 30 frames/sec for as many as 21 object classes. However, there’s no need to understand the underlying FPGA programming to get the benefit of the fast FPGA hardware. AuvizVCA is implemented as an FPGA-optimized OpenCL kernel that’s invoked through high-level language calls on a host processor. During execution, AuvizVCA invokes AuvizDNN, an optimized library of Deep Neural Network functions. (See “Machine Learning in the Cloud: Deep Neural Networks on FPGAs” by Auviz’ Nagesh Gupta.)
Currently, AuvizVCA is running on an Alpha Data ADM-PCIE-7V3 PCIe board based on a Xilinx Virtex-7 690T FPGA. (See “Alpha Data showcases PCIe accelerator card for HPC based on Kintex UltraScale FPGA at SC14”.) Configurations are generated using the Xilinx SDAccel Development Environment. Planned future releases of AuvizVCA will support the newer Xilinx UltraScale All Programmable device families.
Thanks to Dave Jones’ EEVBlog teardown of NI’s (National Instruments’) greatly enhanced VB-8034 VirtualBench All-in-One instrument (DSO, logic analyzer, 5½-digit DMM, arbitrary waveform generator, programmable digital I/O, and power supply in one box), we know a lot more about the high-quality instrument’s internal design. The VB-8034’s DSO has four 350MHz channels in contrast to its predecessor’s two 100MHz channels. Back when NI announced the VB-8034, all I knew was that it was based on a Xilinx Zynq All Programmable SoC like its predecessor, the VB-8012. After watching Dave’s 40-minute teardown video, we now know that the waveform capture and digital processing are performed by a pair of Xilinx devices: A Zynq Z7020 SoC and a Kintex-7 160T FPGA.
National Instruments VB-8034 enhanced VirtualBench All-in-One instrument
Here’s the EEVblog teardown video for the NI VB-8034:
One of Dave’s high-resolution photos shows the two Xilinx devices on the VB-8034’s main capture and processing board:
Once Dave wipes the last traces of heat-sink compound from the device appearing in the center of the image, we see it’s a Kintex-7 FPGA, which is obviously handling the DSO waveform capture, stowing digitized samples from the two flanking National Semiconductor dual 1.5Gsamples/sec 8-bit ADCs into four nearby 1Gbit DDR3 SDRAMs (two on the top and two on the bottom of the board) in real time. The device in the lower left of the image is a Zynq Z7020 SoC, which is handling the overall instrument control and the instrument’s GUI and USB/Ethernet/WiFi I/O.
We can again see the benefit of a Xilinx-based platform design in the design of the NI VB-8034. The original VB-8012 instrument was based on the Zynq Z7020 but the VB-8034 has a significantly enhanced DSO (see the specs for these as well as other enhancements). So NI was able to leverage the existing Zynq-based VB-8012 design for a lot of the new instrument but then added the Kintex-7 FPGA for the much beefier DSO (3.5x more bandwidth, 2x the channels).
For more information about NI’s VirtualBench All-in-One instruments, see:
By Jeremy Banks, Product Manager, Curtiss-Wright and Jim Everett, Xilinx
A new mezzanine card standard called FMC+, an important development for embedded computing designs using FPGAs and high-speed I/O, will extend the total number of gigabit transceivers (GTs) in a card from 10 to 32 and increase the maximum data rate from 10 to 28 Gbits per second while maintaining backward compatibility with the current FMC standard.
These capabilities mesh nicely with new devices such as those using the JESD204B serial interface standard, as well as 10G and 40G fiber optics and high-speed serial memory. FMC+ addresses the most challenging I/O requirements, offering developers the best of two worlds: the flexibility of a mezzanine card with the I/O density of a monolithic design.
The FMC+ specification has been developed and refined over the last year. The VITA 57.4 working group has approved the spec and will present it for ANSI balloting in early 2016. Let’s take a closer look at this important new standard to see its implications for advanced embedded design.
The Mezzanine Card Advantage
Mezzanine cards are an effective and widely used way to add specialized functions to an embedded system. Because they attach to a base or carrier card, rather than plugging directly into a backplane, mezzanine cards can be easily changed. For system designers, this means both configuration flexibility and an easier path to technology upgrades.
However, this flexibility usually comes at the cost of functionality due to either connectivity issues or the extra real estate needed to fit on the board. For FPGAs, the primary open standard is ANSI/VITA 57.1, otherwise known as the FPGA Mezzanine Card (FMC) specification. A new version dubbed FMC+ (or, more formally, VITA 57.4) extends the capabilities of the current FMC standard with a major enhancement to gigabit serial interface functionality.
FMC+ addresses many of the drawbacks of mezzanine-based I/O, compared with monolithic solutions, simultaneously delivering both flexibility and performance. At the same time, the FMC+ standard stays true to the FMC history and its installed base by supporting backward compatibility.
The FMC standard defines a small-format mezzanine card, similar in width and height to the long-established XMCs or PMCs, but about half the length. This means FMCs have less component real estate than open-standard formats. However, FMCs do not need bus interfaces, such as PCI-X, which often take a considerable amount of board real estate. Instead, FMCs have direct I/O to the host FPGA, with simplified power supply requirements. This means that despite their size, FMCs could actually have more I/O capacity than their XMC counterparts. As with the PMC and XMC specification, FMC and FMC+ define options for both air and conduction cooling, thereby serving both benign and rugged applications in commercial and defense markets.
The anatomy of the FMC specification is simple. The standard allows for up to 160 single-ended or 80 differential parallel I/O signals for high-pin-count (HPC) designs or half that number for low-pincount (LPC) variants. Up to 10 full-duplex GT connections are specified. The GTs are useful for fiber optics or other serial interfaces. In addition, the FMC specification defines key clock signals. All of this I/O is optional, though most hosts now support the full connectivity.
The FMC specification also defines a mix of power inputs, though the primary power supply, defined by the mezzanine, is supplied by the host. This approach works by partially powering up the mezzanine such that the host can interrogate the FMC, which responds by defining a voltage range for the VADJ. Assuming the host can provide this range, then all should be well. Not having the primary regulation on the mezzanine saves space and reduces mezzanine power dissipation.
FMCs for Analog I/O
Designers can use FMCs for any function that you might want to connect to an FPGA, such as digital I/O, fiber optics, control interfaces, memory or additional processing. But analog I/O is the most common use for FMC technology. The FMC specification affords a great deal of scope for fast, high-resolution I/O, but there are still trade-offs—especially with high-speed parts using parallel interfaces.
For example, Texas Instruments’ ADC12D2000RF dual-channel, 2Gsps 12-bit ADCs use a 1:4 multiplexed bus interface, so the bus speed is not too fast for the host FPGA. The digital data interface alone requires 96 signals (48 LVDS pairs). For a device of this class, FMC can support only one of these ADCs, even if there is sufficient space for more, because it is limited to 160 signals. Lower-resolution devices, even at higher speeds, such as those with 8-bit data paths, may allow more channels even with the increased requirements of the front-end analog coupling of the baluns or amplifiers, clocking and the like.
The FMC specification starts to run out of steam with analog interfaces delivering more than 8 bits of resolution at around 5 or 6Gsps (throughputs of > 50Gbps) using parallel interfaces. From a market perspective, leading FMCs based on channel density, speed and resolution are in the 25 to 50Gbps throughput range. This functionality results from a trade-off between physical package sizes and available connectivity to the host FPGA.
In addition to the parallel connections, the FMC specification supports up to 10 full-duplex high-speed serial (GT) links. These interfaces are useful for such functionality as fiber-optic I/O, Ethernet, emerging technologies like Hybrid Memory Cube (HMC) and the Bandwidth Engine, and newer-generation analog I/O devices that use the JESD204B interface.
Although the JESD204 serial-interface standard, currently at revision “B,” has been around for a while, only recently has it has gained wider market penetration and become the serial interface of choice for newer generations of high-sampling data converters. This wide adoption has been stoked by the telecommunications industry’s thirst for ever-smaller, lower-power and lower-cost devices.
As mentioned earlier, a dual-channel 2-Gsps, 12-bit ADC with a parallel interface requires a large number of I/O signals. This requirement directly impacts the package size, in this case mandating a 292-pin package measuring roughly 27 x 27 mm (though newer-generation pin geometry could shrink the package size to something less than 20 x 20 mm).
A JESD204B-connected equivalent device can be provided in a 68-pin, 10 x 10-mm package—with reduced power. This dramatic reduction in package size marries well with evolving FPGAs, which are providing ever more GT links at higher and higher speeds. Figure 1 illustrates an example of package size and FMC/FMC+ board size.
Typical high-speed ADCs and DACs using the JESD204B interface have between one and eight GT links operating at 3 to 12Gbps each, depending on the data throughput required based on sample rate, resolution and number of analog I/O channels.
The FMC specification defines a relatively small mezzanine card, but with the emergence of JESD204B devices there is room to fit more parts onto the available real estate. The maximum of 10 GT links defined by the FMC specification is a useful quantity; even this limited number of GT links provide 80Gbps or more of throughput while using a fraction of the pins otherwise required for parallel I/O.
The emergence of serially connected I/O devices, not just those using JESD204B, does have drawbacks for some application segments in electronic warfare, such as digital radio frequency memory (DRFM). Serial interfaces invariably introduce additional latency due to longer data pipelines. For DRFM applications, latency for data-in to data-out is a fundamental performance parameter. Although latency is likely to vary widely between serially connected devices, new generations of devices will push data through the pipelines faster and faster, with some promising the ability to tune the depth of the pipeline. It remains to be seen how much of an improvement is to be realized.
Some standard ADC devices sampling at >1Gsps today have latency below 100 nanoseconds. Other applications can tolerate this latency, or do not care about it, including software-defined radio (SDR), radar warning receivers and other SIGINT segments. These applications gain large advantages by using a new generation of RF ADCs and DACs, a technology driven by the mass-market telecommunications infrastructure.
Outside of the FPGA community, newer DSP devices are also starting to adopt JESD204B. However, FPGAs are likely to remain the stronghold in taking full advantage of the capabilities of wideband analog I/O devices. That’s because FPGAs can deal with vast data volumes with better parallelization.
The Evolution of FMC+
To move FMC to the next level, the VITA 57.4 working group has created a specification with an increased number of GT links operating at increased speed. FMC+ maintains full FMC backward compatibility by adding to the FMC connector’s outer columns for the additional signals and not changing any of the board profiles or mechanics.
The additional rows will be part of an enhanced connector that will minimize any impact on available real estate. The FMC+ specification increases the maximum number of available GT links from 10 to 24, with the option of adding another eight links, for a total of 32 full duplex. The additional links use a separate connector, referred to as an HSPCe (HSPC being the main connector). Table 1 summarizes FMC and FMC+ connectivity.
Multiple independent signal integrity teams characterized and validated the higher 28Gbps data rate. The maximum full-duplex throughput can now exceed 900Gbps in each direction, when the parallel interface is included. See Figure 2 for an outline of the net throughputs that can be expected for digitizer solutions supporting the different capabilities of FMC and FMC+.
Designers can use the increased throughput enabled by FMC+ to take advantage of new devices that offer huge I/O bandwidth. There will still be trade-offs, such as how many devices can fit on the mezzanine’s real estate budget, but for a moderate number of channels the realizable throughput is a huge leap over today’s FMC specification.
In the next few years, it is reasonable to expect high-resolution ADCs and DACs to break through the 10Gsps barrier to support very wideband communications with direct RF samplings for L-, S-, and even C-band frequencies. Below 10Gsps, converters are emerging with 12-, 14-, and even 16-bit resolutions, with some supporting multiple channels. The majority of these devices will be using JESD204B (or a newer revision) signaling with 12Gbps channels until newer generations inevitably boost this speed even further. These fast-moving advances are fueled by the telecommunications industry, but the military community can take advantage of them to meet SWAP-C requirements.
Other Advantages and Uses of FMC+
Although FMC+, like FMC, is likely to be dominated by ADC, DAC and transceiver products, the increased GT density provided by FPGAs makes it useful for other functions. Two functions of note are fiber optics and new serial memories.
As with JESD204B, there are requirements for faster, denser fiber optics. Those based on fiber-optic ribbon cables offer the smallest parts. Because the FMC+ footprint readily supports 24 full-duplex fiber-optic links, this application is likely where the higher speeds supported by FMC+ will first be realized. Bandwidths of 28Gbps per fiber will take the throughputs quickly past 100G and 400G speeds on a single mezzanine. Optical throughput of 100G is emerging today on the current FMC format.
Another emerging area suitable for FMC+ is serial memory such as Hybrid Memory Cube and MoSys’ Bandwidth Engine. These novel devices represent an entirely new category of high-performance memory, delivering unprecedented system performance and bandwidth by utilizing GT connectivity. (Xcell Journal issue 88 examines these new memory types.)
A new generation of the FMC specification has been introduced and is adapting to new technology driven by serial connected devices. Key players in the FMC industry have already begun adopting this specification. Figure 3 shows the first Xilinx demonstration board featuring FMC+, the KCU114 based on a Xilinx Kintex UltraScale FPGA. The FMC standard, through its new incarnation FMC+, is alive and kicking and is prepared for the next generation of high-performance, FPGA-driven applications.
Note: This blog post originally appeared in the latest issue of Xcell Journal, Issue 94. For the full article, see the full issue of Xcell Journal, Issue 94.
Last month at the NAB 2016 show in Las Vegas, Omnitek announced the Ultra XR Advanced 4K/UHD Waveform Analyzer, designed for colorists, post-production editors, and other content creatives preparing material for 4K/UHD distribution. Like the company’s Ultra 4K Tool Box, the Ultra XR Advanced 4K/UHD Waveform Analyzer is based on a Xilinx Zynq Z7045 All Programmable SoC. Key features of the Ultra XR Advanced 4K/UHD Waveform Analyzer include:
Omnitek Ultra XR Advanced 4K/UHD Waveform Analyzer
Here's a 2-minute video with a short demo of the new product:
Though I know I’m repeating myself, the Omnitek Ultra XR Advanced 4K/UHD Waveform Analyzer is yet another example of a Xilinx All Programmable device serving as a flexible design platform for a range of products or even multiple product lines. In addition, this sort of hardware/software programmable platform allows you to add features at will with no change in your BOM or BOM cost.
By William D. Richard, Associate Professor, Washington University in St. Louis
Using the low-voltage differential signaling (LVDS) inputs on a modern Xilinx FPGA, it is possible to digitize an analog input signal with nothing but one resistor and one capacitor. Since hundreds of LVDS inputs reside on a current-generation Xilinx device, it is theoretically possible to digitize hundreds of analog signals with a single FPGA.
Our team recently explored one corner of the possible design space by digitizing a band-limited input signal with a 3.75MHz center frequency with 5 bits of resolution while investigating options for digitizing the signals from a 128-element linear ultrasound array transducer. Let’s take a look at the details of that demonstration project.
In 2009, Xilinx introduced a LogiCORE soft IP core that, along with an external comparator, one resistor and one capacitor, implements an analog-to-digital converter (ADC) capable of digitizing inputs with frequencies up to 1.205 kHz. Using an FPGA’s LVDS inputs instead of an external comparator, in conjunction with a delta modulator ADC architecture, it is possible to digitize much higher-frequency analog input signals with just one resistor and one capacitor.
ADC Topology and Experimental Platform
The block diagram of a one-channel delta modulator ADC implemented using the LVDS inputs on a Xilinx FPGA is shown in Figure 1. Here, the analog input drives the noninverting LVDS_33 buffer input, and the input signal range is essentially 0 to 3.3 volts. The output of the LDVS_33 buffer is sampled at a clock frequency much higher than the input analog signal frequency and fed back through an LVCMOS33 output buffer and an external, first-order RC filter to the inverting LVDS_33 buffer input. With just this circuitry, the feedback signal, given an appropriate selection of clock frequency (F), resistance (R) and capacitance (C), will track the input analog signal.
As an example, Figure 2 shows an input signal in yellow (channel 1) and the feedback signal in blue (channel 2) for F = 240MHz, R = 2K and C = 47 pF. The input signal shown was produced by an Agilent 33250A function generator using its 200MHz, 12-bit, arbitrary output function capability. The Fourier transform of the input signal as computed by the Tektronix DPO 3054 oscilloscope we used is shown in red (channel M). At these frequencies, the input capacitance of the oscilloscope probe (as well as grounding issues) did degrade the integrity of the feedback signal shown in the oscilloscope trace, but Figure 2 does illustrate operation of the circuit.
We defined the band-limited input signal shown in Figure 2 by applying a Blackman-Nuttall window to a 1Vpp 3.75MHz sine wave. While the noise floor associated with the theoretical windowed signal is almost 100 dB below the amplitude associated with the center frequency, the 200MHz sample frequency and 12-bit resolution of the Agilent 33250A function generator result in a far-less-ideal demonstration signal. The output signals produced by many ultrasound transducers with center frequencies near 3.75MHz are naturally band-limited, due to the mechanical properties of the transducers, and are therefore ideal signal sources for use with this approach.
We obtained the plot shown in Figure 2 using a Digilent Cmod S6 development module with a Xilinx Spartan-6 XC6SLX4 FPGA mounted on a small, custom printed-circuit board with eight R/C networks and input connectors, allowing the prototype system to digitize up to eight signals simultaneously.
Each channel was parallel-terminated with 50 ohms to ground to properly terminate the coaxial cable from the signal generator. It is important to note that to achieve this performance, we set the drive strength of the LVCMOS33 buffers to 24 mA and the slew rate to FAST, as documented in the example VHDL source in Figure 5.
The custom prototype board also supported the use of an FTDI FT2232H USB 2.0 Mini-Module that we used to transfer packetized serial bitstreams to a host PC for analysis. Figure 3 shows the magnitude of the Fourier transform of the bitstream the prototype board produced when fed the analog signal of Figure 2. Peaks associated with subharmonics of the 240MHz sampling frequency are clearly visible, along with a peak at 3.75MHz associated with the input signal.
Large Number of Taps
By applying a bandpass finite impulse response (FIR) filter to the bitstream, it is possible to produce an N-bit binary representation of the analog input signal: the ADC output. Since the digital bitstream is at a much higher frequency than the analog input signal, however, you need to use FIR filters with a large number of taps. The data being filtered, however, only has values of zero (0) and one (1), so multipliers are not needed (only adders to add the FIR filter coefficients).
The ADC output shown in Figure 4 was produced on the host PC using an 801-tap bandpass filter centered at 3.75MHz that we designed using the free, online TFilter FIR filter design tool. This filter had 36 dB or more of attenuation outside the 2.5MHz to 5MHz passband and 0.58 dB of ripple between 3 and 4.5MHz.
The ADC output signal shown in Figure 4 has a resolution of approximately 5 bits. This is ultimately a function of the oversampling rate, and you can achieve higher resolution with designs optimized for lower input frequencies.
The ADC output signal shown in Figure 4 is also severely oversampled at 240MHz and can be decimated to reduce the ADC output bandwidth. In a hardware implementation of the bandpass filter and decimation blocks, it would be possible to only compute every 16th filter output value when decimating by a factor of 16 down to an effective sample rate of 15MHz (three times faster than the highest frequency in the band-limited input signal), reducing the hardware requirements.
Figure 5 shows the VHDL source used with the Digilent Cmod S6 development module to produce the feedback signal shown in Figure 2, along with the bitstream data associated with the Fourier transform of Figure 3. An LVDS_33 input buffer is instantiated directly and connected to the analog input and feedback signals, sigin_p and sigin_n, respectively. The internal signal sig is driven by the output of the LVDS_33 buffer and sampled by the implied flip-flop to produce sigout. The signal sigout is the serial bitstream that is filtered to produce the N-bit ADC output. We used the free Xilinx ISE Webpack tools to implement the project.
Figure 5 shows the VHDL code and the portion of the UCF file associated with the circuitry of Figure 1.
Low Component Count
The ADC architecture we have described has been inaccurately referred to in several recent articles as a delta-sigma architecture. But while true delta-sigma ADCs have advantages, the simplicity of this approach and low component count make it attractive for some applications. And since the LVDS_33 input buffer has a relatively high input impedance, in many applications the sensor output can be directly connected to the FPGA input without the need for a preamplifier or buffer. This can be very advantageous in many systems.
Another advantage of our approach is that superposition makes it possible to “mix” several serial bitstreams and apply a single filter to recover the output signal. In array-based ultrasound systems, for example, the serial bitstreams can be time-delayed to implement a focus algorithm, and then added in vector fashion, and a single filter used to recover the digitized, focused ultrasound vector.
Using an FIR filter to produce the ADC output is a straightforward, brute-force approach used here primarily for illustrative purposes. In most implementations, the ADC output will be produced using the traditional integrator/lowpass filter demodulator topology.
Note: This blog post originally appeared in the latest issue of Xcell Journal, Issue 94. For the full article, see the full issue of Xcell Journal, Issue 94.
AI now completely dominates image recognition because CNNs (convolutional neural networks) outperform competing machine implementations. They even outperform human image recognition at this point. The basic CNN algorithm requires a lot of computation and data reuse, well-matched to FPGA implementations. Last month, Ralph Wittig (a Distinguished Engineer in the Xilinx CTO Office) gave a 20-minute presentation at the OpenPOWER Summit 2016 conference and discussed the current state of the art for CNNs along with some research results from various universities including Tsinghua University in China.
Several interesting conclusions relating to power consumption of CNN algorithm implementations arise from this research:
Here’s a video capturing Wittig’s presentation at the OpenPOWER Summit:
In this video, Witting also notes the use of two CNN-related products previously covered in Xcell Daily:
By Adam Taylor
Last week, we looked at Out-Of-Context Synthesis and the time it can save you. This week, I will explore the use of incremental compilation in the Xilinx Vivado HLx Design Suite’s implementation stage to shorten place-and-route times.
Incremental compilation allows us to use a design check point (DCP) from a reference to speed the overall design flow. A DCP reduces implementation time by using the previous placement and routing as a guide. This reuse preserves the previous quality of results within constraints. We’ll explore this later.
The theory behind incremental compilation is this: having implemented the design, we identify the need to make a change to its behavior through testing or because we want additional features and we’d prefer to save some time in the next design iteration. In a conventional design flow, we would re-synthesise the design and place-and-route it all over again, which takes a significant amount of time. Incremental compilation saves us time by only synthesizing, placing, and routing design modifications.
To use incremental compilation, we must first have a reference design. Typically this is a previously implemented design. We can then copy that existing project directory to create the reference. If the design is not a previous implementation, the DCP must be targeted at the same device and speed grade as the implementation target. The use of a reference DCP is required so that when we re-start the run in the current project, that run will be cleaned prior to the start of the new run. That’s why it’s not possible to re-use a DCP within the same run from the same project.
Incremental Compilation Flow
(Reference: Advanced FPGA Design Methodologies with Xilinx Vivado, Alexander Jäger, Computer Architecture Group, Heidelberg University, Germany)
We find the DCP we are looking for in the implementation folder, which we’ll find within the <project_name>.runs folder. Within the implementation folder—in this example called imp_1—you will find three Vivado Design Checkpoint Files:
The content of these files is fairly straightforward. The first file contains the post-optimization DCP. The second file contains the post-placement DCP. The final file contains the DCP after the design has been routed. You’ll get the best results from the routed DCP.
Using this DCP file is very simple. Within Vivado, we open the implementation settings and point the incremental compile option box to the routed DCP file in our reference design.
Configuring the DCP to be used
Once we have configured the design to use the reference design’s DCP, we run the implementation and Vivado implements the design using the reference design as a starting point. I say starting point because the scope and impact of the design changes will affect the implementation and will determine whether or not we obtain maximum benefit from the incremental compilation, or if it can be used at all.
The changes we implement in the updated design will be functional, netlist-related, or both. If 95% of the cells and routing remain the same between reference and current implementation, we will gain maximum advantage from this approach. We’ll reduce implementation time and we’ll see the same quality of results that we saw with the reference. Should the reference and new design share between 75% and 95% in common, we will still gain some benefit from the incremental implementation. Below 75% commonality, Vivado will implement the new design using the default settings and will not use the reference design. Vivado will place a warning in the log should this occur.
We can determine the levels of similarity by using Vivado’s report_incremental_reuse command while the implementation is open, as shown below:
Vivado Incremental Placement Report
Vivado Incremental Routing Report
You can learn more about incremental compilation in Xilinx User Guide UG904, Vivado Design Suite User Guide: Implementation, which provides detailed information on implementation and implementation strategies.
The code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
You also can find links to all the previous MicroZed Chronicles blogs on my own Web site, here.
By Glenn Steiner, Xilinx
Early embedded processing systems typically consisted of a microprocessor with a few peripherals. These systems acquired a small amount of data, processed the data, made decisions, and then output information based on those decisions. In some cases simple human-machine interfaces read keypads and displayed results. Processing requirements, while demanding at the time, seem trivial by today’s standards. Modern embedded processing systems deal with gigabytes of data and the corresponding analysis of massive datasets. Frequently, there are additional requirements for both deterministic and low-latency operation. Many applications also demand that the system operate reliably and safely while meeting relevant industry standards.
Today it is not possible to purchase a single processor that can simultaneously process high-bandwidth data, perform system application functions, respond to real-time requirements, and meet industry safety standards. However, it is possible to buy a multicore heterogeneous chip that can accomplish these functions. Such a device contains multiple processing elements, each capable of meeting one or more of these requirements. We call such a device a Heterogeneous Processing system.
What is Heterogeneous Multiprocessing?
A Heterogeneous Multiprocessing System consists of multiple single and multicore processors of differing types. The simplest form of a heterogeneous multicore processing system would be the combination of a multicore processor with a GPU. However, today’s technologies enable heterogeneous multiprocessing systems on a chip containing:
When we refer to a Heterogeneous Multiprocessing System in this article, it will contain many of the above elements. One advantage of using an FPGA’s programmable logic to implement multiple processors is the ability to create custom application specific processors that enable parallel data processing in two dimensions—through parallel pipes and through multiple pipeline stages—enabling a massive number of computations in one clock cycle.
A multicore processor can be designed to enable general-purpose computing, or it can be designed for application-specific computing. Application-specific computing enables data-optimized processing with decreased silicon footprint, increased throughput per clock cycle, and typically at lower power when compared to a general-purpose processor performing the same function.
Evolution of Heterogeneous Processing Systems with Programmable Logic
In 2002, Xilinx introduced an FPGA containing an application processor: the PowerPC 405. Xilinx then introduced additional generations of FPGA devices with higher performance PowerPC processors and as many as two PowerPC processors on a chip. Unlike today’s current-generation devices where the processing system is an integrated ASSP (with processor, interconnect, memory controller and peripherals), these early generations required significant FPGA resources to tie the design together into an ASSP-like solution.
In 2011, Xilinx introduced the Zynq-7000 family of fully integrated processing devices with a dual-core ARM Cortex-A9 MPCore processor, interconnect, memory controller, and peripherals, combined with programmable logic based on Xilinx series 7 FPGAs. One might suggest the Zynq-7000 family was a first-generation heterogeneous multiprocessor system because the on-chip programmable logic enabled additional dedicated processing elements to be created and used.
Latest Generation of Heterogeneous Processing Systems with Programmable Logic
In 2015, Xilinx introduced and began shipping the Zynq UltraScale+ MPSoC, a new generation of heterogeneous multiprocessing device. Where past-generation devices consisted of one or more application processor cores combined with programmable logic, the Zynq UltraScale+ MPSoC device family integrates:
Zynq UltraScale+ MPSoC Block Diagram
Multicore Application Processors are the traditional work-horse processor for general-purpose computing. These processors typically operate in symmetric multiprocessing (SMP) mode running an operating system such as Linux or Android. They also support hypervisors upon which multiple operating systems can run.
After a floating-point unit, a Graphics Processing Unit is the most popular coprocessor. GPUs offload graphics processing from an application processor and enable sophisticated user interfaces and complex graphics rendering. This capability is necessary for graphical operating systems such as Android or Windows Embedded Compact. General Purpose GPUs (GPGPUs) can perform general-purpose computing on data arrays in addition to graphics processing.
Real-time processors typically enable low-latency response to events and frequently are more deterministic than application processors. Most often they run real-time operating systems that also support low-latency interrupt handling and deterministic response. In functional-safety applications, real-time processors are commonly run in dual lockstep to enable detection of an error that can occur in one of the two processors.
A Platform Management Unit is responsible for managing critical system functions and housekeeping. Such functions can include system error handling, power management, and functional safety tasks. Being the “heart” of a system, failure of this unit is highly undesirable. Thus, a triple redundant processor with voting logic enables this sub-system to continue operation in the event of an error in one of the processor cores.
A Configuration and Security Unit is responsible for system configuration including loading of a processor’s first-stage boot loader and the programmable logic configuration bit stream with optional authentication and decryption of loaded program code and configuration bit stream, and ongoing monitoring of system security. System security monitoring watches for tampering attempts indicated by conditions such as under or over voltage, under or over temperature, and attempts to extract system information.
On-chip programmable logic brings the ultimate in flexibility when it comes to heterogeneous processing. Additional off-the-shelf soft processing cores can be added to handle application-specific computing tasks and custom processing cores can be added with multiple pipelines and multiple pipeline stages, enabling massive parallel processing of streaming data.
A Heterogeneous Multiprocessing System Example
One common industrial vision and control application is robotic pick-and-place assembly. Such applications typically have the following requirements:
Note that each of the above functions potentially requires unique processing capabilities. For example, tasking a general-purpose processor with real-time HD image processing can easily overtax the processor.
Let’s consider one possible solution using a heterogeneous multiprocessing system. To help better visualize the problem we will use the example of a robotic system playing a solitaire game on a tablet computer. Portions of this system have been implemented and demonstrated at the Xilinx booth during Embedded World 2016. (See “3D Delta Printer plays Robotic Solitaire on a Touchpad under control of a Xilinx Zynq UltraScale+ MPSoC.”)
Video Acquisition and Processing
A 1080p60 video stream requires a data rate of 3Gbps or 373Mbytes/sec. The video pipeline to process this data can include adjustments for brightness, contrast, and white balance; distortion correction; and dead-pixel elimination. Such bit-level processing is efficiently done via programmable logic. Not so much by processors.
In our example system an HD camera views the tablet computer and a video pipeline implemented with programmable logic performs the image processing.
Object Detection and Recognition
Initial object detection typically requires a scan of the entire image looking for key characteristics such as a particular object shape. This function is typically implemented with programmable logic. Once an object is identified as one of potential interest, additional potentially complex algorithms can be run to make further determinations about the object. An application processor can often perform this next level of identification on a smaller data set but with a more complex algorithm.
In the case of the robotic solitaire player, the entire image can be scanned by the programmable logic, identifying card boundaries and locating of the playing card, its rank, and its suit. With the data set now significantly reduced, the card rank and suit images can be passed to the application processor for rank and suit identification via image-recognition algorithms.
Algorithmic Decision Making
Algorithmic decision making is typically a complex process best handled by a general purpose application processor. In our example, a newly exposed card triggers a new set of potential decisions to be made with regard to card play or movement, and is done by the application processors.
Motion Path Selection
While the shortest path between two points is generally considered to be a straight line, such a path can result in hitting an object between the two endpoints. The motion path can end up being multi-segmented and will need to be translated from a traditional Cartesian system to one in which the robot operates. In our multicore heterogeneous example this can be done either with the application processor or with the real-time processor.
With the robotic solitaire player, the problem is somewhat simplified because there are no potential obstructions along the path above the tablet computer. For our example application, we selected a delta robot. Delta robots are built typically with three arms connected to a universal joint to which the effector (device designed to interact with the environment) is attached. Thus the motion of the effector in 3D Cartesian space must be translated to the motion of each of the three independent motors movements. In this application the desired x,y,z coordinates are passed to the real-time processors to compute the motion path for each of the three arms.
Motor Drive Control
Motor control algorithms enable acceleration, running, and deceleration, typically optimizing for minimal movement time within mechanical constraints, ensuring that no damage occurs to the part being moved due to acceleration/deceleration, and optimizing for reduced energy consumption. Such computations along with the motor-drive functions are typically done in real-time. These computations are best done by a real-time processor. In our example, this control is done by the real-time processor. The real-time processor operates in lock-step for increased reliability.
Safety Event Detection and Shutdown
A safety event can be a human entering the robotic cell with the risk of the robot injuring the human. It is critical for the system to recognize this event and quickly respond in a manner to protect the human.
For our robotic solitaire player, one might create infrared walls of light beams around the robot. When a beam is interrupted the power to the robot can be removed resulting in an immediate stop of the system. In this example the triple-redundant platform management unit can be used. This highly reliable processing element can receive the input from the infrared light walls and, upon event detection, shut the robot down.
Graphical User Interface
Graphical User Interfaces (GUIs) typically run on top of operating systems such as Linux. Linux supports numerous graphical environments starting with basic windows managers and extending to full desktop environments.
The solitaire robotic system requires the display of the solitaire table playing surface, real-time view of the HD camera image, window display of card rank and suit detection, and play status window. The Ubuntu Desktop environment provides a good platform upon which these elements can be displayed and from which the play can be controlled. The multicore application processor is ideal for running Linux and the Ubuntu desktop. The integrated multicore GPU is used for the combined 2D, 3D, and video data display.
Configuration and Security
Processing systems require booting along with OS and application loading. Programmable logic requires configuration. Developers increasingly wish to protect their code and IP from competitors and hackers. Encryption of code and configuration data and authentication to confirm that the correct code is being loaded is therefore critical. Once operating, the system needs to be protected from external influence.
The configuration and security unit in this example authenticates and decrypts both the code and configuration data for the solitaire player prior to execution. E-fuses can be blown to prevent configuration and data read-back via interfaces such as JTAG.
System attacks can cause information leakage or improper operation. Such attacks can include under/over voltage or under/over temperature. These attacks can be detected and as required the system can be locked down.
Early embedded systems typically consisted of one or possibly a few microprocessors tackling a wide variety of functions within a system including user interface, data acquisition, data processing, external control, and application processing. Subsequent generations brought higher-performance processors, multicore processors, function-specific processors, and real-time processors. FPGA’s started as glue-logic devices and, as they became larger, were used to implement additional peripherals, state machines, and massively parallel data processing. The latest-generation Xilinx Zynq UltraScale+ MPSoCs now enable single-chip, heterogeneous multiprocessing systems consisting of multicore applications processors, multicore graphics processors, multicore real-time processors, a platform management unit, a configuration and security unit, and multiple processing elements implemented with programmable logic. Such devices enable both hardware and software customization that produce application-specific functionality that meets targeted embedded application requirements efficiently.
Note: This article is based on an Embedded World 2016 paper and presentation.
Photonfocus has just introduced three compact GigE video cameras capable of pumping as much as 400Mbytes/sec of video over a standard GigE interface using real-time 4:1 video compression. The three Quad Rate QR1-D2048x1088(I/C)-G2 cameras are all based on the CMOSIS CMV2000 CMOS image sensor, which is optimized for low light conditions. The three camera models image in black and white, NIR (near IR), and color respectively.
Photonfocus QuadRate Video Camera based on a Xilinx Spartan-6 FPGA
The cameras have a variable frame resolution and can achieve the following frame rates at the stated frame resolutions:
Resolution Frame rate [fps]
2040 x 1088 169
1024 x 1024 358
800 x 600 606
640 x 480 754
Applications such as motion analysis (bionics, sports, and biomechanical analysis), process failure analysis (burst or fracture of tools, failure of handling systems, breakdown of packaging systems), and machine vision all require high frame rates. Today, the preferred camera interface for such applications is GigE due to its standardization and long transmission distance abilities. In addition, multiple camera systems are easily set-up using GigE networking. However, there’s ultimately a bandwidth problem when networking multiple high-speed video cameras over a GigE connection.
Initially, Photonfocus developed DoubleRate video-compression technology to transmit more video frames through standard GigE pipes. The technology proved popular and the company’s customers asked for even higher compression rates at high image quality to accommodate even faster frame rates. The result: Photonfocus’ QuadRate compression technology delivers real-time 4:1 video compression without dropped frames while maintaining 100% compatibility with the GigEVision and GenICam standards.
The electronics that implements this sort of compression must fit within the camera and you cannot get this sort of compression performance using a processor that will fit in the power and volume envelope of a small device like one of the Photonfocus cameras, which measure 55x55x55.7mm (not including lens mount or lens. Consequently, these cameras, like many previous camera models from Photonfocus, rely on an integrated Xilinx Spartan-6 FPGA to implement a variety of functions including the QuadRate 4:1 real-time video compression. Photonfocus has leveraged the flexibility of FPGA-based hardware platforms to develop multiple product lines.
There are again three key system-design lessons embedded this latest product announcement by Photonfocus:
For additional information about Photonfocus video cameras, see:
By Arnaud Darmont, Founder, CEO and CTO, Aphesa
A customer asked our team at Aphesa to design a high-temperature camera that will operate inside oil wells. The device required a rather large FPGA and had temperature requirements up to at least 125 degrees Celsius—the system’s operating temperature. As a consultancy that develops custom cameras and custom electronics including FPGA code and embedded software, we have experience with high-temperature operation. But for this project, we had to go the extra mile.
The product is a down-hole, dual, color camera designed for use in oil well inspection. It performs embedded image processing, color reconstruction and communication. The system has memory, LED drivers and a high-dynamic-range (HDR) imaging capability. For this project we chose to use the Xilinx Spartan-6 XA6SLX45 device (Spartan-6 LX45 automotive) because of its wide temperature range, robustness, small package, large embedded memory and large cell count.
In most designs where cooling is required, designers use either passive cooling (a heat sink that helps dissipate heat into the air by increasing the surface area in contact with air) or active cooling. Active-cooling solutions typically force an airflow in order to help renew the cold air that absorbs heat above the device. The capability of air to absorb heat depends on the temperature difference between the air and the device, as well as on the pressure of the air. Other solutions include liquid cooling, where the liquid, usually water, replaces air for more efficiency. The capability of a mass of air or fluid to absorb heat is given by the heat absorption equation:
The final approach that designers often use is thermoelectric cooling, where the Peltier effect—a temperature difference created by applying a voltage between two electrodes connected to a sample of semiconductor material—is used to cool one side of a cooling plate while heating the other. Although this phenomenon helps to move heat away from the device to be cooled, Peltier cooling has one big disadvantage: It requires significant external power.
In our case, airflow was not a solution because the quantity of air in the enclosure is limited and the air quickly equalizes in temperature. Water cooling was not possible either, because of the long distance between a water source and the tool. So for us, the Peltier effect was the only cooling option. Since the ambient temperature is fixed (we can’t heat up the large fluid quantity for a very large value of mass), the thermoelectric cooler will actually reduce the temperature of the electronics. Unfortunately, as a high current is required for the cooling devices and a very long conductor is used to connect from the surface to the tool, only limited current is actually available for cooling and only a small temperature drop is reached.
Moreover, our device is a camera and image quality decreases exponentially with temperature. Therefore, we had to optimize our cooling strategies to cool down as much as possible the image sensors and not the other devices such as FPGAs, memories, LED drivers or power supply circuits.
Since cooling of the FPGA was almost impossible due to the limited choice of Peltier cooling on the image sensors only, our only option was to reduce the peak temperature inside the FPGA.
To overcome these varied challenges in our design for an oil well camera, we implemented several solutions. One of the most important decisions was choosing the right size device. Larger devices have more static power consumption but allow for more spread of heat into the device, avoiding hot spots.
Devices qualified for automotive use have a long lifetime at elevated temperature and it is acceptable to have a shorter lifetime in industrial applications. We have evaluated the code in LX25 and LX45 devices of the XA (automotive) series and measured the total power consumption and the temperature of the device’s body. Sometimes it is acceptable to have a higher average device temperature if the peak temperature is less. We also evaluated the lifetime in accelerated aging tests.
Our next design choice was to place a limit on the device usage. In order to reduce the heat dissipated by the device, we avoided using all possible logic cells and memory. The unused parts of the device consume static power but not dynamic power.
We also performed clock gating. As dynamic power depends on the clock rate, we can use clock gating to cancel the dynamic power consumption of the blocks that are not in use. If the clock tree does not toggle, the power consumption in that part of the device is reduced. We also kept the number of I/Os we were using to a minimum. This in turn lowered the power consumption of the I/O banks.
Then, by using some I/Os as virtual grounds, we reduced the distance traveled by the current inside the device and therefore reduced the Joule effect in power routing. The virtual grounds also help with conducting heat into the ground plane.
Since we did not want to use all I/Os and all logic cells, we chose to spread the design over two FPGAs. This means that the heat is dissipated at two separate locations.
We also used multiple ground planes. This technique helps conduct the heat from the warmer areas to the cooler areas and also provides additional heat capacitance. It’s important to design the thermal planes for board reliability against delamination during temperature cycles.
Another important step was to optimize our code to reduce the clock rate. Reducing the clock rate reduces the power consumption but also allows the device to run at a higher temperature. As an example, we evaluated the tradeoff between slow parallel design and fast pipelined design.
Combining all of the above techniques, we came away with a camera that is able to work at 125 degrees of ambient temperature with SDRAM management, communication buses and image processing, even though it is by specification limited to 125 degrees of junction temperature. Moreover, we managed to reach 125 degrees without thermoelectric cooling.
Note: This blog post was abstracted from an article that appeared in the latest issue of Xcell Journal, Issue 94. For the full article, see the full issue of Xcell Journal, Issue 94.
The new video below shows you a pair of operating 100G Ethernet systems—one using 100GBase-KR4 and the other using 100GBase-CR4 electrical standards for backplane and direct attach copper interconnect. Both systems are based on Xilinx Virtex UltraScale All Programmable devices and both employ the Virtex UltraScale 30Gbps GTY SerDes ports operating at 25.78Gbps per lane over four lanes, delivering an aggregate 100G Ethernet bandwidth, without the need for retimers.
The systems drive the high-speed backplane and the 5m of direct attach copper cabling directly from the outputs of the Virtex UltraScale FPGA. As Martin Gilpatric (Transceiver Technical Marketing Manager at Xilinx) says in the video, this is game-changing technology for data center equipment designs. You can make your next-generation designs both less expensive and more power efficient using this technology. Gilpatric says that this achievement highlights two key features of the Virtex UltraScale 30G GTY SerDes transceivers: superior high-speed clocking performance and full adaptive auto-equalization.
As I often do when discussing Xilinx high-performance SerDes transceivers, I’ll put it more directly: bulletproof.
Here’s the 4.5-minute video:
Here’s another Zynq-based product demo from the recent NAB 2016 show in Las Vegas: Pathpartner Technology's HEVC decoder IP. The 1-minute video below shows Pathpartner’s HEVC decoder IP running on a Xilinx Zynq Z7045 SoC and decoding 4Kp30 video in real time. The same IP can handle 4Kp60 video when implemented on Xilinx UltraScale All Programmable devices.
For more information about this HEVC Decoder IP, please contact Pathpartner directly.
Keysight designed a Xilinx Virtex-6 FPGA into its U5303A PCIe 12-bit High-Speed Digitizer with On-Board Signal Processing module and now the company has further leveraged that on-board FPGA processing power to improve the card’s on-board FFT capabilities. The module has a dc to 2GHz bandwidth, which allows conversion of even very low frequencies not observable with ac-coupled digitizers. Keysight’s new FFT option for the U5303A module can compute an FFT on 32,768 samples in 10.24 μsec and can select single- or dual-channel operation on the fly.
Keysight U5303A PCIe 12-bit High-Speed Digitizer with On-Board Signal Processing module
Keysight allows you to make considerable use of the U5303A’s on-board FPGA through an optional U5340A FPGA Development Kit that allows you to develop and implement custom, high-speed, hardware-based signal processing that you then integrate directly into the digitizer card’s programmable hardware. This development kit allows you to directly interface the card’s two high-speed ADCs directly to hardware processing, on-board DDR3 SDRAM, and the card’s PCIe host interface. Here’s an illustration of the available resources:
Diagram showing the components of the Keysight U5340A FPGA Development Kit for High-Speed Digitizers
These Keysight products illustrate the advantage of adding uncommitted hardware resources to products as a form of future proofing through the use of configurable platforms based on Xilinx All Programmable devices. The company announced the U5303A High-speed Digitizer module on June 21, 2013 (back then, Keysight was part of Agilent) and has regularly introduced upgrade options that employ the module’s on-board FPGA capabilities to continually enhance the product’s capabilities. The recent announcement of the enhanced FFT option is merely the latest in a long series of improvements made over the past three years and Keysight has not needed to alter the board design of the digitizer module to accomplish these enhancements. All of the enhancements take the form of additional software and on-board FPGA configuration changes.
Note that the U5303A High-Speed Digitizing module is not the only product in the Keysight high-speed digitizer line to take this approach. The company announced the M9203A PXIe 12-bit High-Speed Digitizer/Wideband Digital Receiver based on a very similar hardware architecture (including the Xilinx Virtex-6 FPGA) with the custom programming capability from the same U5340A FPGA Development Kit option.
You’ll find Keysight’s announcement of this new FFT option for the U5303A card here.
For more information about the U5303A and U5340A, please contact Keysight directly.
For more information about the U5340A FPGA Development Kit, see “Keysight ups PCIe and AXIe game—lets you add signal processing to the FPGAs in its high-speed digitizers.”
As of today, you can now watch a short, 2.5-minute video showing the Xilinx UltraScale+ PCIe Gen3 x16 integrated block for PCI Express in action, working quite successfully with an Intel Skylake processor. 4. The hardened, integrated Xilinx UltraScale+ PCIe Gen3 x16 interface passed PCI SIG compliance testing this month and it’s the industry’s first Gen3 x16 PCIe solution built into a programmable device.
The video below shows the PCIe Gen3 interface operating at 12.65Gbytes/sec (100Gbps+) over real hardware. You’ll find this hardened, integrated PCIe Gen3 interface core in Virtex UltraScale+, Kintex UltraScale+, and Zynq UltraScale+ MPSoC family members. Xilinx Vivado HLx tools already support this advanced feature.
Here’s the video:
Opal Kelly has just announced the XEM7360 USB 3.0 integration module that melds the All Programmable capabilities of a Xilinx Kintex-7 FPGA (a Kintex-7 160T or 410T) with 2Gbytes of DDR3 SDRAM and a USB 3.0 port. This module is designed to serve as a ready-to-use, turnkey solution for hardware prototyping or production of systems that require multiple Gbps transceivers (12.5Gbps max data rate), a lot of fast I/O, and high-performance programmable hardware all supplied by the on-board Kintex-7 FPGA. Applications include video and image capture and processing, high-speed data acquisition including JESD204B interfacing capability, digital communications, cryptography, security, and high-speed coprocessing.
XEM7360 USB 3.0 integration module
Here’s a block diagram of the XEM7360 module:
XEM7360 Block Diagram
The XEM7360 module is bundled with the company’s FrontPanel SDK, which makes it easy for you to interface this module to a PC, Mac, or Linux host using an API that supplies a proven method for communicating with and configuring the module.
For more information about the XEM7360, please contact Opal Kelly directly.
NGCodec used the recent NAB 2016 show in Las Vegas to roll out a demo of its new real-time HEVC/H.265 hardware encoder running on a Xilinx Kintex UltraScale KU060 FPGA. The HEVC/H.265 hardware encoder reduces the video bit rate by approximately 100:1 (reduced to 3Mbps!!!) relative to a non-encoded bit stream while producing a picture that’s virtually indistinguishable from existing 1080p HD video delivered over HDMI. In addition, NGCodec’s HEVC/H.265 hardware encoder has super low latency—about one video frame in this demo.
Here’s a block diagram of the NGCodec demo:
And here’s a very short video of NGCodec’s HEVC/H.265 hardware encoder demo shot at NAB 2016:
NGCodec has already mapped its HEVC/H.265 hardware encoder to Kintex UltraScale, Zynq UltraScale+ MPSoC, and Virtex UltraScale+ devices, allowing you to upgrade your design over time to more using Xilinx’s increasingly advanced FPGA families to reduce power, lower BOM costs, and increase functionality. As NGCodec’s Web page says: “Over time those new FPGAs will outperform the current ASICs.”
Please contact NGCodec directly for more information about its HEVC/H.265 hardware encoder IP.
By Adam Taylor
This is another blog that I should have written much sooner. While working with the Avnet EVK (Embedded Vision Kit), it has become apparent that compilation times can sometimes become excessively long. This is especially frustrating when we only change one little area of a design. So I am going to explain what Out-Of-Context (OOC) compilation is and how it saves time in our build process, which is important because it helps us become even more productive.
Vivado provides us three options when it comes to synthesis:
There a number of different ways we can select the synthesis option we want. The most obvious way is when we generate the output products for a block diagram. Vivado will pop up a box very similar to the one below, allowing you easily to select your choice.
Pop-Up window with Synthesis Option Selection
Alternatively, if you are designing with a pure HDL design you can also right click on the HDL design file and select “Set as Out-Of-Context For Synthesis” from the available options, which then provides a pop-up box where you can configure the settings and provide a path to the desired OOC XDC files needed.
Configuring the HDL OOC
Within the sources window of Vivado HLx you can see if a module is configured for OOC compilation. It will have a yellow box by its name.
Identifying OOC configured blocks and files
One interesting thing that the OOC-Block option allows you to do is create a block diagram with the core of your design, synthesize it using the OOC Block design, and then add in another top-level file that contains both the OOC Block and other modules that add functionality either through HDL code or block diagrams. This approach to synthesis saves considerable compilation time, as shown in the diagram above. All you do need to do is create your own top-level HDL file that ties everything together.
There are a couple of potential areas where you need to be careful when using the OOC-IP flow. The first arises when you use your own IP modules packaged within an OOC-IP flow. You need to ensure these modules are correctly defined not to use the OOC-IP flow. If you get error project 1-486 during implementation, you have likely made this mistake.
The second is with OOC-IP flow, which is bottom up. So if there are customized generics at a higher level, Vivado HLx will use the default settings. Consequently, the synthesis behavior may not be as you want.
OOC can save considerable time in our implementation flow. For further reading check out the following user guides:
The code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
You also can find links to all the previous MicroZed Chronicles blogs on my own Web site, here.
By Zongbo Wang, Bruno Camps, Yan Li, Joshua Fraser, Aerotenna; Lianying Ji and Jie Zhang, Muniu Technology
The unmanned aerial vehicle (UAV) and drone industry is quickly growing and reaching new commercial and consumer markets. The horizon of what is possible with UAVs continues to push forward into creative new applications like 3D modeling, military aid and delivery services.
The problem is that the applications are becoming increasingly more complex, requiring more and more processing power and I/O (input/output) interfaces, while the UAV platforms available are not improving at the same pace. The limits on the capabilities of most UAV platforms are being reached as the software and hardware needed for flight continue to advance.
Our team at Aerotenna has successfully flown a UAV with a board we built based on the Xilinx Zynq-7000 All Programmable SoC. This flight marks the beginning of our plans to release microwave sensing products that are computationally heavy. We turned to the Zynq SoC because the needed processing power was unavailable with other solutions of its class. With this new platform (Figure 1), we plan to improve the unmanned flying experience by deploying our microwave-based collision avoidance systems.
Limits of Today’s UAV Technology
The main push of the UAV industry has been to make flying as affordable as possible, simplifying and stripping all unnecessary capabilities. This is a good thing if you are just looking to buy a product that does only what you want in the simplest way. But for developers like us, who seek to explore new, complex applications, it was necessary to branch out and build our own UAV platform capable of providing the processing speeds to power our ideas.
Another big limitation of today’s standard UAV platforms is the lack of input-and-output connections to the processor. Thus, the flight control system easily reaches the maximum capacity of the processor and the I/O capabilities, not leaving much room for new sensors, and new applications.
Most of the I/Os included in the standard processor boards are already used up by the various components needed for flight. These functions include the inertial measurement sensors for quantifying aircraft orientation, the barometer and altimeter for determining the altitude and an RC receiver for decoding the user’s input. Any I/O that’s left over for adding extra features does not offer much in the way of options, generally confined to the most popular demands such as a camera or GPS for navigation. A single platform that is compatible with a very wide range of sensors and external interfaces currently does not exist on the market.
Here at Aerotenna, we believed the way to overcome these limitations was to create a new board design from scratch. We have been working to perfect a new UAV platform that will excel in the areas where the other platforms fail. We used the Zynq SoC device provided by Xilinx to achieve this goal. Its superior design will offer the greatly increased processor speed and I/O capabilities needed for the next generation of UAVs.
Why the Zynq SoC?
We chose the All Programmable Zynq SoC as the foundation on which to build our powerful platform. The dual-core ARM Cortex-A9 APU within the Zynq SoC chip allows for unparalleled processor speed. Nothing on the market for affordable UAV platform solutions compares with the Zynq SoC in chip structure, multiprocessor capabilities and I/O access speeds. Thus, the Zynq SoC is the perfect candidate for the next-generation platform.
Most of the flight control platforms currently in the market are based on a microcontroller unit (MCU). That architecture limits the potential for sensor fusion due to the limited processing power and I/O extension capabilities.
The advantage of the Zynq SoC is clear in both processing power and I/O capability: the combination of dual ARM cores plus FPGA logic enables a hardware/software co-design approach that places some of the timing-critical processing tasks in the programmable logic. The I/O peripherals and memory interfaces are more versatile than the ones provided by MCU-based platforms.
Another reason we chose the Zynq SoC is because it is able to easily handle the complexities of flight control programs, which can be enormous and require very fast CPUs. And the device has plenty of power left over, which leaves a lot of room for expansion in the flight control programs.
There are many types of flight software programs, and they all differ in behavior and complexity. The flight control software we have decided to use is called ArduPilot, provided by APM (autopilot machine) on Dronecode. It is more complex than most, but provides a lot of functionality not included in simpler programs, such as waypoint navigation and multiple flight modes to cater to the user’s specific application.
What is ArduPilot?
ArduPilot is an open-source autopilot software program built for UAVs. It is kept up to date and improved upon by a large community of developers and enthusiasts. The APM project, which started out as much simpler software built for the open-source Arduino microprocessor, has grown much larger and more complex and is now compatible with many UAV platforms. Currently, the program contains more than 700,000 lines of code, and is a beautifully intricate flight control system.
The code is divided into two main parts: the high-level layer and the hardware abstraction layer (HAL). The high-level layer is responsible for scheduling tasks and making decisions based on incoming data. The HAL is the low-level code that accesses the hardware’s memory. This separation of code structure allows the whole system to be ported to other platforms by changing only the HAL for the platform-specific memory access. And the upper-level code simply retrieves the data from the HAL the same way across all platforms.
The ArduPilot flight control system continues to grow in complexity as the open-source community adds more and more to the APM project. As a result, the industry is reaching the hardware’s limit, waiting for the next platform on which to continue growing.
The initial efforts of porting ArduPilot to the Zynq SoC (led by John Williams in the drones-discuss Google group) in 2014 paved the way for porting the APM to the same Xilinx platform. Dr. Williams noticed the Zynq SoC’s potential for offering custom I/O and real-time image processing as the beginning of an amazing new world for UAVs. In an interesting twist, Williams was the founder of PetaLogix, which created the original PetaLinux tools. Xilinx acquired the company in 2012.
The Aerotenna team continued these design efforts in both hardware and firmware, and accomplished the first Zynq SoC-powered ArduPilot flight in October of 2015. Our custom board runs the ArduPilot flight control software on the PetaLinux operating system. This impressive feat marks a drastic improvement in UAV technology and capability.
The dual ARM core within the Zynq SoC puts our flight control solution far ahead of any other UAV solution of its class in processing power and I/O capabilities. This leap forward will open the door to many new UAV applications that require greater computing power. We want to make sure to provide something with plenty of hardware interfaces built for the enthusiast as well as for the developer. Atop a Linux operating system, the UAV platform has much more flexibility to accommodate a very wide range of applications because of Linux’s programmability and versatility. As one of the most powerful user-programmable operating systems, Linux allows our team to customize the system exactly to our needs.
We accomplished the flight test on the commercial off-the-shelf DJI F550 airframe and plan to test our Zynq SoCbased flight controller on more airframes. We will soon release this platform as part of the Octagonal Pilot On Chip (OcPoC) platform.
To start completely from scratch to make a customized flight control platform is an ambitious endeavor that takes the perfect team of engineers, and a lot of learning, to accomplish. Starting from nothing, there are a lot of decisions that need to be made about the system. In order to run a flight control program, an operating system must be used. A real-time operating system (RTOS) processes data right as it comes in, resulting in a negligible buffering delay. As a result, an RTOS is great for running time-sensitive tasks like flight control. The disadvantage is that it is difficult to interface this kind of system with ArduPilot because some of the data processing tasks would need to be reimplemented in the operating system itself.
That’s why we opted instead for a Linux operating system, which is not real-time, but is much easier to implement in a hardware/software co-design, maximizing the versatility of the system. Xilinx provides a powerful embedded Linux operating system called PetaLinux that is compatible with the Zynq SoC and other Xilinx devices.
The road map to getting this system up and running seemed complicated, and we had to overcome many difficult challenges. The process began by developing the system design with FPGA development software, and writing and creating new intellectual property (IP) for driver interfaces. The cores are embedded processes at the hardware level that process data at very fast speeds. Then, a PetaLinux operating system must be deployed using the custom system design. Finally, we compiled and modified the ArduPilot system to be able to run on PetaLinux and the new platform.
We tackled the problem in stages to achieve a proof of concept. Our team began working to first receive and detect an RC signal, and then to power a single motor. After finally demonstrating this proof of concept, we proceeded to expand the interface between the OcPoC and sensors that ArduPilot relies on to function. Writing our own device drivers from scratch was a major challenge. The most critical sensors for achieving a successful flight include accelerometers, gyroscopes and a barometer. Since multirotor vehicles are naturally unstable, measuring the frame’s inertia and altitude is critical for stability. All these had to be configured in our FPGA hardware design with the correct communication protocol, and eventually included into our PetaLinux operating system.
The ArduPilot code base consists of more than 700,000 lines of code, so one big task was to get the system operational on a brand-new platform. With no easy interface to calibrate the inertial sensors, motors and RC controller (normally done inside a nice graphical user interface for other platforms), we had to manually calibrate the whole system by tweaking the hundreds of stored parameter values. Calibration is necessary, since each hardware component is slightly different and will produce slightly different outputs. So you must define the maximum and minimum values produced by each component. The process finally ended in a smooth and sustained flight for the Aerotenna team.
Introducing the OcPoC
The OcPoC project (Figure 2) is Aerotenna’s UAV flight control platform. With it we plan to meet the needs of the drone community with greatly enhanced processing capability, I/O expansion and much more flexibility in programming than other solutions. Though using the Zynq SoC to power our system may seem to be overkill for running the current ArduPilot release, we foresee this industry continuing to expand and want to provide the potential that will soon be utilized.
This architecture paves the way for developers to create and design with all the processing power they need. With this new platform, we plan to introduce new applications of microwave technology in imaging, mapping and proximity detection compatible with the OcPoC. Our system will also be able to perform onboard data capture and analysis through the processing capabilities of the Zynq SoC chip.
Our platform will provide integrated IMU data acquisition, which will create a ready-to-fly “box” without any additional sensor setup (Figure 3). It will also provide integrated navigation interfaces for any type of wireless navigation control. What takes our platform a step ahead is the ability for any external sensor data to be directed through the Zynq SoC to perform high-speed data processing simultaneous to the ArduPilot program. This is not possible with MCU-based platforms.
The Zynq SoC’s extra processing power can also handle more complicated flight control systems to more finely tune the performance of the UAV. This includes expanding the I/O capabilities to a much wider range of external interfaces and sensor options, like real-time video streaming, microwave proximity sensors and Bluetooth.
Our hope is that by making our platform easy to test and develop new ideas on, many other companies and individuals will contribute novel creations to the drone industry, unhindered by the processing limitations of today’s available hardware.
Note: This article appeared in the latest issue of Xcell Journal, Issue 94.
Admit it, you’re more than curious about the technical specs of the new Xilinx UltraScale+ devices including the new Virtex UltraScale+ family. After all, they’re the first Xilinx devices to be based on the TSMC 16nm FinFET process and you can expect to double the system-performance level per watt versus the extremely successful 28nm Virtex-7 device family—or perhaps even better. Satisfy your curiosity immediately by downloading the just-released Advance Product Specification for the Virtex UltraScale+ family.
Yesterday, National Instruments (NI) unveiled a new mmWave Transceiver System, which serves as a modular, reconfigurable SDR platform for 5G R&D projects. This prototyping platform offers 2GHz of real-time bandwidth for evaluating transmission systems designs in the mmWave E band, which is 71-76GHz for NI’s modular transmit and receive radio heads. You can prototype unidirectional and bidirectional single-antenna and MIMO systems using one or more pairs of these radio heads in conjunction with the transceiver system’s modular PXIe processing chassis.
National Instruments mmWave Transceiver System
The block diagram for this mmWave transceiver system shows that it relies heavily on FPGAs—specifically Xilinx All Programmable devices—to perform the required real-time processing in both the transmitter and receiver chains. Here’s the system block diagram:
National Instruments mmWave Transceiver System Block Diagram
The key to this system’s modularity is NI’s 18-slot PXIe-1085 chassis, which accepts a long list of NI processing modules as well as ADC, DAC, and RF transceiver modules. For the NI mmWave Transceiver System, critical processing modules include the NI PXIe-7976R FlexRIO FPGA module —based on a Xilinx Kintex-7 410T FPGA—and the NI PXIe-7902R FPGA module—based on a Xilinx Virtex-7 485T.
NI PXIe-7976R FlexRIO FPGA module based on a Xilinx Kintex-7 410T FPGA
NI PXIe-7902 FPGA module based on a Xilinx Virtex-7 485T
The NI mmWave Transceiver System maps the different mmWave processing tasks to multiple FPGAs, depending on the particular configuration, in a software-configurable manner using the company’s LabVIEW System Design Software, which provides deep hardware control even down into the FPGAs distributed in the system’s various PXIe processing modules. NI’s LabVIEW relies on the Xilinx Vivado Design Suite for compiling the FPGA configurations. The FPGAs distributed in the NI mmWave Transceiver System provide the flexible, high-performance, low-latency processing required to quickly build and evaluate prototype 5G radio transceiver systems in the mmWave band.
NI has posted a 2-minute video of 5G mmWave proof-of-concept work it’s done with Nokia Networks over the past year using early versions of the mmWave Transceiver System. Using this system, NI and Nokia Networks developed one of the first mmWave communication links capable of streaming data at 10GBps. The quick-prototyping nature of NI’s transceiver prototyping system along with the graphical LabVIEW development environment saved Nokia Networks a year’s development time (!!!), according to the estimate in this video:
Samtec recently coupled a pair of its high-speed FireFly optical fiber Micro Engines to a Xilinx Virtex UltraScale VU095 FPGA and communicated at an aggregate rate of 100Gbps over 100m of OM3 optical fiber, error free. The test setup included a Xilinx Virtex UltraScale FPGA VCU1287 Characterization Kit, two Samtec FireFly Test Kits, and two Samtec FireFly Active Optical Cable Micro Flyover Assemblies. In the exercise, Samtec configured four of the Virtex UltraScale VU095 FPGA’s GTY SerDes ports for 25.78Gbps operation for an aggregate throughput of 100Gbps. Here’s a block diagram of the setup:
And here’s a photo of the Samtec setup:
Samtec employed the IBERT features associated with the Virtex UltraScale FPGA’s GTY SerDes transceivers and the latest Vivado Design Suite HLx Editions. You might be interested in the resulting eye diagram from this demo setup, so here it is:
The dramatic thing about this eye is that there is nothing dramatic to see. It’s a nice, open eye. Exactly what you want to see.
What does one of these Samtec FireFly optical modules look like? Here’s a closeup photo:
Samtec developed the FireFly system to help designers route high-speed serial data across a board or within a multi-board system without weaving these signals through the pc board. As you can see, the high-speed Virtex UltraScale GTY SerDes transceivers pair extremely well with this Samtec system.
For more details, see this Samtec blog post.
Please contact Samtec for more information about the company’s FireFly optical system.
For additional coverage of the Samtec FireFly system, see:
Concurrent with the NAB 2016 show in Las Vegas, Image Matters has launched the Origami Ecosystem for developing advanced 4K/8K, HFR (high frame rate), and HDR (high dynamic range) video designs using the company’s Origami B20 module, which in turn is based on a Xilinx Kintex UltraScale KU060 FPGA. The ecosystem joins companies determined to boost innovation in advanced video hardware development. The Origami Ecosystem connects audiovisual distributors, IP core vendors, hardware providers and development tool vendors under a single banner for both small and large video design projects. Pre-validated hardware and IP cores (including SDI, Ethernet and PCIe interfaces as well as image codecs, signal wrappers and image-processing blocks) together with the Origami modular architecture and development flow permit cost-effective development for new applications. Initial announced members of the Origami Ecosystem include Adeas, Embrionix, Fidus, Image Matters, inrevium, IntoPIX, Omnitek, PLDA, QuickPlay, Samtec, SOC Technologies, Tokyo Electron Device, Village Island, and Xilinx. Image Matters says more member companies are pending approval.
Image Matters Origami B20 Module
For more information about Image Matters’ Origami module, see:
Kevin Morris has just seen the Xilinx PAM4 SerDes demo that I wrote about previously and he blogged it in an EEJournal article titled “This One Goes to 58! Xilinx Ups the SerDes Ante.” In the article, Kevin writes:
“The last node sported 28Gbps SerDes transceivers, so, clearly, this time we should be dealing with 56. I started to ask about the discrepancy, but somehow I knew the answer already:
“Yeah, but this one goes to 58!”
Later in the article, he writes:
“The basic strategies for achieving success with PAM-4 at these rates are similar to what we’ve seen before: transmitter pre-emphasis and receiver equalization. For 56 Gbps PAM-4, though, it’s the receiver that has the heaviest burden to carry. The 9dB or so of vertical space you lose by switching from NRZ to PAM-4 is hard to recover, and the raw data before equalization over most channels is sobering, to say the least. Nothing that really resembles an “eye” can be seen. But the DSP in the automatic equalization strategies of the Xilinx receivers is remarkable, and the recovered signal is clean as a whistle.”
You might want to read this one.
For previous Xcell Daily coverage of the Xilinx PAM4 SerDes technology, see:
Xilinx announced SmartConnect interconnect automation for optimizing interconnects in complex systems built using Xilinx All Programmable devices based on the UltraScale architecture more than a year ago. Today, Xilinx announced that the recently released 2016.1 release of the Vivado Design Suite HLx Editions now incorporates extensions to the SmartConnect technology including new AXI Smartconnect IP that give you an unprecedented performance boost for system designs that use 16nm UltraScale+ devices—2x better than systems based on devices built with 28nm process technology. (Note: For the Vivado 2016.1 announcement, see “Vivado Design Suite—HLx Editions version 2016.1 now online, ready for download.”) Last year when I wrote about SmartConnect technology (see “SmartConnect: Interconnect design automation for UltraScale+ that cuts system area and power by 20% to 30%”), I could only write in general terms because the specifics were not public. With today’s announcement and the associated White Paper, I can now give you a lot more technical detail about this performance-boosting technology.
With this latest Vivado HLx 2016.1 release that includes AXI SmartConnect IP, Xilinx has extended SmartConnect technology with optimization techniques including useful skew optimization, time borrowing, retiming, and pipeline analysis that identify and mitigate system-performance bottlenecks without requiring heavy manual optimizations, extra latency insertion, or costly architecture redesign. Ideally, you’d like to have a highly automated way of optimizing all of this, including interconnect structures.
For devices based on the UltraScale architecture, Xilinx calls that automated interconnect optimization technique “SmartConnect.” SmartConnect technology boosts performance per watt of AXI interconnect by optimizing interconnect networks for performance and area, within the specific interconnectivity requirements inherent to the overall design.
Consider clock skew. The way we go fast in logic design is through the age-old technique called pipelining. Seymour Cray used this logic design technique during the 1960s to build what were then the world’s fastest mainframe computers. We put controlled amounts of logic between registers to divide and pipeline the work to be done. If we get things just right, there’s exactly the same amount of logic—and exactly the same amount of delay—between each pair of pipeline registers so that the entire pipeline runs at some maximum frequency. Only in rarely do we get things just right and increasingly, wire delay plays a large role in the overall delay of each pipeline stage so that delays are never truly equal. There’s always one slowest delay in a pipeline that limits the overall pipeline clock frequency.
The sledgehammer approach to fixing this problem is to add more pipeline registers and to divide the logic between registers ever more finely to produce ever shorter logic delays and to reduce wire delays. Although this technique works, it adds physical registers and pipeline latency. Adding registers increases power and energy consumption. If you really wanted to brute-force this approach, you’d sprinkle pipeline registers all across your FPGA just in case you might need them. This approach adds die area, degrades pipeline latency, and increases static and dynamic power consumption, which explains why Xilinx did not take this approach with SmartConnect technology.
Instead, Xilinx designed several features into UltraScale+ devices including programmable delays in the leaf-clock buffers so that the Vivado design tools can adjust clock skew on a leaf-by-leaf basis to fully exploit useful clock skew in system designs. These leaf-clock buffers each have five discrete delay-tap settings that allow the Vivado router to automatically optimize clock delays. This feature is one aspect of the “ASIC-like clocking” available in All Programmable devices based on the Xilinx UltraScale architecture.
Here’s a diagram of this innovation from the SmartConnect White Paper:
These programmable leaf-clock buffers allow the Vivado Design Suite router to automatically fix setup and hold violations without designer intervention. The router employs timing analysis to determine the exact tap setting for each leaf-clock buffer, which helps achieve timing closure at high clock rates. You do not want to manually deal with all of these skew-delay problems at the leaf level and with SmartConnect technology, you won’t.
The leaf-clock buffers in UltraScale+ devices and the ability of the latest Vivado Design Suite tools to exploit the benefits of these buffers are what Xilinx means when it says that UltraScale+ devices and the Vivado Design Suite are “co-optimized.” You can easily see the benefits of such co-optimization for pipelined function blocks and for interconnect. The White Paper discusses other related co-optimizations.
In addition to the SmartConnect tool optimizations, Xilinx has now introduced AXI SmartConnect IP to really automate the optimal design of large, IP-based systems. Here’s a diagram from the White Paper that illustrates the use of this IP:
As you can see from the diagram, the entire AXI SmartConnect IP appears as one IP block. It’s a simple exercise to draw the 14 wires needed to connect the IP blocks comprising a very complex system based on the Xilinx Zynq UltraScale+ MPSoC heterogeneous processor complex with a large number of DMA and memory controllers. SmartConnect optimizations are baked into the AXI SmartConnect IP block.
There’s a lot more technical detail in the White Paper “Breakthrough UltraScale+ Device Performance with SmartConnect Technology” so I recommend that you download and read it.
By Adam Taylor
The Avnet EVK (Embedded Vision Kit) API provided with the example code allows us to communicate with the camera receiver module within the Zynq and with the Python 1300C imaging device on the camera module. This API provides a number of functions.
You use the functions with “_CAM_” in the name to communicate with the camera receiver module. You use the functions with “_SPI_” in their name to communicate with the Python Camera Module:
void onsemi_vita_spi_reg_write( onsemi_vita_t *pContext, Xuint32 uRegOffset, Xuint32 uData )
void onsemi_vita_cam_reg_write( onsemi_vita_t *pContext, Xuint32 uRegOffset, Xuint32 uData )
As developers, we need to be careful to use the correct function for out intended task; they are pretty easy to confuse.
The first function we’ll look at in depth is the function that controls the camera receiver module. The camera receiver module has fewer registers and is not as well documented as the camera module itself, which is well documented via its datasheet.
Examining the VHDL for the camera receiver IP and adding a little code to our example application to read and display the contents of the register helped me to create the following memory map for the camera receiver module. Zynq SoC addressing is 32-bit addressing over the AXI interface so the register addresses increment by four for each address, as you would expect.
Example from the Camera Module Register Display
The complete memory map for the camera receiver can be seen in the pdf document attached at the end of this blog post.
Running the example code, which extracts the contents of all registers, should show alignment between the status that we read last week because they are both generated by the same module. It is also worth noting at this point that the camera receiver is designed to interface to different kinds of sensors, so not all registers may be populated or used.
We need to use the SPI interface to control the Python 1300C module. Through this interface, we can configure and control the settings to make the device exhibit the desired behavior. The SPI interface uses 16 bits of data plus address and command. Using the SPI read command to read the device ID from the Python 1300C sensor module results in the following response–which rather helpfully ties to the datasheet:
There are 279 register within the Python 1300C device on the camera module. These addresses increment by one between addresses and are fully detailed within the datasheet for the device.
It is over the SPI interface that we gain real control of the Python device. We can use this interface to configure the regions of interest/windows, change the digital and analog gain settings, and set all of the other device configurations.
Of course, before jumping in and changing the settings of the Python 1300C device, it is always a good idea to examine the current register contents.
In the next blog we will look a little closer at how we can use some of these advance features now that we understand how to communicate with the Python 1300C imager.
This week at the NAB Show in Las Vegas, Blackmagic introduced an Arduino shield with 3G-SDI interfaces that you can use to build custom camera control units (CCUs) and other broadcast-automation controls. According to Blackmagic’s announcement: “Using the Blackmagic 3G-SDI Arduino Shield is easy, as all you need to do is get an Arduino, download the IDE from arduino.cc and then load the sample code. The sample code shows you how the Arduino communicates to the Blackmagic 3G-SDI Arduino Shield and the shield sends the commands to the camera via the SDI connection.” You can get more information on the Blackmagic 3G-SDI Arduino Shield and watch a 3-minute video here.
Meanwhile, here’s an image taken from that Blackmagic Web page:
Blackmagic 3G-SDI Arduino Shield
You can clearly see a Xilinx Spartan-6 LX25T FPGA used on the board to implement the 3Gbps 3G-SDI interface and the board’s logic functions.
Please contact Blackmagic for more information about the 3G-SDI Arduino Shield.
If you remember the Hitachi SuperH line of 32-bit RISC microprocessors then it’s possible that the open-source J Core processor cores might interest you, primarily as a learning tool. J Cores are clean-room versions of the early SuperH processor architectures, developed by the Open Processor Foundation (OPF), written in VHDL, and available royalty and patent free under a BSD license. The only reason these open-source cores can exist is because the patents on the SuperH processors are starting to expire. According to Wikipedia, “The last of the SH-2 patents expired in 2014.” The OPF’s cores are called J Cores because the SuperH trademarks have not expired. Renesas now markets devices based on the actual SuperH cores.
Because it’s written in VHDL, you can easily instantiate a J Core processor on a small FPGA, as in fact the OPF developers intend. This Web page tells you how to flash a bitstream file into an FPGA board's onboard SPI flash, which configures the FPGA to act like a j2 processor that’s compatible with the SH-2 processor’s instruction set. The j2 processor has no MMU so it consumes only 60% of a Xilinx Spartan-6 LX9 FPGA, which is the target FPGA architecture for this project. The Spartan-6 LX9 FPGA is the second smallest device in the low-end Spartan-6 device family. According to the J Core site, the least expensive FPGA development board that the j2 build system currently targets is Numato Lab’s $49.95 Mimas v2 (also available on Amazon for $64.95 with free shipping).
I was alerted to this interesting project by an article written by Jim Turley on EEJournal.com titled “Patents Expired! Create Your Own Processor! Make a 32-bit SuperH CPU in Your Spare Time”. As Turley writes:
“…interesting toy or real-life processor resource? J Core can be either. It’s certainly easy enough to get started, and you can follow the progress of future generations of the processor here. (Basically, as the remaining patents expire, their salient features will appear in subsequent J Core generations.) The FPGA implementation comes with a complete walkthrough for first-timers.
“There are plenty of processor architectures in the world. There are even free 32-bit RISC architectures in the wild (OpenRISC, LEON, RISC-V, et al). But most of those are “synthetic” processors designed either as teaching aids or as belligerently open-sourced projects, not as production processors designed for commercial use. SuperH – sorry, J Core – is among the first “real” processors to be reverse-engineered as a free replacement for the once-commercial product. As such, it comes with an established tool chain, a proven track record, and 20+ years of production history. You can’t ask much more of a free processor.”
Turley also says:
“As modern processors go, J-Core isn’t quite all there. There’s no MMU, for example, nor any support for floating-point arithmetic, threads, or multicore implementations. Housed in a Spartan-6 FPGA, it’s not very fast, either. But it is free and it boots Linux out of the box.”
If you just want a SuperH processor, this is probably not the right route to take for several reasons but it’s an interesting tool for learning and I find it fascinating that older microprocessor chips can now be implemented entirely with a very small FPGA. Just think about what the bigger FPGAs can do today.
By Lei Guan, Member of Technical Staff, Bell Laboratories, Nokia
Massive-MIMO wireless systems have risen to the forefront as the preferred foundation architecture for 5G wireless networks. A low-latency precoding implementation scheme is critical for enjoying the benefits of the multi-transmission architecture inherent in the multiple-input, multiple-output (MIMO) approach. Our team built a high-speed, low-latency precoding core with Xilinx System Generator and the Vivado Design Suite that is simple and scalable.
Due to their intrinsic multiuser spatial-multiplexing transmission capability, massive-MIMO systems significantly increase the signal-to-interference-and-noise ratio at both the legacy single-antenna user equipment and the evolved multi-antenna user terminals. The result is more network capacity, higher data throughput and more efficient spectral utilization.
But massive-MIMO technology does have its challenges. To use it, telecom engineers need to build multiple RF transceivers and multiple antennas based on a radiating phased array. They also have to utilize digital horsepower to perform the so-called precoding function.
Our solution was to build a low-latency and scalable frequency-dependent precoding piece of intellectual property (IP), which can be used in Lego fashion for both centralized and distributed massive-MIMO architectures.
Key to this DSP R&D project were high-performance Xilinx 7 series FPGAs, along with Xilinx’s Vivado Design Suite 2015.1 with System Generator and MATLAB/Simulink.
Precoding in Generalized Systems
In a cellular network, user data streams that radiate from generalized MIMO transmitters will be “shaped” in the air by the so-called channel response between each transmitter and receiver at a particular frequency. In other words, different data streams will go through different paths, reaching the receiver at the other end of the airspace. Even the same data stream will behave differently at certain times because of a different “experience” in the frequency domain.
This inherent wireless-transmission phenomenon is equivalent to applying a finite impulse response (FIR) filter with particular frequency response on each data stream, resulting in poor system performance due to the introduced frequency “distortion” by the wireless channels. If we treat the wireless channel as a big black box, only the inputs (transmitter outputs) and outputs (receiver inputs) are apparent at the system level. We can actually add a pre-equalization black box at the MIMO transmitter side with inversed channel response to precompensate the channel black-box effects, and then the cascade system will provide reasonable “corrected” data streams at the receiver equipment.
We call this pre-equalization approach precoding, which basically means applying a group of “reshaping” coefficients at the transmitter chain. For example, if we are going to transmit NRX independent data streams with NTX (number of transmitters) antennas, we will need to perform a pre-equalization precoding at a cost of NRX × NTX temporary complex linear convolution operations and corresponding combining operations before radiating NTX RF signals to the air.
A straightforward low-latency implementation of complex linear convolution is a FIR-type complex discrete digital filter in the time domain.
System Functional Requirements
Under the mission to create a low-latency precoding piece of IP, my team faced a number of essential requirements.
1. We had to precode one data stream into multiple-branch parallel data streams with different sets of coefficients.
2. We needed to place a 100-plus taplength complex asymmetric FIR function at each branch to provide reasonable precoding performance.
3. The precoding coefficients needed to be updated frequently.
4. The designed core must be easily updated and expanded to support different scalable system architectures.
5. Precoding latency should be as low as possible with given resource constraints.
Moreover, besides attending to the functional requirements for a particular design, we had to be mindful of hardware resource constraints as well. In other words, creating a resource-friendly algorithm implementation would be beneficial in terms of key-limited hardware resources such as DSP48s, a dedicated hardware multiplier on Xilinx FPGAs.
High-Speed, Low-Latency Precoding (HLP) Core Design
Essentially, scalability is a key feature that must be addressed before you begin a design of this nature. A scalable design will enable a sustainable infrastructure evolution in the long term and lead to an optimal, cost-effective deployment strategy in the short term. Scalability comes from modularity. Following this philosophy, we created a modularized generic complex FIR filter evaluation platform in Simulink with Xilinx System Generator.
Figure 1 illustrates the top-level system architecture. Simulink_HLP_core describes multibranch complex FIR filters with discrete digital filter blocks in Simulink, while FPGA_HLP_core realizes multibranch complex FIR filters with Xilinx resource blocks in System Generator, as shown in Figure 2.
Different FIR implementation architectures lead to different FPGA resource utilizations. Table 1 compares the complex multipliers (CM) used in a 128-tap complex asymmetric FIR filter in different implementation architectures. We assume the IQ data rate is 30.72 Msamples/second (20MHz bandwidth LTE-Advanced signal).
The full parallel implementation architecture is quite straightforward according to its simple mapping to the direct-I FIR architecture, but it uses a lot of CM resources. A full serial implementation architecture uses the fewest CM resources by sharing the same CM unit with 128 operations in a time-division multiplexing (TDM) manner, but runs at an impossible clock rate for the state-of-the-art FPGA.
A practical solution is to choose a partially parallel implementation architecture, which splits the sequential long filter chain into several segmental parallel stages. Two examples are shown in Table 1. We went for plan A due to its minimal CM utilization and reasonable clock rate. We can actually determine the final architecture by manipulating the data rate, clock rate and number of sequential stages thus:
FCLK = FDATA×NTAP÷NSS
where NTAP and NSS represent the length of the filters and number of sequential stages.
Then we created three main modules:
Branch 1 includes four subprocessing stages isolated by registers for better timing: a FIR coefficients RAM (cRAM) sequential-write and parallel-read stage; a complex multiplication stage; a complex addition stage; and a segmental accumulation-and-downsample stage.
In order to minimize the I/O numbers for the core, our first stage involved creating a sequential write operation to load the coefficients from storage to the FIR cRAM in a TDM manner (each cRAM contains 16 = 128/8 IQ coefficients). We designed a parallel read operation to feed the FIR coefficients to the CM core simultaneously.
In the complex multiplication stage, in order to minimize the DSP48 utilization, we chose the efficient, fully pipelined three-multiplier architecture to perform complex multiplication at a cost of six time cycles of latency.
Next, the complex addition stage aggregates the outputs of the CMs into a single stream. Finally, the segmental accumulation-and-downsample stage accumulates the temporary substreams for 16 time cycles to derive the corresponding linear convolution results of a 128-tap FIR filter, and to downsample the high-speed streams back to match the data-sampling rate of the system—here, 30.72MHz.
We performed the IP verification in two steps. First, we compared the outputs of the FPGA_HLP_core with the referenced double-precision multibranch FIR core in Simulink. We found we had achieved a relative amplitude error of less than 0.04 percent for a 16-bit-resolution version. A wider data width will provide better performance at the cost of more resources.
After verifying the function, it was time to validate the silicon performance. So our second step was to synthesize and implement the created IP in the Vivado Design Suite 2015.1 targeting the FPGA fabric of the Zynq-7000 All Programmable SoC (equivalent to a Kintex xc7k325tffg900-2). With full hierarchy in the tools’ synthesize and default implementation settings, it was easy to achieve the required timing at a 491.52MHz internal processing clock rate, since we created a fully pipelined design with clear registered hierarchies.
The HLP IP we designed can be easily used to create a larger massive-MIMO precoding core. Table 2 presents selected application scenarios, with key resource utilizations. You will need an extra aggregation stage to deliver the final precoding results.
For example, as shown in Figure 4, it’s easy to build a 4 x 4 precoding core by plugging in four HLP cores and one extra pipelined data aggregation stage.
Efficient and Scalable
We have illustrated how to quickly build an efficient and scalable DSP linear convolution application in the form of a massive-MIMO precoding core with Xilinx System Generator and Vivado design tools. You could expand this core to support longer-tap FIR applications by either using more sequential stages in the partially parallel architecture, or by reasonably increasing the processing clock rate to do a faster job. For the latter case, it would be helpful to identify the bottleneck and critical path of the target devices regarding the actual implementation architecture.
Then, co-optimization of hardware and algorithms would be a good approach to tune the system performance, such as development of a more compact precoding algorithm regarding hardware utilization. Initially, we focused on a precoding solution with the lowest latency. For our next step, we are going to explore an alternative solution for better resource utilization and power consumption.
For more information, please contact the author by e-mail: firstname.lastname@example.org.
Note: This article appeared in the latest issue of Xcell Journal, Issue 94.
By Stefan Petko and Duncan Cockburn, Xilinx, Inc
Wireless network operators face a major challenge in maintaining the bottom line while increasing the capacity and density of their networks. A compression scheme for wireless interfaces can help by reducing the required fronthaul network infrastructure investment. We used the Vivado Design Suite’s high-level synthesis (HLS) tool to evaluate an Open Radio equipment Interface (ORI) standard compression scheme for E-UTRA I/Q data to estimate its impact on signal fidelity, the introduced latency and its implementation cost. We found that Xilinx’s Vivado HLS offered an efficient platform for evaluating and implementing the selected compression algorithm.
The ever-increasing demand for wireless bandwidth drives the need for new network capabilities such as higher-order MIMO (multiple-input, multiple-output) configurations and carrier aggregation, for example. The resulting increase in network complexity leads the operators to architectural changes such as the centralization of baseband processing to optimize their network resource utilization. While reducing baseband processing costs, the sharing of baseband processing resources increases the complexity of the fronthaul network.
These fronthaul networks, transporting the modulated antenna carrier signals between the baseband units (BBU) and remote radio heads (RRH), are most frequently implemented using the Common Public Radio Interface (CPRI) protocol over optical fiber. The CPRI protocol requires a constant bit rate and its specification has over the years increased the maximum data rate to match the increasing bandwidth demands.
Network operators are now looking at technologies that will allow them to achieve a significant hike in data rate without increasing the number of optical fibers in use, thus maintaining current capex and opex overheads associated with a cell site. In an effort to provide a long-term solution, the network operators are looking at alternative network arrangements including rearchitecting the interface between the baseband-processing and radio units in order to reduce the fronthaul bandwidth. However, functional rearrangements can make it more difficult to meet the stringent performance requirements for some wireless interface specifications.
An alternative way to reduce bandwidth is to implement a compression/decompression (codec) scheme for wireless interfaces that are nearing or exceeding the available throughput. The achievable compression ratios depend on the specific wireless-signal characteristics such as noise levels, dynamic range and oversampling rates.
The ORI standard is a refinement of the CPRI specification aiming to enable an open BBU/RRH interface. In its latest release, ORI specifies a lossy time-domain E-UTRA data compression technique for channel bandwidths of 10, 15 or 20MHz. The combination of a fixed 3/4 rate resampling and nonlinear quantization of 15-bit IQ samples achieves a 50 percent reduction in bandwidth requirements, facilitating an 8 x 8 MIMO configuration covering two sectors over a single 9.8-Gbps CPRI link, for example.
We compared the ORI IQ compression performance to a Mu-Law compression algorithm implementation as specified by the ITU-T Recommendation G.711. Also a nonlinear quantization technique, Mu-Law uses a logarithmic function to redistribute the quantized values across the available number range. Unlike CDF-based quantization, which considers the statistical distribution of the input samples, Mu-Law quantized output is a function of the corresponding input sample value and the specified companding value.
For our prototype configuration, we aimed to scale up the compression algorithm to fully utilize a 9.8304Gbps CPRI link (line bit rate option 7). The ORI-compressed E-UTRA sample specification allows us to transport 16 compressed IQ channels (32 I and Q channels compressed independently) over a single 9.8G CPRI link. A target throughput of three compressed samples per CPRI clock is sufficient to fully pack the 32-bit Xilinx LogiCORE IP CPRI IQ interface, giving us the required 737.28-Msample/sec compression IP output.
We tested the implemented codec algorithm using a 20MHz LTE E-UTRA FDD channel stimulus generated with MATLAB’s LTE System Toolbox. We then used Keysight VSA to demodulate the captured IQ data and quantify the signal distortion due to compression and decompression stages by measuring the output waveform error vector magnitude (EVM). We compared the reported output EVM measurements—which represent the difference between the ideal and the measured signal—to the reference input signal EVM.
We utilized the Vivado HLS FIR IP to prototype the resampling filter. To meet the high-throughput requirements of our design, we implemented parallel single-rate FIR filters and used a loop-based filter output decimation.
The benefits of a fast C-level simulation become even more evident when the verification data sets are large. This is very much the case when evaluating an IQ compression algorithm since at minimum, a full radio frame of data (307,200 IQ samples per channel) is required to make use of the VSA tools for EVM measurements. We observed a simulation speed-up of two orders of magnitude for C simulation compared with C/RTL co-simulation, translating to a nine-hour co-simulation run vs. a five-minute C simulation for our compression IP test run.
Another significant HLS testbench advantage was the ease of input data use and output data capture utilizing files in conjunction with the HLS streams. The result was to provide an interface for data analysis with VSA tools or direct comparison against the Octave model output in the C++ testbench.
The Keysight VSA measurements reported an averaged EVM of 0.29 percent for a codec configuration with 144 FIR coefficients. Compared with the original input data with EVM RMS of 0.18 percent, the additional EVM attributable to the compression-decompression processing chain is 0.23 percent. By comparison, the Mu-Law compression algorithm operating on the equivalent input data set results in an average EVM of 1.07 percent.
Vivado high-level synthesis confirmed the required throughput reported in terms of the initiation interval—the number of clock cycles before the top-level task is ready to accept new input data. We also verified that the exported Vivado IP Integrator cores met the timing requirements for the target Kintex UltraScale platform.
From the design tool perspective, Vivado HLS provides a viable hardware prototyping path. A high-level testbench fits well within a design framework that requires flow of data among a number of design and verification tools. The main advantage of such a testbench is the ability to perform fast C-level simulations of a hardware system model. For IQ compression and similar applications, simulation runs involve frequent higher-level parameter or input data set changes, making a fast feedback essential.
Note: This article was abstracted from the latest issue of Xcell Journal, Issue 94.