AMD, ARM, Huawei, IBM, Mellanox, Qualcomm, and Xilinx have joined forces to bring an open, high-performance acceleration framework to data centers. Through the new CCIX Consortium, the companies are collaborating on a specification for the new Cache Coherent Interconnect for Accelerators (CCIX). For the first time in the industry, a single interconnect technology specification will ensure that processors using different instruction set architectures (ISA) can coherently share data with accelerators and enable efficient heterogeneous computing—significantly improving compute efficiency for servers running data-center workloads.
Power and space constraints within data centers has made application acceleration using methods with superior performance/watt relative to CPUs a high-priority requirement. Applications including big data analytics, search, machine learning, NFV, wireless 4G/5G, in-memory database processing, video analytics, and network processing benefit from acceleration and Xilinx All Programmable devices already deliver this acceleration with superior performance/watt. CCIX will allow application accelerators to access and process data irrespective of where it resides and without the need for continuous CPU/server oversight or complex programming—making acceleration that much more efficient.
Simultaneous with the announcement of the CCIX Consortium, Xilinx also announced that CCIX capabilities would be incorporated into acceleration-enhanced versions of the company’s 16nm UltraScale+ All Programmable devices along with HBM DRAM. (See today’s companion blog post: “Xilinx UltraScale+ All Programmable Device Memory Bandwidth Takes Xpress Lane, Jumps 10x with 3D-on-3D HBM.”)
Xcell Daily will provide more technical details about these technologies as they emerge.
Today, Xilinx announced a new memory-bandwidth express lane on its UltraScale+ All Programmable device roadmap. This new high-speed route places 3D HBM (high-bandwidth memory) DRAM with its massively parallel, high-bandwidth interfaces along with Xilinx’s most advanced 16nm UltraScale+ All Programmable silicon on a 3D CoWoS silicon interposer developed jointly by TSMC and Xilinx (3D on 3D). HBM-enabled UltraScale+ FPGAs employing the high-density, high-performance interconnect of TSMC’s CoWoS silicon interposer will deliver multi-Tbps memory bandwidth—10x the memory bandwidth achievable with individually packaged FPGAs and SDRAMs. Acceleration-enhanced Xilinx 16nm All Programmable devices with this 10x memory-bandwidth boost will serve the processing and memory-bandwidth requirements of data-center applications—including cloud computing—especially well. As today’s announcement says: “Xilinx is already collaborating with leading hyperscale data center customers to create optimized configurations and products.”
There are no additional technical details about this roadmap expansion in today’s Xilinx press release, so Xcell Daily cannot provide additional information about this latest Xilinx announcement at this time, but there’s already plenty of available technical information to review about HBM and the TSMC/Xilinx silicon interposer.
According to Wikipedia, the original version of HBM—a 3D-stacked array of DRAM devices—was developed by AMD and SK Hynix and became a JEDEC standard in 2013. HBM2—which doubles the per-pin memory transfer rate—became a JEDEC standard in January, 2016. Samsung announced early production of HBM2 devices just days later and SK Hynix demonstrated HBM2 devices in March, 2016. So HBM is already quite real.
Here’s a slide from the SK Hynix HBM2 announcement comparing HBM1 and HBM2:
SK Hynix comparison of HBM1 and HBM2
An HBM 3D memory stack consists of multiple memory die plus an optional base logic die stitched together with TSVs (through silicon vias). One HBM 2 stack has a reported memory bandwidth in excess of 1Tbps and multiple stacks can be incorporated into a device.
Here’s a slide comparing the original version of HBM to DDR3 SDRAM taken from an SK Hynix paper presented at Hot Chips 2014:
HBM vs DDR3 comparison from “HBM: Memory Solution for Bandwidth-Hungry Processors” presented by Joonyoung Kim and Younsu Kim at Hot Chips 2014
As you can see, even this original version of HBM jumps memory bandwidth by more than 10x relative to a DDR3 SDRAM memory bank.
The 3D silicon interposer technology developed by TSMC and Xilinx and now formally called CoWoS, won one of two 2013 SEMI Awards for North America. SEMI is “the global industry association serving the manufacturing supply chain for the micro- and nano-electronics industries.” (See “Xilinx wins SEMI award for 3D silicon interposer technology, which decreases power consumption and boosts bandwidth.”)
16nm Xilinx devices based on CoWoS interposers constitute a 3rd generation of 3D devices. First-generation Xilinx devices includes of a series of Virtex-7 FPGAs based on 28nm technology:
The second generation of Xilinx 3D devices includes the Virtex UltraScale VU440 3D FPGA based on 20nm TSMC silicon with 4.4M logic cells.
Today’s announcement says that Xilinx will be extending its 3D technology into the 16nm UltraScale+ device family. As you might imagine—because it’s 3rd-generation technology and built on two generations of multiple, shipping, production devices—CoWoS is extremely solid and well understood.
Stay tuned to Xcell Daily for further technical details about this 3rd-generation 3D-on-3D technology as they become available.
For more information on Xilinx and 3D technology, see “Want the real skinny on commercial 3D ICs? Xilinx’s Vincent Tong forecasts the future on 3DInCites.”
Nutaq just posted a rather complex video showing its 2nd-generation PicoSDR 8x8 connected to a 16-element, 2D antenna array and processing a real-world eNB downlink. The hardware system is being controlled by Matlab’s LTE System Toolbox. Nutaq’s PicoSDR 8×8-E relies on one 0-6Ghz radio, built on AD9361 RFICs from Analog Devices and controlled by an on-board Xilinx Virtex-6 FPGA as shown in the following block diagram:
Here’s the LTE demo on video:
Please contact Nutaq for information about the PicoSDR 8x8.
When the Commodore Amiga appeared in the mid 1980s, it was a color-graphics wonder. When held up and compared against the IBM PC’s CGA graphics (ugh!) or the Apple Macintosh’s monochrome screen, it was well ahead of its time and Commodore sold millions of the 68000-based computers. A network of Amiga computers created the special effects for the pilot episode of the successful “Babylon 5” TV series. (Now there’s a piece of trivia for you.) Three decades later, the Amiga is another interesting historical artifact—to most people. But there are still fanatics keeping it alive. Lukas F Hartmann is one such fanatic. After reviving his slumbering interest in the Amiga, he decided it needed more graphics oomph. Time has indeed passed the orphaned Amiga by.
No problem said Hartmann. I’ll just buy an upgraded graphics card, he thought. Oops, “Nowadays, these totally outdated cards are rare and sold for ridiculous prices. My A2000 was stuck with 640x256 PAL resolution, 64 colors (ignoring HAM) or headache-inducing interlaced modes,” he writes on his github page.
Hartmann’s solution: “I'll just make my own graphics card. How hard can it be?"
Knowing he was unlikely to create an ASIC for this application, Hartmann turned to FPGAs and made the extremely reasonable choice of a Xilinx Spartan-6 FPGA. If you want all of the grisly hardware and software design details, click over to his github page (with more info on this Hackaday page) but I’m going to give you the bottom line here.
Hartmann’s final hardware design adapts a Scarab Hardware miniSpartan6+ dev board to the Amiga bus with a custom carrier card. (For more information on the miniSpartan-6 board, see “The miniSpartan6+ low-cost FPGA dev board arrives. Cowabunga!”) The video card uses one of the miniSpartan6+ dev board’s two HDMI ports to drive modern displays at resolutions to 1280x720p—unimagined when the Amiga first appeared. Here’s a photo of Hartmann’s prototype board.
MNT VA2000 Amiga Graphics Card
The miniSpartan6+ dev board started life as a Kickstarter project and now Hartmann’s project, formally named the MNT VA2000 Amiga Graphics Card, is also a crowdfunded project. Hartmann is selling 50 units at €149.00 each. You’ll find it here.
Hartmann’s project is a perfect example that demonstrates just how handy FPGAs can be. An FPGA is exactly the right device for developing hardware like a GPU. It allows for plenty of design experimentation with no incremental hardware or NRE cost and relatively minor time penalties for changing your mind while delivering hardware-driven performance that far exceeds what you might get from a processor stepping through code.
By Adam Taylor
Within the embedded system space, it is very common to find a FPGA at the heart of the system. This is due to the FPGA’s ability to perform several functions in parallel and its deterministic response. Many embedded systems also contain a processor to handle communication, housekeeping, scheduling, and other tasks traditionally performed in software.
The combination of FPGA and processor can add to what is often called the SWAP-C of the system. SWAP-C relates to Size, Weight, Power and Cost of the solution. Obviously, using both a processor and an FPGA increases not just the BOM cost but the non-recurring engineering costs as well. In addition, design and verification becomes more complicated. The two devices will also require more board space, which increases the solution’s size and weight. The power architecture will also be more complicated than if just one device were used, further impacting the SWAP-C.
While it is difficult to implement functions typically performed by an FPGA with a software-driven processor, a design can often benefit by implementing a processor within the FPGA.
We have several choices when it comes to implementing a processor in a Xilinx FPGA:
In this article, we’ll take a closer look at implementing a Xilinx MicroBlaze processor within our FPGA design to reduce our SWAP-C.
What is MicroBlaze
MicroBlaze is a 32-bit softcore processor. That means it is soft IP that you can customize and then synthesize, followed by place and route into the logic resources of the target FPGA. Each MicroBlaze processor instantiation is customized and can include advanced features such as FPUs (floating-point units), MMUs (memory-management units), and instruction and data caches.
You can run a number of operating systems on the MicroBlaze processor including FreeRTOS, Micrium uc/OSiii and Linux. You can also run bare-metal code. The soft nature of the instantiation ensures that we do not run into obsolescence problems. In short, the MicroBlaze processor is a very powerful tool to have in our embedded system development tool box.
Creating a MicroBlaze System
The ability to implement MicroBlaze processors within our design is a standard feature of the Xilinx Vivado HL WebPACK edition.
The first thing to do is to create a new project in Vivado and add a new block diagram. We can then add in the MicroBlaze processor core from the IP Catalog. Once we have placed the MicroBlaze processor in the block diagram, we need to customize it for the performance we require. Opening the MicroBlaze processor for customization will present the first of five processor customization pages. On the first page, we can select the desired performance for from the core as shown in Figure 1. For this example, we will develop a high-performance MicroBlaze processor.
Figure 1: Selecting the configuration of the MicroBlaze
To create the basics of our system we are going to need the following IP cores:
Our connection architecture for these blocks is shown in Figure 2 below:
Figure 2: High Level Block Diagram
We are going to use a 100MHz clock that will be the input to a Clocking Wizard, which will use an MMCM (mixed-mode clock manager) to generate clocks at 100, 166.667, and 200MHz. The MicroBlaze processor will run from the MMCM’s 100MHz output while the other clocks will be used for the Memory Interface Generator. Tables 1 and 2 below show the configuration of the AXI Interconnects:
Table 1: Peripheral AXI Interconnect, Clock and Reset Configuration
Table 2: Memory AXI Interconnect, Clock and Reset Configuration
Configuring the DDR Memory
We are developing a high-performance MicroBlaze system so we want to be able to execute our program from DDR SDRAM and also use more exotic capabilities including DMA so that the MicroBlaze processor can process captured data. DDR memory interfaces can be difficult to implement due to the complex driving requirements but the Xilinx MIG can automatically generate the DDR interface between the AXI bus and the DDR SDRAM.
The MIG is available from the IP Catalog and customizing it allows us to select the desired clock frequency, target memory device, memory options, termination schemes, and pin allocation. Figure 3 below shows the selection of the target DDR device. While it may initially look complicated, it is very easy to work with and to get up and running quickly.
Figure 3: MIG selecting target device
Once we have customized the interface as required for our application, we need to provide two clocks: a 200MHz reference clock and a 166.667MHz system clock in this example.
With all of the modules within the design customized to our needs, we can create an RTL wrapper and re-generate the outputs, allowing us to build the system and develop our first application.
Developing the Software
Once the project implementation completes and we have a bit file, we can open the implemented design and export the HDF and the bit file to SDK. Now we’re ready to create our software application.
If this is the first time you have opened SDK, you will be asked for the workspace you wish to use. The workspace is the area where your projects and associated software project files such as BSPs (Board Support Packages) and the hardware definition will be stored.
To get this up and running within SDK, we need to do the following:
The first step is to import the hardware definition. To do this in Vivado, select file -> new -> other from the SDK menu. This will open a dialog box as shown in Figure 4. Select “Hardware Platform Specification” beneath the Xilinx folder.
Figure 4: Selecting the Hardware Platform Specification
Enter your project name in the next dialog box. One good practice to get into is always to call it project_HW, to label it clearly. Browse to the directory within your Vivado project that contains the HDF file. Note this is within the .sdk folder under your Vivado project.
This will create the hardware specification that will appear in the project explorer on the left side of SDK. Within this project, you should be able to open the HDF file and see the addresses of all memory-mapped peripherals.
With the hardware platform created, we are now ready to create a BSP. This will contain the drivers and API that allows us to drive and control the hardware. We can create the BSP by selecting file -> new -> board support package. This will open a dialog box and we can step through the pages.
Enter a project name. Notice how it has picked up the hardware platform we just created. For this example, we will use the standalone operating system. This will open a settings pop-up for the BSP. There are no changes we need to make here, but this is where we can add options if needed—e.g. light-weight IP stack, etc.
On the standalone page, we can also select the stdin and stdout for the compiler. Make sure this is set to AXI UART.
At this point, we can then create our application. For this example, I am going to use the simple “hello world” template. We can create the application project by selecting file -> new -> application project within SDK. This will open a dialog box where we can select the BSP we previously created as well as the hardware definition and the processor we are targeting. (In this case there is only one.)
These steps will create a simple application that outputs the character string “hello world” over the UART. Selecting “build all” will then build the BSP and the application project, producing an ELF file, which you can download and run on the hardware.
Running on the Hardware
We need to create a debug environment that will download the ELF when we click on it. To do this, right-click on your application project and select Debug As -> Debug Configurations. This will open a dialog box, as shown in Figure 5, where you can create a new debug environment. We wish to create a new GBD debug application.
Figure 5: Creating the debug configuration
Provide a name and, if not selected, select “Reset Processor” from the drop-down menu close to the bottom. We also need to click on the Debugger applications tab and uncheck the option “stop at main() when debugging.” to ensure that the application will run automatically on download. Finally then click on apply, not debug, and then close.
The first thing to do is program the FPGA. We do this under the Xilinx Tools -> Program FPGA. Once you see that the FPGA is programmed, you are now ready to download your ELF file. Click on the bug icon on the top menu and this will use the debug configuration we just created.
Once downloaded you should see the software run and the message “hello world” appear in your chosen terminal program.
Building your own MicroBlaze system is very simple and straightforward to implement, as is developing the software to run on it. If you are looking to reduce your system’s SWAP-C, a MicroBlaze processor can help.
Available starting this month, XIMEA’s recently introduced xiSpec Hyperspectral Multi-Linescan USB3 Vision Camera covers the visible and NIR (near-infrared) spectrum. This camera does not use a hyperspectral filter mosaic. Instead it uses a sensor that incorporates a line-wise arrangement of 150 HSI (hyperspectral imaging) bands spanning wavelengths between 470 and 900nm. The camera is USB3 Vision compliant and includes drivers for Windows, Linux, and MacOS and an SDK.
xiSpec Hyperspectral Multi-Linescan USB3 Vision Camera
Because it’s a multi-linescan camera, it can capture crisp images of objects passing by the camera, in conveyor or drone applications for example. The camera’s visual and near-infrared spectral imaging capabilities are especially relevant in agriculture applications such as produce sorting and grading.
For an excellent introduction to hyperspectral imaging from XIMEA, click here.
Note: The XIMEA xiSpec Hyperspectral Multi-Linescan USB3 Vision Camera is based on a Xilinx Spartan-6 FPGA. XIMEA has used Xilinx FPGAs for its previous camera designs as well, including a Kintex-7 FPGA in the CB200 5K digital video camera (see “XIMEA CB200 5K digital video camera pumps 1.7Gbytes/sec down a 300M optical cable using FPGA-based PCIe”) and the CB120 4K video camera (see “XIMEA 4K video camera sends 130frames/sec over 300m of fiber using company’s FPGA-based camera platform.”)
By Adam Taylor
Engineers never lose sight of the need to deliver projects that hit the quality, schedule and budget targets. You can apply the lessons learned by the community of embedded system developers over the years to ensure that your next embedded system project achieves those goals. Let’s explore some important lessons that have led to best practices for embedded development.
Systems engineering is a broad discipline covering development of everything from aircraft carriers and satellites, for example, to the embedded systems that enable their performance. We can apply a systems engineering approach to manage the embedded systems engineering life cycle from concept to end-of-life disposal.
The first stage in a systems engineering approach is not, as one might think, to establish the system requirements, but to create a systems engineering management plan. This plan defines the engineering life cycle for the system and the design reviews that the development team will perform, along with expected inputs and outputs from those reviews. The plan sets a clear definition for the project management, engineering and customer communities as to the sequence of engineering events and the prerequisites at each stage.
In short, it lays out the expectations and deliverables. With a clear understanding of the engineering life cycle, the next step of thinking systematically is to establish the requirements for the embedded system under development. A good requirement set will address three areas. Functional requirements define how the embedded system performs. Nonfunctional requirements define such aspects as regulatory compliance and reliability. Environmental requirements define such aspects as the operational temperature and shock and vibration requirements, along with the electrical environment (for example, EMI and EMC).
Within a larger development effort, those requirements will be flowed down and traceable from a higher-level specification, such as a system or subsystem specification (Figure 1). If there is no higher-level specification, we must engage with stakeholders in the development to establish a clear set of stakeholder requirements and then use those to establish the embedded system requirements.
Generating a good requirement set requires that we put considerable thought into each requirement to ensure that it meets these standards:
It is also common to use specific language when defining requirements to demonstrate intention. Typically, we use SHALL for a mandatory requirement and SHOULD for a nonmandatory requirement. Nonmandatory requirements let us express desired system attributes.
After we have established our requirements baseline, best practice is to create a compliance matrix, stating compliance for each requirement. We can also start establishing our verification strategy by assigning a verification method for each requirement. These methods are generally Test, Analysis, Inspection, Demonstration, and Read Across. Creating the requirementsalong with the compliance and verification matrices enables us to:
Every engineering project encompasses a number of budgets, which we should allocate to solutions identified within the architecture. Budget allocation ensures that the project achieves the overall requirement and that the design lead for each module understands the module’s allocation in order to create an appropriate solution. Typical areas for which we allocate budgets are the total mass for the function; the total power consumption for the function; reliability, defined as either mean time between failures or probability of success; and the allowable crosstalk between signal types within a design (generally a common set of rules applicable across a number of functions). One of the most important aspects of establishing the engineering budgets is to ensure that we have a sufficient contingency allocation. We must defeat the desire to pile contingency upon contingency, however, as this becomes a significant technical driver that will affect schedule and cost.
From the generation of the compliance matrix and the engineering budgets, we should be able to identify the technically challenging requirements. Each of these at-risk requirements should have a clear mitigation plan that demonstrates how we will achieve the requirement. One of the best ways to demonstrate this is to use technology readiness levels (TRLs). There are nine TRL levels, describing the progression of the maturity of the design from its basic principles observed (TRL 1) to full function and field deployment (TRL 9).
Assigning a TRL to each of the technologies used in our architecture, in conjunction with the compliance matrix, lets us determine where the technical risks reside. We can then effect a TRL development plan to ensure that as the project proceeds, the low TRL areas increase to the desired TRL. The plan could involve ensuring that we implement and test the correct functionality as the project progresses, or performing functional or environmental/dynamic testing during the project’s progression.
Once we understand the required behavior of the embedded system, we need to create an architecture for the solution. The architecture will comprise the requirements grouped into functional blocks. For instance, if the embedded system must process an analog input or output, then the architecture would contain an analog I/O block. Other blocks may be more obvious, such as power conditioning, clocks and reset generation.
The architecture should not be limited to the hardware (electrical) solution, but should include the architecture of the FPGA/SoC and associated software. Of course, the key to modular design is good documentation of the interfaces to the module and the functional behavior.
One key aspect of the architecture is to show how the system is to be created at a high level so that the engineering teams can easily understand how it will be implemented. This step is also key for supporting the system during its operational lifetime.
When determining our architecture, we need to consider a modular approach that not only allows reuse on the current project but also enables reuse in future projects. Modularity requires that we consider potential reuse from day one and that we document each module as a standalone unit. In the case of internal FPGA/SoC modules, a common interface standard such as the ARM AMBA Advanced Extensible Interface (AXI) facilitates reuse.
An important benefit of modular design is the potential ability to use commercial off-the-shelf modules for some requirements. COTS modules let us develop systems faster, as we can focus our efforts on those aspects of the project that can best benefit from the added value of our expertise.
The system power architecture is one area that can require considerable thought. Many embedded systems will require an isolating AC/DC or DC/DC converter to ensure that failure of the embedded system cannot propagate. Figure 2 provides an example of a power architecture. The output rails from this module will require subregulation to provide voltages for the processing core and conversion devices. We must take care to guard against significant degradation of switching losses and efficiency in these stages. As we decrease efficiency, we increase the system thermal dissipation, which can affect the unit reliability if not correctly addressed.
We must also take care to understand the behavior of the linear regulators used and the requirements for further filtering on the power lines. This need arises as devices such as FPGAs and processors switch at far higher frequencies than a linear regulator’s control loop can address. As the noise increases in frequency, the noise rejection of the linear regulator decreases, resulting in the need for additional filtering and decoupling. Failure to understand this relationship has caused issues in mixed-signal equipment.
Another important consideration is the clock and reset architecture, especially if there are several boards that require synchronization. At the architectural level, we must consider the clock distribution network: Are we fanning out a single oscillator across multiple boards or using multiple oscillators of the same frequency? To ensure the clock distribution is robust, we must consider:
We must also pay attention to the reset architecture, ensuring that we only apply the reset where it is actually required. SRAM-based FPGAs, for example, typically do not need a reset. If we are using an asynchronous assertion of the reset, we need to ensure that its removal cannot result in a metastability issue.
Formal documentation of both internal and external interfaces provides clear definition of the interfaces at the mechanical, physical and electrical levels, along with protocol and control flows. These formal documents are often called interface control documents (ICDs). Of course, it is best practice to use standard communication interfaces wherever possible.
One of the most important areas of interface definition is the “connectorization” of the external interfaces. This process takes into account the pinout of the required connector, the power rating of the connector pins and the number of mating cycles required, along with any requirements for shielding.
As we consider connector types for our system, we should ensure that there cannot be inadvertent cross connection due to the use of the same connector type within the subsystem. We can avoid the possibility of cross connection by using different connector types or by employing different connector keying, if supported.
Connectorization is one of the first areas in which we begin to use aspects of the previously developed budgets. In particular, we can use the crosstalk budget to guide us in defining the pinout.
The example in Figure 3 illustrates the importance of this process. Rearranging the pinout to place the ground reference voltage (GND) pin between Signal 1 and Signal 2 would reduce the mutual inductance and hence the crosstalk.
The ICD must also define the grounding of the system, particularly when the project requires external
EMC. In this case, we must take care not to radiate the noisy signal ground.
Engineers and project managers have a number of strategies at their disposal to ensure they deliver embedded systems that meet the quality, cost and schedule requirements. When a project encounters difficulties, however, we can be assured that its past performance will be a good indicator of its future performance, without significant change on the project.
Note: This article originally appeared in Xcell Software Journal, Issue 3.
Avnet has just rolled out its second FMC Carrier Card for the Avnet PicoZed SOM, which is based on a Xilinx Zynq-7000 SoC (a Z-7010, Z-7015, Z-7020, or Z-7030). The $349 PicoZed FMC Carrier Card V2 greatly expands the I/O capabilities of the PicoZed SOM with connector interfaces for the on-module Gigabit Ethernet PHY and USB PHY. The carrier card also has a micro SD card and USB-UART. The majority of the Zynq SoC’s programmable-logic I/O pins are brought out to an LPC FMC connector. In addition, an HDMI output port, real-time clock, high-performance clock synthesizer, two MAC ID EEPROMs and several Digilent-compatible Pmod connectors. The four serial transceivers on the 7015 and 7030 SOMs are allocated to a PCIe Gen2 x1 card edge interface, the previously mentioned FMC connector, an SFP+ cage for high-speed optical networking, and general-purpose SMA connectors. You’ll also find an HDMI output port, a real-time clock, and several Digilent-compatible Pmod connectors on the new PicoZed FMC Carrier Card V2.
Here’s a block diagram of the PicoZed FMC Carrier Card V2:
$349 Avnet PicoZed FMC Carrier Card V2 Block Diagram
The carrier card comes bundled with Wind River Pulsar Linux to speed embedded development, including the development of IoT—particularly industrial IoT (IIoT)—devices. With the FMC expansion connector, you can rapidly develop and prototype many types of embedded systems for vision and video, motor-control, and SDR applications.
Here’s a 5-minute Avnet video to explain things:
Even if you’re not especially interested in a PicoZed carrier card at the moment, it’s worth watching this video for some design tips embedded in it that you’ll find particularly interesting for your own Zynq-based hardware designs. At the 3-minute mark in the video, Avnet Project Engineer Dan Rozwood discusses some interesting specifics of a cost-reduced clocking system designed for this new carrier card based on an IDT programmable-clock IC. You don’t get design tips like this dropped on you every day, so take five minutes to grab this one.
For more information about the Avnet PicoZed SOM, see:
By Scott McNutt, Senior Software Engineer, DesignLinx Hardware Solutions, LLC
Embedded systems usually fall into one of two categories: those that require hard real-time performance and those that don’t. In the past, we had to pick our poison—the performance of our “go to” real-time operating system or the rich feature set of our favorite Linux distribution—and then struggle with its shortcomings.
Today, embedded developers no longer need to choose between the two. Asymmetric multiprocessing (AMP) offers the best of both worlds.
Several modern system-on-chip (SoC) product offerings integrate multiple CPUs, a broad variety of standard I/O peripherals and programmable logic. The Xilinx Zynq-7000 All Programmable SoC family, for example, includes a dual-core ARM Cortex-A9, standard peripherals (such as Gigabit Ethernet MACs, USB, DMA, SD/MMC, SPI and CAN) and a large programmable logic array. We can use these SoC products as the basis of a Linux/RTOS AMP system that provides considerable flexibility.
In many ways, the typical AMP configuration is similar to a PCI-based system, with the Linux domain functioning as the host, the RTOS domain functioning as an adapter, and one or more shared memory regions used for communication between the two domains. Unlike PCI, however, an AMP configuration can more conveniently—and dynamically—assign resources (both the standard peripherals and custom logic) to one domain or the other. In addition, a Linux/RTOS AMP system can dynamically reconfigure programmable logic based on runtime requirements, such as the presence or absence of various external devices.
This level of flexibility is often coupled with concerns about complexity and the degree of difficulty involved in bringing up an AMP system. Rest assured that the Linux development community has introduced many features into the kernel that greatly simplify AMP configuration and use.
With respect to multiprocessing, the Linux kernel comes in two flavors: the uniprocessor (UP) kernel and the symmetric multiprocessor (SMP) kernel. The UP kernel can only run on a single core, regardless of the number of available cores. AMP systems can incorporate two or more instances of the uniprocessor kernel.
The SMP kernel, however, can run on one core or simultaneously on multiple cores (Figure 1). An optional kernel command line parameter controls the number of cores that the SMP kernel uses following system initialization. Once the kernel is running, various command line utilities control the number of cores assigned to the kernel. The ability to dynamically control the number of cores used by the kernel is a primary reason AMP developers prefer the SMP kernel over the UP kernel.
The Remote Processor (remoteproc) Framework is the Linux component that is responsible for starting and stopping individual cores (remote processors), as well as for loading a core’s software in an AMP system. For example, we can dynamically reconfigure the SMP system shown in Figure 1 into the AMP system shown in Figure 2, and then back again to SMP, using the capabilities of remoteproc.
We can fully control reconfiguration via a userspace application or system initialization script. Reconfiguration control allows user applications to stop, reload and run a variety of RTOS applications based on the dynamic needs of the system.
The core’s software (in our example, the RTOS and user application) is loaded from a standard Executable and Linkable Format (ELF) file that contains a special section known as the resource table. The resource table is analogous to the PCI configuration space in that it describes the resources that the RTOS requires. Among those resources is the memory needed for the RTOS code and data.
Trace buffers are regions of memory that automatically appear as files in a Linux file system. As their name suggests, trace buffers provide basic tracing capabilities to the remote processor. A remote processor writes trace, debug and status messages to the buffers, where the messages are available for inspection via the Linux command line or by custom applications.
One or more trace buffers may be requested via entries in the resource table. Although they typically contain plain text, trace buffers may also contain binary data such as application state information or alarm indications.
We can also use the resource table to define virtual input/output devices (VDEVs), which are basically pairs of shared memory queues that support message transfer between the Linux kernel and the remote processor. The VDEV definition includes fields that negotiate the size of the queues as well as the interrupts used to signal between the processors.
The Linux kernel handles initialization of the virtual I/O queues. The software running on a remote processor need only include a VDEV description in its resource table and then use the queues once it begins execution; the kernel handles the rest.
The Remote Processor Messaging (rpmsg) Framework is a software messaging bus based on the Linux kernel’s virtual I/O system. The messaging bus is similar to a local area subnetwork in which individual processors can create addressable endpoints and exchange messages, all via shared memory.
The kernel’s rpmsg framework acts as a switch, routing messages to the appropriate endpoint based on the destination address contained in the message. Because the message header includes a source address, ad hoc connections can be established between various processors.
Processors can dynamically announce a particular service by sending a message to the rpmsg framework’s naming service. By itself, the naming service feature is only marginally useful. The rpmsg framework, however, allows service names to be bound to device drivers to support the automatic loading and initialization of specific drivers. For example, if a remote processor announces the service dlinx-h323-v1.0, the kernel can search for, load and initialize the driver bound to that name. This greatly simplifies driver management in systems where services are dynamically installed on remote processors.
Interrupt management can be a little tricky, especially when starting and stopping cores. Ultimately, the system needs to redirect specific interrupts dynamically to the remote processor domain when the remote processor is started, then reclaim those interrupts when the remote processor is stopped. In addition, the system must protect the interrupts from inadvertent allocation by potentially misconfigured drivers. In short, interrupts must be managed systemwide.
For the Linux SMP kernel, this is a routine matter—and a further reason that the SMP kernel is preferred in AMP configurations. The remote processor framework conveniently manages interrupts with only minimal support from the device driver.
Device driver development is always a concern because it requires a skill set that may not be readily available. Fortunately, the Linux kernel’s remoteproc and rpmsg frameworks do most of the heavy lifting; drivers need only implement a handful of standard driver routines. A fully functional driver may only require a few hundred lines of code. The kernel source tree includes sample drivers that embedded developers can adapt to their requirements.
Generic open-source device drivers are also available from vendors. DesignLinx Hardware Solutions provides generic rpmsg drivers for both Linux and FreeRTOS. Since the generic driver makes no assumptions about the format of the messages that are exchanged, embedded developers can use it for a variety of AMP applications without any modifications.
The kernel’s multiprocessing support is not limited to homogeneous multiprocessing systems (systems using only the same kind of processor). All of the features described above can also be used in heterogeneous systems (systems with different kinds of processors). These multiprocessing features are especially useful when migrating existing designs “inside the pins.”
Modern SoC products let designers conveniently move various hardware designs from a printed-circuit board to a system-on-chip (Figure 3). What was once implemented as a collection of discrete processors and components on a PCB can be implemented entirely inside the pins of an SoC.
For example, we can implement the original PCB hardware architecture of Figure 3 with a Xilinx Zynq-7000 family SoC using one of the ARM processors as the control CPU and soft processors (such as Xilinx MicroBlaze processors) in the programmable logic to replace the discrete microprocessors. We can use the remaining ARM processor to run the Linux SMP kernel (Figure 4).
The addition of Linux to the original design provides all of the standard multiprocessing features described above for both the ARM cores and the soft core processors (such as start, stop, reload, trace buffers and remote messaging). But it also brings the broad Linux feature set, which supports a variety of network interfaces (Ethernet, Wi-Fi, Bluetooth), networking services (Web servers, FTP, SSH, SNMP), file systems (DOS, NFS, cramfs, flash memory) and other interfaces (PCIe, SPI, USB, MMC, video), to name just a few. These features offer a convenient pathway to new capabilities without significantly altering tried-and-true architectures.
The past several years have seen an increase in multicore SoC offerings that target the embedded market and are well suited for AMP configurations. The Xilinx UltraScale+ MPSoC architecture, for example, includes a 64-bit quad-core ARM Cortex-A53, a 32-bit dual-core ARM Cortex-R5, a graphics processing unit (GPU) and a host of other peripherals—and, of course, a healthy helping of programmable logic. This is fertile ground for designers who understand how to harness the performance of real-time operating systems coupled with the rich feature set of the Linux kernel.
For more information on designing a Linux/RTOS AMP system, contact DesignLinx Hardware Solutions. A premier member of the Xilinx Alliance Program, DesignLinx specializes in FPGA design and support, including systems design, schematic capture and electronic packaging/mechanical engineering design, and signal integrity.
Note: This article originally appeared in Xcell Software Journal, Issue 3.
Those crafty folks at Red Pitaya in Slovenia want to
steal get your ideas for the most exciting Red Pitaya project they could develop in five years. To entice you, they’re offering to award a Red Pitaya ALU casing to the winner of a “Can You Tell The Future?” contest. Here’s what the prize looks like:
Note: You get the case as a prize, not the Red Pitaya board.
If you somehow missed the numerous Xcell Daily blog posts about the Red Pitaya, it’s an Open Instrumentation Platform based on a Xilinx Zynq-7000 SoC that started life as a wildly successful Kickstarter project and is now in worldwide distribution. Here’s a photo of the Red Pitaya board, which is the best way to list its features:
Red Pitaya is an open-source-software measurement and control tool that consists of easy-to-use visual programming software and free of charge, ready-to-use open-source, web-based test and measurement instruments running on a powerful, credit card-sized board. With a single click, the board can transform into a web-based oscilloscope, spectrum analyzer, signal generator, LCR meter, Bode analyzer, or one of many other applications. Red Pitaya can be controlled by using Matlab, LabVIEW, Python, and Scilab.
The Red Pitaya design team originally included a bunch of instrumentation engineers who developed equipment for particle accelerators. The Red Pitaya represents entry into a broader market.
For more Xcell Daily blog posts about the Red Pitaya, see:
By Adam Taylor
Once we have installed the software for NI’s LabVIEW RIO Evaluation Kit, the next step is to turn on and configure the board for first use. We also need to set up the software environment so that we can play with the hardware and run the first example.
The first thing we have to do is connect the RIO hardware to our PC. Then the software can detect the RIO board and configure it correctly so that we can communicate with it. We have two options: we can either use an Ethernet or USB connection. To make is nice and simple, I decided to use the USB point-to-point connection cable supplied. The only real difference to using this is the static IP address of 172.22.11.2 (see below).
With the USB cable connected and power applied, we can complete the RIO Evaluation Kit setup and simply step through the stages to detection. Configure the IP settings and then transfer the software files required. Once this is completed, you should see the LCD screen on top of the RIO Eval board display the classical hello world message.
RIO board has been found
Configuring the setup - Click Apply Static IP
Final Configuration Step – Check the LCD
With the board detected, the next things we need to do are:
Configuring the RIO using NI Max
Full and detailed instructions of how to achieve this are contained within the NI Build Your Own Embedded System.
Once we have done all of this, we are ready to download the first example, which uses a pre-generated FPGA bit file for the PL (programmable logic) side of the on-board Zynq-7000 SoC. Once we have understood the example design’s basic framework, we can customize the PL configuration.
For those unfamiliar with LabVIEW, each design element consists of User Interface and a block diagram below it where the design is actually implemented. These design elements can be used hierarchically within a LabVIEW project.
Opening LabVIEW example 1, we will see the LED Intensity Picker user interface design. We open this example and click on the run arrow. The application will be downloaded to the RIO board and we can then control the LEDs.
User Interface for the LED control application
Being able to run the application is one thing. However, we want to understand how the different elements of the design come together so that we can create our own designs.
From within the LED Picker example, we can open the block diagram below it by using the window -> show block diagram option:
Block Diagram of the top level design for the first example
The design above uses a pre-compiled FPGA bit file for the PWM control, which is contained within the read/write control and comes with its own VI (LabVIEW Virtual Instrument) design.
Opening this up will show you a simple UI that contains two registers. These registers are the interfaces to the Zynq SoC’s PL for this design. These are the actual register interfaces used in the FPGA configuration. In other words, they’re real hardware registers, not virtual ones. (They are not values held in RAM somewhere.)
Zynq SoC’s PL register definition for the LED Picker Example
We can then open the FPGA design which also contains a further design for the PWM light-intensity modulator. Again, this comes with a UI that defines the interfaces to control the PWM in the block diagram.
FPGA design of the PWM Light-intensity modulator for the LED Picker example
We’ve simply explored the organization and use of this example by downloading it to the RIO board within the LabVIEW framework. However, there is another way we can do this, which we will look at next time.
The code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
You also can find links to all the previous MicroZed Chronicles blogs on my own Web site, here.
Corsa Technologies’ DP2000 series—a new product family of open, programmable SDN switching and routing platform announced just yesterday morning— redefines how network operators deliver line-rate, subscriber-level networking; on-demand services; and real-time network tuning. Bruce Gregory, CEO of Corsa Technologies, says: “The DP2000 series is the first switching and routing platform to offer open programmable networking on a platform where you can dynamically slice a single physical switch into virtualized line-rate switching and routing instances that can support 100G of throughput. What compute has been doing for years, we now make possible at the core of the communications network. We built this solution with WAN and metro applications in mind, because it is here that network architects are struggling to handle exploding volume and diversity of traffic and where our multi-context hardware virtualization has a huge impact.”
Corsa’s DP2000 series of SDN platforms
The design goals for Corsa’s DP2000 series of networking platforms were SDN flexibility with extremely high performance. The resulting DP2000 design provides:
Here’s a new video from Corsa with simple explanations of the DP2000 series features:
There are currently three models in Corsa’s DP2000 series:
According to Corsa’s VP of Product Management Carolyn Raab, the DP2000 series employs arrays of Xilinx UltraScale FPGAs that help deliver these important product features for the various models within the DP2000 series product line. She also says that the FPGA-based architecture has drawn compliments from from customers, media, and analysts when learning about the DP2000 platform.
SDN flexibility and extremely high performance. FPGA-based SDN system design provides both in the Corsa DP2000 series.
Please contact Corsa Technologies directly for more information about the DP2000 SDN switching and routing platform.
Six Xilinx authors just published a lengthy, 15-page article in the March/April issue of IEEE Micro magazine that provides in-depth technical information about the Xilinx 16nm Zynq UltraScale+ MPSoC. You can read the article online using the link above but I’m going to abstract and highlight some of the deepest technical information here.
Zynq UltraScale+ MPSoCs are 2nd-generation devices with a vastly upgraded multiprocessing SoC subsystem; a sophisticated multi-domain, multi-island power-management system; high-density, on-chip UltraRAM static memories; multi-Gbps transceivers with signaling rates to 32Gbps; integrated I/O controllers for 100 GbE, PCIe Gen4, and 150Gbps Interlaken; and high-performance UltraScale programmable logic. Security, safety, and power management are significantly enhanced relative to the original Zynq-7000 SoC family. A block diagram of the Zynq UltraScale+ MPSoC taken from the Technical Reference Manual appears below.
Zynq UltraScale+ MPSoC Block Diagram
Like the original Zynq-7000 SoC, the Zynq UltraScale+ MPSoC initially boots as a processor system with support for RSA authentication and AES decryption. The device can then load the PL (programmable logic) configuration after the initial boot to ensure secure operation of the entire device. You can use the Zynq UltraScale+ MPSoC’s on-chip PL as a peripheral device, as an application accelerator, or as a heterogeneous processing element. The Zynq UltraScale+ MPSoC’s subsystems and PL can be powered off completely or power-gated for dynamic power management. Many of the Zynq UltraScale+ MPSoC’s processor cores in its PS (processor system) can be independently power-gated.
The Zynq UltraScale+ MPSoC’s PS incorporates the following major features:
The Zynq UltraScale+ MPSoC’s PS consists of two subsystems: the dual-core ARM Cortex-R5F real-time subsystem that includes a lockstep RPU (real-time processing unit) in the Zynq UltraScale+ MPSoC’s low-power domain and an application subsystem that includes an APU (application processing unit) based on a quad-core, 64-bit ARM Cortex-A53 processor, powered by the full-power domain. In addition, the Zynq UltraScale+ MPSoC uses a separate power domain for the PL and a battery-powered domain for powering security keys and the real-time clock.
The RPU based on the dual-core ARM Cortex-R5F processor has lockstep and split (independent) operation modes. Lockstep mode is for safety-critical applications. In lockstep mode, the slave processor’s inputs are delayed by two cycles to provide temporal diversity. The two ARM Cortex-R5F processors’ layouts are physically different to provide physical diversity and the lockstep checker logic is redundant. The RPU has a separate low-latency interface to the PL, which can be accessed even when the full-power domain (including the APU) is powered off. The RPU has low-latency, deterministic access to on-chip memory, which can be used to store safety-critical real-time services. The low-power subsystem (LPS) including the RPU supports up to ASIL-C and SIL3. The full-power subsystem (FPS) including the APU supports up to ASIL-B and SIL2.
The APU, RPU, and PL subsystems share the memory subsystem. The Zynq UltraScale+ MPSoC’s SMMU provides memory protection and partitioning for the APU, RPU, and PL subsystems at boot time. The on-chip DDR SDRAM controller provides a six-ported interface that’s shared by various on-chip masters. This memory controller supports three types of traffic: low-latency (LL), best effort (BE), and real-time (RT). LL traffic has the highest priority, subject to RT not exceeding its latency guarantee. RT requests are time-stamped and tracked to ensure the given latency guarantee. If RT traffic latency stays below the guarantee, it’s treated as BE traffic. However, when the RT timestamp exceeds the latency guarantee, the memory controller increases the RT priority to the highest level.
The multiport PS–PL interconnect supports as much as 1Tbps of bandwidth with 85Gbps per port. Each port implements the AMBA AXI4 interface standard with 128-, 64-, or 32-bit data-width options. Coherent ports implement the AMBA ACE-based cache-coherent protocol with one- or two-way coherency.
The Zynq UltraScale+ MPSoC can boot from QSPI, ONFI NAND, SD, or eMMC Flash memories. Boot images and bitstreams are authenticated using a 4,096-bit RSA key with a 384-bit SHA-3 hash. There is on-chip storage for multiple RSA public-keys to support key revocation. Secure boot supports 256-bit AES encryption. The AES keys can be stored in e-fuses or in battery-backed RAM. To mitigate differential power attacks (DPAs), decryption is performed only after authentication succeeds. Boot-image (or bitstream) encryption supports key rolling to further mitigate DPAs. The tamper-detection mechanism monitors the power supply, on-chip temperature, clock frequencies, and critical external and internal interfaces. If a tamper event is detected, the security subsystem clears and locks down the system, which can then only be cold rebooted.
There are multiple power domains on the Zynq UltraScale+ MPSoC, which are further split into multiple power islands—on-die power-gated domains. Each of the APU processor cores is independently power-gated anf the RPU processors are power-gated as a pair. Each APU core can be powered off independently (via power-gating) while the FPD power domain is powered on. APU L2 and RPU tightly-coupled memory are independently power-gated as well. Each of the larger peripherals are independently power-gated. Standard power-management APIs allow the PMU (power-management unit) to provide power-management provide services to the APU and RPU.
The Zynq UltraScale+ MPSoC’s PS supports four power modes:
For more technical details, please refer to the article in IEEE Micro and to the Zynq UltraScale+ MPSoC’s Technical Reference Manual.
With the Xilinx Zynq-7000 All Programmable SoC, you get a software/hardware system-implementation engine with ample on-chip resources capable of handling complex designs. These resources include an ARM dual-core ARM Cortex-A9 MPCore processor, SDRAM and Flash memory controllers, DMA controllers, plenty of standard peripheral and I/O controllers including Ethernet and USB, timers, and a generous chunk of Xilinx 7 series FPGA. Your design will likely take advantage of most if not all of these resources, so you have many choices to make with respect to the initial system design. You need some analytical help to make the best choices with respect to system performance, timing, and energy consumption; you need insight into how the hardware/software balance you create will affect the workload on the Zynq SoC’s software- and hardware-programmable components.
Mirabilis Design has a tool that can help you acquire that insight. It’s called VisualSim Architect. Using Mirabilis’ VisualSim Architect’s graphical model editor, you can assemble models of your system with pre-defined, pre-compiled, parameterized building blocks. Where do you get these models? Mirabilis has already created a parameterized model that accurately implements all of the Zynq-7000 SoC’s on-chip resources. This resource model incorporates timing, energy consumption values, and functional definitions.
Mirabilis Design’s VisualSim Architect
Next, you need to capture your application’s behavior so that you can trial various system-level design approaches. Mirabilis’ VisualSim Architect captures the behavior of the application as a separate flow, where each element of the flow maps to one or more resources in the Zynq-7000 SoC’s architecture. The input to your system-level design experiments then consists of a list of concurrent applications, multiple workloads at the interfaces, different memory capacities, and different hardware/software partitioning strategies.
Note: No RTL, C code or other programming resources are required so modeling at this level is a relatively quick process.
If you would like to see how this works, you’re in luck. Mirabilis is hosting a Webinar next Thursday (May 19) and will be demonstrating VisualSim Architect using the Xilinx Zynq-7000 SoC as a target example. The Webinar will focus on analyzing the performance and energy consumption of a video application on the Zynq-7000 SoC using various partitioning and implementation strategies to compare and contrast the alternatives at the system level.
For example, you can implement video processing on the ARM Cortex-A9 MPCore processor or you can move some of the processing into the Zynq-7000 SoC’s FPGA to boost performance. In either case, video frames must be written to and read from external SDRAM while network and image-sensor data arrives over the USB, Ethernet, or other peripheral interfaces.
You need to understand the answers to all of these questions clearly before detailed implementation to ensure efficient development. That’s why Miabilis developed VisualSim Architect.
Does that sound interesting enough to invest 45 minutes of your time? Yes? Then sign up for the Mirabilis Design Webinar here. The Webinar will be presented by Deepak Shankar, Mirabilis Design’s CEO.
Do you have some SDN-based, Enterprise-class WiFi design challenges facing you? Meet these challenges head on with a little help from a new, free video-on-demand tutorial by Xilinx, offered through IEEE ComSoc (the IEEE’s Communications Society). In this tutorial, you’ll learn about the features in the Xilinx Zynq-7000 SoC and Zynq UltraScale+ MPSoC that help you meet the performance goals for SDN-based, Enterprise-class WiFi routers and access points.
By Robin Getz, Analog Devices and Luc Langlois, Avnet Electronics Marketing
By integrating the critical RF signal path and high-speed programmable logic in a fully verified system-on-module (SOM), Avnet’s PicoZed SDR SOM delivers the flexibility of software-defined radio in a device the size of deck of cards, enabling frequency-agile, wideband 2x2 receive and transmit paths in the 70-MHz to 6.0-GHz range for diverse fixed and mobile SDR applications.
PicoZed SDR combines the Analog Devices AD9361 integrated RF Agile Transceiver with the Xilinx Z-7035 Zynq-7000 All Programmable SoC. The architecture is ideal for mixed software-hardware implementations of complex applications, such as digital receivers, in which the digital front end (physical layer) is implemented in programmable logic, while the upper protocol layers run in software on a dual-core ARM Cortex-A9 MPCore processor. Let’s look at the software-related features of the PicoZed SDR throughout the development process.
Leveraging the full potential of PicoZed SDR calls for a robust, multidomain simulation environment to model the entire signal chain, from the RF analog electronics to the baseband digital algorithms. This is the inherent value of Model-Based Design, a methodology from MathWorks that places the system model at the center of the development process, spanning from requirements definition through design, code generation, implementation and testing. Avnet worked with Analog Devices and MathWorks to develop a support infrastructure for PicoZed SDR in each facet of the design process, starting at the initial prototyping phase.
Using a MATLAB software construct called System objects, MathWorks created a support package for Xilinx Zynq-Based Radio that enables PicoZed SDR as an RF front end to prototype SDR designs right out of the box. Optimized for iterative computations that process large streams of data, System objects automate streaming data between PicoZed SDR and the MATLAB and Simulink environments in a configuration known as radio-in-the-loop, as shown in Figure 1.
Akin to concepts of object-oriented programming, System objects are created by a constructor call to a class name, either in MATLAB code or as a Simulink block. Once a System object is instantiated, you can invoke various methods to stream data through the System object during simulation. The Communications System Toolbox Support Package for Xilinx Zynq-Based Radio from MathWorks contains predefined classes for the PicoZed SDR receiver and transmitter, each with tunable configuration attributes for the AD9361, such as RF center frequency and sampling rate. The code example in Figure 2 creates a PicoZed SDR receiver System object to receive data on a single channel, with the AD9361 local oscillator frequency set to 2.5 GHz and a baseband sampling rate of 1 megasample/second (Msps). The captured data is saved using a log.
Analog Devices has developed the Libiio library to ease the development of software interfacing to Linux Industrial I/O (IIO) devices, such as the AD9361 on the PicoZed SDR SOM. The open-source (GNU Lesser General Public License V2.1) library abstracts the low-level details of the hardware and provides a simple yet complete programming interface that can be used for advanced projects. The library consists of a high-level application programming interface and a set of back ends, as shown in Figure 3.
As shown in Figure 4, the hardware-software co-design workflow in HDL Coder from MathWorks lets you explore the optimal partition of your design between software and hardware targeting the Zynq SoC. The part destined for programmable logic can be automatically packaged as an IP core, including hardware interface components such as ARM AMBA AXI4 or AXI4-Lite interface-accessible registers, AXI4 or AXI4-Lite interfaces, AXI4-Stream video interfaces, and external ports. The MathWorks HDL Workflow Advisor IP core generation workflow lets you insert your generated IP core into a predefined embedded system project in the Xilinx Vivado HLx Design Suite. HDL Workflow Advisor contains all the elements Vivado IDE needs to deploy your design to the SoC platform, except for the custom IP core and embedded software that you generate.
If you have a MathWorks Embedded Coder license, you can automatically generate the software interface model, generate embedded C/C++ code from it, and build and run the executable on the Linux kernel on the ARM processor within the Zynq SoC. The generated embedded software includes AXI driver code, generated from the AXI driver blocks, that controls the HDL IP core. Alternatively, you can write the embedded software and manually build it for the ARM processor.
Note: This article was abstracted from a much longer article that appeared in Xcell Software Journal, Issue 3.
Today, Avnet launched the $299 MicroZed Industrial IoT (IIoT) Starter Kit based on the Avnet MicroZed SoM with a Xilinx Zynq Z7010 SoC. The kit also includes pluggable sensor solutions from Maxim Integrated and STMicroelectronics. The MicroZed IIoT Starter kit integrates IBM’s Watson IoT agent on top of a custom-configured, certified image of the Wind River’s Pulsar Linux operating system. Avnet’s MicroZed IIoT Starter Kit also includes a design example that uses a standard MQTT messaging protocol to communicate with Watson IoT, which provides a registered, secure connection to additional cloud services and applications including the IBM Bluemix portfolio—a rich set of composable services that can rapidly add cognitive capabilities to IoT designs.
Avnet MicroZed Industrial IoT Starter Kit based on the Avnet MicroZed SoM
The pluggable sensor option from ST Microelectronics consists of six MEMS I2C Sensors on an Arduino-compatible shield that includes the following sensors:
The pluggable SPI-based shield from Maxim Integrated is a thermocouple-to-digital module capable of measuring temperatures from a frosty -270°C to 1800°C (more than twice as hot as the surface of the planet Mercury). This sensor module also includes a K-type thermocouple.
But wait, as they say…there’s more. Avnet’s cutting a 50% off deal for the first 100 orders if you fill out this registration form and sign up for an IBM Bluemix account through Avnet.
But wait, that’s not all. If you already have an Avnet MicroZed board, you can get the MicroZed IIoT Upgrade Kit for a mere $129!
What are you waiting for? An engraved invitation?
Dick Selwood just published a quick writeup on EEJournal.com of the QuickPlay Open Development Platform developed by Xilinx Alliance member PLDA and now operating as an independent initiative. The article is titled “FPGAs for the Masses?” and Selwood does a pretty good job of capturing the tool’s description in a paragraph:
“It was, for me, a little difficult to get a grasp on exactly what QuickPlay is. Let's start with what it is not. It is not a complete tool chain for FPGAs – it relies on the synthesis and place and route tools from the device manufacturer. It is not an all-purpose tool; it presumes that you will be developing a system based on one of a (wide) range of boards. But it is a way of developing systems around an FPGA that is accessible to people who are not hardware experts.”
Followed a bit later by this:
“The underlying model of the QuickPlay approach is based on how a software developer thinks - that a design is a number of functions that communicate with one another and also with the outside world. For the hardware implementation, these designs are considered kernels.”
QuickPlay sits atop the Xilinx Vivado HLx Design Suite and provides an alternative way of developing complete system designs targeting specific Xilinx-based target boards.
For more information about QuickPlay, see “A Novel Approach to Software-Defined FPGA Computing.”
By Zach Pfeffer, Edgar Iglesias, Alistair Francis, Nathalie Chan King Choy, and Rob Armstrong Jr, Xilinx
When the System Software team at Xilinx and DornerWorks brought up the Xen Project hypervisor on Xilinx’s Zynq Ultrascale+ MPSoC, we found that we could run the popular 1993 videogame Doom to demonstrate the system and test it. The visually striking game allowed the team to visit Xen engineering topics with the aim of passing on knowledge and experience to future hypervisor users.
Our team used an emulation model of the Zynq UltraScale+ MPSoC available for QEMU (the open-source Quick Emulator) to prepare the software for the Doom demonstration, enabling us to bring it up in hours, not days, when silicon arrived.
A hypervisor is a computer program that virtualizes processors. Applications and operating systems running on the virtualized processors appear to own the system completely, but in fact the hypervisor manages the virtual processors’ access to the physical machine resources, such as memory and processing cores. Hypervisors are popular because they provide design compartmentalization and isolation between the independent software elements running on the system.
As described in “Zynq MPSoC Gets Xen Hypervisor Support” (Xcell Journal, Issue 93), a Type 1 hypervisor runs natively on the hardware, whereas a Type 2 hypervisor is not the lowest layer of software and gets hosted on an OS. Xen is a Type 1 hypervisor. Earlier, we mentioned virtual processors (also known as virtual machines). In Xen, these are referred to as domains. The most privileged domain is called Dom0; the unprivileged guest domains are DomU domains.
Dom0 is the initial domain that the Xen hypervisor creates upon booting. It is privileged and drives the devices on the platform. Xen virtualizes CPUs, memory, interrupts and timers, providing virtual machines with one or more virtual CPUs, a portion of the memory of the system, a virtual interrupt controller and a virtual timer. Unless configured otherwise, Dom0 will get direct access to all devices and drive them. Dom0 also runs a set of drivers called paravirtualized (PV) back ends to give the unprivileged virtual machines access to disk, network and so on. Xen provides all the tools for discovery and initial communication setup. The OS running as DomU gets access to a set of generic virtual devices by running the corresponding PV front-end drivers.
A single back end can service multiple front ends, depending on how many DomUs there are. A pair of PV drivers exists for all of the most common device classes (disk, network, console, frame buffer, mouse, keyboard, etc.). The PV drivers usually live in the OS kernel, i.e., Linux. A few PV back ends can also run in user space, usually in QEMU. The front ends connect to the back ends using a simple ring protocol over a shared page in memory.
The processing contexts of the Doom-on-Zynq UltraScale+ MPSoC are like an onion, with many layers. In the Cortex-A53 cluster are the four ARMv8 cores. On each core, the hypervisor runs in EL2, and the guests (Dom0 or DomU) run in EL0/EL1. Each DomU guest runs Linux; Doom (PrBoom) runs in the user space. Doom uses the Simple Direct Media Layer (SDL), which talks to a frame buffer frontend driver via the SVC instruction (eventually). The frame buffer front end writes the buffer into a shared memory area set up by Dom0. The front-end driver communicates with virtualization code running on Dom0 via a protocol such as Xen Bus or VirtIO using the HVC instruction (eventually). The virtualization code running on Dom0 provides a back end for display which then is encoded by the virtualization code’s VNC server and sent over a network to a VNC client.
This information and the demo should provide a good foundation for further hypervisor study and experimentation. After you are able to run the demo in emulation on QEMU, you can use PetaLinux Tools to run it on Zynq UltraScale+ MPSoC silicon. For more great developer resources, visit Xilinx’s Software Developer Zone.
Note: This article was abstracted from a much longer article that appeared in Xcell Software Journal, Issue 3.
In addition to the Zynq SoC, the 56 x 50mm SOM board carries:
Here’s a photo of the board:
Please contact Axonim Devices for more information about this SOM.
By Adam Taylor
We are leaving Embedded Vision for a while (we will come back to it, as it is a wide topic) and will now look at another way to use the Zynq SoC’s PS (Processor System) and PL (Programmable Logic) sections by employing National Instruments’ (NI’s) LabVIEW and high-level synthesis. To do this, we will be using NI’s LabVIEW RIO Evaluation Kit, which is based on a Zynq Z7020 SoC. NI’s RIO is supported by NI’s LabVIEW Real Time application and LabVIEW FPGA. Over the next few weeks we be creating designs using this framework.
At this point I should add that developing for the Zynq this way is new to me, so it will be interesting for me to learn how to develop designs using this approach along with you over the next few blogs.
You can choose from one of three possible development methods for NI’s RIO Evaluation kit as shown below:
Available Development Frameworks for NI’s RIO Evaluation Kit
These three different methods allow us to best develop our system totally within the LabVIEW framework, entirely in C/C++, or using a combination of LabVIEW and C/C++. This provides flexibility allows us to pick the best approach for the particular application requirements of each new project. It’s really nice to have that flexibility.
I find the RIO evaluation kit architecture interesting. The kit is based on NI’s sbRIO-9637, where the “sb” stands for “single board.” The sbRIO-9637 provides USB and SD Card interfaces and it uses the Zynq SoC’s EMIO extension into the PL to provide a number of hardware interfaces including CAN, RS232, RS485, and GigE. (The image below shows the board architecture.) The board also uses the Zynq SoC’s XADC to provide a number of analog inputs as well as four analog outputs and 28 digital IO lines from the Zynq SoC’s PL.
sbRIO-9637 Board Architecture
The RIO Development Kit combines the sbRIO-9637 board with a demo board that contains the following:
Top of the NI RIO Evaluation Kit board showing LCD, etc.
NI RIO Evaluation Kit Board arrangement – Boards are connected via MIO and DIO connectors
These I/O resources should allow us to develop some pretty interesting applications that will familiarize us with the development framework.
Because the LabVIEW RIO development framework differs significantly from what we have used before, the first thing we need to do to get this kit up and running is install the software. The kit includes two DVDs: the first is the evaluation kit software and needs to be installed before we can develop designs for the board; the second is the LabVIEW FPGA Xilinx tools DVD.
Now here is where it gets really cool. We can either install the second DVD or we can use the cloud based FPGA compile service (use of this is required if using Win 8 or 10). My internet connection is pretty slow, so I will install the second DVD and will run all of the software locally. Although I promise to try and use the cloud compile at least once if possible to see how it works.
Installing the first disk is pretty simple. We need to ensure the evaluation kit is connected to our router so it can be validated and so that we obtain the IP address, which we will need for future developments.
Once the software is installed, we’ll look at creating our first application—which I will address next time.
The code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
You also can find links to all the previous MicroZed Chronicles blogs on my own Web site, here.
This week at the FCCM 2016 Symposium--the 24th IEEE International Symposium on Field-Programmable Custom Computing Machines in Washington DC—BittWare and Atomic Rules demonstrated the Atomic Rules UDP Offload Engine (UOE) IP core running on BittWare’s XUSP3S PCIe board operating at 25GbE data rates with no packet loss. The BittWare XUSP3S PCIe board is based on Xilinx’s Virtex UltraScale FPGA. The Atomic Rules UOE IP core operates at 10/25 GbE data rates and, according to the press release, is upgradable to 50/100 GbE on newer FPGA platforms.
Here’s a short video with a demo of the core made by Atomic Rules. (Be certain to note the short, unsolicited testimonial to the bulletproof nature of the Xilinx GTY 30.5Gbps SerDes transceivers.)
Atomic Rules’ UOE IP core implements the UDP standard RFC 768 including checksum, segmentation, and reassembly hardware offload—which moves much of the work described in RFC 768 from software to hardware to accommodate 25, 50, and 100 GbE line rates. The UOE IP core supports concurrent transmission and reception of application-level UDP datagrams over a LAN or across a network. An integral IGMPv2 multicast pre-selector removes unwanted traffic.
Atomic Rules UOE IP Core
BittWare announced the XUSP3R at the same event—a 3/4-length PCIe board based on a Xilinx Virtex UltraScale VU190 FPGA with as many as four Gen3 x8 PCIe interfaces and four front-panel QSFP28 cages supporting a variety of Ethernet combinations up to 400Gbps. Four on-board DIMM sockets support large memory configurations. Each DIMM slot accommodates 64GBytes of 72-bit DDR4 SDRAM with ECC, 144Mbits of QDR-IV DRAM, or 576Mbits of QDR-II+ DRAM. According to the press release, the board can also accommodate an optional and independent Hybrid Memory Cube (HMC) module with capacities to 4GBytes.
BittWare XUSP3R PCIe Networking Card based on a Xilinx Virtex UltraScale VU190 FPGA
For more information on BittWare’s XUP3S Networking board, see “BittWare’s XUSP3S PCIe Network Card uses UltraScale FPGAs to support four 100 GbE or sixteen 25/10 GbE ports. More on the way.”
Earlier this week at the Embedded Vision Summit in Santa Clara, QuickPlay and Auviz Systems demonstrated an FPGA-accelerated visual color-detection application developed using a rapid development environment based on the previously announced QuickPlay development environment (See “A Novel Approach to Software-Defined FPGA Computing”) and Auviz middleware IP. This demo ran on an XpressKUS PCIe board from ReFLEX CES, which was based on a Xilinx Kintex UltraScale KU060 FPGA.
The two companies have announced that they are teaming up to provide a next-generation, software-defined development environment for FPGA-accelerated vision applications.
The Auviz Video Content Analysis Platform, AuvizVCA, runs on an FPGA and employs semantic segmentation to deliver accurate image detection and classification at better than 30 frames/sec for as many as 21 object classes. However, there’s no need to understand the underlying FPGA programming to get the benefit of the fast FPGA hardware. AuvizVCA is implemented as an FPGA-optimized OpenCL kernel that’s invoked through high-level language calls on a host processor. During execution, AuvizVCA invokes AuvizDNN, an optimized library of Deep Neural Network functions. (See “Machine Learning in the Cloud: Deep Neural Networks on FPGAs” by Auviz’ Nagesh Gupta.)
Currently, AuvizVCA is running on an Alpha Data ADM-PCIE-7V3 PCIe board based on a Xilinx Virtex-7 690T FPGA. (See “Alpha Data showcases PCIe accelerator card for HPC based on Kintex UltraScale FPGA at SC14”.) Configurations are generated using the Xilinx SDAccel Development Environment. Planned future releases of AuvizVCA will support the newer Xilinx UltraScale All Programmable device families.
Thanks to Dave Jones’ EEVBlog teardown of NI’s (National Instruments’) greatly enhanced VB-8034 VirtualBench All-in-One instrument (DSO, logic analyzer, 5½-digit DMM, arbitrary waveform generator, programmable digital I/O, and power supply in one box), we know a lot more about the high-quality instrument’s internal design. The VB-8034’s DSO has four 350MHz channels in contrast to its predecessor’s two 100MHz channels. Back when NI announced the VB-8034, all I knew was that it was based on a Xilinx Zynq All Programmable SoC like its predecessor, the VB-8012. After watching Dave’s 40-minute teardown video, we now know that the waveform capture and digital processing are performed by a pair of Xilinx devices: A Zynq Z7020 SoC and a Kintex-7 160T FPGA.
National Instruments VB-8034 enhanced VirtualBench All-in-One instrument
Here’s the EEVblog teardown video for the NI VB-8034:
One of Dave’s high-resolution photos shows the two Xilinx devices on the VB-8034’s main capture and processing board:
Once Dave wipes the last traces of heat-sink compound from the device appearing in the center of the image, we see it’s a Kintex-7 FPGA, which is obviously handling the DSO waveform capture, stowing digitized samples from the two flanking National Semiconductor dual 1.5Gsamples/sec 8-bit ADCs into four nearby 1Gbit DDR3 SDRAMs (two on the top and two on the bottom of the board) in real time. The device in the lower left of the image is a Zynq Z7020 SoC, which is handling the overall instrument control and the instrument’s GUI and USB/Ethernet/WiFi I/O.
We can again see the benefit of a Xilinx-based platform design in the design of the NI VB-8034. The original VB-8012 instrument was based on the Zynq Z7020 but the VB-8034 has a significantly enhanced DSO (see the specs for these as well as other enhancements). So NI was able to leverage the existing Zynq-based VB-8012 design for a lot of the new instrument but then added the Kintex-7 FPGA for the much beefier DSO (3.5x more bandwidth, 2x the channels).
For more information about NI’s VirtualBench All-in-One instruments, see:
By Jeremy Banks, Product Manager, Curtiss-Wright and Jim Everett, Xilinx
A new mezzanine card standard called FMC+, an important development for embedded computing designs using FPGAs and high-speed I/O, will extend the total number of gigabit transceivers (GTs) in a card from 10 to 32 and increase the maximum data rate from 10 to 28 Gbits per second while maintaining backward compatibility with the current FMC standard.
These capabilities mesh nicely with new devices such as those using the JESD204B serial interface standard, as well as 10G and 40G fiber optics and high-speed serial memory. FMC+ addresses the most challenging I/O requirements, offering developers the best of two worlds: the flexibility of a mezzanine card with the I/O density of a monolithic design.
The FMC+ specification has been developed and refined over the last year. The VITA 57.4 working group has approved the spec and will present it for ANSI balloting in early 2016. Let’s take a closer look at this important new standard to see its implications for advanced embedded design.
The Mezzanine Card Advantage
Mezzanine cards are an effective and widely used way to add specialized functions to an embedded system. Because they attach to a base or carrier card, rather than plugging directly into a backplane, mezzanine cards can be easily changed. For system designers, this means both configuration flexibility and an easier path to technology upgrades.
However, this flexibility usually comes at the cost of functionality due to either connectivity issues or the extra real estate needed to fit on the board. For FPGAs, the primary open standard is ANSI/VITA 57.1, otherwise known as the FPGA Mezzanine Card (FMC) specification. A new version dubbed FMC+ (or, more formally, VITA 57.4) extends the capabilities of the current FMC standard with a major enhancement to gigabit serial interface functionality.
FMC+ addresses many of the drawbacks of mezzanine-based I/O, compared with monolithic solutions, simultaneously delivering both flexibility and performance. At the same time, the FMC+ standard stays true to the FMC history and its installed base by supporting backward compatibility.
The FMC standard defines a small-format mezzanine card, similar in width and height to the long-established XMCs or PMCs, but about half the length. This means FMCs have less component real estate than open-standard formats. However, FMCs do not need bus interfaces, such as PCI-X, which often take a considerable amount of board real estate. Instead, FMCs have direct I/O to the host FPGA, with simplified power supply requirements. This means that despite their size, FMCs could actually have more I/O capacity than their XMC counterparts. As with the PMC and XMC specification, FMC and FMC+ define options for both air and conduction cooling, thereby serving both benign and rugged applications in commercial and defense markets.
The anatomy of the FMC specification is simple. The standard allows for up to 160 single-ended or 80 differential parallel I/O signals for high-pin-count (HPC) designs or half that number for low-pincount (LPC) variants. Up to 10 full-duplex GT connections are specified. The GTs are useful for fiber optics or other serial interfaces. In addition, the FMC specification defines key clock signals. All of this I/O is optional, though most hosts now support the full connectivity.
The FMC specification also defines a mix of power inputs, though the primary power supply, defined by the mezzanine, is supplied by the host. This approach works by partially powering up the mezzanine such that the host can interrogate the FMC, which responds by defining a voltage range for the VADJ. Assuming the host can provide this range, then all should be well. Not having the primary regulation on the mezzanine saves space and reduces mezzanine power dissipation.
FMCs for Analog I/O
Designers can use FMCs for any function that you might want to connect to an FPGA, such as digital I/O, fiber optics, control interfaces, memory or additional processing. But analog I/O is the most common use for FMC technology. The FMC specification affords a great deal of scope for fast, high-resolution I/O, but there are still trade-offs—especially with high-speed parts using parallel interfaces.
For example, Texas Instruments’ ADC12D2000RF dual-channel, 2Gsps 12-bit ADCs use a 1:4 multiplexed bus interface, so the bus speed is not too fast for the host FPGA. The digital data interface alone requires 96 signals (48 LVDS pairs). For a device of this class, FMC can support only one of these ADCs, even if there is sufficient space for more, because it is limited to 160 signals. Lower-resolution devices, even at higher speeds, such as those with 8-bit data paths, may allow more channels even with the increased requirements of the front-end analog coupling of the baluns or amplifiers, clocking and the like.
The FMC specification starts to run out of steam with analog interfaces delivering more than 8 bits of resolution at around 5 or 6Gsps (throughputs of > 50Gbps) using parallel interfaces. From a market perspective, leading FMCs based on channel density, speed and resolution are in the 25 to 50Gbps throughput range. This functionality results from a trade-off between physical package sizes and available connectivity to the host FPGA.
In addition to the parallel connections, the FMC specification supports up to 10 full-duplex high-speed serial (GT) links. These interfaces are useful for such functionality as fiber-optic I/O, Ethernet, emerging technologies like Hybrid Memory Cube (HMC) and the Bandwidth Engine, and newer-generation analog I/O devices that use the JESD204B interface.
Although the JESD204 serial-interface standard, currently at revision “B,” has been around for a while, only recently has it has gained wider market penetration and become the serial interface of choice for newer generations of high-sampling data converters. This wide adoption has been stoked by the telecommunications industry’s thirst for ever-smaller, lower-power and lower-cost devices.
As mentioned earlier, a dual-channel 2-Gsps, 12-bit ADC with a parallel interface requires a large number of I/O signals. This requirement directly impacts the package size, in this case mandating a 292-pin package measuring roughly 27 x 27 mm (though newer-generation pin geometry could shrink the package size to something less than 20 x 20 mm).
A JESD204B-connected equivalent device can be provided in a 68-pin, 10 x 10-mm package—with reduced power. This dramatic reduction in package size marries well with evolving FPGAs, which are providing ever more GT links at higher and higher speeds. Figure 1 illustrates an example of package size and FMC/FMC+ board size.
Typical high-speed ADCs and DACs using the JESD204B interface have between one and eight GT links operating at 3 to 12Gbps each, depending on the data throughput required based on sample rate, resolution and number of analog I/O channels.
The FMC specification defines a relatively small mezzanine card, but with the emergence of JESD204B devices there is room to fit more parts onto the available real estate. The maximum of 10 GT links defined by the FMC specification is a useful quantity; even this limited number of GT links provide 80Gbps or more of throughput while using a fraction of the pins otherwise required for parallel I/O.
The emergence of serially connected I/O devices, not just those using JESD204B, does have drawbacks for some application segments in electronic warfare, such as digital radio frequency memory (DRFM). Serial interfaces invariably introduce additional latency due to longer data pipelines. For DRFM applications, latency for data-in to data-out is a fundamental performance parameter. Although latency is likely to vary widely between serially connected devices, new generations of devices will push data through the pipelines faster and faster, with some promising the ability to tune the depth of the pipeline. It remains to be seen how much of an improvement is to be realized.
Some standard ADC devices sampling at >1Gsps today have latency below 100 nanoseconds. Other applications can tolerate this latency, or do not care about it, including software-defined radio (SDR), radar warning receivers and other SIGINT segments. These applications gain large advantages by using a new generation of RF ADCs and DACs, a technology driven by the mass-market telecommunications infrastructure.
Outside of the FPGA community, newer DSP devices are also starting to adopt JESD204B. However, FPGAs are likely to remain the stronghold in taking full advantage of the capabilities of wideband analog I/O devices. That’s because FPGAs can deal with vast data volumes with better parallelization.
The Evolution of FMC+
To move FMC to the next level, the VITA 57.4 working group has created a specification with an increased number of GT links operating at increased speed. FMC+ maintains full FMC backward compatibility by adding to the FMC connector’s outer columns for the additional signals and not changing any of the board profiles or mechanics.
The additional rows will be part of an enhanced connector that will minimize any impact on available real estate. The FMC+ specification increases the maximum number of available GT links from 10 to 24, with the option of adding another eight links, for a total of 32 full duplex. The additional links use a separate connector, referred to as an HSPCe (HSPC being the main connector). Table 1 summarizes FMC and FMC+ connectivity.
Multiple independent signal integrity teams characterized and validated the higher 28Gbps data rate. The maximum full-duplex throughput can now exceed 900Gbps in each direction, when the parallel interface is included. See Figure 2 for an outline of the net throughputs that can be expected for digitizer solutions supporting the different capabilities of FMC and FMC+.
Designers can use the increased throughput enabled by FMC+ to take advantage of new devices that offer huge I/O bandwidth. There will still be trade-offs, such as how many devices can fit on the mezzanine’s real estate budget, but for a moderate number of channels the realizable throughput is a huge leap over today’s FMC specification.
In the next few years, it is reasonable to expect high-resolution ADCs and DACs to break through the 10Gsps barrier to support very wideband communications with direct RF samplings for L-, S-, and even C-band frequencies. Below 10Gsps, converters are emerging with 12-, 14-, and even 16-bit resolutions, with some supporting multiple channels. The majority of these devices will be using JESD204B (or a newer revision) signaling with 12Gbps channels until newer generations inevitably boost this speed even further. These fast-moving advances are fueled by the telecommunications industry, but the military community can take advantage of them to meet SWAP-C requirements.
Other Advantages and Uses of FMC+
Although FMC+, like FMC, is likely to be dominated by ADC, DAC and transceiver products, the increased GT density provided by FPGAs makes it useful for other functions. Two functions of note are fiber optics and new serial memories.
As with JESD204B, there are requirements for faster, denser fiber optics. Those based on fiber-optic ribbon cables offer the smallest parts. Because the FMC+ footprint readily supports 24 full-duplex fiber-optic links, this application is likely where the higher speeds supported by FMC+ will first be realized. Bandwidths of 28Gbps per fiber will take the throughputs quickly past 100G and 400G speeds on a single mezzanine. Optical throughput of 100G is emerging today on the current FMC format.
Another emerging area suitable for FMC+ is serial memory such as Hybrid Memory Cube and MoSys’ Bandwidth Engine. These novel devices represent an entirely new category of high-performance memory, delivering unprecedented system performance and bandwidth by utilizing GT connectivity. (Xcell Journal issue 88 examines these new memory types.)
A new generation of the FMC specification has been introduced and is adapting to new technology driven by serial connected devices. Key players in the FMC industry have already begun adopting this specification. Figure 3 shows the first Xilinx demonstration board featuring FMC+, the KCU114 based on a Xilinx Kintex UltraScale FPGA. The FMC standard, through its new incarnation FMC+, is alive and kicking and is prepared for the next generation of high-performance, FPGA-driven applications.
Note: This blog post originally appeared in the latest issue of Xcell Journal, Issue 94. For the full article, see the full issue of Xcell Journal, Issue 94.
Last month at the NAB 2016 show in Las Vegas, Omnitek announced the Ultra XR Advanced 4K/UHD Waveform Analyzer, designed for colorists, post-production editors, and other content creatives preparing material for 4K/UHD distribution. Like the company’s Ultra 4K Tool Box, the Ultra XR Advanced 4K/UHD Waveform Analyzer is based on a Xilinx Zynq Z7045 All Programmable SoC. Key features of the Ultra XR Advanced 4K/UHD Waveform Analyzer include:
Omnitek Ultra XR Advanced 4K/UHD Waveform Analyzer
Here's a 2-minute video with a short demo of the new product:
Though I know I’m repeating myself, the Omnitek Ultra XR Advanced 4K/UHD Waveform Analyzer is yet another example of a Xilinx All Programmable device serving as a flexible design platform for a range of products or even multiple product lines. In addition, this sort of hardware/software programmable platform allows you to add features at will with no change in your BOM or BOM cost.
By William D. Richard, Associate Professor, Washington University in St. Louis
Using the low-voltage differential signaling (LVDS) inputs on a modern Xilinx FPGA, it is possible to digitize an analog input signal with nothing but one resistor and one capacitor. Since hundreds of LVDS inputs reside on a current-generation Xilinx device, it is theoretically possible to digitize hundreds of analog signals with a single FPGA.
Our team recently explored one corner of the possible design space by digitizing a band-limited input signal with a 3.75MHz center frequency with 5 bits of resolution while investigating options for digitizing the signals from a 128-element linear ultrasound array transducer. Let’s take a look at the details of that demonstration project.
In 2009, Xilinx introduced a LogiCORE soft IP core that, along with an external comparator, one resistor and one capacitor, implements an analog-to-digital converter (ADC) capable of digitizing inputs with frequencies up to 1.205 kHz. Using an FPGA’s LVDS inputs instead of an external comparator, in conjunction with a delta modulator ADC architecture, it is possible to digitize much higher-frequency analog input signals with just one resistor and one capacitor.
ADC Topology and Experimental Platform
The block diagram of a one-channel delta modulator ADC implemented using the LVDS inputs on a Xilinx FPGA is shown in Figure 1. Here, the analog input drives the noninverting LVDS_33 buffer input, and the input signal range is essentially 0 to 3.3 volts. The output of the LDVS_33 buffer is sampled at a clock frequency much higher than the input analog signal frequency and fed back through an LVCMOS33 output buffer and an external, first-order RC filter to the inverting LVDS_33 buffer input. With just this circuitry, the feedback signal, given an appropriate selection of clock frequency (F), resistance (R) and capacitance (C), will track the input analog signal.
As an example, Figure 2 shows an input signal in yellow (channel 1) and the feedback signal in blue (channel 2) for F = 240MHz, R = 2K and C = 47 pF. The input signal shown was produced by an Agilent 33250A function generator using its 200MHz, 12-bit, arbitrary output function capability. The Fourier transform of the input signal as computed by the Tektronix DPO 3054 oscilloscope we used is shown in red (channel M). At these frequencies, the input capacitance of the oscilloscope probe (as well as grounding issues) did degrade the integrity of the feedback signal shown in the oscilloscope trace, but Figure 2 does illustrate operation of the circuit.
We defined the band-limited input signal shown in Figure 2 by applying a Blackman-Nuttall window to a 1Vpp 3.75MHz sine wave. While the noise floor associated with the theoretical windowed signal is almost 100 dB below the amplitude associated with the center frequency, the 200MHz sample frequency and 12-bit resolution of the Agilent 33250A function generator result in a far-less-ideal demonstration signal. The output signals produced by many ultrasound transducers with center frequencies near 3.75MHz are naturally band-limited, due to the mechanical properties of the transducers, and are therefore ideal signal sources for use with this approach.
We obtained the plot shown in Figure 2 using a Digilent Cmod S6 development module with a Xilinx Spartan-6 XC6SLX4 FPGA mounted on a small, custom printed-circuit board with eight R/C networks and input connectors, allowing the prototype system to digitize up to eight signals simultaneously.
Each channel was parallel-terminated with 50 ohms to ground to properly terminate the coaxial cable from the signal generator. It is important to note that to achieve this performance, we set the drive strength of the LVCMOS33 buffers to 24 mA and the slew rate to FAST, as documented in the example VHDL source in Figure 5.
The custom prototype board also supported the use of an FTDI FT2232H USB 2.0 Mini-Module that we used to transfer packetized serial bitstreams to a host PC for analysis. Figure 3 shows the magnitude of the Fourier transform of the bitstream the prototype board produced when fed the analog signal of Figure 2. Peaks associated with subharmonics of the 240MHz sampling frequency are clearly visible, along with a peak at 3.75MHz associated with the input signal.
Large Number of Taps
By applying a bandpass finite impulse response (FIR) filter to the bitstream, it is possible to produce an N-bit binary representation of the analog input signal: the ADC output. Since the digital bitstream is at a much higher frequency than the analog input signal, however, you need to use FIR filters with a large number of taps. The data being filtered, however, only has values of zero (0) and one (1), so multipliers are not needed (only adders to add the FIR filter coefficients).
The ADC output shown in Figure 4 was produced on the host PC using an 801-tap bandpass filter centered at 3.75MHz that we designed using the free, online TFilter FIR filter design tool. This filter had 36 dB or more of attenuation outside the 2.5MHz to 5MHz passband and 0.58 dB of ripple between 3 and 4.5MHz.
The ADC output signal shown in Figure 4 has a resolution of approximately 5 bits. This is ultimately a function of the oversampling rate, and you can achieve higher resolution with designs optimized for lower input frequencies.
The ADC output signal shown in Figure 4 is also severely oversampled at 240MHz and can be decimated to reduce the ADC output bandwidth. In a hardware implementation of the bandpass filter and decimation blocks, it would be possible to only compute every 16th filter output value when decimating by a factor of 16 down to an effective sample rate of 15MHz (three times faster than the highest frequency in the band-limited input signal), reducing the hardware requirements.
Figure 5 shows the VHDL source used with the Digilent Cmod S6 development module to produce the feedback signal shown in Figure 2, along with the bitstream data associated with the Fourier transform of Figure 3. An LVDS_33 input buffer is instantiated directly and connected to the analog input and feedback signals, sigin_p and sigin_n, respectively. The internal signal sig is driven by the output of the LVDS_33 buffer and sampled by the implied flip-flop to produce sigout. The signal sigout is the serial bitstream that is filtered to produce the N-bit ADC output. We used the free Xilinx ISE Webpack tools to implement the project.
Figure 5 shows the VHDL code and the portion of the UCF file associated with the circuitry of Figure 1.
Low Component Count
The ADC architecture we have described has been inaccurately referred to in several recent articles as a delta-sigma architecture. But while true delta-sigma ADCs have advantages, the simplicity of this approach and low component count make it attractive for some applications. And since the LVDS_33 input buffer has a relatively high input impedance, in many applications the sensor output can be directly connected to the FPGA input without the need for a preamplifier or buffer. This can be very advantageous in many systems.
Another advantage of our approach is that superposition makes it possible to “mix” several serial bitstreams and apply a single filter to recover the output signal. In array-based ultrasound systems, for example, the serial bitstreams can be time-delayed to implement a focus algorithm, and then added in vector fashion, and a single filter used to recover the digitized, focused ultrasound vector.
Using an FIR filter to produce the ADC output is a straightforward, brute-force approach used here primarily for illustrative purposes. In most implementations, the ADC output will be produced using the traditional integrator/lowpass filter demodulator topology.
Note: This blog post originally appeared in the latest issue of Xcell Journal, Issue 94. For the full article, see the full issue of Xcell Journal, Issue 94.
AI now completely dominates image recognition because CNNs (convolutional neural networks) outperform competing machine implementations. They even outperform human image recognition at this point. The basic CNN algorithm requires a lot of computation and data reuse, well-matched to FPGA implementations. Last month, Ralph Wittig (a Distinguished Engineer in the Xilinx CTO Office) gave a 20-minute presentation at the OpenPOWER Summit 2016 conference and discussed the current state of the art for CNNs along with some research results from various universities including Tsinghua University in China.
Several interesting conclusions relating to power consumption of CNN algorithm implementations arise from this research:
Here’s a video capturing Wittig’s presentation at the OpenPOWER Summit:
In this video, Witting also notes the use of two CNN-related products previously covered in Xcell Daily: