Green Hills Software has announced that it has been selected by a US supplier of guidance and navigation equipment for commercial and military aircraft to provide its DO-178B Level A-compliant real-time multicore operating system for next-generation of equipment based on the Xilinx Zynq Ultrascale+ MPSoC. The Zynq Ultrascale+ MPSoC’s four 64-bit ARM Cortex-A53 processor cores will run Green Hills Software's INTEGRITY-178 Time-Variant Unified Multi Processing (tuMP) safety-critical operating system. The Green Hills INTEGRITY-178 tuMP RTOS has been shipping to aerospace and defense customers since 2010. INTEGRITY-178 tuMP supports ARINC-653 Part 1 Supplement 4 standard (including section 2.2.1 – SMP operation), as well as the Part 2 optional features including Sampling Port Data Structures, Sampling Port Extensions, Memory Blocks, Multiple Module Schedules, and File System and offers advanced options such as a DO-178B Level A-compliant network stack.
Linux provides a number of mechanisms that allow you to interact with FPGA bitstreams without using complex kernel device drivers. This feature allows you to develop and test your programmable hardware using simple Linux user-space applications. This free training Webinar by Doulos will review your options and examine their pros and cons.
The concepts will be explored in the context of Xilinx Zynq SoCs and Zynq UltraScale+ MPSoCs.
Doulos’ Senior Member of Technical Staff Simon Goda will present this webinar on August 4 and will moderate live Q&A throughout the broadcast. There are two Webinar broadcasts to accommodate different time zones.
Earlier this year, the University of New Hampshire’s InterOperability Laboratory (UNH-IOL) gave a 25G and 50G Plugfest and everybody came to the party to test compatibility of their implementations with each other. The long list of partiers included:
“The 25 Gigabit Ethernet Consortium is an open organization to all third parties who wish to participate as members to enable the transmission of Ethernet frames at 25 or 50 Gigabit per second (Gbps) and to promote the standardization and improvement of the interfaces for applicable products.”
From the Consortium’s press release about the plugfest:
“The testing demonstrated a high degree of multi-vendor interoperability and specification conformance.”
For its part, Xilinx tested its 10/25G High-Speed Ethernet LogiCORE IP and 40/50G High-Speed Ethernet LogiCORE Subsystem IP using the Xilinx VCU108 Eval Kit based on a Virtex UltraScale XCVU095-2FFVA2104E FPGA over copper using different cable lengths. Consortium rules do not permit me to tell you which companies interoperated with each other, but I can say that Xilinx tested against every company on the above list. I’m told that the Xilinx 25G/50G receiver “did well.”
Xilinx Virtex UltraScale VCU108 Eval Kit
Last month, I wrote about Perrone Robotic’s Autonomous Driving Platform based on the Zynq UltraScale+ MPSoC. (See “Linc the autonomous Lincoln MKZ running Perrone Robotics' MAX AI takes a drive in Detroit without puny humans’ help” and “Perrone Robotics builds [Self-Driving] Hot Rod Lincoln with its MAX platform, on a Zynq UltraScale+ MPSoC.”) That platform runs on a controller box supplied by iVeia. In the 2-minute video below, iVeia’s CTO Mike Fawcett describes the attributes of the Zynq UltraScale+ MPSoC that make it a superior implementation technology for autonomous driving platforms. The Zynq UltraScale+ MPSoC’s immense, heterogeneous computing power supplied by six ARM processors plus programmable logic and a few more programmable resources flexibly delivers the monumental amount of processing required for vehicular sensor fusion and real-time perception processing while consuming far less power and generating far less heat than competing solutions involving CPUs or GPUs.
Here’s the video:
Today, Mentor announced that it is making the Android 6.0 (Marshmallow) OS for the Xilinx Zynq UltraScale+ MPSoC along with pre-compiled binaries for the ZCU102 Eval Kit (currently on sale for half off, or $2495). This Android implementation includes the Mentor Android 6.0 board support package (BSP) built on the Android Open Source Project. The Android software is available for immediate, no-charge download directly from the Mentor Embedded Systems Division.
You need to file a download request with Mentor to get access.
Maybe you thought that VadaTech’s AMC597 300MHz-to-6GHz Octal Versatile Wideband Transceiver, which connects four AD9371 chips over JESD204B high-speed serial interfaces with a Xilinx Kintex UltraScale KU115 FPGA (the UltraScale DSP monster with 5520 DSP48E2 slices) and three banks of DDR4 SDRAM (two 8Gbyte banks and one 4Gbyte bank for a total of 20Gbytes), was cool but you’re not developing radios. Well, VadaTech now has another way for you to get a Kintex UltraScale KU115 FPGA on an AMC module. It’s called the AMC583 FPGA Dual FMC+ Carrier and it teams the UltraScale DSP monster with an NXP (formerly Freescale) QorIQ P2040 quad-core PowerPC processor and 8Gbytes of DDR4 SDRAM in two separate banks. The QorIQ processor and the UltraScale FPGA communicate over a high-speed 4-lane PCIe interface as well as the processor’s local bus. Two on-board FMC+ sites connect to the Kintex UltraScale FPGA and permit easy expansion.
Here’s a block diagram of VadaTech’s AMC583:
VadaTech AMC583 Block Diagram
If you need high-speed analog I/O capabilities, VadaTech has also just announced the FMC250, an FMC mezzanine module with two 12-bit 2.6Gsamples/sec ADCs and one 16-bit, 12Gsamples/sec DAC.
CCIX (the “cache-coherent interconnect for accelerators,” pronounced “see-six”), is a new, high-speed, chip-to-chip I/O protocol being developed by the CCIX Consortium. It’s based on the ubiquitous PCIe protocol, which means it can leverage PCIe’s existing, low-cost hardware infrastructure but it can go faster—a lot faster. While PCIe 4.0 (just starting to roll out) operates at a maximum rate of 16GTransfers/sec—that’s about 64Gbytes/sec bidirectionally on a 16-lane link—CCIX takes the signaling to 25GTransfer/sec, which approaches 100Gbytes/sec bidirectionally over the same 16 lanes. For compatibility, CCIX connections initialize as PCIe connections, thus maintaining PCIe protocol compatibility, but then permit a bootstrap mechanism where two connected CCIX devices can agree to stomp on the I/O accelerator pedal for a 56% speed boost using the same hardware.
All of this and more is explained in a new, easy-to-read technical bulletin posted by Synopsys titled “An Introduction to CCIX.”
Synopsys is a CCIX Contributor and Xilinx is a CCIX Promoter—both members of the CCIX Consortium at different membership levels. Xilinx is intensely interested in I/O protocols like CCIX to permit ever-faster communications between fast processor arrays and even faster FPGA-based accelerators and is looking forward to the first products with CCIX interconnect sampling later this year.
For more information about CCIX, see:
I’ve written about SDRs (software-defined radios) built with Analog Devices’ AD9371 dual RF transceivers and Xilinx All Programmable devices before but never on the scale of VadaTech’s AMC597 300MHz-to-6GHz Octal Versatile Wideband Transceiver, which connects four AD9371 chips over JESD204B high-speed serial interfaces with a Xilinx Kintex UltraScale KU115 FPGA (the UltraScale DSP monster with 5520 DSP48 slices) and three banks of DDR4 SDRAM (two 8Gbyte banks and one 4Gbyte bank for a total of 20Gbytes). The whole system fits into an AMC form factor. Here’s a photo:
VadaTech AMC597 300MHz-to-6GHz Octal Versatile Wideband Transceiver
It’s essentially a solid block of raw SDR capability jammed into a compact, 55W (typ) package. This programmable powerhouse has the RF and processing capabilities you need to develop large, advanced digital radio systems using development tools from VadaTech, Analog Devices, and Xilinx. The AMC597 is compatible with Analog Devices’ design tools for AD9371; you can develop your own FPGA-based processing configuration with Xilinx’s Vivado Design Suite and System Generator for DSP; and VadaTech supplies reference designs with VHDL source code, documentation, and configuration binary files.
On July 18 (that’s one week from today), Xilinx’s Video Systems Architect Alex Luccisano will be presenting a free 1-hour Webinar on streaming media titled “Any Media Over Any Network: Streaming and Recording Solution.” He’ll be discussing key factors such as audio/video codecs, bit rates, formats, and resolutions in the development of OTT (over-the-top) and VOD (video-on-demand) boxes and live-streaming equipment. Alex will also be discussing the Xilinx Zynq UltraScale+ MPSoC EV device family, which incorporates a hardened, multi-stream AVC/HEVC simultaneous encode/decode block that supports UHD-4Kp60. That’s the kind of integration you need to develop highly differentiated pro AV and broadcast products (and any other streaming-media or recording products) that stand well above the competition.
Baidu’s FPGA Cloud Compute Server, a new high-performance computing service in Baidu’s Cloud, caps the company’s nine years of research into FPGA-accelerated computing research—resulting in this announcement of widespread deployment. “FPGAs have the capability to deliver significant performance for deep learning inference, security, and other high growth data center applications,” said Liu Yang, Head of Baidu Technical Infrastructure, Co-General Manager of Baidu Cloud. “Years of research and FPGA engineering expertise at Baidu has culminated in our delivery of proven acceleration infrastructure for industry and academia.”
The Baidu FPGA Cloud Server provides a complete FPGA-based, hardware and software development environment and includes numerous design examples to help you achieve rapid development and migration while reducing development costs. Each FPGA instance in the Baidu FPGA Cloud Compute Server is a dedicated acceleration platform. FPGA resources are never shared between instances or users. The design examples cover deep learning acceleration and encryption/decryption, among others. In addition, the Baidu FPGA Cloud Server includes real-time monitoring of hardware resources, with statistics for the average length of the queue and the hardware temperature, to help users understand the acceleration hardware’s use and allow the handling of unexpected situations to reduce development risk.
You can quickly purchase one or more FPGA instances using the Baidu Cloud console in just a few minutes.
To provide this new service, Baidu developed its own FPGA accelerator card based on the Xilinx Kintex UltraScale KU115 FPGA. (That’s the DSP monster of the 20nm UltraScale FPGA family with 5520 DSP48 slices and 1.451M system logic cells.) According to Baidu, its FPGA Cloud Server can increase application speed by as much as 100x relative to CPU-based implementations.
Note: For more information about Baidu’s development of FPGA-based cloud acceleration, see “Baidu Adopts Xilinx Kintex UltraScale FPGAs to Accelerate Machine Learning Applications in the Data Center.”
Bittware’s XUPP3R PCIe card based on the Xilinx Virtex UltraScale+ VU9P FPGA has become really popular with customers. (See “BittWare’s UltraScale+ XUPP3R board and Atomic Rules IP run Intel’s DPDK over PCIe Gen3 x16 @ 150Gbps.”) That popularity has led to the inevitable question from BittWare’s customers: How about a bigger FPGA? Although physically, it’s easy to stick a bigger device on a big PCIe card, there’s an issue with heat—getting rid of it. To tackle this engineering problem, BittWare has developed an entirely new platform called “Viper” that employs computer-based thermal modeling, heat pipes, channeled airflow, and the new Xilinx “lidless” D2104 package to get heat out of the FPGA and into the cooling airstream of the PCIe card cage more efficiently. (For more information about the Xilinx lidless D2104 package, see “Mechanical and Thermal Design Guidelines for the UltraScale+ FPGA D2104 Lidless Flip-Chip Packages.”)
The first card to use the Viper platform is the BittWare XUPVV4.
BittWare’s XUPVV4 PCIe Card employs the company’s new Viper Platform with heat-pipe cooling for lidless FPGAs
Here are the specs for the BittWare XUPVV4:
You should be able to build pretty much whatever you want with this board. So, if someone comes to you and says, “you’re gonna’ need a bigger FPGA,” take a look at the BittWare XUPVV4. Plug it into a server and accelerate something today.
Last year at Embedded World 2016, a vision-guided robot based on a Xilinx Zynq UltraScale+ ZU9 MPSoC incorporated into a ZCU102 eval kit autonomously played solitaire on an Android tablet in the Xilinx booth. (See “3D Delta Printer plays Robotic Solitaire on a Touchpad under control of a Xilinx Zynq UltraScale+ MPSoC.”) This year at Embedded World 2017, an upgraded and improved version of the robot again appeared in the Xilinx booth, still playing solitaire.
In the original implementation, an HD video camera monitored the Android tablet’s screen to image the solitaire playing cards. Acceleration hardware implemented in the Zynq MPSoC’s PL (programmable logic) performed real-time preprocessing of the HD video stream including Sobel edge detection. Software running on the Zynq MPSoC’s ARM Cortex-A53 APU (Application Processing Unit) recognized the playing cards from the processed video supplied by the Zynq MPSoC’s PL and planned the solitaire game moves for the robot. The Zynq MPSoC’s dual-core ARM Cortex-R5 RPU (Real-Time Processing Unit) operating in lockstep—useful for safety-critical applications such as robotic control—operated the robotic stylus positioner, fashioned from a 3D Delta printer. The other processing sections of the Zynq UltraScale+ ZU9 MPSoC were also gainfully employed in this demo.
This year a trained, 3-layer Convolutional BNN (Binary Neural Network) with 256 neurons/layer executed the playing-card recognition algorithm. The tangible results: improved accuracy and a performance boost of 11,320x! (Not to mention the offloading of the recognition task from the Zynq MPSoC’s APU.)
Here’s a new, 2-minute video explaining the new autonomous solitaire-playing demo system:
Note: For more information about BNNs and programmable logic, see:
Metamako decided that it needed more than one Xilinx UltraScale FPGA to deliver the low latency and high performance it wanted from its newest networking platform. The resulting design is a 1RU or 2RU box that houses one, two, or three Kintex UltraScale or Virtex UltraScale+ FPGAs, connected by “near-zero” latency links. The small armada of FPGAs means that the platform can run multiple networking applications in parallel—very quickly. This new networking platform allows Metamako to expand far beyond its traditional market—financial transaction networking—into other realms such as medical imaging, SDR (software-defined radio), industrial control, and telecom. The FPGAs are certainly capable of implementing tasks in all of these applications with extremely high performance.
Metamako’s Triple-FPGA Networking Platform
The Metamako platform offers an extensive range of standard networking features including data fan-out, scalable broadcast, connection monitoring, patching, tapping, time-stamping, and a deterministic port-to-FPGA latency of just 3nsec. Metamako also provides a developer’s kit with the platform with features that include:
This latest networking platform from Metamako demonstrates a key attribute of Xilinx All Programmable technology: the ability to fully differentiate a product by exploiting the any-to-any connectivity and high-speed processing capabilities of Xilinx silicon using Xilinx’s development tools. No other chip technology could provide Metamako with a comparable mix of extreme connectivity, speed, and design flexibility.
You can now download the Vivado Design Suite 2017.2 HLx editions, which include many new UltraScale+ devices:
In addition, the low-cost Spartan-7 XC7S50 FPGA has been added to the WebPack edition.
Download the latest releases of the Vivado Design Suite HL editions here.
Think you don’t need HBM (high-bandwidth memory) in your FPGA-based designs? There was probably a time, not that long ago, when you thought you didn’t need a smartphone. Still think so? With its 460Gbytes/sec bandwidth, HBM doesn’t crash through the memory wall, it vaults you over the wall. And who needs to get over the memory wall? Anyone working with high-speed Ethernet, high-res video, and most high-performance DSP applications. Pretty much anything you’d use a Xilinx UltraScale+ All Programmable device for. Here’s a chart illustrating the problem:
Allow me to translate this chart for you: “You’re not going to get there with DDR SDRAM.”
Fortunately, there’s no longer a need for me to convince you that you need HBM. There’s an 11-page White Paper to do that job. It’s titled “Virtex UltraScale+ HBM FPGA: A Revolutionary Increase in Memory Performance.”
And, if you weren’t aware that Xilinx was adding HBM to its FPGAs, read this blog from last November: “Xilinx Virtex UltraScale+ FPGAs incorporate 32 or 64Gbits of HBM, delivers 20x more memory bandwidth than DDR.”
Anthony Collins, Harpinder Matharu, and Ehab Mohsen of Xilinx have just published an application article about the 16nm Xilinx RFSoC in MicroWave Journal titled “RFSoC Integrates RF Sampling Data Converters for 5G New Radio.” Xilinx announced the RFSoC, which is based on the 16nm Xilinx Zynq UltraScale+ MPSoC, back in February (see “Xilinx announces RFSoC with 4Gsamples/sec ADCs and 6.4Gsamples/sec DACs for 5G, other apps. When we say “All Programmable,” we mean it!”). The Xcell Daily blog with that announcement has been very popular. Last week, another blog gave more details (see “Ready for a few more details about the Xilinx All Programmable RFSoC? Here you go”), and now there’s this article in Microwave Journal.
This new article gets into many specifics with respect to designing the RFSoC into systems with block diagrams and performance numbers. In particular, there’s a table showing MIMO radio designs based on the RFSoC with 37% to 51% power reductions and significant pcb real-estate savings due to the RFSoC’s integrated, multi-Gbps ADCs and DACs.
If you’re looking to glean a few more technical details about the RFSoC, this article is the latest place to go.
Cloud computing and application acceleration for a variety of workloads including big-data analytics, machine learning, video and image processing, and genomics are big data-center topics and if you’re one of those people looking for acceleration guidance, read on. If you’re looking to accelerate compute-intensive applications such as automated driving and ADAS or local video processing and sensor fusion, this blog post’s for you to. The basic problem here is that CPUs are too slow and they burn too much power. You may have one or both of these challenges. If so, you may be considering a GPU or an FPGA as an accelerator in your design.
How to choose?
Although GPUs started as graphics accelerators, primarily for gamers, a few architectural tweaks and a ton of software have made them suitable as general-purpose compute accelerators. With the right software tools, it’s not too difficult to recode and recompile a program to run on a GPU instead of a CPU. With some experience, you’ll find that GPUs are not great for every application workload. Certain computations such as sparse matrix math don’t map onto GPUs well. One big issue with GPUs is power consumption. GPUs aimed at server acceleration in a data-center environment may burn hundreds of watts.
With FPGAs, you can build any sort of compute engine you want with excellent performance/power numbers. You can optimize an FPGA-based accelerator for one task, run that task, and then reconfigure the FPGA if needed for an entirely different application. The amount of computing power you can bring to bear on a problem is scary big. A Virtex UltraScale+ VU13P FPGA can deliver 38.3 INT8 TOPS (that’s tera operations per second) and if you can binarize the application, which is possible with some neural networks, you can hit 500TOPS. That’s why you now see big data-center operators like Baidu and Amazon putting Xilinx-based FPGA accelerator cards into their server farms. That’s also why you see Xilinx offering high-level acceleration programming tools like SDAccel to help you develop compute accelerators using Xilinx All Programmable devices.
For more information about the use of Xilinx devices in such applications including a detailed look at operational efficiency, there’s a new 17-page White Paper titled “Xilinx All Programmable Devices: A Superior Platform for Compute-Intensive Systems.”
MathWorks has just published a 30-minute video titled “FPGA for DSP applications: Fixed Point Made Easy.” The video targets users of the company’s MATLAB and Simulink software tools and covers fixed-point number systems, how these numbers are represented in MATLAB and in FPGAs, quantization and quantization challenges, sources of error and minimizing these errors, how to use MathWorks’ design tools to understand these concepts, implementation of fixed-point DSP algorithms on FPGAs using MathWorks’ tools, and the advantages of the Xilinx DSP48 block—which you’ll find in all Xilinx 28nm series 7, 20nm UltraScale, and 16nm UltraScale+ devices including Zynq SoCs and Zynq UltraScale+ MPSoCs.
The video also shows the development of an FIR filter using MathWorks’ fixed-point tools as an example with some useful utilization feedback that helps you optimize your design. The video also briefly shows how you can use MathWorks’ HDL Coder tool to develop efficient, single-precision, floating-point DSP hardware for Xilinx FPGAs.
By Adam Taylor
We can create very responsive design solutions using Xilinx Zynq SoC or Zynq UltraScale+ MPSoC devices, which enble us to architect systems that exploit the advantages provided by both the PS (processor system) and the PL (programmable logic) in these devices. When we work with logic designs in the PL, we can optimize the performance of design techniques like pipelining and other UltraFast design methods. We can see the results of our optimization techniques using simulation and Vivado implementation results.
When it comes to optimizing the software, which runs on acceleration cores instantiated in the PS, things may appear a little more opaque. However, things are not what they might appear. We can gather statistics on our accelerated code with ease using the performance analysis capabilities built into XSDK. Using performance analysis, we can examine the performance of the software we have running on the acceleration cores and we can monitor AXI performance within the PL to ensure that the software design is optimized for the application at hand.
Using performance analysis, we can examine several aspects of our running code:
For those who may not be familiar with the concept, a stall occurs when the cache does not contain the requested data, which must then be fetched from main memory. While the data is fetched, the core can continue to process different instructions using out-of-order (OOO) execution, however the processor will eventually run out of independent instructions. It will have to wait for the information it needs. This is called a stall.
We can gather these stall statistics thanks to the Performance Monitor Unit (PMU) contained within each of the Zynq UltraScale+ MPSoC’s CPUs. The PMU provides six profile counters, which are configured by and post processed by XSDK to generate the statistics above.
If we want to use the performance monitor within SDK, we need to work with a debug build and then open the Performance Monitor Perspective within XSDK. If we have not done so before, we can open the perspective as shown below:
Opening the Performance Analysis Perspective
With the performance analysis perspective open, we can debug the application as normal. However, before we click on the run icon (the debugger should be set to stop at main, as default), we need to start the performance monitor. To do that, right click on the “System Debugger on Local” symbol within the performance monitor window and click start.
Starting the Performance Analysis
Then, once we execute the program, the statistics will be gathered and we can analyse them within XDSK to determine the best optimizations for our code.
To demonstrate how we can use this technique to deliver a more optimized system, I have created a design that runs on the ZedBoard and performs AES256 Encryption on 1024 packets of information. When this code was run the ZedBoard the following execution statistics were collected:
So far, these performance statistics only look at code executing on the PS itself. Next time, we will look at how we can use the AXI Performance Monitor with XSDK. If we wish to do this, we need to first instrument the design in Vivado.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
Linc, Perrone Robotics’ autonomous Lincoln MKZ automobile, took a drive around the Perrone paddock at the TU Automotive autonomous vehicle show in Detroit last week and Dan Isaacs, Xilinx’s Director Connected Systems in Corporate Marketing, was there to shoot photos and video. Perrone’s Linc test vehicle operates autonomously using the company’s MAX (Mobile Autonomous X), a “comprehensive full-stack, modular, real-time capable, customizable, robotics software platform for autonomous (self-driving) vehicles and general purpose robotics.” MAX runs on multiple computing platforms including one based on an Iveia controller, which is based on an Iveia Atlas SOM, which in turn is based on a Xilinx Zynq UltraScale+ MPSoC. The Zynq UltraScale+ MPSoC handles the avalanche of data streaming from the vehicle’s many sensors to ensure that the car travels the appropriate path and avoids hitting things like people, walls and fences, and other vehicles. That’s all pretty important when the car is driving itself in public. (For more information about Perrone Robotics’ MAX, see “Perrone Robotics builds [Self-Driving] Hot Rod Lincoln with its MAX platform, on a Zynq UltraScale+ MPSoC.”)
Here’s a photo of Perrone’s sensored-up Linc autonomous automobile in the Perrone Robotics paddock at TU Automotive in Detroit:
And here’s a photo of the Iveia control box with the Zynq UltraScale+ MPSoC inside, running Perrone’s MAX autonomous-driving software platform. (Note the controller’s small size and lack of a cooling fan):
Opinions about the feasibility of autonomous vehicles are one thing. Seeing the Lincoln MKZ’s 3800 pounds of glass, steel, rubber, and plastic being controlled entirely by a little silver box in the trunk, that’s something entirely different. So here’s the video that shows Perrone Robotics’ Linc in action, driving around the relative safety of the paddock while avoiding the fences, pedestrians, and other vehicles:
If you’re designing next-generation avionics systems, you may be facing some challenges:
Do these sound like your challenges? Want some help? Check out this June 20 Webinar.
When someone asks where Xilinx All Programmable devices are used, I find it a hard question to answer because there’s such a very wide range of applications—as demonstrated by the thousands of Xcell Daily blog posts I’ve written over the past several years.
Now, there’s a 5-minute “Powered by Xilinx” video with clips from several companies using Xilinx devices for applications including:
That’s a huge range covered in just five minutes.
Here’s the video:
Signal Integrity Journal just published a new article titled “Addressing the 5G Challenge with Highly Integrated RFSoC,” written by four Xilinx authors. The articles discusses some potential uses for Xilinx RFSoC technology, announced in February. (See “Xilinx announces RFSoC with 4Gsamples/sec ADCs and 6.4Gsamples/sec DACs for 5G, other apps. When we say “All Programmable, we mean it!”)
Cutting to the chase of this 2600-word article, the Xilinx RFSoC is going to save you a ton of power and make it easier for you to achieve your performance goals for 5G and many other advanced, mixed-signal system designs.
If you’re involved in the design of a system like that, you really should read the article.
Light Reading’s International Group Editor Ray Le Maistre recently interviewed David Levi, CEO of Ethernity Networks, who discusses the company’s FPGA-based All Programmable ACE-NIC, a Network Interface Controller with 40Gbps throughput. The carrier-grade ACE-NIC accelerates vEPC (virtual Evolved Packet Core, a framework for virtualizing the functions required to converge voice and data on 4G LTE networks) and vCPE (virtual Customer Premise Equipment, a way to deliver routing, firewall security and virtual private network connectivity services using software rather than dedicated hardware) applications by 50x, dramatically reducing end-to-end latency associated with NFV platforms. Ethernity’s ACE-NIC is based on a Xilinx Kintex-7 FPGA.
“The world is crazy about our solution—it’s amazing,” says Levi in the Light Reading video interview.
Ethernity Networks All Programmable ACE-NIC
Because Ethernity implements its NIC IP in a Kintex-7 FPGA, it was natural for Le Maistre to ask Levi when his company would migrate to an ASIC. Levi’s answer surprised him:
“We offer a game changer... We invested in technology—which is covered by patents—that consumes 80% less logic than competitors. So essentially, a solution that you may want to deliver without our patents will cost five times more on FPGA… With this kind of solution, we succeed over the years in competing with off-the-shelf components… with the all-programmable NIC, operators enjoy the full programmability and flexibility at an affordable price, which is comparable to a rigid, non-programmable ASIC solution.”
In other words, Ethernity plans to stay with All Programmable devices for its products. In fact, Ethernity Networks announced last year that it had successfully synthesized its carrier-grade switch/router IP for the Xilinx Zynq UltraScale+ MPSoC and that the throughput performance increases to 60Gbps per IP core with the 16nm device—and 120Gbps with two instances of that core. “We are going to use this solution for novel SDN/NFV market products, including embedded SR-IOV (single-root input/output virtualization), and for high density port solutions,” – said Levi.
Towards the end of the video interview, Levi looks even further into the future when he discusses Amazon Web Services’ (AWS’) recent support of FPGA acceleration. (That’s the Amazon EC2 F1 compute instance based on Xilinx Virtex UltraScale+ FPGAs rolled out earlier this year.) Because it’s already based on Xilinx All Programmable devices, Ethernity’s networking IP runs on the Amazon EC2 F1 instance. “It’s an amazing opportunity for the company [Ethernity],” said Levi. (Try doing that in an ASIC.)
Here’s the Light Reading video interview:
When discussed in Xcell Daily two years ago, Exablaze’s 48-port ExaLINK Fusion Ultra Low Latency Switch and Application Platform with the company’s FastMUX option was performing fast Ethernet port aggregation on as many as 15 Ethernet ports with blazingly fast 100nsec latency. (See “World’s fastest Layer 2 Ethernet switch achieves 110nsec switching using 20nm Xilinx UltraScale FPGAs.”) With its new FastMUX upgrade, also available free to existing customers with a current support contract as a field-installable firmware upgrade, Exablaze has now cut that number in half, to an industry-leading 49nsec (actually, between 48.79nsec and 58.79nsec). The FastMUX option aggregates 15 server connections into a single upstream port. All 48 ExaLINK Fusion ports including the FastMux ports are cross-point enabled so that they can support layer 1 features such as tapping for logging, patching for failover, and packet counters and signal quality statistics for monitoring.
The ExaLINK Fusion platform is based on a Xilinx 20nm UltraScale FPGA, which initially gave Exablaze the ability to initially create the fast switching and fast aggregation hardware and massive 48-port connectivity and then to improve the product’s design by taking advantage of the FPGA’s reprogrammability, which simply requires a firmware upgrade that can be performed in the field.
Perhaps you think DPDK (Data Plane Development Kit) is a high-speed data-movement standard that’s strictly for networking applications. Perhaps you think DPDK is an Intel-specific specification. Perhaps you think DPDK is restricted to the world of host CPUs and ASICs. Perhaps you’ve never heard of DPDK—given its history, that’s certainly possible. If any of those statements is correct, keep reading this post.
Originally, DPDK was a set of data-plane libraries and NIC (network interface controller) drivers developed by Intel for fast packet processing on Intel x86 microprocessors. That is the DPDK origin story. Last April, DPDK became a Linux Foundation Project. It lives at DPDK.org and is now processor agnostic.
DPDK consists of several main libraries that you can use to:
So far, DPDK certainly sounds like a networking-specific development kit but, as Atomic Rules’ CTO Shep Siegel says, “If you can make your data-movement problem look like a packet-movement problem,” then DPDK might be a helpful shortcut in your development process.
Siegel knows more than a bit about DPDK because his company has just released Arkville, a DPDK-aware FPGA/GPP data-mover IP block and DPDK PMD (Poll Mode Driver) that allow Linux DPDK applications to offload server cycles to FPGA gates in tandem with the Linux Foundation’s 17.05 release of the open-source DPDK libraries. Atomic Rules’ Arkville release is compatible with Xilinx Vivado 2017.1 (the latest version of the Vivado Design Suite), which was released in April. Currently, Atomic rules provides two sample designs:
(Atomic Rules’ example designs for Arkville were compiled with Vivado 2017.1 as well.)
These examples are data movers; Arkville is a packet conduit. This conduit presents a DPDK interface on the CPU side and AXI interfaces on the FPGA side. There’s a convenient spot in the Arkville conduit where you can add your own hardware for processing those packets. That’s where the CPU offloading magic happens.
Atomic Rules’ Arkville IP works well with all Xilinx UltraScale devices but it works especially well with Xilinx UltraScale+ All Programmable devices that provide two integrated PCIe Gen3 x16 controllers. (That includes devices in the Kintex UltraScale+ and Virtex UltraScale+ FPGA families and the Zynq UltraScale+ MPSoC device families.)
Because, as BittWare’s VP of Network Products Craig Lund says, “100G Ethernet is hard. It’s not clear that you can use PCIe to get [that bit rate] into a server [using one PCIe Gen3 x16 interface]. From the PCIe specs, it looks like it should be easy, but it isn’t.” If you are handling minimum-size packets, says Lund, there are lots of them—more than 14 million per second. If you’re handling big packets, then you need a lot of bandwidth. Either use case presents a throughput challenge to a single PCIe Root Complex. In practice, you really need two.
BittWare has implemented products using the Atomic Rules Arkville IP, based on its XUPP3R PCIe card, which incorporates a Xilinx Virtex UltraScale+ VU13P FPGA. One of the many unique features of this BittWare board is that it has two PCIe Gen3 x16 ports: one available on an edge connector and the other available on an optional serial expansion port. This second PCIe Gen3 x16 port can be connected to a second PCIe slot for added bandwidth.
However, even that’s not enough says Lund. You don’t just need two PCIe Gen3 x16 slots; you need two PCIe Gen2 Root Complexes and that means you need a 2-socket motherboard with two physical CPUs to handle the traffic. Here’s a simplified block diagram that illustrates Lund’s point:
BittWare’s XUPP3R PCIe Card has two PCIe Gen3 x16 ports: one on an edge connector and the other on an optional serial expansion port for added bandwidth
BittWare has used its XUPP3R PCIe card and the Arkville IP to develop two additional products:
Note: For more information about Atomic Rules’ IP and BittWare’s XUPP3R PCIe card, see “BittWare’s UltraScale+ XUPP3R board and Atomic Rules IP run Intel’s DPDK over PCIe Gen3 x16 @ 150Gbps.”
Arkville is a product offered by Atomic Rules. The XUPP3R PCIe card is a product offered by BittWare. Please contact these vendors directly for more information about these products.
By Adam Taylor
So far, our examination of the Zynq UltraScale MPSoC + has focused mainly upon the PS (processing system) side of the device. However, to fully utilize the device’s capabilities we need to examine the PL (programmable logic) side also. So in this blog, we will look at the different AXI interfaces between the PS and the PL.
Zynq MPSoC Interconnect Structure
These different AXI interfaces provide a mixture of master and slave ports from the PS perspective and they can be coherent or not. The PS is the master for the following interfaces:
For the remaining interfaces the PL is the master:
Except for the ACE and ACP interfaces, which have a fixed data width, the remaining interfaces have a selectable data width of 32, 64, or 128 bits.
To support the different power domains within the Zynq MPSoC, each of the master interfaces within the PS is provided with an AXI isolation block that isolates the interface should a power domain be powered down. To protect the APU and RPU from hanging up performing an AXI access, each PS master interface also has a AXI timeout block to recover from any incorrect AXI interactions—for example, if the PL is not powered or configured.
We can use these interfaces simply within our Vivado design, where we can enable, disable, and configure the desired interface.
Once you have enabled and configured the desired interfaces, you can connect them into your design in the PL. Within the simple example in this blog post, we are going to transfer data to and from a BRAM located within the PL.
This example uses the AXI master connected to the low-power domain (LPD). However, both the APU and the RPU can address the BRAM via this interface thanks to the SMMU, the Central Switch, and the Low Power Switch. However, the use of the LPD AXI interconnect will allow the RPU to access the PL if the FPD (full-power domain) is powered down. Of course, it does increase complexity when using the APU.
This simple example performs the following steps:
Program Starting to read addresses for part 1
Data written to the first 256 BRAM addresses
Data read back to confirm the write
The key element in our designs is selecting the correct AXI interface for the application and data transfers at hand and ensuring that we are getting the best possible performance from the interconnect. Next time we will look at the quality of service and the AXI performance monitor.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
My Pappy said
Son, you’re gonna
Drive me to drinkin’
If you don’t stop drivin’
That Hot Rod Lincoln” — Commander Cody & His Lost Planet Airmen
In other words, you need an autonomous vehicle.
For the last 14 years, Perrone Robotics has focused on creating platforms that allow vehicle manufacturers to quickly integrate a variety of sensors and control algorithms into a self-driving vehicle. The company’s MAX (Mobile Autonomous X) is “comprehensive full-stack, modular, real-time capable, customizable, robotics software platform for autonomous (self-driving) vehicles and general purpose robotics.”
Sensors for autonomous vehicles include cameras, lidar, radar, ultrasound, and GPS. All of these sensors generate a lot of data—about 1Mbyte/sec for the Perrone test platform. Designers need to break up all of the processing required for these sensors into tasks that can be distributed to multiple processors and then fuse the processed sensor data (sensor fusion) to achieve real-time, deterministic performance. For the most demanding tasks, software-based processing won’t deliver sufficiently quick response.
Self-driving systems must make as many as 100 decisions/sec based on real-time sensor data. You never know what will come at you.
According to Perrone’s Chief Revenue Officer Dave Hofert, the Xilinx Zynq UltraScale+ MPSoC with its multiple ARM Cortex-A53 and -R5 processors and programmable logic can handle all of these critical tasks and provides a “solution that scales,” with enough processing power to bring in machine learning as well.
Here’s a brand new, 3-minute video with more detail and a lot of views showing a Perrone-equipped Lincoln driving very carefully all by itself:
For more detailed information about Perrone Robotics, see this new feature story from an NBC TV affiliate.
Last week, I wrote about National Instruments’ new PXIe-7915 FlexRIO PXIe module, based on three Xilinx Kintex UltraScale FPGAs. (See “NI’s new FlexRIO module based on Kintex UltraScale FPGAs serves as platform for new modular instruments.”) The PXIe-7915 was featured on the NI Week Keynote stage in a demo where it analyzed the output of a 2nd-generation NI PXIe-5840 Vector Signal Transceiver (a VST 2, powered by a Xilinx Virtex-7 690T FPGA.) The real-time analysis requires the computation of 2.4M FFTs/sec, performed by the DSP slices in the Kintex UltraScale FPGA.
Here’s a short, 63-second video of Rob Bauer, a Product Manager at National Instruments, demonstrating the new PCIe-7915 FlexRIO module in action: