Dave Jones regularly tears down equipment on his EEVblog videos and discusses how these products have been engineered as he disassembles them. His latest, just-posted teardown focuses on the beautifully engineered Rigol DSG815 RF Signal Generator, on loan from Emona Instruments in Australia, and even though Dave is only looking for “RF goodness,” his teardown unearths a Xilinx Spartan-6 LX25 FPGA buried in the RF heart of the instrument (pointed out at 13:09 in the video and discussed later):
Mellanox announced its ConnectX-4 Lx EN Programmable Adapter Card—a PCIe card that integrates Mellanox’s best-in-class Programmable network controller ASIC with a Xilinx All Programmable Kintex UltraScale FPGA—at SC15 in Austin last week. The addition of the Xilinx FPGA gives the card additional, heavy-duty, on-board processing capabilities beyond those already built into the Mellanox ASIC. The on-board FPGA (a Kintex UltraScale KU040 or KU060) serves as a “bump-on-the-wire” programmable processing engine for developing a wide range of networking applications including high-frequency trading, deep-packet inspection, compression/decompression, storage, and security. High-speed Ethernet communications go through the on-board Mellanox ASIC, which handles the network communications protocols and includes native hardware support for low-latency RDMA over Converged Ethernet (RoCE), Ethernet stateless offload engines, PeerDirect communication acceleration, and Mellanox’s Multi-Host Technology. The Mellanox ASIC communicates with the board’s single QSFP optical-port cage and the Kintex UltraScale FPGA over 40GbE links.
Mellanox ConnectX-4 Lx EN 10/40GbE Programmable Adapter Card with one QSFP port and a Kintex UltraScale FPGA for local, on-board processing
The ConnectX-4 Lx EN Programmable adapter card includes FPGA Board Support Packages (BSP), providing programmers with the necessary infrastructure to implement their own applications and reduce time to market. Moving applications from the host CPU to the Xilinx FPGA on the ConnectX-4 Lx EN Programmable Adapter Card while taking advantage of the Mellanox network controller ASIC’s hardware-based application acceleration drops CPU utilization and makes the CPU available for additional high-level functions.
The following demo by Micron’s Ryan Laity, HMC Applications Engineering Manager, at SC15 in the Micron booth shows a Sidewinder Probe monitoring real-time traffic into and out of a Micron HMC (Hybrid Memory Cube) mounted on a Xilinx Virtex UltraScale VCU110 Dev Kit. The HMC is an extremely fast DRAM built using 3D assembly techniques.
If you’re unfamiliar with the Micron HMC, it’s a completely new DRAM memory architecture that exposes the parallelism inherent in DRAM chips. Each HMC memory die is divided into 16 slices with 16 independent I/O ports on each DRAM die. A stack of these HMC DRAM die communicate vertically through the stack to a logic device that serves as the base for the stack, the DRAM controller, and the communications manager for the HMC 3D assembly. A 4-DRAM chip stack gives the underlying logic chip parallel, simultaneous access to 64 DRAM slices, which creates a ton of raw memory bandwidth. The logic chip communicates with the host system using multiple high-speed serial links. HMCs can have either four or eight 16-lane links that operate as fast as 15Gbps per lane. A 4-link connection, which would have a total of 64 lanes, has a maximum aggregate communications bandwidth of 160Gbytes/sec.
As a result, an HMC can deliver 15x the bandwidth of a DDR3 SDRAM module while consuming 70% less energy per transmitted bit.
The first challenge to address when developing an HMC-based system is physically communicating with the HMC at rated speed. All of those 15Gbps serial connections sort of cry out for connection to a bunch of bullet-proof SerDes ports on a Xilinx All Programmable device. In the case of this demo, the Xilinx UltraScale VCU110 Dev Kit sports a Virtex UltraScale VU190 FPGA, which has sixty 16.3Gbps GTH and sixty 30.5Gbps GTY SerDes transceivers, so it’s more than capable of meeting the HMC’s serial-bandwidth needs.
Beyond the actual high-speed physical connection between the FPGA and the HMC, you also need an appropriate HMC controller instantiated into the FPGA. In the Micron demo at SC15, the Xilinx UltraScale VU190 FPGA contains a Xilinx-designed HMC controller IP block. (Contact your friendly neighborhood Xilinx salesperson or FAE for more information about this IP.)
In addition, the demo video shows a soon-to-be-released Micron Sidewinder, in the little blue box. The Sidewinder shows packet-data throughput on the several internal and external HMC data links in real time. It looks to be a really useful tool when developing applications in conjunction with HMC memory technology.
This was inevitable. Someone was bound to offer cloud-based access to FPGA hardware acceleration.
Bitfusion develops software, designs hardware, and creates data centers dedicated to accelerating cloud-based computing (in partnership with Rackspace). As the concept applies to FPGA-based hardware acceleration, Bitfusion is developing hardware accelerators based on Xilinx Kintex UltraScale devices and will provide cloud-based development tools based on Xilinx tools including SDAccel that allow customers to access hardware-based acceleration as a service for HPC (high-performance computing).
I spoke with Subbu Rama last week in his company’s booth at SC15 in Austin. Rama is CEO and a founder of Bitfusion. His vision is to allow anyone with a laptop to solve HPC-class problems by providing easy, on-demand access to a variety of integrated cloud-based tools including FPGA-based hardware acceleration. Bitfusion charges you for the number of kernel function calls and the kernels’ execution time, and not development time.
Your preferred language should be in there somewhere.
Bitfusion has also developed several example applications including:
Here’s a TechCrunch Disrupt video from May with a more detailed explanation of Bitfusion’s idea:
ArrayFire specializes in OpenCL libraries and has just started to dip its toe into the world of FPGA-based hardware acceleration. The company demonstrated real-time video feature detection running on a Xilinx FPGA and programmed with Xilinx’s SDAccel development environment last week at SC15 in Austin, Texas.
Here’s a still image from the demo:
ArrayFire FPGA-based Video Feature Detection Demo at SC15
On the left, the split-screen image shows a frame from a video stream. The same frame with detected corners highlighted appears on the right. Corner detection was taking place in real time.
For more information, see the November 17 blog post on the ArrayFire Web site.
One ClusterTech Lightning Image Processor implemented on a Xilinx Kintex-7 325T FPGA replaces 10 CPU servers in on-the-fly JPEG image processing applications. The FPGA consumes 15W, about 10% of the CPU power consumption according to ClusterTech. The company was demonstrating its Lightning Image Processor at SC15 in Austin last week and I shot a photo of the processor in action.
Clustertech Lightning Image Processor in action at SC15
The image shows a split-screen display. On the left, you see 140 scaled images. These images were reduced to thumbnails in real time by ClusterTech’s Lightning Image Processor. On the right, you see that a CPU running the same image-scaling algorithm has completed 16 image thumbnails in the same amount of time. That’s about 90% slower using 9x more power.
This is a great demonstration of the kinds of performance improvements and power reductions you can achieve in data-center applications using FPGA-based hardware acceleration.
ClusterTech also had data sheets for two more FPGA-based hardware accelerators: an Accelerated-RAID Erasure Code Generator (erasure coding is a high-speed replacement for other, older RAID data-protection algorithms) and an Accelerated Data Compressor that delivers the compression throughput of “40 x86 CPU cores.” Both of these ClusterTech accelerators are implemented with Xilinx Kintex UltraScale FPGAs—a Kintex UltraScale KU060 for the Accelerated-RAID Erasure Code Generator and a Kintex UltraScale KU115 for the Accelerated Data Compressor. Contact ClusterTech for more info about these hardware accelerators.
My prowl for groundbreaking cloud-computing apps on the SC15 show floor last week brought me to Ryft’s booth where the company’s new Ryft ONE FPGA-accelerated data analytics platform was prominently on display. The Ryft ONE is a 1U box designed to slide right into a server rack. It melds a familiar x86 front end with 48Tbytes of SSD storage and a high-speed parallel processing array—dubbed the Ryft Analytics Cortex—consisting of 11 FPGAs ready to be loaded with Ryft Algorithm Primitives that implement tasks including search, fuzzy search, and term frequency. Ryft ONE can execute searches and fuzzy searches at the rate of 10Gbytes/sec using standard cloud-friendly APIs.
The secret to the Ryft ONE’s analytic processing speed is that the box doesn’t index the data. Data streams into the Ryft ONE at 10Gbytes/sec, feeds into the SSDs, and then passes into the Ryft Analytics Cortex—which treats the data (all of the data) as a big, searchable bit string. Consequently, data type and data organization do not matter. It’s all just one long string of searchable bits to the Ryft ONE.
As much as I don’t like using the overused phrase, index-less search is truly a paradigm shift for data analytics in the data center. It’s a fast paradigm shift too. A Ryft ONE delivers a 100x performance advantage over Apache Spark for specific streaming analytic jobs.
Oh, and those FPGAs in the Ryft ONE? They’re from Xilinx of course. But then you probably guessed that already.
Note: Ryft has a detailed White Paper about the Ryft ONE written by the The Bloor Group. It’s titled “A Ryft in the Market.” Contact Ryft for more info. These guys know what they’re doing. They’re located just outside of Washington, DC. Get my dryft?
By Adam Taylor
When we last looked at creating our own SDSoC platform, we’d created a platform, built a simple software program, and checked that the Zynq SoC’s XADC was present as expected in the resultant utilization report. We did not attempt to accelerate any functions using the Zynq SoC’s PL (programmable logic) and we did not try to use the XADC within a software application. We will now that our platform is capable of doing both in this blog post by combining the previous AES encryption example with the XADC platform.
My aim is to read from the XADC, encrypt the XADC data using the AES function, and then accelerate the AES function in the PL. As this is a pipe-cleaning example, we will only encrypt 16 samples from the XADC and we will scale the 12-bit XADC value to 8 bits for the AES encryption.
First, we must map in the correct driver files to drive the XADC within the hardware; this is very simple. We use SDSoC as we would for a normal SDK application and create a hardware platform using the HDF file within the <project_name>.sdk folder. Then we create a new BSP using that hardware platform (see blog 2; the process is the same for SDSoC). This step provides us with a driver header file for the XADC. We’ll need to use the xsysmon.h file and we’ll also use the xparameters.h file generated by this step.
The reason we need to do all of this is to create a BSP. Until we build SDSoC, there is no BSP available. Unless we include these header files for the drivers, we will not be able to compile the software correctly.
Once we have created these files, we need to ensure that SDSoC can see them. There are a number of ways we can do this:
Option 2 is the preferred option and is very simple to do. We just create a directory under our new definition and copy in the header files from the newly created BSP. Then we update the software PFM file as below to define the BSP configuration and the location of the new header files:
The next step is to ensure that we can accelerate the AES function. Previously, I wrote the software that I wanted to pipe-clean this stage. (You can probably tell that my day job involves developing systems for space.) When I first attempted to do this, I very quickly ran into an error because I had previously created the hardware platform (see blog 108) and I had only included the private processor interrupt needed for the XADC. What I had not done was think ahead to SDSoC’s needs.
Obviously, to create efficient solutions, SDSoC requires the use of the Zynq PL-to-PS (programmable logic to processor system) interrupts for data transfer etc. As such, I had to go back and recreate the platform with the Shared Peripheral Interrupts from the PL to the PS enabled.
To correctly use these in SDSoC we need to connect a concatenation block as below connected to the interrupts:
Note: Ignore the critical warning on the floating input to the concatenation block when you validate and generate the output products.
With all of these tasks completed, I could finally create an application to take the XADC data and encrypt it using AES. When I ran this program using just the Zynq PS running a bare metal system, the software task consumed 28350 clock cycles—similar to that we achieved previously (blog 102).
Setting the AES encryption to run within the Zynq SoC’s PL side reduces the execution time to 12467 clock cycles, which is slower than the previous example because I only provided a 100MHz clock for use in the PL with the SDSoC hardware platform. This result teaches us another lesson about generating your own hardware: make sure you have included all of the clocks possible at frequencies you have considered.
We can now use the Zynq XADC within SDSoC and we can accelerate things using the Zynq SoC’s PL side. Now that we have a new SDSoC Platform, we can finally begin to look at signal processing.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
You also can find links to all the previous MicroZed Chronicles blogs on my own Web site, here.
Waiting for the holiday sales to get started with Xilinx Virtex UltraScale FPGAs for your high-speed networking and other high-performance system designs? Get a head start with this sale on the VCU108 Eval Kit with a Xilinx UltraScale VU095 FPGA. I’ve just heard that Xilinx has knocked $500 off the price, making it $5495. But don't wait too long. I don't know how long this sale will last.
Here’s what the included board looks like:
Xilinx Virtex UltraScale VCU108 Eval Kit
The Xilinx VCU108 Eval Kit gives you fast access to a 20nm Virtex UltraScale XCVU095 FPGA (941K logic cells, 60.8Mbits of Block RAM, 768 DSP slices, 32 GTH 16Gbps SerDes ports, 32 GTY 30.5Gbps SerDes ports, four PCIe hard embedded blocks, six Interlaken blocks, and four 100G Ethernet hard MACs) and provides a great platform for prototyping systems that require massive data flow and packet processing such as 400+ Gbps systems, large-scale emulation, and high performance computing.
For more information on this kit, see “Xilinx VCU108 Eval Kit: Easy access to 20nm Virtex UltraScale FPGA for high-bandwidth and high-speed networking designs.”
Earlier this week, BittWare announced the XUSP3S PCIe high-speed networking card, which is based on either a Xilinx Virtex UltraScale or Kintex UltraScale FPGA. (The card leverages pin compatibility among six different Xilinx UltraScale FPGAs to provide BittWare’s customers with a large number of system-level design and manufacturing choices.) These networking cards all sport four front-panel QSFP28 optical port cages, each capable of operating as a 100Gbps Ethernet link or as four 25Gbps Ethernet links. In addition, the card sports two 72-bit-wide, on-board DDR4 SDRAM banks (max capacity 16Gbytes per bank) and two 260-pin SODIMM memory sockets, each accommodating as much as 16Gbytes of 72-bit-wide DDR4 SDRAM. That's a total board SDRAM capacity of 64Gbytes. (Note: the Kintex UltraScale version of the board supports only one SODIMM socket.)
In addition to the high-speed networking ports, the on-board Xilinx UItraScale FPGA provides significant, user-programmable computational capability, which BittWare’s customers can use to develop a wide range of on-board, network-oriented tasks—including all forms of network processing, security applications, hardware acceleration, storage , broadcast, and signal intelligence—using the company’s own BittWorks II Toolkit and associated FPGA development kit. The BittWorks II tools provide a variety of features to allow developers to take full advantage of the FPGA capabilities on BittWare’s boards.
Here’s a block diagram of BittWare’s XUSP3S board:
BittWare XUSP3S PCIe High-Speed Networking Card Block Diagram
The BittWare press release announcing the XUSP3S Networking card quotes company President and CEO Jeffry Milrod as saying, “BittWare has been exclusively focused on high-end FPGA boards for over a decade, and until now we hadn’t had a single Xilinx-based offering. Xilinx UltraScale FPGAs were so compelling that we recognized that we had to change that.” OK, that’s stunningly frank in a corporate press release (thanks so much!), but the press release goes even further with the following specific and unprompted testimonial from Milrod:
“We have been amazed by the ease of bring-up, the stability, and the performance of these FPGAs. The transceivers are rock solid at 25Gbps, external memory interfaces easily achieved max speeds, power is as promised, and we are in full production. With surprising adoption in pre-production, the XUSP3S has already been a huge success and is a wonderful way to kick off our new Xilinx-based product line.”
Now I’ve been writing about the bulletproof nature of Xilinx’s high-speed SerDes ports for more than two years, ever since Xcell Daily first appeared on the Web. However, it’s one thing for me, the Xilinx corporate blogger to write it (even if it is true). It’s an entirely different thing to have a customer say it.
Now stuff like this is absolute catnip to a blogger so I headed straight to BittWare’s booth this week at SC15 in Austin, Texas to get more information. The first thing I did was grab the board and take a photo:
BittWare XUSP3S PCIe High-Speed Networking Card
Yes, it’s real. Yes, it’s shipping right now. No, the bits in the FPGA are not falling out just because the device is mounted upside down on the board.
I spoke to Bittware’s VP of Systems & Solutions Ron Huizen and SVP of Sales and Marketing Darren Taylor who happened to be at BittWare’s booth. Most of what they told me…well, I simply can’t repeat it here. Suffice it to say that multiple reasons drove BittWare’s decision to base this new, high-end, high-speed, high-performance networking card on Xilinx UltraScale All Programmable devices. It wasn’t just the bulletproof SerDes ports, although those certainly helped—a lot.
They absolutely backed up Jeffry Milrod’s description of BittWare’s experience with Xilinx UltraScale FPGAs. In addition, they told me that the XUSP3S high-speed networking card is BittWare’s first Xilinx-based board to be introduced in a very, very long time (I heard something about Virtex-II). I also heard that if you look carefully on BittWare’s Web site, you’ll find descriptions of two more high-performance BittWare networking boards, also based on Xilinx UltraScale devices, that are already in the works: the XUSPL4 Low-Profile PCIe Board with Dual QSFP and DDR4 SDRAM, again based on either a Xilinx Virtex UltraScale or Kintex UltraScale FPGA, and the XUSP3R 3/4-Length PCIe Board with 128 GBytes DDR4 SDRAM, based on a Xilinx Virtex UltraScale FPGA.
Xilinx has just announced the next generation of Spartan devices, dubbed Spartan-7. Xilinx introduced the first Spartan FPGA family in 1998 and over the family’s nearly two-decade lifespan, Spartan devices have been designed into the most cost-sensitive applications within the automotive, consumer, industrial IoT, data center, wired and wireless communications, and portable medical end-equipment markets. Spartan-7 devices continue the history of this extremely successful product line by delivering a lot of I/O pins in small packages, which is a critical factor for many cost-sensitive designs. In addition, according to yesterday’s press release “The family will provide up to 4X price-performance-per-watt improvement over previous generations.”
The Xilinx Spartan-7 device family is based on the same TSMC 28nm HPL process technology used to implement the other three very successful Xilinx 7 series device families—Virtex-7, Kintex-7, and Artix-7—and leverages all of the knowledge Xilinx has gained about this extremely versatile IC process technology over the past several years. Like all 7 series devices, the members of the Xilinx Spartan-7 family will be supported by the Vivado Design Suite of tools including the no-cost WebPACK edition.
More Spartan-7 details to follow at a later date.
Alpha Data is getting really crafty at using the various high-speed SerDes ports in advanced Xilinx devices. Case in point is this week’s announcement of the ADM-PCIE-8V3 Accelerator Board for 100G/25G network processing, which is based on a Xilinx Virtex UltraScale VU095 FPGA. The board was just announced this week and is on display at the SC15 conference taking place in Austin, TX. Here’s a photo of the board that I shot at SC15:
Alpha Data ADM-PCIE-8V3 Accelerator Board for 100G/25G network processing
This board bristles with high-speed I/O ports, as befits an accelerator that's used to implement high-speed networks with embedded configurability and processing capabilities. The Virtex UltraScale VU095 FPGA has 32 GTY SerDes ports capable of 30.5Gbps operation and another 32 GTH SerDes ports capable of 16.3Gbps operation and the Alpha Data ADM-PCIE-8V3 Accelerator Board uses some of the GTY ports to provide I/O to its two 100G QSFP28 optical port cages and two 28Gbps-capable Samtec FireFly optical ports, which you can see mounted above the board on the card’s rear bracket in the photo above. (Note: Xcell Daily covered the Samtec FireFly high-speed optical I/O flyover system more than a year ago and it’s exciting to see this unique optical I/O system implemented in a board-level product like the Alpha Data ADM-PCIE-8V3 Accelerator Board. See “Samtec FireFly multi-Gbps interconnect system lets you choose between photons (100M) and electrons (28Gbps/channel)”.)
In addition, the Alpha Data ADM-PCIE-8V3 Accelerator Board implements a 16-lane PCIe Gen3 interface to the host server system.
All of this I/O serves the significant, high-speed processing power embedded in the Xilinx Virtex UltraScale VU095 FPGA, which is flanked by two on-board banks of 8Gbyte, 72-bit-wide DDR4-2400 ECC memory (for a total of 16Gbytes of SDRAM). This memory capacity and processing power can be harnessed for a variety of on-board network-processing applications.
Along with the board, Alpha Data supplies a BSP (Board Support Package) and a corresponding API.
Warning: The following blog post contains graphic references to math in connection with big data analytics. If you are sensitive to or easily put off by algorithmic imagery—equations for example—please skip this one. On the other hand, if you want to see how DRC Computer Corp gets a 100x speed boost in big data analytics, then please read on.
DRC Computer Corp is demonstrating its FPGA-boosted implementation of the Dijkstra and Betweenness Centrality algorithms this week in the Xilinx booth at SC15 in Austin, Texas. (See, I warned you about the math.) To cut to the performance chase, the company is using a 20nm Xilinx Virtex UltraScale VU190 FPGA in conjunction with an IBM POWER8 server and this system is getting a 100x speed boost in algorithmic execution versus a CPU-only implementation while the FPGA consumes only 25W or so. That’s two orders of magnitude speed improvement for very little power and energy consumption.
Just what is the Dijkstra and Betweenness Centrality algorithm? It’s used for graph networking. What’s a graph network? I could try to tell you, but we're just digging deeper and deeper into the math pit. (See, I really did warn you about the math.) Suffice it to say that graph networking can rapidly identify relationships between people, events, locations, and objects. In other words, it’s an increasingly important big-data-analytics application for many commercial and government organizations, where extremely fast execution can be mission-critical for financial or other reasons.
DRC’s graph networking algorithm uses complex analytics to discover relationships between entities that runs orders of magnitude faster than what can be achieved with conventional computer architectures—and it’s running on Xilinx Virtex UltraScale FPGAs this week at SC15.
Here’s a photo of the system from the Xilinx booth at SC15:
DRC Computer Corp Graph Networking Demo at SC15
Nearly one year ago, The Dini Group announced the DNPCIe_40G_KU_LL FPGA board based on one of several Xilinx Kintex UltraScale All Programmable devices. (See “DINI Group Announces Immediate Availability of Kintex UltraScale FPGA Board.”) I ran across Mike Dini in his booth earlier this week at the SC15 conference in Austin and found out that The Dini Group has now moved its low-latency TCP Offload Engine IP to this board and the Xilinx Kintex UltraScale architecture and the result is even lower latency. Part of the reason is the faster performance of the 20nm Kintex UltraScale FPGAs but another part, according to Mike Dini, is lower-latency Ethernet PHY and MAC IP from Xilinx that’s been optimized for Xilinx UltraScale devices. These boards are used for high-frequency trading and in these applications every microsecond’s worth of latency reduction is worth a lot of money.
Here’s a photo of The Dini Group’s DNPCIe_40G_KU_LL FPGA board:
Dini Group’s DNPCIe_40G_KU_LL FPGA board based on Xilinx Kintex UltraScale FPGAs
And here’s a block diagram of the board:
Block Diagram of Dini Group’s DNPCIe_40G_KU_LL FPGA board based on Xilinx Kintex UltraScale FPGAs
Now I would never normally resort to a common stereotype like a leprechaun in an Xcell Daily blog post about the Irish Centre for High-End Computing (ICHEC) but in all honesty, the center’s Senior Software Architect Gilles Civario was dressed as one when I visited ICHEC’s booth at SC15 today (as you can see from the image to the right). I was visiting the booth to learn more about the center’s experiments with FPGA-based hardware acceleration for HPC (High-Performance Computing).
According to an ICHEC handout at SC15:
“Sustainable Exascale computing will only become possible with significant improvements in performance per watt on next generation systems. ICHEC has recently developed a SEMA (System for Energy Management on Accelerators) to better meet this goal… SEMA consists of a single workstation with a custom cooling configuration that can support up to three distinct HPC-relevant many-core accelerators in the form of PCIe cards. At the heart of SEMA are multiple Texas Instruments INA226 current shunt monitors that measure the device power rails with their built-in delta-sigma ADCs that can sample at 500KHz.”
The result is a server that can measure PCIe card power consumption with 99.9% accuracy and 1msec time resolution.
Researchers at ICHEC plugged three different PCIe processing cards into this instrumented server. The first was based on an Intel Xeon Phi 7120 CPU. The second was based on a Tesla K40 GPGPU. The third was an Alpha Data ADM7V3 FPGA card, which is based on a Xilinx Virtex-7 FPGA.
The selected benchmark program for ICHEC’s SEMA power-consumption testing is the SHOC benchmark written in OpenCL. SHOC consists of vectorized multiplication-addition (MADD) operations. Researchers made minor modifications to instrument and to adapt the code to each of the three processing elements (CPU, GPGPU, and FPGA).
The graphed results for all three types of PCIe cards appear below:
The ICHEC handout states: “The results show the various events that the devices go through while executing the OpenCL kernel. The initial peaks in all of the devices reflect the transfer of data into the device memory followed by the execution of the kernel. We can also observe the various power state changes the devices go through from execution to idle.”
These three graphs provide a lot of information but don’t be fooled, as I was initially. According to ICHEC’s SC15 handout on this project, “The [above] figures represent different workloads of the kernel and different observed intervals, therefore cannot be used to directly compare the energy efficiency of the devices.”
Even so, I make three key observations from these three graphs:
I believe this data shows that FPGAs have a real power and energy advantage over both CPUs and GPGPUs when running appropriate computational loads, but the above data is merely a quantitative confirmation of something you can see qualitatively by looking at the fans and heatsinks used to cool high-end CPUs and GPGPUs. You should draw your own conclusions.
From my exploration of the dozen or so Xilinx Alliance Members at SC15 this week, I got the clear impression that FPGA-based acceleration can boost code execution performance by anywhere from 5x to 100x depending on the nature of the code, the size of the data sets, and the amount of code optimization performed.
According to the handout, ICHEC is considering offering this power-measurement platform in the form of a HaaS (hardware as a service) model in the form of a cloud system to interested researchers. Send an email to firstname.lastname@example.org for more information. In Irish lore, if you catch a leprechaun he owes you three wishes in exchange for his freedom. Take another look at Gilles above. Go catch him. I figure he still owes you one more wish besides low power and fast execution times.
The SC15 (Supercomputing 2015) exhibit floor opened up this evening in Austin, Texas and one of the first demos of FPGA-based acceleration I saw on the show floor was in the IBM booth. Gavin Stone, VP of Marketing for Edico Genome was showing a 50-60x speedup in exome/genome analysis using his company’s Dragen accelerator card, which is based on a Xilinx FPGA. Here’s a photo of the Dragen board:
Edico Genome Dragen Accelerator Card for Exome/Genome Analysis, based on a Xilinx 28nm FPGA
The Xilinx FPGA is in the center of the board, wearing the Dragen decal.
The Edico Genome Dragen card is plugged into an IBM OpenPOWER server and acts as an accelerator for the server’s POWER processor. Assembling an exome or genome from the fragmented data supplied by a DNA sequencer is like assembling a jigsaw puzzle consisting of many millions of overlapping and duplicate puzzle pieces. It’s quite an exercise in template matching and mapping within a very large database.
Below is a 3-minute video demo of the Edico Genome system in action. As Stone explains in this video, the FPGA-based Dragen card cuts exome processing time to about six minutes. Software-only exome processing requires about six hours, so the FPGA-based hardware acceleration is providing a 50x to 60x speedup:
Currently, the Edico Genome Dragen board is based on a 28nm Xilinx device. Stone told me the company is already considering a Xilinx UltraScale device for its next-generation board and hoping to get another 5x performance improvement. I wouldn’t bet against it.
Well, it’s time to ‘fess up. TSMC’s 16nm FF+ FinFET process technology is turning out to be even better than expected, causing Xilinx to change the clock-rate specs for the Zynq UltraScale+ MPSoC’s APU (Application Processor Unit) and the GPU (Graphics Processing Unit)—in the good direction. Effective immediately, the quad-core ARM Cortex-A53 APU clock rates have jumped 15% to 20% with the fastest clock speed now rated as 1.5GHz for a -3 speed grade and the ARM Mali GPU clock rate has jumped 50% with the fastest clock speed now rated as 667MHz for a -3 speed grade. These improved specs are already reflected in the latest Zynq UltraScale+ MPSoC product guide now online.
Today, IBM and Xilinx announced a formal, joint strategic collaboration to develop higher performance and energy-efficient data center applications using Xilinx-based, FPGA-enabled workload acceleration on IBM POWER-based systems. As the press release states, this collaboration will “develop open acceleration infrastructures, software and middleware to address emerging applications such as machine learning, network functions virtualization (NFV), genomics, high performance computing (HPC) and big data analytics.” As part of the IBM and Xilinx strategic collaboration, IBM Systems Group developers will create solution stacks for POWER-based servers, storage and middleware systems with Xilinx FPGA accelerators for data center architectures such as OpenStack, Docker, and Spark. IBM will also develop and qualify Xilinx accelerator boards into IBM Power Systems servers.
“The combination of IBM and Xilinx provides our clients not only with a new level of accelerated computing made possible by the tight integration between IBM POWER processors and Xilinx FPGAs, but also gives them the ability to benefit directly from the constant stream of innovation being delivered by the rapidly expanding OpenPOWER ecosystem,” said Ken King, General Manager, OpenPOWER, IBM.
For its part, Xilinx is developing and will release POWER-based versions of its leading software defined SDAccel Development Environment and libraries for the OpenPOWER developer community. In addition, Xilinx has deepened its investment in the OpenPOWER Foundation, has raised its foundation membership to Platinum level, and has been approved for a Board Director position.
I see this agreement as a huge extension of the memcached acceleration work covered in last week’s Xcell Daily blog titled “Memcached KVS implementation services requests in 3 to 5 microseconds instead of hundreds or thousands using FPGAs and CAPI.” This technique melds the OpenPOWER Foundation’s coherent CAPI interface, high-speed PCIe connections, multiple PCIe-centric DMA controllers, and FPGA- and hardware-based application accelerators to both speed up applications by more than an order of magnitude and to cut power consumption in the data center. Both of these benefits are huge wins for the HPC (high-performance computing), NFV, and cloud-computing communities and take POWER8-based systems to computing spaces where the promise of Moore's Law falls short.
Boards that can take advantage of OpenPOWER’s CAPI protocol are already on the market. For example, Xcell Daily covered the announcement of Alpha Data’s CAPI Acceleration Development Kit earlier this year. (See “CAPI Acceleration Development Kit brings coherent FPGA acceleration to IBM POWER8 servers.”)
Note: For more Xcell Daily coverage of CAPI, see “Low-Power Coherent Accelerator Board boosts performance of IBM Power8 Servers through CAPI.”
By Adam Taylor
While SDSoC comes with a number of predefined hardware platforms, it is often necessary to create our own customized platform that configures the PS and the PL how we desire it. Over the previous few blogs, we created a system that incorporates the Zynq SoC’s XADC. We will now pull that hardare definition into SDSoC so that we can use it as the base for future developments. It’s always useful to have an ADC available.
A Xilinx development team has been investigating KVS (key-value store) implementations such as memcached since 2013. As many as 30% of the servers in data centers implement KVS functions, so accelerating these functions can significantly improve data center efficiency. Initial work with Xilinx FPGAs was able to demonstrate a 35x performance/power improvement relative to the fastest x86 implementations at the time. However there were memory capacity and performance constraints that limited the results.
New work by the Xilinx team using PCIe DMA engines and the OpenPOWER CAPI protocol has yielded significant results without the limitations. Typical x86 installations service memcached requests within a range of hundreds to thousands of microseconds. These new OpenPOWER CAPI installations pairing IBM POWER8 processors with appropriately configured Xilinx FPGAs can service the same requests in 3 to 5 microseconds. Zounds!
Memcached using CAPI and FPGAs
If this work interests you, there’s a new blog posted on the OpenPOWER Foundation’s Web site, written by Michaela Blott, Principal Engineer, Xilinx Research, with more details. Click here.
Note: Xilinx has a booth at SC15 in Austin next week.
The great thing about attending shows like this week’s ARM TechCon in Silicon Valley is finding things you’d never guess existed. Case in point is a double surprise at the Micrium booth, where I saw μC/OS for the Xilinx Zynq UltraScale+ MPSoC operating and learned about a free μC/OS for Makers that encompasses Xilinx Zynq SoC designs.
The demo of μC/OS-III for the Xilinx Zynq UltraScale+ MPSoC was running on one of the Zynq MPSoC boards I wrote about earlier in the Xcell Daily blog. (See “Lift-off! 16nm Zynq UltraScale+ MPSoC ships to customers. From tapeout to “Hello World” in 2.5 months.”) The Xilinx Zynq UltraScale+ MPSoC combines several programmable processors including a quad-core 64-bit ARM Cortex-A53 APU (Application Processing Unit), a dual-core 32-bit ARM Cortex-R5 RPU (Real-Time Processing Unit), and the possibility of creating multiple Xilinx MicroBlaze instances within the Zynq MPSoC’s programmable logic. Micrium supports all of the processors in the Zynq MPSoC platform with the µC/OS-II and µC/OS-III operating systems and additional components including µC/TCP-IP, µC/USB, and µC/FS (among others).
Here’s a photo of Micrium’s OS software running a software-defined radio app on the Zynq UltraScale+ MPSoC board at ARM TechCon with the various software components shown on the LCD:
At the same time, I learned about the new μC/OS for Makers package, which provides a full-featured version of μC/OS, which supports the Xilinx Zynq SoC and other processor architectures, to qualified Makers.
How do you know if you qualify? It looks like there are three different qualifying determinants:
A lot of people will qualify and if they grow out of the qualification thanks to a successful design using a Xilinx Zynq with μC/OS, that’s a good thing, right?
Here’s a chart of Micrium licensing options scanned from a handout I picked up at the Micrium booth:
There are a lot of awards in our industry and I do not normally blog about them. However, I do make exceptions and the annual Thomson Reuters Top 100 Global Innovators award is one of those exceptions. For the fourth year in a row, Thomson Reuters has named Xilinx in its Top 100 Global Innovators report. Xilinx innovations are directly aimed at helping customers integrate the highest levels of software-based intelligence with hardware optimization and any-to-any connectivity in all applications including those associated with six key Megatrends (5G Wireless, SDN/NFV, Video/Vision, ADAS, Industrial IoT, and Cloud Computing) shaping the world’s industries today.
According to SVP David Brown, Thomson Reuters uses a scientific approach to analyzing metrics including patent volume, application-to-grant success, globalization and citation influence. Consequently, this award is based on objective criteria and is not a popularity contest, which is why I consider it bloggable. That, and Xilinx’s presence on the Top 100 list this year, and in 2012, 2013, and 2014. (Note: The top 100 innovators are not ranked. You’re either on the list—or you’re not. Xilinx is.)
Brown writes in a preface to the report:
“…we’ve developed an objective formula that identifies the companies around the world that are discovering new inventions, protecting them from infringers and commercializing them. This is what we call the “Lifecycle of Innovation:” discovery, protection and commercialization. Our philosophy is that a great idea absent patent protection and commercialization is nothing more than, a great idea.”
“…for five consecutive years the Thomson Reuters Top 100 companies have consistently outperformed other indices in terms of revenue and R&D spend. This year, our Top 100 innovators outperform the MSCI World Index in revenue by 6.01 percentage points and in employment by 4.09 percentage points. We also outperform the MSCI World Index in market-cap-weighted R&D spend by 1.86 percentage points. The conclusion: investment in R&D and innovation results in higher revenue and company success.”
Here’s a video showing Thomson Reuters Senior IP Analyst Bob Stembridge describing the methodology for determining the world’s most innovative companies for this report:
For more information about this fascinating study and report, use the link above and download the report PDF.
PFP Cybersecurity’s eMonitor employs technology based on taking fine-grained measurements of a processor’s power consumption and performing anomaly detection using base references from trusted software sources, machine learning, and data analytics. The overall PFP solution employs distributed sensors at or in the IoT, cloud, or on a premise machine to make these power measurements, which it then compares to a template envelope of expected power behavior. PFP Cybersecurity is demonstrating a Zynq-based version of its technology at this week’s ARM TechCon here in Silicon Valley. The technology is intended as a design add-on for production hardware.
Here’s a photo from the Xilinx booth at ARM TechCon showing the Zynq-based version of the company’s external demonstration vehicle—the eMonitor (in the blue box)—which is testing known good and known bad boards that have supposedly identical designs:
Zynq-based version of PFP Cybersecurity’s eMonitor (in the blue box)
Both test boards are based on Xilinx Spartan-3E FPGAs. The eMonitor’s only connection to the boards is the dc power cable. The good board has a known FPGA configuration, which produces this power fingerprint:
Note that all readings but one are green and below a nominal threshold line. As shown in this photo, the one red reading is not sufficient to invalidate the good board.
The second board is physically identical to the good board and has the same FPGA configuration but a free-running binary counter has been added to the FPGA configuration as an example of a “Trojan” or “virus.” This “bad” board produces this power fingerprint:
As you can see, it’s extremely easy to tell difference between these nearly identical boards using PFP Cybersecurity's technology.
Yesterday, AGGIOS announced its Seed Energy Manager, which provides software-defined power management for the Xilinx Zynq UltraScale+ MPSoC. I saw this new power-management tool in action yesterday at ARM TechCon in Silicon Valley. In conjunction with the company’s EnergyLab energy management synthesis tool, Seed Energy Manager gives you remarkably simple control over the power consumption of complex, multi-processor systems based on the Zynq MPSoC. “The Xilinx Zynq MPSoC is an ideal target for our software-defined power management solutions because of the levels of multiprocessing complexity it handles and the critical importance of optimizing power for the whole application," said Dr. Vojin Zivojnovic, CEO of AGGIOS.
The EnergyLab tool allows you to define the independent blocks in your system so that you can compute the energy savings when you turn them off. EnergyLab then creates an abstracted system description in UHAL, a Unified Hardware Abstraction Layer, that Seed Energy Manager uses to develop power-management strategies based on the system’s actual resource usage. The Seed Energy Manager:
Davorin Mista, co-founder and VP of Engineering of AGGIOS, gave me an impressive demo in the AGGIOS booth at ARM TechCon. The company is using one of the Zynq MPSoC boards I wrote about earlier in the Xcell Daily blog. (See “Lift-off! 16nm Zynq UltraScale+ MPSoC ships to customers. From tapeout to “Hello World” in 2.5 months.”)
AGGIOS Software-Defined Radio Power Demo using Zynq UltraScale+ MPSoC at ARM TechCon
AGGIOS is clearly taking advantage of many of the power-management features inside of the Zynq MPSoC. Equally clearly, AGGIOS has gotten a lot of early information about the Zynq MPSoC’s power management system from Xilinx and is making excellent use of that information.
I was particularly impressed with Seed Energy Manager’s ability to entirely shut down and power down the Zynq MPSoC’s FPGA fabric when not needed and then automatically repower and re-configure the FPGA in a matter of 20msec or so when needed. This is exactly the sort of ability you want in power-constrained applications—and today, what applications are not power constrained?
The Xilinx Zynq UltraScale+ MPSoC is a complex device with its four ARM Cortex-A53 application processors, two ARM Cortex-R5 real-time processors, and various other specialized processors not to mention the attached FPGA. You are going to need a tool like the AGGIOS Seed Energy Manager, so why not take a look?
Cloud infrastructure must dramatically transform to meet the increasing demands of applications including image and speech recognition, video processing, and personalized medicine. Next week at SC15 in Austin, 14 companies will showcase cloud-based applications that benefit from the performance boost provided by Xilinx FPGAs. These companies include:
In addition, Xilinx will be demonstrating eight such applications in its SC15 booth.
For more information, click here.
FPGA usage has evolved from its early use as glue logic, as reflected in the six Megatrends now making significant use of Xilinx All Programmable devices: 5G Wireless, SDN/NFV, Video/Vision, ADAS, Industrial IoT, and Cloud Computing. Today, you’re just as likely to use one Xilinx All Programmable device to implement a single-chip system because that’s the fastest way to get from concept to working, production systems. Consequently, system-level testing of Xilinx devices has similarly evolved to track these more advanced uses for the company’s products.
If you’d like more information about this new level of testing, a good place to look is page 11 of the just-published 2015 Annual Quality Report from Xilinx. (You just might want to take a look at all of the report’s pages while you’re at it.)
Adam Taylor, the author of the popular MicroZed Chronicles series, is starting a new article series covering the Xilinx SDSoC development environment over on the Embedded.com site. His first article states:
“Today HLS has broken through into the main stream with FPGA vendors and EDA companies offering tools which convert C, C++, System C and Matlab into FPGA bit streams. Another inflection point has been reached with the advent of System on Chip (SoC) devices that tightly couple both processors and programmable within the same die… what if we could design both the processor and the programmable logic not only with the same high level language but also with the same tool and move design elements from one side to the other with ease to ensure we could meet the performance requirements of the system?”
Taylor’s new article series will go on to explain just exactly how you do that.
Based on his MicroZed series, this new one is highly recommended.
By Adam Taylor
Now that we have determined the latency from interrupt assertion to ISR execution on the Xilinx Zynq SoC, we can define the hardware we want to pull into SDSoC from Vivado. Knowing and designing for the latency means we can be sure that we never miss an XADC sample. If we had been unable to determine the latency, we would have been forced to develop a more complicated acquisition architecture in Vivado. Luckily, we have significant margin.
Recall that in blog 104 in this series, we examined XADC use in the real world. You will remember that we can sample at 961.5385Ksamples/sec using a 100MHz AXI clock frequency. That means a new sample and corresponding interrupt will occur every 1.04 Microseconds. It is therefore simple to show that we have sufficient margin given the measured ISR latency. (Between you and me, I was pretty confident we’d be OK but actually measuring the timing helps when we look at more complex designs.)Read more...
When, oh when, will Xilinx deign to part with more technical specs on the Xilinx Zynq UltraScale+ MPSoC? After all, the first devices shipped to customers a whole five weeks ago? (See “Lift-off! 16nm Zynq UltraScale+ MPSoC ships to customers. From tapeout to “Hello World” in 2.5 months.”) How long must we wait?
Not another minute more.
Hot off the press is the initial release of the 15-page White Paper WP470, officially titled “Unleash the Unparalleled Power and Flexibility of Zynq UltraScale+ MPSoCs.” From it, you’ll get the following block diagram:
You’ll also find these vital Zynq UltraScale+ MPSoC Processing System stats:
You’ll want to know how much of that new UltraScale architectural programmability is packed into the Zynq UltraScale+ MPSoC device family. WP470 has you covered there too:
That should be more than enough to whet your appetite for more info. Go, get White Paper WP470 and find out more—a lot more—about the Xilinx Zynq UltraScale+ MPSoC.
Xcell Daily has already covered the crowd-funded, Zynq-based, wireless Snickerdoodle Dev Board more than a few times but krtkl’s pre-ordering and funding campaign on CrowdSupply just racked up its thousand-and-first pledge and the total funds raised is now $58,291. That’s 105% of goal, which is pretty darn impressive for a Zynq dev board with the low, low base price of only $55 (plus $5 shipping). Certainly worth another post in this blog.
krtkl’s Wireless Snickerdoodle Dev Board based on a Xilinx Zynq SoC
If you are interested in working with the Xilinx Zynq SoC at a rock-bottom price, check out the Snickerdoodle. There are still two weeks left in the funding campaign, but this one’s already a “go.”
For more information about the Snickerdoodle Dev Board, see ““$55 Zynq-based Wireless Snickerdoodle single-board computer with WiFi, Bluetooth launched today on CrowdSupply.”