Every device family in the Xilinx UltraScale+ family of devices (Virtex UltraScale+ FPGAs, Kintex UltraScale+ FPGAs, and Zynq UltraScale+ MPSoCs) have members with 28Gbps-capable GTY transceivers. That’s likely to be important to you as the number and forms of small, 28Gbps interconnect grow. You have many such choices in such interconnect these days including:
The following 5.5-minute video demonstrates all of these interfaces operating with 25.78Gbps lanes on Xilinx VCU118 and KCU116 Eval Kits, as concisely explained (as usual) by Xilinx’s “Transceiver Marketing Guy” Martin Gilpatric. Martin also discusses some of the design challenges associated with these high-speed interfaces.
But first, as a teaser, I could not resist showing you the wide-open IBERT eye on the 25.78Gbps Samtec FireFly AOC:
Now that’s a desirable eye.
Here’s the new video:
Amazon Web Services (AWS) is now offering the Xilinx SDAccel Development Environment as a private preview. SDAccel empowers hardware designers to easily deploy their RTL designs in the AWS F1 FPGA instance. It also automates the acceleration of code written in C, C++ or OpenCL by building application-specific accelerators on the F1. This limited time preview is hosted in a private GitHub repo and supported through an AWS SDAccel forum. To request early access, click here.
The HTG-910 Low-Profile PCIe Development Platform from Hitech Global teams a Virtex UltraScale+ (VU9P, VU13P) or Virtex UltraScale VU190 FPGA with two QSFP28 (4x15G) optical cages, two Samtec FireFly Micro Flyover ports (each capable of 100Gbps operation), and 34Gbytes of DDR4 SDRAM in three independent banks. There’s also a Z-Ray interposer capable of carrying 16 32.75Gbps GTY SerDes transceiver ports from the FPGA to a high-speed mezzanine card.
Here’s a block diagram of the card:
And here’s a photo:
This is one big, fast PCIe card that should be capable of implementing just about anything you can think up.
Last week, stealth startup Burlywood in Longmont, Colorado unstealthed and announced TrueFlash, the industry’s first modular NAND Flash memory controller for SSDs. The controller is designed to manage media (like NAND Flash memory) that can exhibit high defects and error rates. The controller is designed to scale to 100Tbytes and beyond, accommodates 3D TLC and QLC Flash devices from multiple sources in the same controller, and can be tuned to specific environments and workloads. According to the company’s brief explanation on its Web site (they are just unstealthing), the Burlywood SSD controller IP “allows for rapid integration of customer specified requirements across interfaces, protocols, FTL, QoS, capacity, flash types, and form-factor.” No doubt all of that flexibility comes from the implementation technology: a Xilinx UltraScale+ FPGA.
For more information about the Burlywood TrueFlash SSD controller, please contact the company directly.
Two new papers, one about hardware and one about software, describe the Snowflake CNN accelerator and accompanying Torch7 compiler developed by several researchers at Purdue U. The papers are titled “Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks” (the hardware paper) and “Compiling Deep Learning Models for Custom Hardware Accelerators” (the software paper). The authors of both papers are Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, and Eugenio Culurciello from Purdue’s School of Electrical and Computer Engineering and the Weldon School of Biomedical Engineering.
In the abstract, the hardware paper states:
“Snowflake, implemented on a Xilinx Zynq XC7Z045 SoC is capable of achieving a peak throughput of 128 G-ops/s and a measured throughput of 100 frames per second and 120 G-ops/s on the AlexNet CNN model, 36 frames per second and 116 Gops/s on the GoogLeNet CNN model and 17 frames per second and 122 G-ops/s on the ResNet-50 CNN model. To the best of our knowledge, Snowflake is the only implemented system capable of achieving over 91% efficiency on modern CNNs and the only implemented system with GoogLeNet and ResNet as part of the benchmark suite.”
The primary goal of the Snowflake accelerator design was computational efficiency. Efficiency and bandwidth are the two primary factors influencing accelerator throughput. The hardware paper says that the Snowflake accelerator achieves 95% computational efficiency and that it can process networks in real time. Because it is implemented on a Xilinx Zynq Z-7045, power consumption is a miserly 5W according to the software paper, well within the power budget of many embedded systems.
The hardware paper also states:
“Snowflake with 256 processing units was synthesized on Xilinx's Zynq XC7Z045 FPGA. At 250MHz, AlexNet achieved in 93:6 frames/s and 1:2GB/s of off-chip memory bandwidth, and 21:4 frames/s and 2:2GB/s for ResNet18.”
Here’s a block diagram of the Snowflake machine architecture from the software paper, from the micro level on the left to the macro level on the right:
There’s room for future performance improvement notes the hardware paper:
“The Zynq XC7Z045 device has 900 MAC units. Scaling Snowflake up by using three compute clusters, we will be able to utilize 768 MAC units. Assuming an accelerator frequency of 250 MHz, Snowflake will be able to achieve a peak performance of 384 G-ops/s. Snowflake can be scaled further on larger FPGAs by increasing the number of clusters.”
This is where I point out that a Zynq Z-7100 SoC has 2020 “MAC units” (actually, DSP48E1 slices)—which is a lot more than you find on the Zynq Z-7045 SoC—and the Zynq UltraScale+ ZU15EG MPSoC has 3528 DSP48E2 slices—which is much, much larger still. If speed and throughput are what you desire in a CNN accelerator, then either of these parts would be worthy of consideration for further development.
This week, Everspin launched its line of MRAM-based nvNITRO NVMe Storage Accelerator cards with an incredible performance spec: up to 1.46 million IOPS for random 4Kbyte mixed 70/30 read/write operations. In the world of IOPS, that’s very fast. In fact it’s roughly 3x faster than an Intel P4800X Optane SSD card, which is spec’ed at up to 500K IOPS for random 4Kbyte mixed 70/30 read/write operations. Multiple factors contribute to the nvNITRO Storage Accelerator’s speed including Everspin’s new 1Gbit Spin Torque Magnetorestrictive RAM (ST-MRAM) with high-speed, DDR4, SDRAM-compatible I/O; a high-performance, MRAM-specific memory controller IP block compatible with NVMe 1.1+; and the Xilinx Kintex UltraScale KU060 FPGA that implements the MRAM controller and the board’s PCIe Gen3 x8 host interface. Everspin’s nvNITRO NVMe cards will ship in Q4 of 2017 and will be available in 1 and 2Gbyte capacities.
Everspin’s nvNITRO NVMe card
Nonvolatile MRAM delivers several significant advantages over other memory technologies used to implement NVMe cards. It’s non-volatile, so no backup power is needed. In addition, ST-MRAM has very high endurance (see chart below), so the nvNITRO card accommodates unlimited drive writes per day, eliminates the need for wear-leveling algorithms that steal memory cycles in NAND-Flash storage, and exhibits no degradation in read/write performance over time.
Everspin’s ST-MRAM has low write times and high write endurance
So much for the chart’s Y axis. You can see from the X axis that Everspin’s ST-MRAM has a very fast write speed—it’s about as fast as DRAM—which is one reason that the nvNITRO Storage Accelerator has such fast read/write performance.
There’s one more line in the Everspin nvNITRO NVMe Storage Accelerator’s data sheet that’s worth mentioning:
“Customer-defined features using own RTL with programmable FPGA”
There appears to be room for your own custom code in that Kintex UltraScale KU060 FPGA that implements the PCIe interface and ST-MRAM controller on the nvNITRO Storage Accelerator card. You can add your own special sauce to the design with no incremental BOM cost. Try doing that with an ASSP!
Semiwiki’s Bernard Murphy recently published a blog titled “Cloud-Based Emulation” that describes the cloud-based version of Aldec’s HES FPGA-based hardware-emulation and -prototyping platform and HES-DVM, an automated and scalable hybrid verification environment for SoC, ASIC, and FPGA-based designs. You access this service through Amazon Web Services’ Elastic Compute cloud (AWS EC2). Murphy’s blog is based on a blog written by Krzysztof Szczur, the Hardware Verification Products Manager at Aldec that’s titled “Emulation on the Cloud.”
Here’s a conceptual picture of Aldec’s cloud-based offering:
At the bottom is the actual Aldec server hardware, which resides at Aldec. It’s based on the company’s HES7XV1380BP Prototyping and Emulation Board. That board incorporates two Xilinx Virtex-7 690T FPGAs. Like amazon’s own FPGA offering, the AWS EC2 F1 instance, Aldec’s HES Cloud is an interesting new way to provide instant access to FPGA capabilities via the cloud. I expect to see more application like this in the future.
Netcope Technologies’ NFB-100G2Q NIC broke industry records for 100GbE performance earlier this year by achieving 148.8M packets/sec throughput on DPDK (the Data Plane Development Kit) for 64-byte packets—which is 100GbE’s theoretical maximum. That’s good news if you’re deploying NFV applications. Going faster is the name of the game, after all. That performance—tracking the theoretical maximum as defined by line rate and packet size—continues as the frame size gets larger. You can see that from this performance graph from the Netcope Web site:
It’s possible to go even faster, you just need a faster line rate. That’s what the just-announced Netcope NFB-200G2QL 200G Programmable Smart NIC is for: sending packets to your application at 200Gbps over two 100GbE connections. The Netcope NFB-100G2Q NIC is based on a Xilinx Virtex-7 580T FPGA. The NFB-200G2QL Smart NIC (with NACA/NASA-style air scoop) is based on a Xilinx Virtex UltraScale+ FPGA.
The Netcope NFB-200G2QL 200G Programmable Smart NIC is based on a Xilinx Virtex UltraScale+ FPGA
For more information about Netcope’s DPDK performance, see the company’s White Paper titled “Improving DPDK Performance.”
For more information about the Netcope NFB-100G2 NIC in Xcell Daily, see “Brief demo at SC15 by NetCOPE shows the company’s 100G Ethernet boards in action.”
Amazon’s AWS EC2 F1 instance, which delivers FPGA as a service and is based on Xilinx Virtex UltraScale+ VU9P FPGAs, has been available for a while but you may not have tried it yet. Here’s another reason to consider trying it out: a new 17-minute video that demonstrates how to debug your FPGA design in the cloud. The video covers:
Here’s the video:
For more information about the AWS EC2 F1 instance, see:
Earlier this year, the University of New Hampshire’s InterOperability Laboratory (UNH-IOL) gave a 25G and 50G Plugfest and everybody came to the party to test compatibility of their implementations with each other. The long list of partiers included:
“The 25 Gigabit Ethernet Consortium is an open organization to all third parties who wish to participate as members to enable the transmission of Ethernet frames at 25 or 50 Gigabit per second (Gbps) and to promote the standardization and improvement of the interfaces for applicable products.”
From the Consortium’s press release about the plugfest:
“The testing demonstrated a high degree of multi-vendor interoperability and specification conformance.”
For its part, Xilinx tested its 10/25G High-Speed Ethernet LogiCORE IP and 40/50G High-Speed Ethernet LogiCORE Subsystem IP using the Xilinx VCU108 Eval Kit based on a Virtex UltraScale XCVU095-2FFVA2104E FPGA over copper using different cable lengths. Consortium rules do not permit me to tell you which companies interoperated with each other, but I can say that Xilinx tested against every company on the above list. I’m told that the Xilinx 25G/50G receiver “did well.”
Xilinx Virtex UltraScale VCU108 Eval Kit
To paraphrase Douglas Adams’ Hitchhikers Guide to the Galaxy: “400GE is fast. Really fast. You just won't believe how vastly, hugely, mind-bogglingly fast it is.”
Xilinx, Microsoft/Azure Networking, and Juniper held a 400GE panel at OFC 2017 that explored the realities of the 400GE ecosystems, deployment models and why the time for 400GE has arrived. The half-hour video below is from OFC 2017. Xilinx’s Mark Gustlin discusses the history of Ethernet from 10Mbps in the 1980s to today’s 400GE, including an explanation lower-speed variants and why they exist. It also provides technical explanations for why the 400GE IEEE technical specs look the way they do and what 400GE optical modules will look like as they evolve. Microsoft/Azure Networking’s Brad Booth describes what he expects Azure’s multi-campus, data-center networking architecture to look like in 2019 and how he expects 400GE to fit into that architecture. Finally, Juniper’s David Ofelt discusses how the 400GE development model has flipped: the hyperscale developers and system vendors are now driving the evolution and the carriers are following their lead. He also touches on the technical issues that have held up 400GE development and what happens when we max out on optical module density (we’re almost there).
For more information about 400GE in Xcell Daily, see:
Baidu’s FPGA Cloud Compute Server, a new high-performance computing service in Baidu’s Cloud, caps the company’s nine years of research into FPGA-accelerated computing research—resulting in this announcement of widespread deployment. “FPGAs have the capability to deliver significant performance for deep learning inference, security, and other high growth data center applications,” said Liu Yang, Head of Baidu Technical Infrastructure, Co-General Manager of Baidu Cloud. “Years of research and FPGA engineering expertise at Baidu has culminated in our delivery of proven acceleration infrastructure for industry and academia.”
The Baidu FPGA Cloud Server provides a complete FPGA-based, hardware and software development environment and includes numerous design examples to help you achieve rapid development and migration while reducing development costs. Each FPGA instance in the Baidu FPGA Cloud Compute Server is a dedicated acceleration platform. FPGA resources are never shared between instances or users. The design examples cover deep learning acceleration and encryption/decryption, among others. In addition, the Baidu FPGA Cloud Server includes real-time monitoring of hardware resources, with statistics for the average length of the queue and the hardware temperature, to help users understand the acceleration hardware’s use and allow the handling of unexpected situations to reduce development risk.
You can quickly purchase one or more FPGA instances using the Baidu Cloud console in just a few minutes.
To provide this new service, Baidu developed its own FPGA accelerator card based on the Xilinx Kintex UltraScale KU115 FPGA. (That’s the DSP monster of the 20nm UltraScale FPGA family with 5520 DSP48 slices and 1.451M system logic cells.) According to Baidu, its FPGA Cloud Server can increase application speed by as much as 100x relative to CPU-based implementations.
Note: For more information about Baidu’s development of FPGA-based cloud acceleration, see “Baidu Adopts Xilinx Kintex UltraScale FPGAs to Accelerate Machine Learning Applications in the Data Center.”
Bittware’s XUPP3R PCIe card based on the Xilinx Virtex UltraScale+ VU9P FPGA has become really popular with customers. (See “BittWare’s UltraScale+ XUPP3R board and Atomic Rules IP run Intel’s DPDK over PCIe Gen3 x16 @ 150Gbps.”) That popularity has led to the inevitable question from BittWare’s customers: How about a bigger FPGA? Although physically, it’s easy to stick a bigger device on a big PCIe card, there’s an issue with heat—getting rid of it. To tackle this engineering problem, BittWare has developed an entirely new platform called “Viper” that employs computer-based thermal modeling, heat pipes, channeled airflow, and the new Xilinx “lidless” D2104 package to get heat out of the FPGA and into the cooling airstream of the PCIe card cage more efficiently. (For more information about the Xilinx lidless D2104 package, see “Mechanical and Thermal Design Guidelines for the UltraScale+ FPGA D2104 Lidless Flip-Chip Packages.”)
The first card to use the Viper platform is the BittWare XUPVV4.
BittWare’s XUPVV4 PCIe Card employs the company’s new Viper Platform with heat-pipe cooling for lidless FPGAs
Here are the specs for the BittWare XUPVV4:
You should be able to build pretty much whatever you want with this board. So, if someone comes to you and says, “you’re gonna’ need a bigger FPGA,” take a look at the BittWare XUPVV4. Plug it into a server and accelerate something today.
Cloud computing and application acceleration for a variety of workloads including big-data analytics, machine learning, video and image processing, and genomics are big data-center topics and if you’re one of those people looking for acceleration guidance, read on. If you’re looking to accelerate compute-intensive applications such as automated driving and ADAS or local video processing and sensor fusion, this blog post’s for you to. The basic problem here is that CPUs are too slow and they burn too much power. You may have one or both of these challenges. If so, you may be considering a GPU or an FPGA as an accelerator in your design.
How to choose?
Although GPUs started as graphics accelerators, primarily for gamers, a few architectural tweaks and a ton of software have made them suitable as general-purpose compute accelerators. With the right software tools, it’s not too difficult to recode and recompile a program to run on a GPU instead of a CPU. With some experience, you’ll find that GPUs are not great for every application workload. Certain computations such as sparse matrix math don’t map onto GPUs well. One big issue with GPUs is power consumption. GPUs aimed at server acceleration in a data-center environment may burn hundreds of watts.
With FPGAs, you can build any sort of compute engine you want with excellent performance/power numbers. You can optimize an FPGA-based accelerator for one task, run that task, and then reconfigure the FPGA if needed for an entirely different application. The amount of computing power you can bring to bear on a problem is scary big. A Virtex UltraScale+ VU13P FPGA can deliver 38.3 INT8 TOPS (that’s tera operations per second) and if you can binarize the application, which is possible with some neural networks, you can hit 500TOPS. That’s why you now see big data-center operators like Baidu and Amazon putting Xilinx-based FPGA accelerator cards into their server farms. That’s also why you see Xilinx offering high-level acceleration programming tools like SDAccel to help you develop compute accelerators using Xilinx All Programmable devices.
For more information about the use of Xilinx devices in such applications including a detailed look at operational efficiency, there’s a new 17-page White Paper titled “Xilinx All Programmable Devices: A Superior Platform for Compute-Intensive Systems.”
When someone asks where Xilinx All Programmable devices are used, I find it a hard question to answer because there’s such a very wide range of applications—as demonstrated by the thousands of Xcell Daily blog posts I’ve written over the past several years.
Now, there’s a 5-minute “Powered by Xilinx” video with clips from several companies using Xilinx devices for applications including:
That’s a huge range covered in just five minutes.
Here’s the video:
Light Reading’s International Group Editor Ray Le Maistre recently interviewed David Levi, CEO of Ethernity Networks, who discusses the company’s FPGA-based All Programmable ACE-NIC, a Network Interface Controller with 40Gbps throughput. The carrier-grade ACE-NIC accelerates vEPC (virtual Evolved Packet Core, a framework for virtualizing the functions required to converge voice and data on 4G LTE networks) and vCPE (virtual Customer Premise Equipment, a way to deliver routing, firewall security and virtual private network connectivity services using software rather than dedicated hardware) applications by 50x, dramatically reducing end-to-end latency associated with NFV platforms. Ethernity’s ACE-NIC is based on a Xilinx Kintex-7 FPGA.
“The world is crazy about our solution—it’s amazing,” says Levi in the Light Reading video interview.
Ethernity Networks All Programmable ACE-NIC
Because Ethernity implements its NIC IP in a Kintex-7 FPGA, it was natural for Le Maistre to ask Levi when his company would migrate to an ASIC. Levi’s answer surprised him:
“We offer a game changer... We invested in technology—which is covered by patents—that consumes 80% less logic than competitors. So essentially, a solution that you may want to deliver without our patents will cost five times more on FPGA… With this kind of solution, we succeed over the years in competing with off-the-shelf components… with the all-programmable NIC, operators enjoy the full programmability and flexibility at an affordable price, which is comparable to a rigid, non-programmable ASIC solution.”
In other words, Ethernity plans to stay with All Programmable devices for its products. In fact, Ethernity Networks announced last year that it had successfully synthesized its carrier-grade switch/router IP for the Xilinx Zynq UltraScale+ MPSoC and that the throughput performance increases to 60Gbps per IP core with the 16nm device—and 120Gbps with two instances of that core. “We are going to use this solution for novel SDN/NFV market products, including embedded SR-IOV (single-root input/output virtualization), and for high density port solutions,” – said Levi.
Towards the end of the video interview, Levi looks even further into the future when he discusses Amazon Web Services’ (AWS’) recent support of FPGA acceleration. (That’s the Amazon EC2 F1 compute instance based on Xilinx Virtex UltraScale+ FPGAs rolled out earlier this year.) Because it’s already based on Xilinx All Programmable devices, Ethernity’s networking IP runs on the Amazon EC2 F1 instance. “It’s an amazing opportunity for the company [Ethernity],” said Levi. (Try doing that in an ASIC.)
Here’s the Light Reading video interview:
When discussed in Xcell Daily two years ago, Exablaze’s 48-port ExaLINK Fusion Ultra Low Latency Switch and Application Platform with the company’s FastMUX option was performing fast Ethernet port aggregation on as many as 15 Ethernet ports with blazingly fast 100nsec latency. (See “World’s fastest Layer 2 Ethernet switch achieves 110nsec switching using 20nm Xilinx UltraScale FPGAs.”) With its new FastMUX upgrade, also available free to existing customers with a current support contract as a field-installable firmware upgrade, Exablaze has now cut that number in half, to an industry-leading 49nsec (actually, between 48.79nsec and 58.79nsec). The FastMUX option aggregates 15 server connections into a single upstream port. All 48 ExaLINK Fusion ports including the FastMux ports are cross-point enabled so that they can support layer 1 features such as tapping for logging, patching for failover, and packet counters and signal quality statistics for monitoring.
The ExaLINK Fusion platform is based on a Xilinx 20nm UltraScale FPGA, which initially gave Exablaze the ability to initially create the fast switching and fast aggregation hardware and massive 48-port connectivity and then to improve the product’s design by taking advantage of the FPGA’s reprogrammability, which simply requires a firmware upgrade that can be performed in the field.
Perhaps you think DPDK (Data Plane Development Kit) is a high-speed data-movement standard that’s strictly for networking applications. Perhaps you think DPDK is an Intel-specific specification. Perhaps you think DPDK is restricted to the world of host CPUs and ASICs. Perhaps you’ve never heard of DPDK—given its history, that’s certainly possible. If any of those statements is correct, keep reading this post.
Originally, DPDK was a set of data-plane libraries and NIC (network interface controller) drivers developed by Intel for fast packet processing on Intel x86 microprocessors. That is the DPDK origin story. Last April, DPDK became a Linux Foundation Project. It lives at DPDK.org and is now processor agnostic.
DPDK consists of several main libraries that you can use to:
So far, DPDK certainly sounds like a networking-specific development kit but, as Atomic Rules’ CTO Shep Siegel says, “If you can make your data-movement problem look like a packet-movement problem,” then DPDK might be a helpful shortcut in your development process.
Siegel knows more than a bit about DPDK because his company has just released Arkville, a DPDK-aware FPGA/GPP data-mover IP block and DPDK PMD (Poll Mode Driver) that allow Linux DPDK applications to offload server cycles to FPGA gates in tandem with the Linux Foundation’s 17.05 release of the open-source DPDK libraries. Atomic Rules’ Arkville release is compatible with Xilinx Vivado 2017.1 (the latest version of the Vivado Design Suite), which was released in April. Currently, Atomic rules provides two sample designs:
(Atomic Rules’ example designs for Arkville were compiled with Vivado 2017.1 as well.)
These examples are data movers; Arkville is a packet conduit. This conduit presents a DPDK interface on the CPU side and AXI interfaces on the FPGA side. There’s a convenient spot in the Arkville conduit where you can add your own hardware for processing those packets. That’s where the CPU offloading magic happens.
Atomic Rules’ Arkville IP works well with all Xilinx UltraScale devices but it works especially well with Xilinx UltraScale+ All Programmable devices that provide two integrated PCIe Gen3 x16 controllers. (That includes devices in the Kintex UltraScale+ and Virtex UltraScale+ FPGA families and the Zynq UltraScale+ MPSoC device families.)
Because, as BittWare’s VP of Network Products Craig Lund says, “100G Ethernet is hard. It’s not clear that you can use PCIe to get [that bit rate] into a server [using one PCIe Gen3 x16 interface]. From the PCIe specs, it looks like it should be easy, but it isn’t.” If you are handling minimum-size packets, says Lund, there are lots of them—more than 14 million per second. If you’re handling big packets, then you need a lot of bandwidth. Either use case presents a throughput challenge to a single PCIe Root Complex. In practice, you really need two.
BittWare has implemented products using the Atomic Rules Arkville IP, based on its XUPP3R PCIe card, which incorporates a Xilinx Virtex UltraScale+ VU13P FPGA. One of the many unique features of this BittWare board is that it has two PCIe Gen3 x16 ports: one available on an edge connector and the other available on an optional serial expansion port. This second PCIe Gen3 x16 port can be connected to a second PCIe slot for added bandwidth.
However, even that’s not enough says Lund. You don’t just need two PCIe Gen3 x16 slots; you need two PCIe Gen2 Root Complexes and that means you need a 2-socket motherboard with two physical CPUs to handle the traffic. Here’s a simplified block diagram that illustrates Lund’s point:
BittWare’s XUPP3R PCIe Card has two PCIe Gen3 x16 ports: one on an edge connector and the other on an optional serial expansion port for added bandwidth
BittWare has used its XUPP3R PCIe card and the Arkville IP to develop two additional products:
Note: For more information about Atomic Rules’ IP and BittWare’s XUPP3R PCIe card, see “BittWare’s UltraScale+ XUPP3R board and Atomic Rules IP run Intel’s DPDK over PCIe Gen3 x16 @ 150Gbps.”
Arkville is a product offered by Atomic Rules. The XUPP3R PCIe card is a product offered by BittWare. Please contact these vendors directly for more information about these products.
By: Gaurav Singh
CCIX was just announced last year and already things are getting interesting.
The approach of CCIX as an acceleration interconnect is to work within the existing volume server infrastructure while delivering improvements in performance and cost.
We’ve reached a major milestone. CCIX members Xilinx and Amphenol FCI have recently revealed the first public CCIX technology demo and what it means for the future of data center system design is exciting to consider.
In the demo video below, you’ll see the transferring of a data pattern at 25 Gbps between two Xilinx FPGAs, across a channel comprised of an Amphenol/FCI PCI Express CEM connector and a trace card. The two devices contain Xilinx Transceivers electrically compliant with CCIX. By using the PCI Express infrastructure found in every data center today, we can achieve this 25 Gig performance milestone. The total insertion loss in the demo is greater than 35dB, die pad to die pad, which allows flexibility in system design. We’re seeing excellent margin, and a BER of less than 1E-12.
At 25 Gig, this is the fastest data transfer between accelerators over PCI Express connections ever achieved. It’s three times faster than the top transfer speed of PCI Express Gen3 solutions available today. The application benefits of communicating three times faster between accelerators is significant in data centers, and CCIX is designed to excel in multi-accelerator configurations.
CCIX will enable seamless system integration between processors such as X86, POWER and ARM and all accelerator types, including FPGAs, GPUs, network accelerators and storage adaptors. Even custom ASICs can be incorporated into a CCIX topology. And CCIX gives system designers the flexibility to choose the right combination of heterogeneous components from many different vendors to deliver optimized configurations for the data center.
We’re looking forward to the first products with CCIX sampling later this year.
You’ve probably heard that “time equals money.” That’s especially true with high-frequency trading (HFT), which seeks high profits based on super-short portfolio holding periods driven by quant (quantitative) modeling. Microseconds make the difference in the HFT arena. As a result, a lot of high-frequency trading companies use FPGA-based hardware to make decisions and place trades and a lot of those companies use Xilinx FPGAs. No doubt that’s why Aldec is showing its HES-HPC-DSP-KU115 FPGA accelerator board at the Trading Show 2017 being held in Chicago, starting today.
Aldec HES-HPC-DSP-KU115 FPGA accelerator board
This board is based on two Xilinx All Programmable devices: the Kintex UltraScale KU115 FPGA and the Zynq Z-7100 SoC (the largest member of the Zynq SoC family). This board has been optimized for High Performance Computing (HPC) applications and prototyping of DSP algorithms thanks to the Kintex UltraScale KU115 FPGA’s 5520 DSP blocks. This board partners the Kintex UltraScale FPGA with six simultaneously accessible external memories—two DDR4 SODIMMs and four low-latency RLDRAMs—providing immense FPGA-to-memory bandwidth.
The Zynq Z-7100 SoC can operate as an embedded Linux host CPU and it can implement a PCIe host interface and multiple Gigabit Ethenert ports.
In addition, the Aldec HES-HPC-DSP-KU115 FPGA accelerator board has two QSFP+ optical-module sockets for 40Gbps network connections.
Amazon Web Services (AWS) has just posted a 35-minute deep-dive video discussing the Amazon EC2 F1 Instance, a programmable cloud accelerator based on Xilinx Virtex UltraScale+ FPGAs. (See “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available.”) This fresh, new video talks about the development process and the AWS SDK.
Rather than have me filter this interesting video, here it it:
Today, IBM and Xilinx announced PCIe Gen4 16/Gtransfer/sec/lane interoperation between an IBM Power9 processor and a Xilinx UltraScale+ All Programmable device. (FYI: That’s double the performance of a PCIe Gen3 connection.) IBM expects this sort of interface to be particularly important in the data center for high-speed, processor-to-accelerator communications, but the history of PCIe evolution clearly suggests that PCIe Gen4 is destined for wide industry adoption across many markets—just like PCIe generations 1 through 3. The thirst for bit rate exists everywhere, in every high-performance design.
All Xilinx Virtex UltraScale+ FPGAs, many Zynq UltraScale+ MPSoCs, and some Kintex UltraScale+ FPGAs incorporate one or more PCIe Gen3/4 hardened, integrated blocks, which can operate as PCIe Gen4 x8 or Gen3 x16 Endpoints or Roots. In addition, all UltraScale+ MGT transceivers (except the PS-GTR transceivers in Zynq UltraScale+ MPSoCs) support the data rates required for PCIe Gen3 and Gen4 interfaces. (See “DS890: UltraScale Architecture and Product Data Sheet: Overview” and “WP458: Leveraging UltraScale Architecture Transceivers for High-Speed Serial I/O Connectivity” for more information.)
The new PALTEK DS-VU 3 P-PCIE Data Brick places a Xilinx Virtex UltraScale+ VU3P FPGA along with 8Gbytes of DDR4-2400 SDRAM, two VITA57.1 FMC connectors, and four Samtec FireFly Micro Flyover ports on one high-bandwidth, PCIe Gen3 with a x16 host connector. The card aims to provide FPGA-based hardware acceleration for applications including 2K/4K video processing, machine learning, big data analysis, financial analysis, and high-performance computing.
PALTEK Data Brick packs Virtex UltraScale+ VU3P FPGA onto a PCIe card
The Samtec Micro Flyover ports accept both ECUE copper twinax and ECUO optical cables. The ECUE twinax cables are for short-reach applications and have a throughput of 28Gbps per channel. The ECUO optical cables operate at a maximum data rate of 14Gbps per channel and are available with as many as 12 simplex or duplex channels (with 28Gbps optical channels in development at Samtec).
For broadcast video applications, PALTEK also offers companion 12G-SDI Rx and 12G-SDI-Tx cards that can break out eight 12G-SDI video channels from one FireFly connection.
Please contact PALTEK directly for more information about these products.
For more information about the Samtec FireFly system, see:
On May 16, David Pellerin, Business Development Manager at AWS (Amazon Web Services) will be presenting two 1-hour Webinars with a deep dive into Amazon’s EC2 F1 Instance. (The two times are to accommodate different time zones worldwide.) The Amazon EC2 F1 Instance allows you to create custom hardware accelerators for your application using cloud-based server hardware that incorporates multiple Xilinx Virtex UltraScale+ VU9P FPGAs. Each Amazon EC2 F1 Instance can include as many as eight FPGAs, so you can develop extremely large and capable, custom compute engines with this technology. Applications in diverse fields such as genomics research, financial analysis, video processing, security/cryptography, and machine learning are already using the FPGA-accelerated EC2 F1 Instance to improve application performance by as much as 30x over general-purpose CPUs.
Register for Amazon’s Webinar here.
The 1-minute video appearing below shows two 56Gbps, PAM-4 demos from the recent OFC 2017 conference. The first demo shows a CEI-56G-MR (medium-reach, 50cm, chip-to-chip and low-loss backplane) connection between a Xilinx 56Gbps PAM-4 test chip communicating through a QSFP module over a cable to a Credo device. A second PAM-4 demo using CEI-56G-LR (long-reach, 100cm, backplane-style) interconnect shows a Xilinx 56Gbps PAM-4 test chip communicating over a Molex backplane to a Credo device, which is then communicating with a GlobalFoundries device over an FCI backplane, which is then communicating over a TE backplane back to the Xilinx device. This second demo illustrates the growing, multi-company ecosystem supporting PAM-4.
For more information about the Xilinx PAM-4 test chip, see “3 Eyes are Better than One for 56Gbps PAM4 Communications: Xilinx silicon goes 56Gbps for future Ethernet,” and “Got 90 seconds to see a 56Gbps demo with an instant 2x upgrade from 28G to 56G backplane? Good!”
Looking for a relatively painless overview of the current state of the art for high-speed Ethernet used in data centers and for telecom? You should take a look at this just-posted, 30-minute video of a panel discussion at OFC2017 titled “400GE from Hype to Reality.” The panel members included:
Gustlin starts by discussing the history of 400GbE’s development, starting with a study group organized in 2013. Today, the 400GbE spec is at draft 3.1 and the plan is to produce a final standard by December 2017.
Booth answers a very simple question in his talk: “”Yes, we ill” use 400GbE in the data center. He then proceeds to give a fairly detailed description of the data centers and networking used to create Microsoft’s Azure cloud-computing platform.
Ofelt describes the genesis of the 400GbE standard. Prior to 400G, says Ofelt, system vendors worked with end users (primarily telecom companies) to develop faster Ethernet standards. Once a standard appeared, ther would be a deployment ramp. Although 400GbE development started that way, the people building hyperscale data centers sort of took over and they want to deploy 400GbE at scale, ASAP.
Don’t be fooled by the title of this panel. There’s plenty of discussion about 25GbE through 100GbE and 200GbE as well, so if you’re needing a quick update on high-speed Ethernet’s status, this 30-minute video is for you.
As of today, Amazon Web Services (AWS) has made the FPGA-accelerated Amazon EC2 F1 compute instance generally available to all AWS customers. (See the new AWS video below and this Amazon blog post.) The Amazon EC2 F1 compute instance allows you to create custom hardware accelerators for your application using cloud-based server hardware that incorporates multiple Xilinx Virtex UltraScale+ VU9P FPGAs. Each Amazon EC2 F1 compute instance can include as many as eight FPGAs, so you can develop extremely large and capable, custom compute engines with this technology. According to the Amazon video, use of the FPGA-accelerated F1 instance can accelerate applications in diverse fields such as genomics research, financial analysis, video processing (in addition to security/cryptography and machine learning) by as much as 30x over general-purpose CPUs.
Access through Amazon’s FPGA Developer AMI (an Amazon Machine Image within the Amazon Elastic Compute Cloud (EC2)) and the AWS Hardware Developer Kit (HDK) on Github. Once your FPGA-accelerated design is complete, you can register it as an Amazon FPGA Image (AFI), and deploy it to your F1 instance in just a few clicks. You can reuse and deploy your AFIs as many times, and across as many F1 instances as you like and you can list it in the AWS Marketplace.
The Amazon EC2 F1 compute instance reduces the time a cost needed to develop secure, FPGA-accelerated applications in the cloud and has now made access quite easy through general availability.
Here’s the new AWS video with the general-availability announcement:
The Amazon blog post announcing general availability lists several companies already using the Amazon EC2 F1 instance including:
You are never going to get past a certain performance barrier by compiling C for a software-programmable processor. At some point, you need hardware acceleration.
As an analogy: You can soup up a car all you want; it’ll never be an airplane.
Sure, you can bump the processor clock rate. You can add processor cores and distribute the tasks. Both of these approaches increase power consumption, so you’ll need a bigger and more expensive power supply; they increase heat generation, which means you will need better cooling and probably a bigger heat sink or a fan (or another fan); and all of these things increase BOM costs.
Are you sure you want to take that path? Really?
OK, you say. This blog’s from an FPGA company (actually, Xilinx is an “All Programmable” company), so you’ll no doubt counsel me to use an FPGA to accelerate these tasks and I don’t want to code in Verilog or VHDL, thank you very much.
Not a problem. You don’t need to.
You can get the benefit of hardware acceleration while coding in C or C++ using the Xilinx SDSoC development environment. SDSoC produces compiled software automatically coupled to hardware accelerators and all generated directly from your high-level C or C++ code.
That’s the subject of a new Chalk Talk video just posted on the eejournal.com Web site. Here’s one image from the talk:
This image shows three complex embedded tasks and the improvements achieved with hardware acceleration:
A beefier software processor or multiple processor cores will not get you 1000x more performance—or even 30x—no matter how you tweak your HLL code, and software coders will sweat bullets just to get a few percentage points of improvement. For such big performance leaps, you need hardware.
Here’s the 14-minute Chalk Talk video:
Samtec recorded a demo of its FireFly FQSFP twinax cable assembly carrying four 28Gbps lanes from a Xilinx Virtex UltraScale+ VU9P FPGA on a VCU118 eval board to a QSFP optical cage at the recent OFC 2017 conference in Los Angeles. (The Virtex UltraScale+ VU9P FPGA has 120 GTY transceivers capable of 32.75Gbps operation and the VCU118 eval kit includes the Samtec FireFly daughtercard with cable assembly.) Samtec’s FQSFP assembly plugs mid-board into a FireFly connector on the VCU118 board. The 28Gbps signals then “fly over” the board through to the QSFP cage and loop back over the same path, where they are received back into the FPGA. The demonstration shows 28Gbps performance on all four links with zero bit errors.
As explained in the video, the advantage to using the Samtec FireFly flyover system is that it takes the high-speed 28Gbps signals out of the pcb-design equation, making the pcb easier to design and less expensive to manufacture. Significant savings in pcb manufacturing cost can result for large board designs, which no longer need to deal with signal-integrity issues and controlled-impedance traces for such high-speed routes.
Samtec has now posted the 2-minute video from OFC 2017 on YouTube and here it is:
Note: Martin Rowe recently published a related technical article about the Samtec FireFly system titled "High-speed signals jump over PCB traces" on the EDN.com Web site.
Here’s a 90-second video showing a 56Gbps Xilinx test chip with a 56Gbps PAM4 SerDes transceiver operating with plenty of SI margin and better than 10-12 error rate over a backplane originally designed for 28Gbps operation.
Note: This working demo employs a Xilinx test chip. The 56Gbps PAM4 SerDes is not yet incorporated into a product. Not yet.
For more information about this test chip, see “3 Eyes are Better than One for 56Gbps PAM4 Communications: Xilinx silicon goes 56Gbps for future Ethernet.”