Aldec recently published a very descriptive example design for a high-performance, re-programmable network router/switch based on the company’s TySOM-2A-7Z030 embedded development board and an FMC-NET daughter card. The TySOM-2A-7Z030 incorporates a Xilinx Zynq Z-7030 SoC. The Zynq SoC’s dual-core Arm cortex-A9 MPCore processor runs the OpenWrt Linux distribution for embedded devices in this design. It’s a favorite for network switch developers. The design employs the programmable logic (PL) in the Zynq SoC to create four 1G/2.5G Ethernet MACs that connect to the FMC-NET card’s four Ethernet PHYs and a 10G Ethernet subsystem that connects to the FMC-NET card’s QSFP+ card cage.
There is no PCIe Gen5—yet—but there’s a 32Gbps/lane future out there and TE Connectivity demonstrated that future at this week’s DesignCon 2018. The demo’s real purpose was to show the capabilities of TE Connectivity’s Sliver connector system which includes card-edge and cabled connectors. In the demo at DesignCon, four channels carry 32Gbps data streams through surface-mount and right-angle connectors to create a mockup of a future removable-storage device. Those 32Gbps data streams are generated, transmitted, and received by bulletproof Xilinx UltraScale+ GTY transceivers operating reliably at a theoretical PCIe Gen5’s 32Gbps/lane data rate despite 35dB of loss through the demo system.
In the following short video, Samtec’s Ralph Page describes the demo and mentions the nice eyes and clear data levels, as seen on the Xilinx demo software screen positioned above the demo boards. He also mentions the BER—5.29x10-8. That’s the error rate before adding the error-reducing capabilities of a FEC, which can drop the error rate by perhaps another ten orders of magnitude or more.
Samtec’s demo points to a foreseeable future where you will be able to develop large backplanes with screamingly fast performance using PAM4 SerDes transceivers.
The 2-minute video below shows you an operational Xilinx Virtex UltraScale+ XCVU37P FPGA, which is enhanced with co-packaged HBM (high-bandwidth memory) DRAM using Xilinx’s well-proven, 3rd-generation 3D manufacturing process. (Xilinx started shipping 3D FPGAs way back in 2011, starting with the Virtex-7 2000T and we’ve been shipping these types of devices ever since.)
This video was made on the very first day of silicon bringup for the device and it is already operating at full speed (460Gbytes/sec), error-free, over 32 channels. The Virtex UltraScale+ XCVU37P is one big All Programmable device with:
2852K System Logic Cells
9Mbits of BRAM
270Mbits of UltraRAM
9024 DSP48E2 slices
8Gbytes of integrated HBM DRAM
96 32.75Gbps GTY SerDes transceivers
Whatever your requirements, whatever your application, chances are this extremely powerful FPGA will deliver all of the heavy lifting (processing, memory, and I/O) that you need.
Although Deutsche Börse Group, one of the world’s largest stock and security exchanges, already had a packet-capture and time-stamping solution in place, a major upgrade and redesign of their co-location network in 2017 added 60 Metamakos K-Series networking devices—including the company’s MetaApp 32 Network Application Platform—in to its data center in Frankfurt, Germany. This upgrade was a response to increasing customer demand for market fairness and precision in network-based trading. The upgrade significantly enhances and strengthens network-monitoring capabilities and gives full visibility for network transactions by capturing every packet entering and exiting Deutsche Börse Group’s network. (Metamako has published a case study of this application. Click here for more information.)
Metamako’s MetaApp 32 is an adaptable network application platform that brings intelligence to the network edge for some of the most demanding, latency-critical networks including high-frequency trading and analytics. It is based on Xilinx Virtex-7 FPGAs, which means that the company’s Network Application Platform can run multiple networking applications in parallel—very quickly.
A quick look at the latest product table for the Xilinx Zynq UltraScale+ RFSoC will tell you that the sample rate for the devices’ RF-class, 14-bit DAC has jumped to 6.554Gsamples/sec, up from 6.4Gsamples/sec. I asked Senior Product Line Manager Wouter Suverkropp about the change and he told me that the increase supports “…an extra level of oversampling for DOCSIS3.1 [designers]. The extra oversampling gives them 3dB processing gain and therefore simplifies the external circuits even further.”
Zynq UltraScale+ RFSoC Conceptual Diagram
For more information about the Zynq UltraScale+ RFSoC, see:
Closeup view of the QSFP28 ports on Accolade’s ANIC-200Kq Flow Classification and Filtering Adapter
The new ANIC-200Kq adapter differs from the older ANIC-200Ku adapter in its optical I/O ports. The ANIC-200Kq adapter incorporates two QSFP28 optical cages and the ANIC-200Kq adapter incorporates two CFP2 cages. Both the QSFP28 and CFP2 interfaces accept SR4 and LR4 modules. The QSFP28 optical cages put Accolade’s ANIC-200Kq adapter squarely in the 25, 40, and 100GbE arenas, providing data center architects with additional architectural flexibility when designing their optical networks. For this reason, QSFP28 is fast becoming the universal form factor for new data center installations.
For more information in Xcell Daily about Accolade’s fast Flow Classification and Filtering Adapters, see:
Netcope Technologies’ NFB-200G2QL 200G Ethernet Smart NIC based on a Virtex UltraScale+ FPGA
One trick to doing this: using two PCIe Gen3 x16 slots to get packets to/from the server CPU(s). Why two slots? Because Netcope discovered that its 200G Smart NIC PCIe card could transfer about 110Gbps worth of packets over one PCIe Gen3 x16 slot and the theoretical maximum traffic throughput for one such slot is 128Gbps. That means 200Gbps will not pass through the eye of this 1-slot needle. Hence the need for two PCIe slots, which will carry the 200Gbps worth of packets with a comfortable margin. Where’s that second PCIe Gen3 interface coming from? Over a cable attached to the Smart NIC board and implemented in the board’s very same Xilinx Virtex UltraScale+ VU7P FPGA, of course. The company has written a White Paper describing this technique titled “Overcoming the Bandwidth Limitations of PCI Express.”
And yes, there’s a short video showing this Netcope sorcery as well:
Last week at SC17 in Denver, BittWare announced its TeraBox 1432D 1U FPGA server box, a modified Dell PowerEdge C4130 with a new front panel that exposes 32 100GbE QSFP ports from as many as four of the company’s FPGA accelerator cards. (That’s a total front-panel I/O bandwidth of 3.2Tbps!) The new 1U box doubles the I/O rack density with respect to the company’s previous 4U offering.
BittWare’s TeraBox 1432D 1U FPGA Server Box exposes 32 100GbE QSFP ports on its front panel
The TeraBox 1432D server box can be outfitted with four of the company’s XUPP3R boards, which are based on Xilinx Virtex UltraScale+ FPGAs (VU7P, VU9P, or VU11P) and can be fitted for eight QSFPs each. (That’s four QSFP cages) on the board and four more QSFPs on a daughter card connected to the XUPP3R board via a cable to an FMC connector. This configuration underscores the extreme I/O density and capability of Virtex UltraScale+ FPGAs.
BittWare TeraBox 1432D interior detail
The new BittWare TeraBox 1432D will be available Q1 2018 with the XUPP3R FPGA accelerator board. According to the announcement, BittWare will also release the Xilinx UltraScale+ VU13P-based XUPVV4 in 2018. This new board will also fit in the TeraBox 1432D.
Here’s a 3-minute video from SC17 with a walkthrough of the TeraBox 1432D 1U FPGA server box by BittWare's GM and VP of Network Products Craig Lund:
One of the several demos in the Xilinx booth during this week’s SC17 conference in Denver was a working demo of the CCIX (Cache Coherent Interconnection for Accelerators) protocol, which simplifies the design of offload accelerators for hyperscale data centers by providing low-latency, high-bandwidth, fully coherent access to server memory. The demo shows L2 switching acceleration using an FPGA to offload a host processor. The CCIX protocol manages a hardware cache in the FPGA, which is coherently linked to the host processor’s memory. Cache updates take place in the background without software intervention through the CCIX protocol. If cache entries are invalidated in the host memory, the CCIX protocol automatically invalidates the corresponding cache entries in the FPGA’s memory.
Senior Staff Design Engineer Sunita Jain gave Xcell Daily a 3-minute explanation of the demo, which shows a 4.5x improvement in packet transfers using CCIX versus software-controlled transfers:
There’s one thing to note about this demo. Although the CCIX standard calls for using the PCIe protocol as a transport layer at 25Gbps/lane, which is faster than PCIe Gen4, this demo only demonstrates the CCIX protocol and is using the significantly slower PCIe Gen1 for the transport layer.
For more information about the CCIX protocol as discussed in Xcell Daily, see:
The target application of interest at SC17 was high-frequency trading, where every microsecond you can shave off of system response times directly adds dollars to the bottom line, so the ROI on a product like the nvNITRO NVMe Accelerator Card that cuts transaction times is easy to calculate.
Everspin MRAM-based nvNITRO NVMe Accelerator Card
It turns out that a common thread and one of the bottlenecks for high-frequency trading applications is the use of Apache Log4j event-logging utility. However, incoming packets arrive at a variable rate—the traffic is bursty—and the Log4j logging utility needs to keep up with the highest possible burst rates to ensure that every event is logged. Piping these events directly into SSD storage sets a low limit to the burst rate that a system can handle. Inserting an nvNITRO NVMe Accelerator Card as a circular buffer in series with the incoming event stream as shown below boosts Log4j performance by 9x.
Proof of efficacy appears in the chart below, which shows the much lower latency and much better determinism provided by the nvNITRO card:
The new Mellanox Innova-2 Adapter Card teams the company’s ConnectX-5 Ethernet controller with a Xilinx Kintex UltraScale+ KU15P FPGA to accelerate computing, storage, and networking in data centers. According to the announcement, “Innova-2 is based on an efficient combination of the state-of-the-art ConnectX-5 25/40/50/100Gb/s Ethernet and InfiniBand network adapter with Xilinx UltraScale FPGA accelerator.” The adapter card has a PCIe Gen4 host interface.
Mellanox’s Innova-2 PCIe Adapter Card
Key features of the card include:
Dual-port 25Gbps Ethernet via SFP cages
TLS/SSL, IPsec crypto offloads
Mellanox ConnectX-5 Ethernet controller and Xilinx Kintex UltraScale+ FPGA for either “bump-on-the-wire” or “look-aside” acceleration
Low-latency RDMA and RDMA over Converged Ethernet (RoCE)
OVS and Erasure Coding offloads
Mellanox PeerDirect communication acceleration
End-to-end QoS and congestion control
Hardware-based I/O virtualization
Innova-2 is available in multiple, pre-programmed configurations for security applications with encryption acceleration such as IPsec or TLS/SSL. Innova-2 boosts performance by 6x for security applications while reducing total cost of ownership by 10X when compared to alternatives.
Innova-2 enables SDN and virtualized acceleration and offloads for Cloud infrastructure. The on-board programmable resources allow deep-learning training and inferencing applications to achieve faster performance and better system utilization by offloading algorithms into the card’s Kintex UltraScale+ FPGA and the ConnectX acceleration engines.
The adapter card is also available as an unprogrammed card, open for customers’ specific applications. Mellanox provides configuration and management tools to support the Innova-2 Adapter Card across Windows, Linux, and VMware distributions.
Please contact Mellanox directly for more information about the Innova-2 Adapter Card.
Accolade’s new Flow-Shunting feature for its FPGA-based ANIC network adapters lets you more efficiently drive packet traffic through existing 10/40/100GbE data center networks by offloading host servers. It does this by eliminating the processing and/or storage of unwanted traffic flows, as identified by the properly configured Xilinx UltraScale FPGA on the ANIC adapter. By offloading servers and reducing storage requirements, flow shunting can deliver operational cost savings throughout the data center.
The new Flow Shunting feature is a subset of the existing Flow Classification capabilities built into the FPGA-based Advanced Packet Processor in the company’s ANIC network adapters. (The company has written a technology brief explaining the capability.) Here’s a simplified diagram of what’s happening inside of the ANIC adapter:
The Advanced Packet Processor in each ANIC adapter performs a series of packet processing functions including flow classification (outlined in red). The flow classifier inspects each packet, determines whether each packet is part of a new flow or an existing flow, and then updates the associated lockup table (LUT)—which resides in a DRAM bank—with the flow classification. The LUT has room to store as many as 32 million unique IP flow entries. Each flow entry includes standard packet-header information (source/destination IP, protocol, etc.) along with flow metadata including total packet count, byte count, and the last time a packet was seen. The same flow entry tracks information about both flow directions to maintain a bi-directional context. With this information, the ANIC adapter can take specific actions on an individual flow. Actions might include forwarding, dropping, or re-directing packets in each flow.
These operations form the basis for flow shunting, which permits each application to decide from which flow(s) it does and does not want to receive data traffic. Intelligent, classification-based flow shunting allows an application to greatly reduce the amount of data it must analyze or handle, which frees up server CPU resources for more pressing tasks.
LightReading has just posted a 5-minute video interview with Kirk Saban (Xilinx’s Senior Director for FPGA and SoC Product Management and Marketing) discussing some of the aspects of the newly announced Zynq UltraScale+ RFSoCs with on-chip RF ADCs and DACs. These devices are going to revolutionize the design of all sorts of high-end equipment that must deal with high-speed analog signals in markets as diverse as 5G communications, broadband cable, test and measurement, and aerospace/defense.
--“Revolution” written by John Lennon and Paul McCartney
Today, Xilinx announcedfive new Zynq UltraScale+ RFSoC devices with all of the things you expect in a Xilinx Zynq UltraScale+ SoC—a 4-core APU with 64-bit ARM Cortex A-53 processor cores, a 2-core RPU with two 32-bit ARM Cortex-R5 processors, and ultra-fast UltraScale+ programmable logic—with revolutionary new additions: 12-bit, RF-class ADCs, 14-bit, RF-class DACs, and integrated SD-FEC (Soft Decision Forward Error Correction) cores.
Just in case you missed the plurals in that last sentence, it’s not one but multiple RF ADCs and DACs.
That means you can bring RF analog signals directly into these chips, process those signals using high-speed programmable logic along with thousands of DSP48E2 slices, and then output processed RF analog signals—using the same device to do everything. In addition, if you’re decoding and/or encoding data, two of the announced Zynq UltraScale+ RFSoC family members incorporate SD-FEC IP cores that support LDPC coding/decoding and Turbo decoding for applications including 5G wireless communications, backhaul, DOCSIS, and LTE. If you’re dealing with RF signals and high-speed communications, you know just how revolutionary these parts are.
For everyone else not accustomed to handling RF analog signals… well you’ll just have to take my word for it. These devices are revolutionary.
Here’s a conceptual block diagram of a Zynq UltraScale+ RFSoC:
Here’s a table that shows you some of the resources available on the new Zynq UltraScale+ RFSoC devices:
As you can see from the table, you can get as many as eight 12-bit, 4Gsamples/sec ADCs or sixteen 12-bit, 2Gsamples/sec ADCs on one device. You can also get eight or sixteen 14-bit, 6.4Gsamples/sec DACs on the same device. Two of the Zynq UltraScale+ RFSoCs also incorporate eight SD-FECs. In addition, there are plenty of logic cells, DSP slices, and RAM on these devices to build just about anything you can imagine. (With my instrumentation background, I can imagine new classes of DSOs and VSTs (Vector Signal Transceivers), for example.)
You get numerous benefits by basing your design on a Zynq UltraScale+ RFSoC device. The first and most obvious is real estate. Putting the ADCs, DACs, processors, programmable logic, DSPs, memory, and programmable I/O on one device saves a tremendous amount of board space and means you won’t be running high-speed traces across the pcb to hook all these blocks together.
Next, you save the complexity of dealing with high-speed converter interfaces like JESD204B/C. The analog converters are already interfaced to the processors and logic inside of the device. Done. Debugged. Finished.
You also save the power associated with those high-speed interfaces. That alone can amount to several watts of power savings. These benefits are reviewed in a new White Paper titled “All Programmable RF-Sampling Solutions.”
Need to get an FPGA-based, high-performance network server or appliance designed and fielded quickly? If so, take a serious look at the heavyweight combination of a BittWare XUPP3R PCIe board based on any one of three Xilinx Virtex UltraScale+ FPGAs (VU7P, VU9P, or VU11P) and LDA Technologies’ slick 1U e4 FPGA chassis, which is designed to bring all of the BittWare XUPP3R’s I/O ports on its QSFP, PCIe, and high-speed serial expansion ports to the 48-port front panel, like so:
LDA Technologies’ 1U e4 FPGA chassis
The reason that BittWare’s XUPP3R PCIe card can support this many high-speed GbE ports is because the Virtex UltraScale+ FPGAs have that many bulletproof high-speed SerDes ports.
BittWare XUPP3R PCIe board based on Virtex UltraScale+ FPGAs
You can use the combination of the BittWare XUPP3R PCIe card and LDA Technologies’ e4 FPGA chassis to develop a variety of network equipment including:
LDA Technologies’ e4 FPGA chassis is specifically designed to accept a PCIe FPGA card like BittWare’s XUPP3R and it has several features designed to specifically support the needs of such a card including:
Dual-redundant power supply with load balancing
On-board Ethernet switch permits simple port-to-port switching without FPGA intervention
On-board COM Express socket that accepts Type 6 and Type 10 processor cards
Locally stored FPGA configuration with 1-second boot time
Precision 100MHz OCXO (oven-controlled crystal oscillator) for applications that require precise timing
The e4 FPGA chassis fits in a 1U rack space and it’s only 12 inches deep, so you can mount two back-to-back in a 1U rack slot with the front I/O ports pointing forward and the back I/O ports pointing out the rear of the rack.
For a 10-minute BittWare video that gives you even more information about this card/chassis combo, click here.
This week, EXFO announced and demonstrated its FTBx-88400NGE Power Blazer 400G Ethernet Tester at the ECOC 2017 optical communications conference in Gothenburg, Sweden using a Xilinx VCU140 FPGA design platform as an interoperability target. The VCU140 development platform is based on a Xilinx Virtex UltraScale+ VU9P FPGA. EXFO’s FTBx-88400NGE Power Blazer offers advanced testing for the full suite of new 400G technologies including support for FlexE (Flex Ethernet), 400G Ethernet, and high-speed transceiver validation. The Flex Ethernet (FlexE) function supports one or more bonded 100GBASE-R PHYs supporting multiple Ethernet MAC operating at a rate of 10, 40, or n x 25Gbps. Flex Ethernet is a key data center technology that helps data centers deliver links that are faster than emerging 400G solutions.
Here’s a photo of the ECOC 2017 demo:
This demonstration is yet one more proof point for the 400GbE standard, which will be used in a variety of high-speed communications applications including data-center interconnect, next-generation switch and router line cards, and high-end OTN transponders.
Every device family in the Xilinx UltraScale+ family of devices (Virtex UltraScale+ FPGAs, Kintex UltraScale+ FPGAs, and Zynq UltraScale+ MPSoCs) have members with 28Gbps-capable GTY transceivers. That’s likely to be important to you as the number and forms of small, 28Gbps interconnect grow. You have many such choices in such interconnect these days including:
The following 5.5-minute video demonstrates all of these interfaces operating with 25.78Gbps lanes on Xilinx VCU118 and KCU116 Eval Kits, as concisely explained (as usual) by Xilinx’s “Transceiver Marketing Guy” Martin Gilpatric. Martin also discusses some of the design challenges associated with these high-speed interfaces.
But first, as a teaser, I could not resist showing you the wide-open IBERT eye on the 25.78Gbps Samtec FireFly AOC:
Netcope Technologies’ NFB-100G2Q NIC broke industry records for 100GbE performance earlier this year by achieving 148.8M packets/sec throughput on DPDK (the Data Plane Development Kit) for 64-byte packets—which is 100GbE’s theoretical maximum. That’s good news if you’re deploying NFV applications. Going faster is the name of the game, after all. That performance—tracking the theoretical maximum as defined by line rate and packet size—continues as the frame size gets larger. You can see that from this performance graph from the Netcope Web site:
Earlier this year, the University of New Hampshire’s InterOperability Laboratory (UNH-IOL) gave a 25G and 50G Plugfest and everybody came to the party to test compatibility of their implementations with each other. The long list of partiers included:
“The 25 Gigabit Ethernet Consortium is an open organization to all third parties who wish to participate as members to enable the transmission of Ethernet frames at 25 or 50 Gigabit per second (Gbps) and to promote the standardization and improvement of the interfaces for applicable products.”
From the Consortium’s press release about the plugfest:
“The testing demonstrated a high degree of multi-vendor interoperability and specification conformance.”
To paraphrase Douglas Adams’ Hitchhikers Guide to the Galaxy: “400GE is fast. Really fast. You just won't believe how vastly, hugely, mind-bogglingly fast it is.”
Xilinx, Microsoft/Azure Networking, and Juniper held a 400GE panel at OFC 2017 that explored the realities of the 400GE ecosystems, deployment models and why the time for 400GE has arrived. The half-hour video below is from OFC 2017. Xilinx’s Mark Gustlin discusses the history of Ethernet from 10Mbps in the 1980s to today’s 400GE, including an explanation lower-speed variants and why they exist. It also provides technical explanations for why the 400GE IEEE technical specs look the way they do and what 400GE optical modules will look like as they evolve. Microsoft/Azure Networking’s Brad Booth describes what he expects Azure’s multi-campus, data-center networking architecture to look like in 2019 and how he expects 400GE to fit into that architecture. Finally, Juniper’s David Ofelt discusses how the 400GE development model has flipped: the hyperscale developers and system vendors are now driving the evolution and the carriers are following their lead. He also touches on the technical issues that have held up 400GE development and what happens when we max out on optical module density (we’re almost there).
For more information about 400GE in Xcell Daily, see:
Xilinx announced the addition of the P416 network programming language for SDN applications to its SDNet Development Environment for high-speed (1Gbps to 100Gbps) packet processing back in May. (See “The P4 has landed: SDNet 2017.1 gets P4-to-FPGA compilation capability for 100Gbps data-plane packet processing.”) An OFC 2017 panel session in March—presented by Xilinx, Barefoot Networks, Netcope Technologies, and MoSys—discussed the adoption of P4, the emergent high-level language for packet processing, and early implementations of P4 for FPGA and ASIC targets. Here’s a half-hour video of that panel discussion.
Metamako decided that it needed more than one Xilinx UltraScale FPGA to deliver the low latency and high performance it wanted from its newest networking platform. The resulting design is a 1RU or 2RU box that houses one, two, or three Kintex UltraScale or Virtex UltraScale+ FPGAs, connected by “near-zero” latency links. The small armada of FPGAs means that the platform can run multiple networking applications in parallel—very quickly. This new networking platform allows Metamako to expand far beyond its traditional market—financial transaction networking—into other realms such as medical imaging, SDR (software-defined radio), industrial control, and telecom. The FPGAs are certainly capable of implementing tasks in all of these applications with extremely high performance.
Metamako’s Triple-FPGA Networking Platform
The Metamako platform offers an extensive range of standard networking features including data fan-out, scalable broadcast, connection monitoring, patching, tapping, time-stamping, and a deterministic port-to-FPGA latency of just 3nsec. Metamako also provides a developer’s kit with the platform with features that include:
A Simplified Stack - One device houses the FPGA(s), x86 server, and Layer 1 switch, ensuring that all hardware components work in sync.
Integration with existing FPGA Tools – Platform-specific adapters for programming the FPGA(s) are embedded in the Metamako device, allowing for quick and easy (remote) access to the device by the FPGA development tools.
Layer 1+ Crosspoint Functionality – Includes all core Metamako Layer 1 functions such as market-scalable broadcast, connection monitoring, remote patching, tapping, and timestamping.
SFP Agnostic – Metamako’s Layer 1+ switch is SFP agnostic, which saves FPGA developers time and effort in having to interface with lots of different types of SFPs.
Feature Rich – Standard enterprise features include access control, syslog, SNMP, packet stats, tcpdump, JSON RPC API, time series data, and streaming telemetry.
Easy Application Deployment - Metamako's platform includes a built-in application infrastructure that allows developers to wrap applications into simple packages for deployment.
This latest networking platform from Metamako demonstrates a key attribute of Xilinx All Programmable technology: the ability to fully differentiate a product by exploiting the any-to-any connectivity and high-speed processing capabilities of Xilinx silicon using Xilinx’s development tools. No other chip technology could provide Metamako with a comparable mix of extreme connectivity, speed, and design flexibility.
When someone asks where Xilinx All Programmable devices are used, I find it a hard question to answer because there’s such a very wide range of applications—as demonstrated by the thousands of Xcell Daily blog posts I’ve written over the past several years.
Now, there’s a 5-minute “Powered by Xilinx” video with clips from several companies using Xilinx devices for applications including:
Machine learning for manufacturing
Autonomous cars, drones, and robots
Real-time 4K, UHD, and 8K video and image processing
VR and AR
High-speed networking by RF, LED-based free-air optics, and fiber
Light Reading’s International Group Editor Ray Le Maistre recently interviewed David Levi, CEO of Ethernity Networks, who discusses the company’s FPGA-based All Programmable ACE-NIC, a Network Interface Controller with 40Gbps throughput. The carrier-grade ACE-NIC accelerates vEPC (virtual Evolved Packet Core, a framework for virtualizing the functions required to converge voice and data on 4G LTE networks) and vCPE (virtual Customer Premise Equipment, a way to deliver routing, firewall security and virtual private network connectivity services using software rather than dedicated hardware) applications by 50x, dramatically reducing end-to-end latency associated with NFV platforms. Ethernity’s ACE-NIC is based on a Xilinx Kintex-7 FPGA.
“The world is crazy about our solution—it’s amazing,” says Levi in the Light Reading video interview.
Ethernity Networks All Programmable ACE-NIC
Because Ethernity implements its NIC IP in a Kintex-7 FPGA, it was natural for Le Maistre to ask Levi when his company would migrate to an ASIC. Levi’s answer surprised him:
“We offer a game changer... We invested in technology—which is covered by patents—that consumes 80% less logic than competitors. So essentially, a solution that you may want to deliver without our patents will cost five times more on FPGA… With this kind of solution, we succeed over the years in competing with off-the-shelf components… with the all-programmable NIC, operators enjoy the full programmability and flexibility at an affordable price, which is comparable to a rigid, non-programmable ASIC solution.”
In other words, Ethernity plans to stay with All Programmable devices for its products. In fact, Ethernity Networks announced last year that it had successfully synthesized its carrier-grade switch/router IP for the Xilinx Zynq UltraScale+ MPSoC and that the throughput performance increases to 60Gbps per IP core with the 16nm device—and 120Gbps with two instances of that core. “We are going to use this solution for novel SDN/NFV market products, including embedded SR-IOV (single-root input/output virtualization), and for high density port solutions,” – said Levi.
Towards the end of the video interview, Levi looks even further into the future when he discusses Amazon Web Services’ (AWS’) recent support of FPGA acceleration. (That’s the Amazon EC2 F1 compute instance based on Xilinx Virtex UltraScale+ FPGAs rolled out earlier this year.) Because it’s already based on Xilinx All Programmable devices, Ethernity’s networking IP runs on the Amazon EC2 F1 instance. “It’s an amazing opportunity for the company [Ethernity],” said Levi. (Try doing that in an ASIC.)
When discussed in Xcell Daily two years ago, Exablaze’s 48-port ExaLINK Fusion Ultra Low Latency Switch and Application Platform with the company’s FastMUX option was performing fast Ethernet port aggregation on as many as 15 Ethernet ports with blazingly fast 100nsec latency. (See “World’s fastest Layer 2 Ethernet switch achieves 110nsec switching using 20nm Xilinx UltraScale FPGAs.”) With its new FastMUX upgrade, also available free to existing customers with a current support contract as a field-installable firmware upgrade, Exablaze has now cut that number in half, to an industry-leading 49nsec (actually, between 48.79nsec and 58.79nsec). The FastMUX option aggregates 15 server connections into a single upstream port. All 48 ExaLINK Fusion ports including the FastMux ports are cross-point enabled so that they can support layer 1 features such as tapping for logging, patching for failover, and packet counters and signal quality statistics for monitoring.
The ExaLINK Fusion platform is based on a Xilinx 20nm UltraScale FPGA, which initially gave Exablaze the ability to initially create the fast switching and fast aggregation hardware and massive 48-port connectivity and then to improve the product’s design by taking advantage of the FPGA’s reprogrammability, which simply requires a firmware upgrade that can be performed in the field.
Perhaps you think DPDK (Data Plane Development Kit) is a high-speed data-movement standard that’s strictly for networking applications. Perhaps you think DPDK is an Intel-specific specification. Perhaps you think DPDK is restricted to the world of host CPUs and ASICs. Perhaps you’ve never heard of DPDK—given its history, that’s certainly possible. If any of those statements is correct, keep reading this post.
Originally, DPDK was a set of data-plane libraries and NIC (network interface controller) drivers developed by Intel for fast packet processing on Intel x86 microprocessors. That is the DPDK origin story. Last April, DPDK became a Linux Foundation Project. It lives at DPDK.org and is now processor agnostic.
DPDK consists of several main libraries that you can use to:
Send and receive packets while minimizing the number of CPU cycles needed (usually less than 80)
Develop fast packet-capture algorithms
Run 3rd-party fast-path stacks
So far, DPDK certainly sounds like a networking-specific development kit but, as Atomic Rules’ CTO Shep Siegel says, “If you can make your data-movement problem look like a packet-movement problem,” then DPDK might be a helpful shortcut in your development process.
Siegel knows more than a bit about DPDK because his company has just releasedArkville, a DPDK-aware FPGA/GPP data-mover IP block and DPDK PMD (Poll Mode Driver) that allow Linux DPDK applications to offload server cycles to FPGA gates in tandem with the Linux Foundation’s 17.05 release of the open-source DPDK libraries. Atomic Rules’ Arkville release is compatible with Xilinx Vivado 2017.1 (the latest version of the Vivado Design Suite), which was released in April. Currently, Atomic rules provides two sample designs:
Four-Port, Four-Queue 10 GbE example (Arkville + 4×10 GbE MAC)
Single-Port, Single-Queue 100 GbE example (Arkville + 1×100 GbE MAC)
(Atomic Rules’ example designs for Arkville were compiled with Vivado 2017.1 as well.)
These examples are data movers; Arkville is a packet conduit. This conduit presents a DPDK interface on the CPU side and AXI interfaces on the FPGA side. There’s a convenient spot in the Arkville conduit where you can add your own hardware for processing those packets. That’s where the CPU offloading magic happens.
Atomic Rules’ Arkville IP works well with all Xilinx UltraScale devices but it works especially well with Xilinx UltraScale+ All Programmable devices that provide two integrated PCIe Gen3 x16 controllers. (That includes devices in the Kintex UltraScale+ and Virtex UltraScale+ FPGA families and the Zynq UltraScale+ MPSoC device families.)
Because, as BittWare’s VP of Network Products Craig Lund says, “100G Ethernet is hard. It’s not clear that you can use PCIe to get [that bit rate] into a server [using one PCIe Gen3 x16 interface]. From the PCIe specs, it looks like it should be easy, but it isn’t.” If you are handling minimum-size packets, says Lund, there are lots of them—more than 14 million per second. If you’re handling big packets, then you need a lot of bandwidth. Either use case presents a throughput challenge to a single PCIe Root Complex. In practice, you really need two.
BittWare has implemented products using the Atomic Rules Arkville IP, based on its XUPP3R PCIe card, which incorporates a Xilinx Virtex UltraScale+ VU13P FPGA. One of the many unique features of this BittWare board is that it has two PCIe Gen3 x16 ports: one available on an edge connector and the other available on an optional serial expansion port. This second PCIe Gen3 x16 port can be connected to a second PCIe slot for added bandwidth.
However, even that’s not enough says Lund. You don’t just need two PCIe Gen3 x16 slots; you need two PCIe Gen2 Root Complexes and that means you need a 2-socket motherboard with two physical CPUs to handle the traffic. Here’s a simplified block diagram that illustrates Lund’s point:
BittWare’s XUPP3R PCIe Card has two PCIe Gen3 x16 ports: one on an edge connector and the other on an optional serial expansion port for added bandwidth
BittWare has used its XUPP3R PCIe card and the Arkville IP to develop two additional products:
DFC Design’s Xenie FPGA module product family pairs a Xilinx Kintex-7 FPGA (a 70T or a 160T) with a Marvell Alaska X 88X3310P 10GBASE-T PHY on a small board. The module breaks out six of the Kintex-7 FPGA’s 12.5Gbps GTX transceivers and three full FPGA I/O banks (for a total of 150 single-ended I/O or up to 72 differential pairs) with configurable I/O voltage to two high-speed, high-pin-count, board-to-board connectors. A companion Xenie BB Carrier board accepts the Xenie FPGA board and breaks out the high-speed GTX transceivers into a 10GBASE-T RJ45 connector, an SFP+ optical cage, and four SDI connectors (two inputs and two outputs).
Here’s a block diagram and photo of the Xenia FPGA module:
Xenia FPGA module based on a Xilinx Kintex-7 FPGA
And here’s a photo of the Xenie BB Carrier board that accepts the Xenia FPGA module: