Next week at OFC 2017 in Los Angeles, Acacia Communications, Optelian, Precise-ITC, Spirent, and Xilinx will present the industry’s first interoperability demo supporting 200/400GbE connectivity over standardized OTN and DWDM. Putting that succinctly, the demo is all about packing more bits/λ, so that you can continue to use existing fiber instead of laying more.
Callite-C4 400GE/OTN Transponder IP from Precise-ITC instantiated in a Xilinx Virtex UltraScale+ VU9P FPGA will map native 200/400GbE traffic—generated by test equipment from Spirent—into 2x100 and 4x100 OTU4-encapsulated signals. The 200GbE and 400GbE standards are still in flux, so instantiating the Precise-ITC transponder IP in an FPGA allows the design to quickly evolve with the standards with no BOM or board changes. Concise translation: faster time to market with much less risk.
Callite-C4 400GE/OTN Transponder IP Block Diagram
Optelian’s TMX-2200 200G muxponder, scheduled for release later this year, will muxpond the OTU4 signals into 1x200Gbps or 2x200Gbps DP-16QAM using Acacia Communications’ CFP2-DCO coherent pluggable transceiver.
The Optelian and Precise-ITC exhibit booths at OFC 2017 are 4139 and 4141 respectively.
This week, EETimes’ Junko Yoshida published an article titled “Xilinx AI Engine Steers New Course” that gathers some comments from industry experts and from Xilinx with respect to Monday’s reVISION stack announcement. To recap, the Xilinx reVISION stack is a comprehensive suite of industry-standard resources for developing advanced embedded-vision systems based on machine learning and machine inference.
As Xilinx Senior Vice President of Corporate Strategy Steve Glaser tells Yoshida, “Xilinx designed the stack to ‘enable a much broader set of software and systems engineers, with little or no hardware design expertise to develop, intelligent vision guided systems easier and faster.’”
“While talking to customers who have already begun developing machine-learning technologies, Xilinx identified ‘8 bit and below fixed point precision’ as the key to significantly improve efficiency in machine-learning inference systems.”
Yoshida also interviewed Karl Freund, Senior Analyst for HPC and Deep Learning at Moor Insights & Strategy, who said:
“Artificial Intelligence remains in its infancy, and rapid change is the only constant.” In this circumstance, Xilinx seeks “to ease the programming burden to enable designers to accelerate their applications as they experiment and deploy the best solutions as rapidly as possible in a highly competitive industry.”
She also quotes Loring Wirbel, a Senior Analyst at The Linley group, who said:
“What’s interesting in Xilinx's software offering, [is that] this builds upon the original stack for cloud-based unsupervised inference, Reconfigurable Acceleration Stack, and expands inference capabilities to the network edge and embedded applications. One might say they took a backward approach versus the rest of the industry. But I see machine-learning product developers going a variety of directions in trained and inference subsystems. At this point, there's no right way or wrong way.”
There’s a lot more information in the EETimes article, so you might want to take a look for yourself.
Next week at the OFC Optical Networking and Communication Conference & Exhibition in Los Angeles, Xilinx will be in the Ethernet Alliance booth demonstrating the industry’s first, standard-based, multi-vendor 400GE network. A 400GE MAC and PCS instantiated in a Xilinx Virtex UltraScale+ VU9P FPGA will be driving a Finisar 400GE CFP8 optical module, which in turn will communicate with a Spirent 400G test module over a fiber connection.
In addition, Xilinx will be demonstrating:
If you’re visiting OFC, be sure to stop by the Xilinx booth (#1809).
MRAM (magnetic RAM) maker Everspin wants to make it easy for you to connect its 256Mbit DDR3 ST-MRAM devices (and it’s soon-to-be-announced 1Gbit ST-MRAMs) to Xilinx UltraScale FPGAs, so it now provides a software script for the Vivado MIG (Memory Interface Generator) that adapts the MIG DDR3 controller to the ST-MRAM’s unique timing and control requirements. Everspin has been shipping MRAMs for more than a decade and, according to this EETimes.com article by Dylan McGrath, it’s still the only company to have shipped commercial MRAM devices.
Nonvolatile MRAM’s advantage is that it has no wearout failure, as opposed to Flash memory for example. This characteristic gives MRAM huge advantages over Flash memory in applications such as server-class enterprise storage. MRAM-based storage cards require no wear leveling and their read/write performance does not degrade over time, unlike Flash-based SSDs.
As a result, Everspin also announced its nvNITRO line of NVMe storage-accelerator cards. The initial cards, the 1Gbyte nvNITRO ES1GB and 2Gbyte nvNITRO ES2GB, deliver 1,500,000 IOPS with 6μsec end-to-end latency. When Everspin's 1Gbit ST-MRAM devices become available later this year, the card capacities will increase to 4 to 16Gbytes.
Here’s a photo of the card:
Everspin nvNITRO Storage Accelerator
If it looks familiar, perhaps you’re recalling the preview of this board from last year’s SC16 conference in Salt Lake City. (See “Everspin’s NVMe Storage Accelerator mixes MRAM, UltraScale FPGA, delivers 1.5M IOPS.”)
If you look at the photo closely, you’ll see that the hardware platform for this product is the Alpha Data ADM-PCIE-KU3 PCIe accelerator card, loaded 1 or 2Gbyte Everspin ST-MRAM DIMMs. Everspin has added its own IP to the Alpha Data card, based on a Kintex UltraScale KU060 FPGA, to create an MRAM-based NVMe controller.
As I wrote in last year’s post:
“There’s a key point to be made about a product like this. The folks at Alpha Data likely never envisioned an MRAM-based storage accelerator when they designed the ADM-PCIE-KU3 PCIe accelerator card but they implemented their design using an advanced Xilinx UltraScale FPGA knowing that they were infusing flexibility into the design. Everspin simply took advantage of this built-in flexibility in a way that produced a really interesting NVMe storage product.”
It’s still an interesting product, and now Everspin has formally announced it.
A paper describing the superior performance of an FPGA-based, speech-recognition implementation over similar implementations on CPUs and GPUs won a Best Paper Award at FPGA 2017 held in Monterey, CA last month. The paper—titled “ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA” and written by authors from Stanford U, DeePhi Tech, Tsinghua U, and Nvidia—describes a speech-recognition algorithm using LSTM (Long Short-Term Memory) models with load-balance-aware pruning implemented on a Xilinx Kintex UltraScale+ KU060 FPGA. The implementations runs at 200MHz and draws 41W (for the FPGA board) slotted into a PCIe chassis. Compared to Core i7 CPU/Pascal Titan X GPU implementations of the same algorithm, the FPGA-based implementation delivers 43x/3x more raw performance and 40x/11.5x better energy efficiency, according to the FPGA 2017 paper. So the FPGA implementation is both faster and more energy-efficient. Pick any two.
Here’s a block diagram of the resulting LSTM speech-recognition design:
The paper describes the algorithm and implementation in detail, which probably contributed to this paper winning the conference’s Best Paper Award. This work was supported by the National Natural Science Foundation of China.
Today, Aldec announced its latest FPGA-based HES prototyping board—the HES-US-440—with a whopping 26M ASIC gate capacity. This board is based on the Xilinx Virtex UltraScale VU440 FPGA and it also incorporates a Xilinx Zynq Z-7100 SoC that acts as the board’s peripheral controller and host interface. The announcement includes a new release of Aldec’s HES-DVM Hardware/Software Validation Platform that enables simulation acceleration and emulation use modes for the HES-US-440 board in addition to the physical prototyping capabilities. You can also use this prototyping board directly to implement HPC (high-performance computing) applications.
Aldec HES-US-440 Prototyping Board, based on a Xilinx Virtex UltraScale VU440 FPGA
The Aldec HES-US-440 board packs a wide selection of external interfaces to ease your prototyping work including four FMC HPC connections, PCIe, USB 3.0 and USB 2.0 OTG, UART/USB bridge, QSFP+, 1Gbps Ethernet, HDMI, SATA; has on-board NAND and SPI Flash memories; and incorporates two microSD slots.
Here’s a block diagram of the HES-US-440 prototyping board:
Aldec HES-US-440 Prototyping Board Block Diagram
For more information about the Aldec HES-US-440 prototyping board and Aldec’s HES-DVM Hardware/Software Validation Platform, please contact Aldec directly.
Berten DSP’s GigaX API for the Xilinx Zynq SoC creates a high-speed, 200Mbps full-duplex communications channel between a GbE port and the Zynq SoC’s PS (programmable logic) through an attached SDRAM buffer and an AXI DMA controller IP block. Here’s a diagram to clear up what’s happening:
The software API implements IP filtering and manages TCP/UDP headers, which help you implement a variety of hardware-accelerated Ethernet systems including Ethernet bridges, programmable network nodes, and network offload appliances. Here’s a performance curve illustrating the kind of throughput you can expect:
Please contact Berten DSP directly for more information about the GigaX API.
With a month left in the Indiegogo funding period, the MATRIX Voice open-source voice platform campaign stands at 289% of its modest $5000 funding goal. MATRIX Voice is the third crowdfunding project by MATRIX Labs, based on Miami, Florida. The MATRIX Voice platform is a 3.14-inch circular circuit board capable of continuous voice recognition and compatible with the latest voice-based, cognitive cloud-based services including Microsoft Cognitive Service, Amazon Alexa Voice Service, Google Speech API, Wit.ai, and Houndify. The MATRIX Voice board, based on a Xilinx Spartan-6 LX4 FPGA, is designed to plug directly onto a low-cost Raspberry Pi single-board computer or it can be operated as a standalone board. You can get one of these boards, due to be shipped in May, for as little as $45—if you’re quick. (Already, 61 of the 230 early-bird special-price boards are pledged.)
Here’s a photo of the MATRIX Voice board:
This image of the top of the MATRIX Voice board shows the locations for the seven rear-mounted MEMS microphones, seven RGB LEDs, and the Spartan-6 FPGA. The bottom of the board includes a 64Mbit SDRAM and a connector for the Raspberry Pi board.
Because this is the latest in a series of developer boards from MATRIX Labs (see last year’s project: “$99 FPGA-based Vision and Sensor Hub Dev Board for Raspberry Pi on Indiegogo—but only for the next two days!”), there’s already a sophisticated, layered software stack for the MATRIX Voice platform that include a HAL (Hardware Abstraction Layer) with the FPGA code and C++ library, an intermediate layer with a streaming interface for the sensors and vision libraries (for the Raspberry Pi camera), and a top layer with the MATRIX OS and high-level APIs. Here’s a diagram of the software stack:
And now, who better to describe this project than the originators:
As the BittWare video below explains, CPUs are simply not able to process 100GbE packet traffic without hardware acceleration. BittWare’s new Streamsleuth, to be formally unveiled at next week’s RSA Conference in San Francisco (Booth S312), adroitly handles blazingly fast packet streams thanks to a hardware assist from an FPGA. And as the subhead in the title slide of the video presentation says, StreamSleuth lets you program its FPGA-based packet-processing engine “without the hassle of FPGA programming.”
(Translation: you don’t need Verilog or VHDL proficiency to get this box working for you. You get all of the FPGA’s high-performance goodness without the bother.)
That said, as BittWare’s Network Products VP & GM Craig Lund explains, this is not an appliance that comes out of the box ready to roll. You need (and want) to customize it. You might want to add packet filters, for example. You might want to actively monitor the traffic. And you definitely want the StreamSleuth to do everything at wire-line speeds, which it can. “But one thing you do not have to do, says Lund, “is learn how to program an FPGA.” You still get the performance benefits of FPGA technology—without the hassle. That means that a much wider group of network and data-center engineers can take advantage of BittWare’s StreamSleuth.
As Lund explains, “100GbE is a different creature” than prior, slower versions of Ethernet. Servers cannot directly deal with 100GbE traffic and “that’s not going to change any time soon.” The “network pipes” are now getting bigger than the server’s internal “I/O pipes.” This much traffic entering a server this fast clogs the pipes and also causes “cache thrash” in the CPU’s L3 cache.
Sounds bad, doesn’t it?
What you want is to reduce the network traffic of interest down to something a server can look at. To do that, you need filtering. Lots of filtering. Lots of sophisticated filtering. More sophisticated filtering than what’s available in today’s commodity switches and firewall appliances. Ideally, you want a complete implementation of the standard BPF/pcap filter language running at line rate on something really fast, like a packet engine implemented in a highly parallel FPGA.
The same thing holds true for attack mitigation at 100Gbe line rates. Commodity switching hardware isn’t going to do this for you at 100GbE (10GbE yes but 100GbE, “no way”) and you can’t do it in software at these line rates. “The solution is FPGAs” says Lund, and BittWare’s StreamSleuth with FPGA-based packet processing gets you there now.
Software-based defenses cannot withstand Denial of Service (DoS) attacks at 100GbE line rates. FPGA-accelerated packet processing can.
So what’s that FPGA inside of the BittWare Streamsleuth doing? It comes preconfigured for packet filtering, load balancing, and routing. (“That’s a Terabit router in there.”) To go beyond these capabilities, you use the BPF/pcap language to program your requirements into the the StreamSleuth’s 100GbE packet processor using a GUI or APIs. That packet processor is implemented with a Xilinx Virtex UltraScale+ VU9P FPGA.
Here’s what the guts of the BittWare StreamSleuth look like:
And here’s a block diagram of the StreamSleuth’s packet processor:
The Virtex UltraScale+ FPGA resides on a BittWare XUPP3R PCIe board. If that rings a bell, perhaps you read about that board here in Xcell Daily last November. (See “BittWare’s UltraScale+ XUPP3R board and Atomic Rules IP run Intel’s DPDK over PCIe Gen3 x16 @ 150Gbps.”)
Finally, here’s the just-released BittWare StreamSleuth video with detailed use models and explanations:
For more information about the StreamSleuth, contact BittWare directly or go see the company’s StreamSleuth demo at next week’s RSA conference. For more information about the packet-processing capabilities of Xilinx All Programmable devices, click here. And for information about the new Xilinx Reconfigurable Acceleration Stack, click here.
Amazon Web Services (AWS) rolled out the F1 instance for cloud application development based on Xilinx Virtex UltraScale+ Plus VU0P FPGAs last November. (See “Amazon picks Xilinx UltraScale+ FPGAs to accelerate AWS, launches F1 instance with 8x VU9P FPGAs per instance.) It appears from the following LinkedIn post that people are using it already to do some pretty interesting things:
If you’re interested in Cloud computing applications based on the rather significant capabilities of Xilinx-based hardware application acceleration, check out the Xilinx Acceleration Zone.
Accolade’s newly announced ATLAS-1000 Fully Integrated 1U OEM Application Acceleration Platform pairs a Xilinx Kintex UltraScale KU060 FPGA on its motherboard with an Intel x86 processor on a COM Express module to create a network-security application accelerator. The ATLAS-1000 platform integrates Accolade’s APP (Advanced Packet Processor), instantiated in the Kintex UltraScale FPGA, which delivers acceleration features for line-rate packet processing including lossless packet capture, nanosecond-precision packet timestamping, packet merging, packet filtering, flow classification, and packet steering. The platform accepts four 10G SFP+ or two 40G QSFP pluggable optical modules. Although the ATLAS-1000 is designed as a flow-through security platform, especially for bump-in-the-wire applications, there’s also 1Tbyte worth of on-board local SSD storage.
Accolade Technology's ATLAS-1000 Fully Integrated 1U OEM Application Acceleration Platform
Here’s a block diagram of the ATLAS-1000 platform:
All network traffic enters the FPGA-based APP for packet processing. Packet data is then selectively forwarded to the x86 CPU COM Express module depending on the defined application policy.
Please contact Accolade Technology directly for more information about the ATLAS-1000.
Aquantia has packed its Ethernet PHY—capable of operating at 10Gbps over 100m of Cat 6a cable (or 5Gbps down to 100Mbps over 100m of Cat 5e cable)—with a Xilinx Kintex-7 FPGA, creating a universal Gigabit Ethernet component with extremely broad capabilities. Here’s a block diagram of the new AQLX107 device:
This Aquantia device gives you a space-saving, one-socket solution for a variety of Ethernet designs including controllers, protocol converters, and anything-to-Ethernet bridges.
Please contact Aquantia for more information about this unique Ethernet chip.
The Linley Cloud Hardware Conference (formerly our Data Center Conference) is coming to the Hyatt Regency Hotel in Santa Clara, CA on February 8. This full-day, single-track event focuses on the processors, accelerators, Ethernet controllers, new memory technologies, and interconnects used for cloud computing and networking. The conference includes a special afternoon panel titled “Accelerating the Cloud” that will be moderated by The Linley Group’s Principal Analyst Jag Bolaria with the following panelists:
Until Feb 2, you can snag a free ticket to the conference (and a free breakfast) by clicking here if you’re a cloud-service provider, network-service provider, network-equipment vendor, server OEM, system designer, software developer, member of the press, or work in the financial community. (That’s a pretty wide net.) After that date, it’s going to cost you $195 to attend if you’re in that net or $795 if you’re not.
Time to start swimming.
Edico Genome and Dell EMC have developed a bundled compute-and-storage solution for rapid, cost-effective and accurate analysis of next-generation bio-sequencing data. The bundle consists of Edico Genome’s DRAGEN processor integrated into a 1U Dell 4130 server with Dell EMC’s Isilon scale-out networked attached storage (NAS). Edico Genome’s DRAGEN bio-IT processor is designed to analyze sequencing data quickly using the hardware acceleration of a Xilinx FPGA. (For more information about the Edico Genome DRAGEN processor, see “FPGA-based Edico Genome Dragen Accelerator Card for IBM OpenPOWER Server Speeds Exome/Genome Analysis by 60x.”)
For more information about this Edico/Dell bio-processor bundle, click here.
Work started on CCIX, the cache-coherent interconnect for accelerators, a little over a year ago. The CCIX specification describes an interconnect that makes workload handoff from server CPUs to hardware accelerators as simple as passing a pointer. This capability enables a whole new class of accelerated data center applications.
Xilinx VP of Silicon architecture Gaurav Singh discussed CCIX at the recent Xilinx Technology Briefing held at SC16 in Salt Lake City. His talk covers many CCIX details and you can watch him discuss these topics in this 9-minute video from the briefing:
The video below shows Ravi Sunkavalli, the Xilinx Sr. Director of Data Center Solutions, discussing how advanced FPGAs like devices based on the Xilinx UltraScale architecture can aid you in developing high-speed networking and storage equipment as data centers migrate to faster internal networking speeds. Sunkavalli posits that CPUs, which are largely used for networking and storage applications connected with today’s 10G networks, quickly run out of gas at 40G and 100G networking speeds. FPGAs can provide “bump-in-the-wire” acceleration for high-speed networking ports thanks to the large number of fast compute elements and the high-speed transceivers incorporated into devices like the Xilinx UltraScale and UltraScale+ FPGAs.
Examples of networking applications already handled by FPGAs include VNF (Virtual Network Functions) such as VPNs, firewalls, and security. FPGAs are already being used to implement high-speed data center storage functions such as error correction, compression, and security.
The following 8-minute video was recorded during a Xilinx technology briefing at the recent SC16 conference in Salt Lake City:
Last November at SC16 in Salt Lake City, Xilinx Distinguished Engineer Ashish Sirasao gave a 10-minute talk on deploying deep-learning applications using FPGAs with significant performance/watt benefits. Sirasao started by noting that we’re already knee-deep in machine-learning applications: spam filters; cloud-based and embedded voice-to-text converters; and Amazon’s immensely successful, voice-operated Alexa are all examples of extremely successful machine-learning apps in broad use today. More—many more—will follow. These applications all have steep computing requirements.
There are two phases in any machine-learning application. The first is training and the second is deployment. Training is generally done using floating-point implementations so that application developers need not worry about numeric precision. Training is a 1-time event so energy efficiency isn’t all that critical.
Deployment is another matter however.
Putting a trained deep-learning application in a small appliance like Amazon’s Alexa calls for attention to factors such as energy efficiency. Fortunately, said Sirasao, the arithmetic precision of the application can change from training to mass deployment and there are significant energy-consumption gains to be had by deploying fixed-point machine-learning applications. According to Sirasao, you can get accurate machine inference using 8- or 16-bit fixed-point implementations while realizing a 10x gain in energy efficiency for the computing hardware and a 4x gain in memory energy efficiency.
The Xilinx DSP48E2 block implemented in the company’s UltraScale and UltraScale+ devices is especially useful for these machine-learning deployments because its DSP architecture can perform two independent 8-bit operations per clock per DSP block. That translates into nearly double the compute performance, which in turn results in much better energy efficiency. There’s a Xilinx White Paper on this topic titled “Deep Learning with INT8 Optimization on Xilinx Devices.”
Further, Xilinx recently announced its Acceleration Stack for machine-learning (and other cloud-based applications), which allows you to focus on developing your application rather than FPGA programming. You can learn about the Xilinx Acceleration Stack here
Finally, here’s the 10-minute video with Sirasao’s SC16 talk:
Do you have a big job to do? How about a terabit router bristling with optical interconnect? Maybe you need a DSP monster for phased-array radar or sonar. Beamforming for advanced 5G applications using MIMO antennas? Some other high-performance application with mind-blowing processing and I/O requirements?
You need to look at Xilinx Virtex UltraScale+ FPGAs with their massive data-flow and routing capabilities, massive memory bandwidth, and massive I/O bandwidth. These attributes sweep away design challenges caused by performance limits of lesser devices.
Now you can quickly get your hands on a Virtex UltraScale+ Eval Kit so you can immediately start that challenging design work. The new eval kit is the Xilinx VCU118 with an on-board Virtex UltraScale+ VU9P FPGA. Here’s a photo of the board included with the kit:
Xilinx VCU118 Eval Board with Virtex UltraScale+ VU9P FPGA
The VCU118 eval kit’s capabilities spring from the cornucopia of on-chip resources provided by the Virtex UltraScale+ VU9P FPGA including:
If you can’t build what you need with the VCU118’s on-board Virtex UltraScale+ VU9P FPGA—and it’s sort of hard to believe that’s even possible—just remember, there are even larger parts in the Virtex UltraScale+ FPGA family.
Was it just two weeks ago that Amazon announced the FPGA-accelerated F1 instance of its AWS (Amazon Web Services) in developer preview form at Amazon’s AWS re:Invent 2016 event in Las Vegas, Nevada? (See “Amazon picks Xilinx UltraScale+ FPGAs to accelerate AWS, launches F1 instance with 8x VU9P FPGAs per instance.”)
The next day at the same event, NGCodec demonstrated its hardware-accelerated RealityCodec 4K video codec running on the Amazon AWS’ FPGA-accelerated EC2 F1 instance. (See “NGCodec announces high-speed 4K video compression for Amazon AWS’ new FPGA-accelerated EC2 F1 instance.”)
Add two more applications up and running on the FPGA-accelerated AWS EC2 F1 instance:
If you’ve been reading the Xilinx Xcell Daily blogs for a while, then both Ryft and Edico Genome might be familiar to you. Both companies previously implemented high-performance cloud appliances based on Xilinx FPGAs. (See “FPGA-based Ryft ONE search accelerator delivers 100x performance advantage over Apache Spark in the data center” and “FPGA-based Edico Genome Dragen Accelerator Card for IBM OpenPOWER Server Speeds Exome/Genome Analysis by 60x.”)
Now that Amazon is standardizing FPGA-accelerated for cloud computing using multiple Xilinx Virtex UltraScale+ FPGAs, both Ryft and Edico are porting their applications to the Amazon AWS cloud.
Can you say “rapid deployment”? I knew you could.
Today, NGCodec demonstrated its hardware-accelerated RealityCodec 4K video codec running on the Amazon AWS’ FPGA-accelerated EC2 F1 instance, announced just yesterday at Amazon’s AWS re:Invent 2016 event in Las Vegas, Nevada. (See “Amazon picks Xilinx UltraScale+ FPGAs to accelerate AWS, launches F1 instance with 8x VU9P FPGAs per instance.”) Amazon AWS customers will be able to buy NGCodec’s RealityCodec when it becomes available in the AWS Marketplace and will run it on the Amazon EC2 F1 instance, which is a hardware-accelerated offering based on multiple Xilinx UltraScale+ FPGAs packed into the AWS server chassis. NGCodec’s RealityCodec running on the Amazon EC2 F1 instance delivers ultra-high-performance video compression (with up to 4K resolution) and ultra-low, sub-frame latency for cloud-based VR and AR. (See NGCodec’s press release here.)
NGCodec’s RealityCodec is an excellent example of the type of cloud-based application that can benefit from FPGA-based hardware acceleration. Amazon’s announcement yesterday of a standardized hardware-accelerated offering for AWS follows an accelerating trend towards offering FPGA-based acceleration to cloud-services customers. Like Amazon, many of the major cloud service providers have announced deployment of FPGA technology in their Hyperscale data centers to drive their services business in an extremely competitive market. FPGAs are the perfect complement to highly agile cloud computing environments because they are programmable and can be hardware-optimized for any new application or algorithm. The inherent ability of an FPGA to reconfigure and be reprogrammed over time is perhaps its greatest advantage in a fast-moving field.
For more information about FPGA-based hardware acceleration in the data center, check out the new Xilinx Acceleration Zone and take a look at this White Paper from Moor Insights & Strategy, which describes the new Xilinx Reconfigurable Acceleration Stack. (See “Xilinx Reconfigurable Acceleration Stack speeds programming of machine learning, data analytics, video-streaming apps.”)
Jeff Barr, Chief Evangelist at Amazon Web Services, just unveiled the accelerated F1 instance of its AWS (Amazon Web Services) in developer preview form. The rollout came in the form of a blog titled “Developer Preview – EC2 Instances (F1) with Programmable Hardware.”
“One of the more interesting routes to a custom, hardware-based solution is known as a Field Programmable Gate Array, or FPGA. In contrast to a purpose-built chip which is designed with a single function in mind and then hard-wired to implement it, an FPGA is more flexible. It can be programmed in the field, after it has been plugged in to a socket on a PC board. Each FPGA includes a fixed, finite number of simple logic gates. Programming an FPGA is “simply” a matter of connecting them up to create the desired logical functions (AND, OR, XOR, and so forth) or storage elements (flip-flops and shift registers). Unlike a CPU which is essentially serial (with a few parallel elements) and has fixed-size instructions and data paths (typically 32 or 64 bit), the FPGA can be programmed to perform many operations in parallel, and the operations themselves can be of almost any width, large or small.
“This highly parallelized model is ideal for building custom accelerators to process compute-intensive problems. Properly programmed, an FPGA has the potential to provide a 30x speedup to many types of genomics, seismic analysis, financial risk analysis, big data search, and encryption algorithms and applications.
“I hope that this sounds awesome and that you are chomping at the bit to use FPGAs to speed up your own applications!
“Today we are launching a developer preview of the new F1 instance. In addition to building applications and services for your own use, you will be able to package them up for sale and reuse in AWS Marketplace. Putting it all together, you will be able to avoid all of the capital-intensive and time-consuming steps that were once a prerequisite to the use of FPGA-powered applications, using a business model that is more akin to that used for every other type of software. We are giving you the ability to design your own logic, simulate and verify it using cloud-based tools, and then get it to market in a matter of days.
“Equipped with Intel Broadwell E5 2686 v4 processors (2.3 GHz base speed, 2.7 GHz Turbo mode on all cores, and 3.0 GHz Turbo mode on one core), up to 976 GiB of memory, up to 4 TB of NVMe SSD storage, and one to eight FPGAs, the F1 instances provide you with plenty of resources to complement your core, FPGA-based logic. The FPGAs are dedicated to the instance and are isolated for use in multi-tenant environments.
“Here are the specs on the FPGA (remember that there are up to eight of these in a single F1 instance):
“In instances with more than one FPGA, dedicated PCIe fabric allows the FPGAs to share the same memory address space and to communicate with each other across a PCIe Fabric at up to 12 Gbps in each direction. The FPGAs within an instance share access to a 400 Gbps bidirectional ring for low-latency, high bandwidth communication (you’ll need to define your own protocol in order to make use of this advanced feature).”
Amazon is also releasing a developer tool called AMI, “a set of developer tools that you can use in the AWS Cloud at no charge,” for AWS F1 application development.
Note: For additional information on the extensive support Xilinx provides for hardware acceleration in cloud environments, click over to the Xilinx Acceleration Zone, where you’ll find helpful information about the newly announced Reconfigurable Acceleration Stack. (Also, see “Xilinx Reconfigurable Acceleration Stack speeds programming of machine learning, data analytics, video-streaming apps.”)
Want to see how fast machine inference can go and how efficient it can be? The video below shows you how fast the AlexNet image-classification algorithm runs (better than 1800 image classifications/sec)—and how efficiently it runs (<50W)—using an INT8 (8-bit integer) implementation. The demo on the video shows AlexNet running in an open-source Caffe deep-learning framework, implemented with the xDNN deep neural network library running on a Xilinx UltraScale FPGA in the Xilinx Kintex UltraScale FPGA Acceleration Development Kit.
All of the above components are part of the newly announced Xilinx Reconfigurable Acceleration Stack.
Note: If you implemented this classification application using INT16 instead, you’d get about half the performance, as mentioned in the video and discussed in detail in the previous Xcell Daily blog post, “Counter-Intuitive: Fixed-Point Deep-Learning Inference Delivers 2x to 6x Better CNN Performance with Great Accuracy.”
Here’s the video showing FPGA-based image classification in action:
Intuitively, you might think that that more resolution you throw at deep-learning inference, the more accurate the result.
Nope. Not true.
Human intuition does not always guide you to a superior solution in the fields of AI, machine learning, and inference. In this case, the counter-intuitive result gets you near-maximum inference accuracy with greatly improved performance and reduced power consumption resulting in significantly better performance per watt. The technical detail is all there in a new Xilinx White Paper titled “Deep Learning with INT8 Optimization on Xilinx Devices.”
Research has shown that 32-bit floating-point computations are not required in deep learning inferences to obtain the best accuracy. For many applications such as image classification, INT8 (or even lower-precision) fixed-point computations deliver nearly identical inference accuracy compared with floating-point results. Here’s a table from the Xilinx White Paper that shows accuracy results for fine-tuned CNNs (convolutional neural networks) based on fixed-point computations that validates this claim. (The numbers in parentheses indicate accuracy without fine-tuning.):
Note that the reduced-precision fixed-point computations and 32-bit floating-point computations deliver essentially the same inference accuracy for all six of these CNN benchmark applications.
So why should you bother with floating-point computations at all? That’s an excellent question for you to ponder once you let the data override your intuition. You know that floating-point calculations consume more power; you know they consume more resources to implement; and that fact becomes increasingly important when creating massively parallel CNNs.
There are, in fact, several reasons to employ fixed-point computations instead of floating-point computations based on these results. Your design delivers:
Although the first two advantages may seem obvious, the third—better resource utilization—takes some explaining, which the Xilinx White Paper describes in detail. Based on this and other research, FPGA-based floating-point DSPs turn out to be a poor match for many hyperscale applications including machine learning inference. (At the same time, other research suggests that floating-point DSPs as implemented in FPGAs fall well short of the compute efficiency attained by GPUs optimized for CNN training.)
Xilinx’s fixed-point DSP48E2 architecture used in its UltraScale and UltraScale+ FPGAs is optimized for reduced-precision integer computations because you can pack two INT8 operations into every clock tick in each Xilinx DSP48E2 slice thanks to its wide, 27x18-bit multiplier, 48-bit accumulator, and other architectural enhancements. The DSP slices in competitive FPGAs cannot accomplish this feat. (Again, see the Xilinx White Paper for the gory technical details of how this integer operand packing works and how it essentially doubles CNN performance.)
The proof’s in the performance data, so here’s a figure taken from the same Xilinx White Paper that graphically illustrates the superior efficiency of fixed-point CNN implementations using Xilinx UltraScale and UltraScale+ FPGAs:
You can see from this figure that you get significantly more deep-learning GOPS/watt from fixed-point CNN implementations using Xilinx UltraScale and UltraScale+ FPGAs, when compared to competitive devices. Compared to Intel's Arria 10 and Stratix 10 devices as shown in the above figure, Xilinx devices deliver 2X to 6X better GOPS/watt efficiency for deep-learning inference operations—at essentially the same accuracy level attained by 32-bit floating-point implementations.
For more information, you might want to spend some time investigating the resources on the new Xilinx Acceleration Zone Web page, which discusses myriad facets of hyperscale cloud acceleration using Xilinx FPGA technology including the new Xilinx Acceleration Development Kit based on the Xilinx Kintex UltraScale KU115 FPGA.
Intel’s DPDK (Data Plane Development Kit) is a set of software libraries that improves packet processing performance on x86 CPU hosts by as much as 10x. According to Intel, its DPDK plays a critical role in SDN and NFV applications. Last week at SC16 in Salt Lake City, BittWare demonstrated Intel’s DPDK running on a Xeon CPU and streaming packets over a PCIe Gen3 x16 interface at an aggregate rate of 150Gbps (transmit + receive) to and from BittWare’s new XUPP3R PCIe board using Atomic Rules’ Arkville DPDK-aware data mover IP instantiated in the 16nm Xilinx Virtex UltraScale+ VU9P FPGA on Bittware’s board. The Arkville DPDK-aware data mover marshals packets between the IP block implemented in the FPGA’s programmable logic and the CPU host's memory using the Intel DPDK API/ABI. Atomic Rule’s Arkville IP plus a high-speed MAC looks like a line-rate-agnostic, bare-bones L2 NIC.
BittWare’s XUPP3R PCIe board with an on-board Xilinx Virtex UltraScale+ VU9P FPGA
Here’s a very short video of BittWare’s VP of Systems & Solutions Ron Huizen explaining his company’s SC16 demo:
Here’s an equally short video made by Atomic Rules with a bit more info:
If this all looks vaguely familiar, perhaps you’re remembering an Xcell Daily post that appeared just last May where BittWare demonstrated an Atomic Rules UDP Offload Engine running on its XUSP3S PCIe board, which is based on a Xilinx Virtex UltraScale VU095 FPGA. (See “BittWare and Atomic Rules demo UDP Offload Engine @ 25 GbE rates; BittWare intros PCIe Networking card for 4x 100 GbE.”) For the new XUPP3R PCIe board, BittWare has now jumped from the 20nm Virtex UltraScale FPGAs to the latest 16nm Virtex UltraScale+ FPGAs.
Everspin, “The MRAM Company,” took an off-the-shelf Alpha Data ADM-PCIE-KU3 PCIe accelerator card, loaded 1Gbyte of MRAM DIMMs on the card, reprogrammed the on-board Kintex UltraScale KU060 FPGA to create an MRAM-based NVMe controller, and got…
From non-volatile, no-wearout-failure MRAM.
The folks at Alpha Data handed me a data sheet for the resulting Everspin NVMe card, the ES1GB-N02 Storage Accelerator, at this week’s SC16 conference in Salt Lake City. Here’s a scan of that data sheet:
Everspin makes MRAMs with DDR3 pin-level interfaces, but these non-volatile memory devices have unique timing requirements that differ from the DDR3 SDRAM standard. It’s therefore relatively easy to create an MRAM-based DDR3 SODIMM that snaps right into the existing SDRAM socket on the Alpha Data ADM-PCIE-KU3 card. Modify the SDRAM controller in the Kintex UltraScale FPGA to accommodate the MRAM’s timing requirements and—voila!—you’ve created an MRAM storage accelerator card.
There’s a key point to be made about a product like this. The folks at Alpha Data likely never envisioned an MRAM-based storage accelerator when they designed the ADM-PCIE-KU3 PCIe accelerator card but they implemented their design using an advanced Xilinx UltraScale FPGA knowing that they were infusing flexibility into the design. Everspin simply took advantage of this built-in flexibility in a way that produced a really interesting NVMe storage product.
Isn’t that the sort of flexibility you’d like to have in your products?
(Note: MRAM is magnetic RAM.)
Alpha Data’s booth at this week’s SC16 conference in Salt Lake City held the company’s latest top-of-the-line FPGA accelerator card, the ADM-PCIE-9V3, based on the 16nm Xilinx Virtex UltraScale+ VU3P-2 FPGA. Announced just this week, the card also features two QSFP28 sockets that each accommodate one 100GbE connection or four 25GbE connections. If you have a full-height slot available, you can add two more 100GbE interfaces using Samtec FireFly Micro Flyover Optical modules and run four 100GbE interfaces simultaneously. All of this high-speed I/O capability comes courtesy of the 40 32.75Gbps SerDes ports on the Virtex UltraScale+ VU3P FPGA.
Alpha Data ADM-PCIE-9V3 Accelerator Card based on a Xilinx Virtex UltraScale+ VU3P-2 FPGA
To back up the board’s extreme Ethernet bandwidth, the ADM-PCIE-9V3 board incorporates two banks of 72-bit, DDR2400 SDRAM with ECC and a per-bank capacity of 8Gbytes for a total of 16Gbytes of on-board SDRAM. All of this fits on a half-length, low-profile PCIe card, which features a PCIe Gen4 x8 or a PCIe Gen3 x16 host connection and the board supports the OpenPOWER CAPI coherent interface. (The PCIe configuration is programmable, thanks to the on-board Virtex UltraScale+ FPGA.)
Taken as a whole, this new accelerator card delivers serious processing and I/O firepower along every dimension you might care to measure, whether it’s Ethernet bandwidth, memory capacity, or processing power.
The Alpha Data ADM-PCIE-9V3 board is based on a Xilinx Virtex UltraScale+ FPGA so it can serve as a target for the Xilinx SDAccel development environment, which delivers a CPU- and GPU-like development environment for application developers who wish to develop high-performance code using OpenCL, C, or C++ while targeting ready-to-go, plug-in FPGA hardware. In addition, Alpha Data offers an optional Board Support Package for the ADM-PCIE-9V3 accelerator board with example FPGA designs, application software, a mature API, and driver support for Microsoft Windows and Linux to further ease cloud-scale application development and deployment in hyperscale data centers.
This week at SC16 in Salt Lake City, Smart IOPS demonstrated its FPGA-powered Data Engine NVMe SSD, which delivers 1.7M IOPS—which the company claims is 4x that of competing NVMe SSDs. The secret, besides the embedded Xilinx Kintex UltraScale FPGA running the show in hardware, is Smart IOPS’ TruRandom technology, which uses pattern-recognition heuristics baked into the FPGA logic to speed read/write transactions between the host CPU and the Data Engine’s NAND Flash storage. This technology makes sustained random and sequential read/write transactions indistinguishable, meaning they run...
Smart IOPS Data Engine NVMe SSD
Smart IOPS is offering the Data Engine NVMe SSD in 2 to 10Tbyte capacities and three flavors: T2, t2D, and T4. The T2 Data Engines employ 16nm MLC NAND Flash memory; the T2D Data Engines employ 3D MLC NAND Flash memory; and the T4 Data Engines employ 15nm MLC NAND Flash memory. The different types of flash affect the drives’ speeds as shown in these specs:
Smart IOPS Data Engine NVMe SSD specifications
Smart IOPS also packages one or more of its Data Engine SSDs in a rack-mounted Flash Appliance.
The on-board Xilinx Kintex UltraScale FPGA implements all of the functions in the Smart IOPS Data Engine including the PCIe Gen3 host interface; NAND Flash control; and of course the company’s proprietary, patent-pending, speed-multiplying TruRandom heuristics.
Today marks the launch of the Xilinx Reconfigurable Acceleration Stack for reducing the programming hurdles associated with accelerating workloads in hyperscale datacenters in three acceleration stack categories:
Here’s a graphical overview of the material you’ll find in the Xilinx Reconfigurable Acceleration Stack:
The several libraries already included in the Xilinx Reconfigurable Acceleration Stack include:
DNN – A Deep Neural Network (DNN) library from Xilinx, which is a highly optimized library for building deep learning inference applications. This library is designed for maximum compute efficiency at 16-bit and 8-bit integer data types.
GEMM – A General Matrix Multiply (GEMM) library from Xilinx, which is based on the level-3 Basic Linear Algebra Subprograms (BLAS). This library delivers optimized performance at 16-bit and 8-bit integer data types and supports matrices of any size.
HEVC Decoder & Encoder – HEVC/H.265 is the latest video-compression standard from the MPEG and ITU standards bodies. HEVC/H.265 is the successor to the H.264 video-compression standard and it can reduce video bandwidth requirements by as much as 50% relative to H.264. Xilinx provides two HEVC/H.265 video encoders: a high-quality, flexible, real-time encoder to address the majority of video-centric data-center workloads and an alternate encoder for non-camera generated content. One decoder supports all forms of encoded HEVC/H.265 video from either encoder.
Data Mover (SQL) – The SQL data-mover library makes it easy to accelerate data analytics workloads using a Xilinx FPGA. The data-mover library orchestrates standard connections to SQL databases by sending blocks of data from database tables to the FPGA accelerator card’s on-chip memory over a PCIe interface. The library automatically maximizes PCIe bandwidth between the host CPU and the FPGA-based hardware accelerator.
Compute Kernel (SQL) – A library that accelerates numerous core SQL functions on the FPGA hardware accelerator including decimal type, date type, scan, compare, and filter. The library’s compute functions optimally exploit the on-board FPGa’s massive hardware parallelism.
Three of the top seven hyperscale cloud companies including Baidu have already deployed Xilinx FPGAs for hardware acceleration. Last month, Baidu announced that it had designed a Xilinx Kintex UltraScale FPGA into an accelerator card and was using pools of these cards to accelerate machine learning inference. Qualcomm and IBM have announced strategic collaborations with Xilinx for data-center acceleration and the IBM engagement already has already resulted in a storage and networking acceleration framework called CAPI SNAP that eases the creation of accelerated applications such as NoSQL using Xilinx FGPAs. (See last month’s Xcell Daily blog post “OpenPOWER’s CAPI SNAP Framework eases the task of developing high-performance, FPGA-based accelerators for data centers.”)
In addition, Xilinx has been leading an industry initiative toward the development of an intelligent, cache coherent interconnect called CCIX. Xilinx along with AMD, ARM, Huawei, IBM, Mellanox, and Qualcomm formed the CCIX Consortium in May 2016. The initiative’s membership has since tripled in just five months and the CCIX Consortium announced the Release1 specification covering the physical, data-link, and protocol layers, which is now available to the consortium’s members. (See “CCIX Consortium develops Release1 of its fully cache-coherent interconnect specification, grows to 22 members.”)
Today, Xilinx announced four members of a new Virtex UltraScale+ HBM device family that combines high-performance 16nm Virtex UltraScale+ FPGAs with 32 or 64Gbits of HBM (high-bandwidth memory) DRAM in one device. The resulting devices deliver a 20x improvement in memory bandwidth relative to DDR SDRAM—more than enough to keep pace with the needs of 400G Ethernet, multiple 8K digital-video channels, or high-performance hardware acceleration for cloud servers.
These new Virtex UltraScale+ HBM devices are part of the 3rd generation of Xilinx 3D FPGAs, which started with the Virtex-7 2000T that Xilinx started shipping way, way back in 2011. (See “Generation-jumping 2.5D Xilinx Virtex-7 2000T FPGA delivers 1,954,560 logic cells using 6.8 BILLION transistors (PREVIEW!)”) Xilinx co-developed this 3D IC technology with TSMC and the Virtex UltraScale+ HBM devices represent the current, production-proven state of the art.
Here’s a table listing salient features of these four new Virtex UltraScale+ HBM devices:
Each of these devices incorporates 32 or 64Gbits of HBM DRAM with more than 1000 I/O lines connecting each HBM stack through the silicon interposer to the logic device, which contains a hardened HBM memory controller that manages one or two HBM devices. This memory controller has 32 high-performance AXI channels, allowing high-bandwidth interconnect to the Virtex UltraScale+ devices’ programmable logic and access to many routing channels in the FPGA fabric. Any AXI port can access any physical memory location in the HBM devices.
In addition, these Virtex UltraScale+ HBM FPGAs are the first Xilinx devices to offer the new, high-performance CCIX cache-coherent interface announced just last month. (See “CCIX Consortium develops Release1 of its fully cache-coherent interconnect specification, grows to 22 members.”) CCIX simplifies the design of offload accelerators for hyperscale data centers by providing low-latency, high-bandwidth, fully coherent access to server memory. The specification employs a subset of full coherency protocols and is ISA-agnostic, meaning that the specification’s protocols are independent of the attached processors’ architecture and instruction sets. CCIX pairs well with HBM and the new Xilinx UltraScale+ HBM FPGAs provide both in one package.
Here’s an 8-minute video with additional information about the new Virtex UltraScale+ HBM devices:
Are you attending Supercomputing 2016 (SC16) in Salt Lake City next week? Would you like to learn about reconfigurable hardware acceleration for data centers? (Hint: Think superior performance/watt.) Well, you’re in luck. There’s a free, 1-hour briefing on this topic taking place right next door to the conference in the Utah Museum of Contemporary Art on the morning of November 16.
Xilinx is hosting the briefing and if you’d like to attend this space-limited event and hear from Xilinx engineers and researchers about how FPGAs are accelerating the widest range of data center workloads, click here to learn more and to register.