The 1-minute video appearing below shows two 56Gbps, PAM-4 demos from the recent OFC 2017 conference. The first demo shows a CEI-56G-MR (medium-reach, 50cm, chip-to-chip and low-loss backplane) connection between a Xilinx 56Gbps PAM-4 test chip communicating through a QSFP module over a cable to a Credo device. A second PAM-4 demo using CEI-56G-LR (long-reach, 100cm, backplane-style) interconnect shows a Xilinx 56Gbps PAM-4 test chip communicating over a Molex backplane to a Credo device, which is then communicating with a GlobalFoundries device over an FCI backplane, which is then communicating over a TE backplane back to the Xilinx device. This second demo illustrates the growing, multi-company ecosystem supporting PAM-4.
For more information about the Xilinx PAM-4 test chip, see “3 Eyes are Better than One for 56Gbps PAM4 Communications: Xilinx silicon goes 56Gbps for future Ethernet,” and “Got 90 seconds to see a 56Gbps demo with an instant 2x upgrade from 28G to 56G backplane? Good!”
Looking for a relatively painless overview of the current state of the art for high-speed Ethernet used in data centers and for telecom? You should take a look at this just-posted, 30-minute video of a panel discussion at OFC2017 titled “400GE from Hype to Reality.” The panel members included:
Gustlin starts by discussing the history of 400GbE’s development, starting with a study group organized in 2013. Today, the 400GbE spec is at draft 3.1 and the plan is to produce a final standard by December 2017.
Booth answers a very simple question in his talk: “”Yes, we ill” use 400GbE in the data center. He then proceeds to give a fairly detailed description of the data centers and networking used to create Microsoft’s Azure cloud-computing platform.
Ofelt describes the genesis of the 400GbE standard. Prior to 400G, says Ofelt, system vendors worked with end users (primarily telecom companies) to develop faster Ethernet standards. Once a standard appeared, ther would be a deployment ramp. Although 400GbE development started that way, the people building hyperscale data centers sort of took over and they want to deploy 400GbE at scale, ASAP.
Don’t be fooled by the title of this panel. There’s plenty of discussion about 25GbE through 100GbE and 200GbE as well, so if you’re needing a quick update on high-speed Ethernet’s status, this 30-minute video is for you.
As of today, Amazon Web Services (AWS) has made the FPGA-accelerated Amazon EC2 F1 compute instance generally available to all AWS customers. (See the new AWS video below and this Amazon blog post.) The Amazon EC2 F1 compute instance allows you to create custom hardware accelerators for your application using cloud-based server hardware that incorporates multiple Xilinx Virtex UltraScale+ VU9P FPGAs. Each Amazon EC2 F1 compute instance can include as many as eight FPGAs, so you can develop extremely large and capable, custom compute engines with this technology. According to the Amazon video, use of the FPGA-accelerated F1 instance can accelerate applications in diverse fields such as genomics research, financial analysis, video processing (in addition to security/cryptography and machine learning) by as much as 30x over general-purpose CPUs.
Access through Amazon’s FPGA Developer AMI (an Amazon Machine Image within the Amazon Elastic Compute Cloud (EC2)) and the AWS Hardware Developer Kit (HDK) on Github. Once your FPGA-accelerated design is complete, you can register it as an Amazon FPGA Image (AFI), and deploy it to your F1 instance in just a few clicks. You can reuse and deploy your AFIs as many times, and across as many F1 instances as you like and you can list it in the AWS Marketplace.
The Amazon EC2 F1 compute instance reduces the time a cost needed to develop secure, FPGA-accelerated applications in the cloud and has now made access quite easy through general availability.
Here’s the new AWS video with the general-availability announcement:
The Amazon blog post announcing general availability lists several companies already using the Amazon EC2 F1 instance including:
You are never going to get past a certain performance barrier by compiling C for a software-programmable processor. At some point, you need hardware acceleration.
As an analogy: You can soup up a car all you want; it’ll never be an airplane.
Sure, you can bump the processor clock rate. You can add processor cores and distribute the tasks. Both of these approaches increase power consumption, so you’ll need a bigger and more expensive power supply; they increase heat generation, which means you will need better cooling and probably a bigger heat sink or a fan (or another fan); and all of these things increase BOM costs.
Are you sure you want to take that path? Really?
OK, you say. This blog’s from an FPGA company (actually, Xilinx is an “All Programmable” company), so you’ll no doubt counsel me to use an FPGA to accelerate these tasks and I don’t want to code in Verilog or VHDL, thank you very much.
Not a problem. You don’t need to.
You can get the benefit of hardware acceleration while coding in C or C++ using the Xilinx SDSoC development environment. SDSoC produces compiled software automatically coupled to hardware accelerators and all generated directly from your high-level C or C++ code.
That’s the subject of a new Chalk Talk video just posted on the eejournal.com Web site. Here’s one image from the talk:
This image shows three complex embedded tasks and the improvements achieved with hardware acceleration:
A beefier software processor or multiple processor cores will not get you 1000x more performance—or even 30x—no matter how you tweak your HLL code, and software coders will sweat bullets just to get a few percentage points of improvement. For such big performance leaps, you need hardware.
Here’s the 14-minute Chalk Talk video:
Samtec recorded a demo of its FireFly FQSFP twinax cable assembly carrying four 28Gbps lanes from a Xilinx Virtex UltraScale+ VU9P FPGA on a VCU118 eval board to a QSFP optical cage at the recent OFC 2017 conference in Los Angeles. (The Virtex UltraScale+ VU9P FPGA has 120 GTY transceivers capable of 32.75Gbps operation and the VCU118 eval kit includes the Samtec FireFly daughtercard with cable assembly.) Samtec’s FQSFP assembly plugs mid-board into a FireFly connector on the VCU118 board. The 28Gbps signals then “fly over” the board through to the QSFP cage and loop back over the same path, where they are received back into the FPGA. The demonstration shows 28Gbps performance on all four links with zero bit errors.
As explained in the video, the advantage to using the Samtec FireFly flyover system is that it takes the high-speed 28Gbps signals out of the pcb-design equation, making the pcb easier to design and less expensive to manufacture. Significant savings in pcb manufacturing cost can result for large board designs, which no longer need to deal with signal-integrity issues and controlled-impedance traces for such high-speed routes.
Samtec has now posted the 2-minute video from OFC 2017 on YouTube and here it is:
Note: Martin Rowe recently published a related technical article about the Samtec FireFly system titled "High-speed signals jump over PCB traces" on the EDN.com Web site.
Here’s a 90-second video showing a 56Gbps Xilinx test chip with a 56Gbps PAM4 SerDes transceiver operating with plenty of SI margin and better than 10-12 error rate over a backplane originally designed for 28Gbps operation.
Note: This working demo employs a Xilinx test chip. The 56Gbps PAM4 SerDes is not yet incorporated into a product. Not yet.
For more information about this test chip, see “3 Eyes are Better than One for 56Gbps PAM4 Communications: Xilinx silicon goes 56Gbps for future Ethernet.”
Today, Xilinx posted information about the new $2995 Kintex UltraScale+ KCU116 Eval Kit on Xilinx.com. If you’re looking to get into the UltraScale+ FPGAs’ GTY transceiver races—to 32.75Gbps—this is a great kit to start with. The kit includes:
Here’s a nice shot of the KCU116 board from the kit’s quickstart guide:
Kintex UltraScale+ KCU116 Eval Board
One of the key features of this board are the four SFP+ optical cages there on the left. Those handle 25Gbps optical modules, driven of course by four of the KU5P FPGA’s GTY transceivers.
Take a look.
Xcell Daily discussed DeePhi Tech’s Zynq-based CNN acceleration processor last year in connection with the Hot Chips 2016 conference. (See “DeePhi’s Zynq-based CNN processor is faster, more energy efficient than CPUs or GPUs.”) DeePhi’s founder Song Yao appears in a new Powered by Xilinx video this week giving many more details including some fascinating information about an early customer, ZeroTech—China’s second largest drone maker.
DeePhi provides the entire stack needed to develop machine-learning applications based on neural networks including the development software, algorithms, and a neural-network processor that runs efficiently on the Xilinx Zynq SoC. This technology is particularly good for deep-learning, vision-based embedded apps such as drones, robotics, surveillance cameras, and for cloud-computing applications as well.
The video also provides more details on ZeroTech’s use of DeePhi’s machine-learning technology for object detection, pedestrian detection, and gesture recognition—all in a drone that nestles in your hand.
Song Yao explains that DeePhi’s tools provide a GPU-like development environment while taking advantage of the superior efficiency of neural networks implemented with programmable logic. In addition, DeePhi can change the neural network’s architecture to further optimize the design for specific applications.
Finally, he explains that you can use these Zynq-based implementations in applications where GPUs will simply not work due to power-consumption restrictions. In fact, last year at Hot Chips 2016 he reportedly said, “The FPGA based DPU platform achieves an order of magnitude higher energy efficiency over GPU on image recognition and speech detection.”
Here’s the new, 3-minute Powered by Xilinx video:
Adam Taylor and Xilinx’s Sr. Product Manager for SDSoC and Embedded Vision Nick Ni have just published an article on the EE News Europe Web site titled “Machine learning in embedded vision applications.” That title’s pretty self-explanatory, but there are a few points I’d like to highlight. Then you can go read the full article yourself.
As the article states, “Machine learning spans several industry mega trends, playing a very prominent role within not only Embedded Vision (EV), but also Industrial Internet of Things (IIoT) and Cloud Computing.” In other words, if you’re designing products for any embedded market, you might well find yourself at a competitive disadvantage if you’re not adding machine-learning features to your road map.
This article closely ties machine learning with neural networks (including Feed-forward Neural Networks (FNNs), Recurrent Neural Networks (RNNs), and Deep Neural Networks (DNNs), and Convolutional Neural Networks (CNNs)). Neural networks are not programmed; they’re trained. Then, if they’re part of an embedded design, they’re deployed. Training is usually done using floating-point neural-network implementations but, for efficiency (power and cost), deployed neural networks can use fixed-point representations with very little or no loss of accuracy. (See “Counter-Intuitive: Fixed-Point Deep-Learning Inference Delivers 2x to 6x Better CNN Performance with Great Accuracy.”)
The programmable logic inside of Xilinx FPGAs, Zynq SoCs, and Zynq UltraScale+ MPSoCs is especially good at implementing fixed-point neural networks, as described in this article by Nick Ni and Adam Taylor. (Go read the article!)
Meanwhile, this is a good time to remind you of the recent Xilinx introduction of the reVISION stack for neural network development using Xilinx All Programmable devices. For more information about the Xilinx reVISION stack, see:
Image Matters’ Origami B20 module, based on a Xilinx Kintex UltraScale KU060 FPGA, is a small 94x53mm module that you can use to perform all sorts of high-speed processing. (See “Image Matters launches Origami Ecosystem for developing advanced 4K/8K video apps using the FPGA-based Origami module.”) For example, you can use it for a variety of video-compression applications using various IP compression cores including MPEG, JPEG-2000, and TICO. You can also use it for cloud-computing and neural-network applications such as image detection. The key thing is that the small Origami B20 module puts everything you need to run the FPGA on the one small module including SDRAM, Flash memory, the power supply, a backup battery, and security features (including tamper protection).
Here’s a short, 2.5-minute, Powered by Xilinx video with more information about the Origami B20 module:
Next week at OFC 2017 in Los Angeles, Acacia Communications, Optelian, Precise-ITC, Spirent, and Xilinx will present the industry’s first interoperability demo supporting 200/400GbE connectivity over standardized OTN and DWDM. Putting that succinctly, the demo is all about packing more bits/λ, so that you can continue to use existing fiber instead of laying more.
Callite-C4 400GE/OTN Transponder IP from Precise-ITC instantiated in a Xilinx Virtex UltraScale+ VU9P FPGA will map native 200/400GbE traffic—generated by test equipment from Spirent—into 2x100 and 4x100 OTU4-encapsulated signals. The 200GbE and 400GbE standards are still in flux, so instantiating the Precise-ITC transponder IP in an FPGA allows the design to quickly evolve with the standards with no BOM or board changes. Concise translation: faster time to market with much less risk.
Callite-C4 400GE/OTN Transponder IP Block Diagram
Optelian’s TMX-2200 200G muxponder, scheduled for release later this year, will muxpond the OTU4 signals into 1x200Gbps or 2x200Gbps DP-16QAM using Acacia Communications’ CFP2-DCO coherent pluggable transceiver.
The Optelian and Precise-ITC exhibit booths at OFC 2017 are 4139 and 4141 respectively.
This week, EETimes’ Junko Yoshida published an article titled “Xilinx AI Engine Steers New Course” that gathers some comments from industry experts and from Xilinx with respect to Monday’s reVISION stack announcement. To recap, the Xilinx reVISION stack is a comprehensive suite of industry-standard resources for developing advanced embedded-vision systems based on machine learning and machine inference.
As Xilinx Senior Vice President of Corporate Strategy Steve Glaser tells Yoshida, “Xilinx designed the stack to ‘enable a much broader set of software and systems engineers, with little or no hardware design expertise to develop, intelligent vision guided systems easier and faster.’”
“While talking to customers who have already begun developing machine-learning technologies, Xilinx identified ‘8 bit and below fixed point precision’ as the key to significantly improve efficiency in machine-learning inference systems.”
Yoshida also interviewed Karl Freund, Senior Analyst for HPC and Deep Learning at Moor Insights & Strategy, who said:
“Artificial Intelligence remains in its infancy, and rapid change is the only constant.” In this circumstance, Xilinx seeks “to ease the programming burden to enable designers to accelerate their applications as they experiment and deploy the best solutions as rapidly as possible in a highly competitive industry.”
She also quotes Loring Wirbel, a Senior Analyst at The Linley group, who said:
“What’s interesting in Xilinx's software offering, [is that] this builds upon the original stack for cloud-based unsupervised inference, Reconfigurable Acceleration Stack, and expands inference capabilities to the network edge and embedded applications. One might say they took a backward approach versus the rest of the industry. But I see machine-learning product developers going a variety of directions in trained and inference subsystems. At this point, there's no right way or wrong way.”
There’s a lot more information in the EETimes article, so you might want to take a look for yourself.
Next week at the OFC Optical Networking and Communication Conference & Exhibition in Los Angeles, Xilinx will be in the Ethernet Alliance booth demonstrating the industry’s first, standard-based, multi-vendor 400GE network. A 400GE MAC and PCS instantiated in a Xilinx Virtex UltraScale+ VU9P FPGA will be driving a Finisar 400GE CFP8 optical module, which in turn will communicate with a Spirent 400G test module over a fiber connection.
In addition, Xilinx will be demonstrating:
If you’re visiting OFC, be sure to stop by the Xilinx booth (#1809).
MRAM (magnetic RAM) maker Everspin wants to make it easy for you to connect its 256Mbit DDR3 ST-MRAM devices (and it’s soon-to-be-announced 1Gbit ST-MRAMs) to Xilinx UltraScale FPGAs, so it now provides a software script for the Vivado MIG (Memory Interface Generator) that adapts the MIG DDR3 controller to the ST-MRAM’s unique timing and control requirements. Everspin has been shipping MRAMs for more than a decade and, according to this EETimes.com article by Dylan McGrath, it’s still the only company to have shipped commercial MRAM devices.
Nonvolatile MRAM’s advantage is that it has no wearout failure, as opposed to Flash memory for example. This characteristic gives MRAM huge advantages over Flash memory in applications such as server-class enterprise storage. MRAM-based storage cards require no wear leveling and their read/write performance does not degrade over time, unlike Flash-based SSDs.
As a result, Everspin also announced its nvNITRO line of NVMe storage-accelerator cards. The initial cards, the 1Gbyte nvNITRO ES1GB and 2Gbyte nvNITRO ES2GB, deliver 1,500,000 IOPS with 6μsec end-to-end latency. When Everspin's 1Gbit ST-MRAM devices become available later this year, the card capacities will increase to 4 to 16Gbytes.
Here’s a photo of the card:
Everspin nvNITRO Storage Accelerator
If it looks familiar, perhaps you’re recalling the preview of this board from last year’s SC16 conference in Salt Lake City. (See “Everspin’s NVMe Storage Accelerator mixes MRAM, UltraScale FPGA, delivers 1.5M IOPS.”)
If you look at the photo closely, you’ll see that the hardware platform for this product is the Alpha Data ADM-PCIE-KU3 PCIe accelerator card, loaded 1 or 2Gbyte Everspin ST-MRAM DIMMs. Everspin has added its own IP to the Alpha Data card, based on a Kintex UltraScale KU060 FPGA, to create an MRAM-based NVMe controller.
As I wrote in last year’s post:
“There’s a key point to be made about a product like this. The folks at Alpha Data likely never envisioned an MRAM-based storage accelerator when they designed the ADM-PCIE-KU3 PCIe accelerator card but they implemented their design using an advanced Xilinx UltraScale FPGA knowing that they were infusing flexibility into the design. Everspin simply took advantage of this built-in flexibility in a way that produced a really interesting NVMe storage product.”
It’s still an interesting product, and now Everspin has formally announced it.
A paper describing the superior performance of an FPGA-based, speech-recognition implementation over similar implementations on CPUs and GPUs won a Best Paper Award at FPGA 2017 held in Monterey, CA last month. The paper—titled “ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA” and written by authors from Stanford U, DeePhi Tech, Tsinghua U, and Nvidia—describes a speech-recognition algorithm using LSTM (Long Short-Term Memory) models with load-balance-aware pruning implemented on a Xilinx Kintex UltraScale+ KU060 FPGA. The implementations runs at 200MHz and draws 41W (for the FPGA board) slotted into a PCIe chassis. Compared to Core i7 CPU/Pascal Titan X GPU implementations of the same algorithm, the FPGA-based implementation delivers 43x/3x more raw performance and 40x/11.5x better energy efficiency, according to the FPGA 2017 paper. So the FPGA implementation is both faster and more energy-efficient. Pick any two.
Here’s a block diagram of the resulting LSTM speech-recognition design:
The paper describes the algorithm and implementation in detail, which probably contributed to this paper winning the conference’s Best Paper Award. This work was supported by the National Natural Science Foundation of China.
Today, Aldec announced its latest FPGA-based HES prototyping board—the HES-US-440—with a whopping 26M ASIC gate capacity. This board is based on the Xilinx Virtex UltraScale VU440 FPGA and it also incorporates a Xilinx Zynq Z-7100 SoC that acts as the board’s peripheral controller and host interface. The announcement includes a new release of Aldec’s HES-DVM Hardware/Software Validation Platform that enables simulation acceleration and emulation use modes for the HES-US-440 board in addition to the physical prototyping capabilities. You can also use this prototyping board directly to implement HPC (high-performance computing) applications.
Aldec HES-US-440 Prototyping Board, based on a Xilinx Virtex UltraScale VU440 FPGA
The Aldec HES-US-440 board packs a wide selection of external interfaces to ease your prototyping work including four FMC HPC connections, PCIe, USB 3.0 and USB 2.0 OTG, UART/USB bridge, QSFP+, 1Gbps Ethernet, HDMI, SATA; has on-board NAND and SPI Flash memories; and incorporates two microSD slots.
Here’s a block diagram of the HES-US-440 prototyping board:
Aldec HES-US-440 Prototyping Board Block Diagram
For more information about the Aldec HES-US-440 prototyping board and Aldec’s HES-DVM Hardware/Software Validation Platform, please contact Aldec directly.
Berten DSP’s GigaX API for the Xilinx Zynq SoC creates a high-speed, 200Mbps full-duplex communications channel between a GbE port and the Zynq SoC’s PS (programmable logic) through an attached SDRAM buffer and an AXI DMA controller IP block. Here’s a diagram to clear up what’s happening:
The software API implements IP filtering and manages TCP/UDP headers, which help you implement a variety of hardware-accelerated Ethernet systems including Ethernet bridges, programmable network nodes, and network offload appliances. Here’s a performance curve illustrating the kind of throughput you can expect:
Please contact Berten DSP directly for more information about the GigaX API.
With a month left in the Indiegogo funding period, the MATRIX Voice open-source voice platform campaign stands at 289% of its modest $5000 funding goal. MATRIX Voice is the third crowdfunding project by MATRIX Labs, based on Miami, Florida. The MATRIX Voice platform is a 3.14-inch circular circuit board capable of continuous voice recognition and compatible with the latest voice-based, cognitive cloud-based services including Microsoft Cognitive Service, Amazon Alexa Voice Service, Google Speech API, Wit.ai, and Houndify. The MATRIX Voice board, based on a Xilinx Spartan-6 LX4 FPGA, is designed to plug directly onto a low-cost Raspberry Pi single-board computer or it can be operated as a standalone board. You can get one of these boards, due to be shipped in May, for as little as $45—if you’re quick. (Already, 61 of the 230 early-bird special-price boards are pledged.)
Here’s a photo of the MATRIX Voice board:
This image of the top of the MATRIX Voice board shows the locations for the seven rear-mounted MEMS microphones, seven RGB LEDs, and the Spartan-6 FPGA. The bottom of the board includes a 64Mbit SDRAM and a connector for the Raspberry Pi board.
Because this is the latest in a series of developer boards from MATRIX Labs (see last year’s project: “$99 FPGA-based Vision and Sensor Hub Dev Board for Raspberry Pi on Indiegogo—but only for the next two days!”), there’s already a sophisticated, layered software stack for the MATRIX Voice platform that include a HAL (Hardware Abstraction Layer) with the FPGA code and C++ library, an intermediate layer with a streaming interface for the sensors and vision libraries (for the Raspberry Pi camera), and a top layer with the MATRIX OS and high-level APIs. Here’s a diagram of the software stack:
And now, who better to describe this project than the originators:
As the BittWare video below explains, CPUs are simply not able to process 100GbE packet traffic without hardware acceleration. BittWare’s new Streamsleuth, to be formally unveiled at next week’s RSA Conference in San Francisco (Booth S312), adroitly handles blazingly fast packet streams thanks to a hardware assist from an FPGA. And as the subhead in the title slide of the video presentation says, StreamSleuth lets you program its FPGA-based packet-processing engine “without the hassle of FPGA programming.”
(Translation: you don’t need Verilog or VHDL proficiency to get this box working for you. You get all of the FPGA’s high-performance goodness without the bother.)
That said, as BittWare’s Network Products VP & GM Craig Lund explains, this is not an appliance that comes out of the box ready to roll. You need (and want) to customize it. You might want to add packet filters, for example. You might want to actively monitor the traffic. And you definitely want the StreamSleuth to do everything at wire-line speeds, which it can. “But one thing you do not have to do, says Lund, “is learn how to program an FPGA.” You still get the performance benefits of FPGA technology—without the hassle. That means that a much wider group of network and data-center engineers can take advantage of BittWare’s StreamSleuth.
As Lund explains, “100GbE is a different creature” than prior, slower versions of Ethernet. Servers cannot directly deal with 100GbE traffic and “that’s not going to change any time soon.” The “network pipes” are now getting bigger than the server’s internal “I/O pipes.” This much traffic entering a server this fast clogs the pipes and also causes “cache thrash” in the CPU’s L3 cache.
Sounds bad, doesn’t it?
What you want is to reduce the network traffic of interest down to something a server can look at. To do that, you need filtering. Lots of filtering. Lots of sophisticated filtering. More sophisticated filtering than what’s available in today’s commodity switches and firewall appliances. Ideally, you want a complete implementation of the standard BPF/pcap filter language running at line rate on something really fast, like a packet engine implemented in a highly parallel FPGA.
The same thing holds true for attack mitigation at 100Gbe line rates. Commodity switching hardware isn’t going to do this for you at 100GbE (10GbE yes but 100GbE, “no way”) and you can’t do it in software at these line rates. “The solution is FPGAs” says Lund, and BittWare’s StreamSleuth with FPGA-based packet processing gets you there now.
Software-based defenses cannot withstand Denial of Service (DoS) attacks at 100GbE line rates. FPGA-accelerated packet processing can.
So what’s that FPGA inside of the BittWare Streamsleuth doing? It comes preconfigured for packet filtering, load balancing, and routing. (“That’s a Terabit router in there.”) To go beyond these capabilities, you use the BPF/pcap language to program your requirements into the the StreamSleuth’s 100GbE packet processor using a GUI or APIs. That packet processor is implemented with a Xilinx Virtex UltraScale+ VU9P FPGA.
Here’s what the guts of the BittWare StreamSleuth look like:
And here’s a block diagram of the StreamSleuth’s packet processor:
The Virtex UltraScale+ FPGA resides on a BittWare XUPP3R PCIe board. If that rings a bell, perhaps you read about that board here in Xcell Daily last November. (See “BittWare’s UltraScale+ XUPP3R board and Atomic Rules IP run Intel’s DPDK over PCIe Gen3 x16 @ 150Gbps.”)
Finally, here’s the just-released BittWare StreamSleuth video with detailed use models and explanations:
For more information about the StreamSleuth, contact BittWare directly or go see the company’s StreamSleuth demo at next week’s RSA conference. For more information about the packet-processing capabilities of Xilinx All Programmable devices, click here. And for information about the new Xilinx Reconfigurable Acceleration Stack, click here.
Amazon Web Services (AWS) rolled out the F1 instance for cloud application development based on Xilinx Virtex UltraScale+ Plus VU0P FPGAs last November. (See “Amazon picks Xilinx UltraScale+ FPGAs to accelerate AWS, launches F1 instance with 8x VU9P FPGAs per instance.) It appears from the following LinkedIn post that people are using it already to do some pretty interesting things:
If you’re interested in Cloud computing applications based on the rather significant capabilities of Xilinx-based hardware application acceleration, check out the Xilinx Acceleration Zone.
Accolade’s newly announced ATLAS-1000 Fully Integrated 1U OEM Application Acceleration Platform pairs a Xilinx Kintex UltraScale KU060 FPGA on its motherboard with an Intel x86 processor on a COM Express module to create a network-security application accelerator. The ATLAS-1000 platform integrates Accolade’s APP (Advanced Packet Processor), instantiated in the Kintex UltraScale FPGA, which delivers acceleration features for line-rate packet processing including lossless packet capture, nanosecond-precision packet timestamping, packet merging, packet filtering, flow classification, and packet steering. The platform accepts four 10G SFP+ or two 40G QSFP pluggable optical modules. Although the ATLAS-1000 is designed as a flow-through security platform, especially for bump-in-the-wire applications, there’s also 1Tbyte worth of on-board local SSD storage.
Accolade Technology's ATLAS-1000 Fully Integrated 1U OEM Application Acceleration Platform
Here’s a block diagram of the ATLAS-1000 platform:
All network traffic enters the FPGA-based APP for packet processing. Packet data is then selectively forwarded to the x86 CPU COM Express module depending on the defined application policy.
Please contact Accolade Technology directly for more information about the ATLAS-1000.
Aquantia has packed its Ethernet PHY—capable of operating at 10Gbps over 100m of Cat 6a cable (or 5Gbps down to 100Mbps over 100m of Cat 5e cable)—with a Xilinx Kintex-7 FPGA, creating a universal Gigabit Ethernet component with extremely broad capabilities. Here’s a block diagram of the new AQLX107 device:
This Aquantia device gives you a space-saving, one-socket solution for a variety of Ethernet designs including controllers, protocol converters, and anything-to-Ethernet bridges.
Please contact Aquantia for more information about this unique Ethernet chip.
The Linley Cloud Hardware Conference (formerly our Data Center Conference) is coming to the Hyatt Regency Hotel in Santa Clara, CA on February 8. This full-day, single-track event focuses on the processors, accelerators, Ethernet controllers, new memory technologies, and interconnects used for cloud computing and networking. The conference includes a special afternoon panel titled “Accelerating the Cloud” that will be moderated by The Linley Group’s Principal Analyst Jag Bolaria with the following panelists:
Until Feb 2, you can snag a free ticket to the conference (and a free breakfast) by clicking here if you’re a cloud-service provider, network-service provider, network-equipment vendor, server OEM, system designer, software developer, member of the press, or work in the financial community. (That’s a pretty wide net.) After that date, it’s going to cost you $195 to attend if you’re in that net or $795 if you’re not.
Time to start swimming.
Edico Genome and Dell EMC have developed a bundled compute-and-storage solution for rapid, cost-effective and accurate analysis of next-generation bio-sequencing data. The bundle consists of Edico Genome’s DRAGEN processor integrated into a 1U Dell 4130 server with Dell EMC’s Isilon scale-out networked attached storage (NAS). Edico Genome’s DRAGEN bio-IT processor is designed to analyze sequencing data quickly using the hardware acceleration of a Xilinx FPGA. (For more information about the Edico Genome DRAGEN processor, see “FPGA-based Edico Genome Dragen Accelerator Card for IBM OpenPOWER Server Speeds Exome/Genome Analysis by 60x.”)
For more information about this Edico/Dell bio-processor bundle, click here.
Work started on CCIX, the cache-coherent interconnect for accelerators, a little over a year ago. The CCIX specification describes an interconnect that makes workload handoff from server CPUs to hardware accelerators as simple as passing a pointer. This capability enables a whole new class of accelerated data center applications.
Xilinx VP of Silicon architecture Gaurav Singh discussed CCIX at the recent Xilinx Technology Briefing held at SC16 in Salt Lake City. His talk covers many CCIX details and you can watch him discuss these topics in this 9-minute video from the briefing:
The video below shows Ravi Sunkavalli, the Xilinx Sr. Director of Data Center Solutions, discussing how advanced FPGAs like devices based on the Xilinx UltraScale architecture can aid you in developing high-speed networking and storage equipment as data centers migrate to faster internal networking speeds. Sunkavalli posits that CPUs, which are largely used for networking and storage applications connected with today’s 10G networks, quickly run out of gas at 40G and 100G networking speeds. FPGAs can provide “bump-in-the-wire” acceleration for high-speed networking ports thanks to the large number of fast compute elements and the high-speed transceivers incorporated into devices like the Xilinx UltraScale and UltraScale+ FPGAs.
Examples of networking applications already handled by FPGAs include VNF (Virtual Network Functions) such as VPNs, firewalls, and security. FPGAs are already being used to implement high-speed data center storage functions such as error correction, compression, and security.
The following 8-minute video was recorded during a Xilinx technology briefing at the recent SC16 conference in Salt Lake City:
Last November at SC16 in Salt Lake City, Xilinx Distinguished Engineer Ashish Sirasao gave a 10-minute talk on deploying deep-learning applications using FPGAs with significant performance/watt benefits. Sirasao started by noting that we’re already knee-deep in machine-learning applications: spam filters; cloud-based and embedded voice-to-text converters; and Amazon’s immensely successful, voice-operated Alexa are all examples of extremely successful machine-learning apps in broad use today. More—many more—will follow. These applications all have steep computing requirements.
There are two phases in any machine-learning application. The first is training and the second is deployment. Training is generally done using floating-point implementations so that application developers need not worry about numeric precision. Training is a 1-time event so energy efficiency isn’t all that critical.
Deployment is another matter however.
Putting a trained deep-learning application in a small appliance like Amazon’s Alexa calls for attention to factors such as energy efficiency. Fortunately, said Sirasao, the arithmetic precision of the application can change from training to mass deployment and there are significant energy-consumption gains to be had by deploying fixed-point machine-learning applications. According to Sirasao, you can get accurate machine inference using 8- or 16-bit fixed-point implementations while realizing a 10x gain in energy efficiency for the computing hardware and a 4x gain in memory energy efficiency.
The Xilinx DSP48E2 block implemented in the company’s UltraScale and UltraScale+ devices is especially useful for these machine-learning deployments because its DSP architecture can perform two independent 8-bit operations per clock per DSP block. That translates into nearly double the compute performance, which in turn results in much better energy efficiency. There’s a Xilinx White Paper on this topic titled “Deep Learning with INT8 Optimization on Xilinx Devices.”
Further, Xilinx recently announced its Acceleration Stack for machine-learning (and other cloud-based applications), which allows you to focus on developing your application rather than FPGA programming. You can learn about the Xilinx Acceleration Stack here
Finally, here’s the 10-minute video with Sirasao’s SC16 talk:
Do you have a big job to do? How about a terabit router bristling with optical interconnect? Maybe you need a DSP monster for phased-array radar or sonar. Beamforming for advanced 5G applications using MIMO antennas? Some other high-performance application with mind-blowing processing and I/O requirements?
You need to look at Xilinx Virtex UltraScale+ FPGAs with their massive data-flow and routing capabilities, massive memory bandwidth, and massive I/O bandwidth. These attributes sweep away design challenges caused by performance limits of lesser devices.
Now you can quickly get your hands on a Virtex UltraScale+ Eval Kit so you can immediately start that challenging design work. The new eval kit is the Xilinx VCU118 with an on-board Virtex UltraScale+ VU9P FPGA. Here’s a photo of the board included with the kit:
Xilinx VCU118 Eval Board with Virtex UltraScale+ VU9P FPGA
The VCU118 eval kit’s capabilities spring from the cornucopia of on-chip resources provided by the Virtex UltraScale+ VU9P FPGA including:
If you can’t build what you need with the VCU118’s on-board Virtex UltraScale+ VU9P FPGA—and it’s sort of hard to believe that’s even possible—just remember, there are even larger parts in the Virtex UltraScale+ FPGA family.
Was it just two weeks ago that Amazon announced the FPGA-accelerated F1 instance of its AWS (Amazon Web Services) in developer preview form at Amazon’s AWS re:Invent 2016 event in Las Vegas, Nevada? (See “Amazon picks Xilinx UltraScale+ FPGAs to accelerate AWS, launches F1 instance with 8x VU9P FPGAs per instance.”)
The next day at the same event, NGCodec demonstrated its hardware-accelerated RealityCodec 4K video codec running on the Amazon AWS’ FPGA-accelerated EC2 F1 instance. (See “NGCodec announces high-speed 4K video compression for Amazon AWS’ new FPGA-accelerated EC2 F1 instance.”)
Add two more applications up and running on the FPGA-accelerated AWS EC2 F1 instance:
If you’ve been reading the Xilinx Xcell Daily blogs for a while, then both Ryft and Edico Genome might be familiar to you. Both companies previously implemented high-performance cloud appliances based on Xilinx FPGAs. (See “FPGA-based Ryft ONE search accelerator delivers 100x performance advantage over Apache Spark in the data center” and “FPGA-based Edico Genome Dragen Accelerator Card for IBM OpenPOWER Server Speeds Exome/Genome Analysis by 60x.”)
Now that Amazon is standardizing FPGA-accelerated for cloud computing using multiple Xilinx Virtex UltraScale+ FPGAs, both Ryft and Edico are porting their applications to the Amazon AWS cloud.
Can you say “rapid deployment”? I knew you could.
Today, NGCodec demonstrated its hardware-accelerated RealityCodec 4K video codec running on the Amazon AWS’ FPGA-accelerated EC2 F1 instance, announced just yesterday at Amazon’s AWS re:Invent 2016 event in Las Vegas, Nevada. (See “Amazon picks Xilinx UltraScale+ FPGAs to accelerate AWS, launches F1 instance with 8x VU9P FPGAs per instance.”) Amazon AWS customers will be able to buy NGCodec’s RealityCodec when it becomes available in the AWS Marketplace and will run it on the Amazon EC2 F1 instance, which is a hardware-accelerated offering based on multiple Xilinx UltraScale+ FPGAs packed into the AWS server chassis. NGCodec’s RealityCodec running on the Amazon EC2 F1 instance delivers ultra-high-performance video compression (with up to 4K resolution) and ultra-low, sub-frame latency for cloud-based VR and AR. (See NGCodec’s press release here.)
NGCodec’s RealityCodec is an excellent example of the type of cloud-based application that can benefit from FPGA-based hardware acceleration. Amazon’s announcement yesterday of a standardized hardware-accelerated offering for AWS follows an accelerating trend towards offering FPGA-based acceleration to cloud-services customers. Like Amazon, many of the major cloud service providers have announced deployment of FPGA technology in their Hyperscale data centers to drive their services business in an extremely competitive market. FPGAs are the perfect complement to highly agile cloud computing environments because they are programmable and can be hardware-optimized for any new application or algorithm. The inherent ability of an FPGA to reconfigure and be reprogrammed over time is perhaps its greatest advantage in a fast-moving field.
For more information about FPGA-based hardware acceleration in the data center, check out the new Xilinx Acceleration Zone and take a look at this White Paper from Moor Insights & Strategy, which describes the new Xilinx Reconfigurable Acceleration Stack. (See “Xilinx Reconfigurable Acceleration Stack speeds programming of machine learning, data analytics, video-streaming apps.”)