PrecisionFDA, a cloud-based platform created by the US government’s FDA to benchmark genomic analysis technologies and advance regulatory science, recently conducted a challenge called “Hidden Treasures – Warm Up.” The challenge tested the ability and accuracy of genomic analysis pipelines to find in silico injected variants in FASTQ files from exome sequencing of reference cell lines. (FASTQ files are text-based filed used to store biological sequences using ASCII encoding.) PrecisionFDA announced the results of this challenge at the Festival of Genomics, held in Boston on October 4, 2017. There were 86 valid entries from 30 participants. Out of the 86 entries, 45 found all 50 injected variants. Among entries catching all 50 injected variants, Edico Genome’s DRAGEN V2 Germline Pipeline received the highest score in five of the six accuracy metrics: SNP recall and SNP F-score, and indel precision, indel recall and indel F-score. Edico’s entry placed second on the sixth metric: SNP precision.
DRAGEN V2 is the second iteration of the DRAGEN Germline Pipeline, which employs improved sample-specific calibration of the sequencer and sample prep error models, as well as an improved mapper and aligner algorithm. The DRAGEN Genome pipeline is capable of ultra-fast analysis of Next Generation Sequencing (NGS) data, reducing the time required for analyzing a whole genome at 30x coverage from about 10 hours to approximately 22 minutes. This pipeline harnesses the tremendous power of the DRAGEN Bio-IT Platform and includes highly optimized algorithms for mapping, aligning, sorting, duplicate marking, haplotype variant calling, compression and decompression.
Edico Genome’s Dragen pipeline is available as an FPGA-based hardware platform for on-site use and in Amazon’s AWS EC2 F1 instance in the AWS cloud. In both cases, the extreme acceleration comes from running the pipeline on Xilinx All Programmable devices. In the case of the AWS EC2 F1 instance, the Xilinx device used is the Virtex UltraScale+ VU9P FPGA.
For more information about Edico Genome’s DRAGEN pipeline, see:
According to this just-posted Xilinx press release: “… Alibaba Cloud, the cloud computing arm of Alibaba Group, has chosen Xilinx for next-generation FPGA acceleration in their public cloud.” Alibaba Cloud is calling this FPGA-accelerated service its “F2” instance and has seen acceleration factors as large as 30x with the F2 instance versus the same applications running on cloud-based CPUs. The company expects its customers to use this new capability to accelerate applications including data analytics, genomics, video processing, and machine learning.
Alibaba Cloud is the largest cloud provider in China and its F2 instances are available as of today.
Twelve student and industry teams competed for 30 straight hours in the Xilinx Hackathon 2017 competition over the weekend at the Summit Retreat Center in the Xilinx corporate facility located in Longmont, Colorado. Each team member received a Digilent PYNQ-Z1 dev board, which is based on a Xilinx Zynq Z-7020 SoC, and then used their fertile imaginations to conceive of and develop working code for an application using the open-source, Python-based PYNQ development environment, which is based on self-documenting Jupyter Notebooks. The online electronics and maker retailer Sparkfun, located just down the street from the Xilinx facility in Longmont, supplied boxes of compatible peripheral boards with sensors and motor controllers to spur the team members’ imaginations. Several of the teams came from local universities including the University of Colorado at Boulder and the Colorado School of Mines in Golden, Colorado. At the end of the competition, eleven of the teams presented their results using their Jupyter Notebooks. Then came the prizes.
For the most part, team members had never used the PYNQ-Z1 boards and were not familiar with using programmable logic. In part, that was the intent of the Hackathon—to connect teams of inexperienced developers with appropriate programming tools and see what develops. That’s also the reason that Xilinx developed PYNQ: so that software developers and students could take advantage of the improved embedded performance made possible by the Zynq SoC’s programmable hardware without having to use ASIC-style (HDL) design tools to design hardware (unless they want to do so, of course).
Here are the projects developed by the teams, in the order presented during the final hour of the Hackathon (links go straight to the teams’ Github repositories with their Jupyter notebooks that document the projects with explanations and “working” code):
Team John Cena’s Voice-Controlled Mobile Robot
Team “Joy of Pink” developed an emoji generator based on facial interpretation on Microsoft’s cloud-based Azure Emotion API
Team Caffeine’s Audio Fiend Tone-Based Robotic Controller
After the presentations, the judges deliberated for a few minutes using multiple predefined criteria and then awarded the following prizes:
Congratulations to the winners and to all of the teams who spent 30 hours with each other in a large room in Colorado to experience the joy of hacking code to tackle some tough problems. (A follow-up blog will include a photographic record of the event so that you can see what it was like.)
For more information about the PYNQ development environment and the Digilent PYNQ-Z1 board, see “Python + Zynq = PYNQ, which runs on Digilent’s new $229 pink PYNQ-Z1 Python Productivity Package.”
Voice-controlled systems are suddenly a thing thanks to Amazon’s Alexa and Google Home. But how do you get reliable, far-field voice recognition and robust voice recognition in the presence of noise? That’s the question being answered by Aaware with its $199 Far-Field Development Platform. This system couples as many as 13 MEMS microphones (you can use fewer in a 1D linear or 2D array) with a Xilinx Zynq Z-7010 SoC to pre-filter incoming voice, delivering a clean voice data stream to local or cloud-based voice recognition hardware. The system has a built-in wake word (like “Alexa” or “OK, Google”) that triggers the unit’s filtering algorithms.
Here’s a video showing you the Aaware Far-Field Development Platform in action:
Aaware’s technology makes significant use of the Zynq Z-7010 SoC’s programmable-logic and DSP processing capabilities to implement and accelerate the company’s sound-capture technologies including:
You’ll find more technology details for the Aaware Far-Field Development Platform here.
Please contact Aaware directly for more information.
Late last month, I wrote about an announcement by DNAnexus and Edico Genome that described a huge reduction in the cost and time to analyze genomic information, enabled by Amazon’s FPGA-accelerated AWS EC2 F1 instance. (See “Edico Genome and DNAnexus announce $20, 90-minute genome analysis on Amazon’s FPGA-accelerated AWS EC2 F1 instance.”) The AWS Partner Network blog has just published more details in an article written by Amazon’s Aaron Friedman, titled “How DNAnexus and Edico Genome are Powering Precision Medicine on Amazon Web Services (AWS).”
The details are exciting to say the least. The article begins with this statement:
“Diagnosing the medical mysteries behind acutely ill babies can be a race against time, filled with a barrage of tests and misdiagnoses. During the first few days of life, a few hours can save or seal the fate of patients admitted to the neonatal intensive care units (NICUs) and pediatric intensive care units (PICUs). Accelerating the analysis of the medical assays conducted in these hospitals can improve patient outcomes, and, in some cases, save lives.”
Then, if you read far enough into the post, you find this statement:
“Rady Children’s Institute for Genomic Medicine is one of the global leaders in advancing precision medicine. To date, the institute has sequenced the genomes of more than 3,000 children and their family members to diagnose genetic diseases. 40% of these patients are diagnosed with a genetic disease, and 80% of these receive a change in medical management. This is a remarkable rate of change in care, considering that these are rare diseases and often involve genomic variants that have not been previously observed in other individuals.”
This example is merely a road sign, pointing the way to even more exciting developments in FPGA-accelerated, cloud-based computing to come. Well-known Silicon Valley venture capitalist Jim Hogan directly addressed these developments in a speech at San Jose State University just a couple of weeks ago. (See “Four free training videos (two hour's worth) on using Xilinx SDAccel to create apps for Amazon AWS EC2 F1 instances.”)
The Amazon AWS EC2 F1 instance is a cloud service that’s based on multiple Xilinx Virtex UltraScale+ VU9P FPGAs installed in Amazon’s Web servers. For more information on the AWS EC2 F1 Instance in Xcell Daily, see:
Karl Freund, a Senior Analyst at Moor Insights & Strategy, has just published an article on Forbes.com titled “Amazon And Xilinx Deliver New FPGA Solutions” that discusses Amazon’s use of Xilinx Virtex UltraScale+ FPGAs in the Amazon AWS EC2 F1 instance and how those resources are now being used more widely to distribute cloud-based applications through AWS Marketplace’s Amazon Machine Images (AMIs). Freund gave three specific examples of companies using these resources to offer accelerated, cloud-based services:
Freund also notes that cloud-based applications from these three companies are not “GPU-friendly,” which means that these applications benefit far more from FPGA-based acceleration than they do GPU acceleration.
NGCodec, Ryft, and Edico Genome have all appeared in Xcell Daily posts. For more information, see:
Need to get an FPGA-based, high-performance network server or appliance designed and fielded quickly? If so, take a serious look at the heavyweight combination of a BittWare XUPP3R PCIe board based on any one of three Xilinx Virtex UltraScale+ FPGAs (VU7P, VU9P, or VU11P) and LDA Technologies’ slick 1U e4 FPGA chassis, which is designed to bring all of the BittWare XUPP3R’s I/O ports on its QSFP, PCIe, and high-speed serial expansion ports to the 48-port front panel, like so:
LDA Technologies’ 1U e4 FPGA chassis
The reason that BittWare’s XUPP3R PCIe card can support this many high-speed GbE ports is because the Virtex UltraScale+ FPGAs have that many bulletproof high-speed SerDes ports.
BittWare XUPP3R PCIe board based on Virtex UltraScale+ FPGAs
You can use the combination of the BittWare XUPP3R PCIe card and LDA Technologies’ e4 FPGA chassis to develop a variety of network equipment including:
LDA Technologies’ e4 FPGA chassis is specifically designed to accept a PCIe FPGA card like BittWare’s XUPP3R and it has several features designed to specifically support the needs of such a card including:
The e4 FPGA chassis fits in a 1U rack space and it’s only 12 inches deep, so you can mount two back-to-back in a 1U rack slot with the front I/O ports pointing forward and the back I/O ports pointing out the rear of the rack.
For a 10-minute BittWare video that gives you even more information about this card/chassis combo, click here.
Earlier this week at San Jose State University (SJSU), Jim Hogan, one of Silicon Valley’s most successful venture capitalists, gave a talk on the disruptive effects that cognitive science and AI are already having on society. In a short portion of that talk, Hogan discussed how he and one of his teams developed the world’s most experienced lung-cancer radiologist—an AI app—for $75:
Hogan’s trained AI radiologist can look at lung images and find possibly cancerous tumors based on thousands of cases in the CDC database. However, said Hogan, the US Veterans Administration has a database with millions of cases. Yes, his team used that database for training too.
Hogan predicted that something like 25 million AI apps like his lung-cancer-specific radiologist will be developed over the next few years. His $75 example is meant to prove the cost feasibility of developing that many useful apps.
Hogan made me a believer.
In a related connection to AWS app development, Xilinx has just posted four training videos showing you how to develop FPGA-accelerated apps using Xilinx’s SDAccel on Amazon’s AWS EC2 F1 instance. That’s nearly two hours of free training available at your desk. (For more information on the AWS EC2 F1 instance, see “SDAccel for cloud-based application acceleration now available on Amazon’s AWS EC2 F1 instance” and “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available”.)
Here are the videos:
Note: You can watch Jim Hogan’s 90-minute presentation at SJSU by clicking here.
Sounds like a riddle, doesn’t it? How do you squeeze 1024 Xilinx Kintex UltraScale FPGAs plus 16Tbytes of DDR4 SDRAM into a standard 19-inch data-center rack and why would you do that? I can’t tell you how you would do it but I can tell you how IBM Research did it. They started with the design of the FPGA card, mounting a Kintex UltraScale KU060 FPGA on a PCIe card along with a big chunk of DDR4 SDRAM and a Cypress Semiconductor PSOC with an on-chip ARM Cortex-M3 processor for housekeeping over USB. They also instantiated a 10GBASE-KR 10Gbps backplane Ethernet NIC in the FPGA. (This is definitely an application where you want those bulletproof Xilinx UltraScale MGT SerDes transceivers.)
The card looks like this:
Next, IBM Research stuffed 32 of these cards into a half-rack “sled”—a passive carrier board that electronically aggregates the boards with an Intel FM6000 multi-layer switch chip that funnels the 10GbE connections into eight 40GbE optical connections. Then, they bolted two sleds into a 2U, 19-inch rack chassis that connects to the rest of the rack via the 16 40GbE ports on the north side of the Ethernet switches. Install 16 of these chassis into a rack, add 50kW of power and water cooling, and there you have it.
What do you have? Allow me to quote from the conclusion of the IBM paper titled “An FPGA Platform for Hyperscalers,” presented at last month’s IEEE Hot Interconnects conference in Santa Clara:
“…we first compared the network performance of our disaggregated FPGA with that obtained from bare-metal servers, virtual machines, and containers. The results showed that standalone disaggregated FPGAs outperform them in terms of network latency and throughput by a factor of up to 35x and 73x, respectively. We also observed that the Ethernet NIC integrated within the FPGA fabric was consuming less than 10% of the total FPGA resources.”
Note: The open-source TCP/IP stack used in this system was developed by Xilinx and the Systems Group at ETH Zurich and can be found at: http://github.com/fpgasystems/fpga-network-stack. It is written in Vivado HLS and supports thousands of concurred connections at a rate of 10Gbps.
This week, EXFO announced and demonstrated its FTBx-88400NGE Power Blazer 400G Ethernet Tester at the ECOC 2017 optical communications conference in Gothenburg, Sweden using a Xilinx VCU140 FPGA design platform as an interoperability target. The VCU140 development platform is based on a Xilinx Virtex UltraScale+ VU9P FPGA. EXFO’s FTBx-88400NGE Power Blazer offers advanced testing for the full suite of new 400G technologies including support for FlexE (Flex Ethernet), 400G Ethernet, and high-speed transceiver validation. The Flex Ethernet (FlexE) function supports one or more bonded 100GBASE-R PHYs supporting multiple Ethernet MAC operating at a rate of 10, 40, or n x 25Gbps. Flex Ethernet is a key data center technology that helps data centers deliver links that are faster than emerging 400G solutions.
Here’s a photo of the ECOC 2017 demo:
This demonstration is yet one more proof point for the 400GbE standard, which will be used in a variety of high-speed communications applications including data-center interconnect, next-generation switch and router line cards, and high-end OTN transponders.
Ryft has announced that it now offers its Ryft Cloud cloud-based search and analysis tools on Amazon’s FPGA-accelerated AWS EC2 F1 instance through Amazon’s AWS Marketplace. When Xcell Daily last covered Ryft, the company had introduced the Ryft ONE, an FPGA-accelerated data analytics platform. (See “FPGA-based Ryft ONE search accelerator delivers 100x performance advantage over Apache Spark in the data center.”)
Now you can access Ryft’s accelerated search and analysis algorithms instantly through Amazon’s EC2 F1 compute instance, which gets its acceleration from multiple Xilinx Virtex UltraScale+ VU9P FPGAs. According to Ryft, FPGA acceleration using the AWS EC2 F1 instance boosts application performance by 91X compared to traditional CPU-based cloud analytics.
How fast is that? Ryft has published a benchmark chart that shows you just how fast that is:
The announcement includes a link to a Ryft White Paper titled “Powering Elastic Search in the Cloud: Transform High-Performance Analytics in the AWS Cloud for Fast, Data-Driven Decisions.”
For more information about Amazon’s AWS EC2 F1 instance, see:
SDAccel—Xilinx’s development environment for accelerating cloud-based applications using C, C++, or OpenCL—is now available on Amazon’s AWS EC2 F1 instance. (Formal announcement here.) The Amazon EC2 F1 compute instance allows you to create custom hardware accelerators for your application using cloud-based server hardware that incorporates multiple Xilinx Virtex UltraScale+ VU9P FPGAs. SDAccel automates the acceleration of software applications by building application-specific FPGA kernels for the AWS EC2 F1. You can also use HDLs including Verilog and VHDL to define hardware accelerators in SDAccel. With this release, you can access SDAccel through the AWS FPGA developer AMI.
For more information about Amazon’s AWS EC2 F1 instance, see:
For more information about SDAccel, see:
What happens when you host a genomic analysis application on the FPGA-accelerated Amazon AWS EC2 F1 instance? You get Edico Genome’s and DNAnexus’ dramatic announcement of a $20, 90-minute offer to analyze an entire human genome. Edico Genome previously ported the DRAGEN pipeline to Amazon’s FPGA instances and DNAnexus customers can now leverage Edico Genome’s Dragen app as a turnkey solution. DNAnexus provides a global network for sharing and managing genomic data and tools to accelerate genomics. New and existing DNAnexus customers have access to the DRAGEN app.
The two companies have launched a promotion, lasting from Aug. 28 to Oct. 31, where whole-genome analysis on the AWS EC2 F1 2x instances costs $20 and takes about an hour and a half. In the next few weeks, Edico Genome’s DRAGEN will be available through DNAnexus on the F1 16x instances as well, which reduces analysis time to 20 minutes or so. Whole-exome analysis will cost about $5 during the promotional period.
The Amazon AWS EC2 F1 instance is a cloud service that’s based on multiple Xilinx Virtex UltraScale+ VU9P FPGAs installed in Amazon’s Web servers.
For more information about Edico Genome’s DRAGEN processor and genome analysis in Xcell Daily, see:
BrainChip Holdings has just announced the BrainChip Accelerator, a PCIe server-accelerator card that simultaneously processes 16 channels of video in a variety of video formats using spiking neural networks rather than convolutional neural networks (CNNs). The BrainChip Accelerator card is based on a 6-core implementation BrainChip’s Spiking Neural Network (SNN) processor instantiated in an on-board Xilinx Kintex UltraScale FPGA.
Here’s a photo of the BrainChip Accelerator card:
BrainChip Accelerator card with six SNNs instantiated in a Kintex UltraScale FPGA
Each BrainChip core performs fast, user-defined image scaling, spike generation, and SNN comparison to recognize objects. The SNNs can be trained using low-resolution images as small as 20x20 pixels. According to BrainChip, SNNs as implemented in the BrainChip Accelerator cores excel at recognizing objects in low-light, low-resolution, and noisy environments.
The BrainChip Accelerator card can process 16 channels of video simultaneously with an effective throughput of more than 600 frames per second while dissipating a mere 15W for the entire card. According to BrainChip, that’s a 7x improvement in frames/sec/watt when compared to a GPU-accelerated CNN-based, deep-learning implementation for neural networks like GoogleNet and AlexNet. Here’s a graph from BrainChip illustrating this claim:
SNNs mimic human brain function (synaptic connections, neuron thresholds) more closely than do CNNs and rely on models based on spike timing and intensity. Here’s a graphic from BrainChip comparing a CNN model with the Spiking Neural Network model:
For more information about the BrainChip Accelerator card, please contact BrainChip directly.
ARM, Cadence, TSMC, and Xilinx have announced a collaboration to develop a CCIX (Cache Coherent Interconnect for Accelerators) test chip in TSMC’s 7nm FinFET process technology with a 2018 completion date. The test chip will demonstrate multiple ARM CPUs, CMN-600 coherent on-chip bus, and foundation IP communicating to other chips including Xilinx’s Virtex UltraScale+ FPGAs over the coherent, 25Gbps CCIX fabric. Cadence is supplying the CCIX controller and PHY IP for the test chip as well as PCIe Gen 4, DDR4 PHY, and Peripheral IP blocks. In addition, Cadence verification and implementation tools are being used to design and build the test chip. According to the announced plan, the test chip tapes out early in the first quarter of 2018, with silicon availability expected in the second half of 2018.
You can’t understand the importance of this announcement if you aren’t fully up to speed on CCIX, which Xcell Daily has discussed a few times in the recent past.
CCIX simplifies the design of offload accelerators for hyperscale data centers by providing low-latency, high-bandwidth, fully coherent access to server memory. The specification employs a subset of full coherency protocols and is ISA-agnostic, meaning that the specification’s protocols are independent of the attached processors’ architecture and instruction sets. Full coherency is unique to the CCIX specification. It permits accelerators to cache processor memory and processors to cache accelerator memory.
CCIX is designed to provide coherent interconnection between server processors and hardware accelerators, memory, and among hardware accelerators as shown below:
Sample CCIX Configurations
The CCIX Consortium announced Release1 of the CCIX spec a little less than a year ago. CCIX Consortium members Xilinx and Amphenol FCI demonstrated a CCIX interface operating at 25Gbps using two Xilinx 16nm UltraScale+ devices through an Amphenol/FCI PCI Express CEM connector and a trace card earlier this year.
As the CCIX Consortium’s Web site says:
“CCIX simplifies the development and adoption by extending well-established data center hardware and software infrastructure. This ultimately allows system designers to seamlessly integrate the right combination of heterogeneous components to address their specific system needs.”
For more information, see these earlier Xcell Daily CCIX blog posts:
Edico Genome has been developing genetic-analysis algorithms for a while now. (See this Xcell Daily story from 2015, “FPGA-based Edico Genome Dragen Accelerator Card for IBM OpenPOWER Server Speeds Exome/Genome Analysis by 60x”). The company originally planned to accelerate its algorithm by developing an ASIC, but decided this was a poor implementation choice because of the rapid development of its algorithms. Once you develop an ASIC, it’s frozen in time. Instead, Edico Genome found that Xilinx FPGAs were an ideal match for the company’s development needs and so the company developed the Dragen Accelerator Card for exome/genome analysis.
This hardware was well suited to Edico Genome’s customers that wanted to have on-site hardware for genomic analysis but the last couple of years have seen a huge movement to cloud-based apps including genomic analysis. So Edico Genome moved its algorithms to Amazon’s AWS EC2 F1 Instance, which offers accelerated computing thanks to Xilinx UltraScale+ VU9P FPGAs. (See “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available.”)
Edico Genome now offers cloud-based genomic processing and genomic storage in the cloud through Amazon’s AWS EC2 F1 Instance. Like its genomic analysis algorithms, the company’s cloud-based genomic storage takes advantage of the FPGA acceleration offered by Amazon’s AWS EC2 F1 Instance to achieve 2x to 4x compression. When you’re dealing with the human genome, you’re talking about storing 80Gbytes per genome so fast, 2x to 4x compression is a pretty important benefit.
This is all explained by Edico Genome’s VP of Engineering Rami Mehio in an information-packed 3-minute video:
Need to build a networking monster for financial services, low-latency trading, or cloud-based applications? The raw materials you need are already packed into Silicom Denmark’s SmartNIC fb4CGg3@VU PCIe card, which is based on a Xilinx Virtex UltraScale or Virtex UltraScale+ FPGA:
Silicom Denmark’s SmartNIC fb4CGg3@VU PCIe card
The SmartNIC fb4CGg3@VU PCIe card includes complete NIC functionality (TCP Offload Engine (TOE), UDP Offload Engine, and drivers).
Please contact Silicom Denmark directly for more information about the SmartNIC fb4CGg3@VU PCIe card.
Yesterday, Amazon announced a preview of an OpenCL development flow for the AWS EC2 F1 Instance, which is an FPGA-accelerated cloud-computing service based on Xilinx Virtex UltraScale+ VU9P FPGAs. According to Amazon, “…developers with little to no FPGA experience, will find a familiar development experience and now can use the cloud-scale availability of FPGAs to supercharge their applications.” In addition, wrote Amazon: “The FPGA Developer AMI now enables a graphical design canvas, enabling faster AFI development using a graphical flow, and leveraging pre-integrated verified IP blocks,” and "We have also upgraded the FPGA Developer AMI to Vivado 2017.1 SDx, improving the synthesis quality and runtime capabilities."
A picture is worth 1000 words:
For more information and to sign-up for the preview, please visit Amazon’s preview page.
For more information about the Amazon EC2 F1 Instance based on Xilinx Virtex UltraScale+ FPGAs, see “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available” and “AWS does a deep-dive video on the Amazon EC2 F1 Instance, a cloud accelerator based on Xilinx Virtex UltraScale+ FPGAs.”
Curious about using Amazon’s AWS EC2 F1 Instance? Want a head start? Falcon Computing in Santa Clara, California has a 2-day seminar just for you titled “Accelerate Applications on AWS EC2 F1.” It’s being taught by Professor Jason Cong from the Computer Science Department at the U. of California in Los Angeles and it’s taking place on September 28-29 at Falcon’s HQ.
Here’s the agenda:
Please contact Falcon Computing directly for more information about this Amazon AWS EC2 F1 Instance Seminar.
For more information about the Amazon EC2 F1 Instance based on Xilinx Virtex UltraScale+ FPGAs, see “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available” and “AWS does a deep-dive video on the Amazon EC2 F1 Instance, a cloud accelerator based on Xilinx Virtex UltraScale+ FPGAs.”
Xilinx has announced at HUAWEI CONNECT 2017 that Huawei’s new, accelerated cloud service and its FPGA Accelerated Cloud Server (FACS) is based on Xilinx Virtex UltraScale+ VU9P FPGAs. The Huawei FACS platform allows users to develop, deploy, and publish new FPGA-based services and applications on the Huawei Public Cloud with a 10-50x speed-up for compute-intensive cloud applications such as machine learning, data analytics, and video processing. Huawei has more than 15 years of experience in the development of FPGA systems for telecom and data center markets. "The Huawei FACS is a fully integrated hardware and software platform offering developer-to-deployment support with best-in-class industry tool chains and access to Huawei's significant FPGA engineering expertise," said Steve Langridge, Director, Central Hardware Institute, Huawei Canada Research Center.
The FPGA Accelerated Cloud Server is available on the Huawei Public Cloud today. To register for the public beta, please visit http://www.hwclouds.com/product/fcs.html. For more information on the Huawei Cloud, please visit www.huaweicloud.com.
For more information, see this page.
Fidus Systems based the design of its Sidewinder-100 PCIe NVMe Storage Controller on a Xilinx Zynq UltraScale+ MPSoC ZU19EG for many reasons but among the most important are PCIe Gen3/4 capability; high-speed, bulletproof SerDes for the board’s two 100Gbps-capable QSFP optical network cages; vast I/O flexibility inherent in Xilinx All Programmable devices to control DDR SDRAM, to drive the two SFF-8643 Mini SAS connectors for off-board SSDs, etc.; and the immense processing capabilities that come from the six on-chip ARM processor cores (four 64-bit ARM Cortex-A53 MPcore processors and two 32-bit ARM Cortex-R5 MPCore processors); and the big chunk of on-chip programmable logic based on the Xilinx UltraScale architecture. The same attributes that made the Zynq UltraScale+ MPSoC a good foundation for a high-performance NVMe controller like the Sidewinder-100 also make the board an excellent development target for a truly wide variety of hardware designs—just about anything you might imagine.
The Sidewinder-100’s significant performance advantage over SCSI and SAS storage arrays comes from its use of NVMe Over Fabrics technology reduce storage transaction latencies. In addition, there are two on-board M.2 connectors available for docking NVMe SSD cards. The board also accepts two DDR4 SO-DIMMs that are independently connected to the Zynq UltraScale+ MPSoC’s PS (processing system) and PL (programmable logic). That independent connection allows the PS-connected DDR4 SO-DIMM to operate at 1866Mtransfers/sec and the PL-connected DDR4 SO-DIMM to operate at 2133Mtransfers/sec.
All of this makes for a great PCIe Gen4 development platform, as you can see from this photo:
Fidus Sidewinder-100 PCIe NVMe Storage Controller
Because Fidus is a design house, it had general-purpose uses in mind for the Sidewinder-100 PCIe NVMe Storage Controller from the start. The board makes an excellent, ready-to-go development platform for any sort of high-performance PCIe Gen 3 or Gen4 development and Fidus would be happy to help you develop something else using this platform.
Oh, and one more thing. Tucked onto the bottom of the Sidewinder-100 PCIe NVMe Storage Controller Web page is this interesting PCIe Power and Loopback Adapter:
Fidus PCIe Power and Loopback Adapter
It’s just the thing you’ll need to bring up a PCIe card on the bench without a motherboard. After all, PCIe Gen4 motherboards are scarce at the moment and this adapter looks like it should cost a lot less than a motherboard with a big, power-hungry processor on board. Just look at that tiny dc power connector to operate the adapter!
Please contact Fidus Systems directly for more information about the Sidewinder-100 PCIe NVMe Storage Controller and the PCIe Power and Loopback Adapter.
Even though I knew this was coming, it’s still hard to write this blog post without grinning. Last week, acknowledged FPGA-based processor wizard Jan Gray of Gray Research LLC presented a Hot Chips poster titled “GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Framework: A 1680-core, 26 MB SRAM Parallel Processor Overlay on Xilinx UltraScale+ VU9P.” Allow me to unpack that title and the details of the GRVI Phalanx for you.
Let’s start with 1680 “austere” processing elements in the GRVI Phalanx, which are based on the 32-bit RISC-V processor architecture. (Is that parallel enough for you?) The GRVI processing element design follows “Jan’s Razor”: In a chip multiprocessor, cut nonessential resources from each CPU, to maximize CPUs per die. Thus, a GRVI processing element is a 3-stage, user-mode RV321 core minus a few nonessential bits and pieces. It looks like this:
A GRVI Processing Element
Each GRVI processing element requires ~320 LUTs and runs at 375MHz. Typical of a Jan Gray design, the GRVI processing element is hand-mapped and –floorplanned into the UltraScale+ architecture and then stamped 1680 times into the Virtex UltraScale+ VU9P FPGA on a VCU118 Eval Kit.
Now, dropping a bunch of processor cores onto a large device like the Virtex UltraScale+ VU9P FPGA is interesting but less than useful unless you give all of those cores some memory to operate out of, some way for the processors to communicate with each other and with the world beyond the FPGA package, and some way to program the overall machine.
Therefore, the GRVI processing elements are packaged in clusters containing as many as eight processing elements with 32 to 128Kbytes of RAM, and additional accelerator(s). Each cluster is tied to the other on-chip clusters and to the external-world I/O through a HOPLITE router to a NOC (network on chip) with 100Gbps links between nodes. The HOPLITE router is an FPGA-optimized, directional router designed for a 2D torus network.
A GRVI Phalanx cluster looks like this:
A GRVI Phalanx Cluster
Currently, Gray’s paper says there a multithreaded C++ compiler with message-passing runtime layered on top of a RISC-V RV321MA GCC compiler with future plans to support OpenCL, P4, and other programming tools.
Now if all that were not enough (and you will find a lot more packed into Gray’s poster), there’s a Xilinx Virtex UltraScale+ VU9P available to you. It’s as near as your keyboard and Web browser on the Amazon AWS EC2 F1.2XL and F1.16XL instances and Jan Gray is working on putting the GRVI Phalanx on that platform as well.
Incredibly, it’s all in that Hot Chips poster.
Now that Amazon has made the FPGA-accelerated Amazon EC2 F1 compute instance generally available to all AWS customers (see “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available”), just about anyone can get access to the latest Xilinx All Programmable UltraScale+ devices from anywhere, just as long as you have an Internet connection and a Web browser. Xilinx has just published a new video demonstrating the use of its Vivado IP Integrator, a graphical-based design tool, with the AWS EC2 F1 compute instance.
Why use Vivado IP Integrator? As the video says, there are five main reasons:
Here’s the 5-minute video:
Xcell Daily covered an announcement by Baidu about its use of Xilinx Kintex UltraScale+ FPGAs for the acceleration of cloud-based applications last October. (See “Baidu Adopts Xilinx Kintex UltraScale FPGAs to Accelerate Machine Learning Applications in the Data Center.”) Today, Baidu discussed more architectural particulars of its FPGA-acceleration efforts at the Hot Chips conference in Cupertino, California—according to Nicole Hemsoth’s article appearing on the NextPlatform.com site (“An Early Look at Baidu’s Custom AI and Analytics Processor”).
“…Baidu has a new processor up its sleeve called the XPU… The architecture they designed is aimed at this diversity with an emphasis on compute-intensive, rule-based workloads while maximizing efficiency, performance and flexibility, says Baidu researcher, Jian Ouyang. He unveiled the XPU today at the Hot Chips conference along with co-presenters from FPGA maker, Xilinx…
“’The FPGA is efficient and can be aimed at specific workloads but lacks programmability,’ Ouyang explains. ‘Traditional CPUs are good for general workloads, especially those that are rule-based and they are very flexible. GPUs aim at massive parallelism and have high performance. The XPU is aimed at diverse workloads that are compute-intensive and rule-based with high efficiency and performance with the flexibility of a CPU,’ Ouyang says. The part that is still lagging, as is always the case when FPGAs are involved, is the programmability aspect. As of now there is no compiler, but he says the team is working to develop one…
“’To support matrix, convolutional, and other big and small kernels we need a massive math array with high bandwidth, low latency memory and with high bandwidth I/O,” Ouyang explains. “The XPU’s DSP units in the FPGA provide parallelism, the off-chip DDR4 and HBM interface push on the data movement side and the on-chip SRAM provide the memory characteristics required.’”
According to Hemsoth’s article, “The XPU has 256 cores clustered with one shared memory for data synchronization… Somehow the all 256 cores are running at 600MHz.”
For more details, see Hemsoth’s article on the NextPlatform.com Web site.
Every device family in the Xilinx UltraScale+ family of devices (Virtex UltraScale+ FPGAs, Kintex UltraScale+ FPGAs, and Zynq UltraScale+ MPSoCs) have members with 28Gbps-capable GTY transceivers. That’s likely to be important to you as the number and forms of small, 28Gbps interconnect grow. You have many such choices in such interconnect these days including:
The following 5.5-minute video demonstrates all of these interfaces operating with 25.78Gbps lanes on Xilinx VCU118 and KCU116 Eval Kits, as concisely explained (as usual) by Xilinx’s “Transceiver Marketing Guy” Martin Gilpatric. Martin also discusses some of the design challenges associated with these high-speed interfaces.
But first, as a teaser, I could not resist showing you the wide-open IBERT eye on the 25.78Gbps Samtec FireFly AOC:
Now that’s a desirable eye.
Here’s the new video:
Amazon Web Services (AWS) is now offering the Xilinx SDAccel Development Environment as a private preview. SDAccel empowers hardware designers to easily deploy their RTL designs in the AWS F1 FPGA instance. It also automates the acceleration of code written in C, C++ or OpenCL by building application-specific accelerators on the F1. This limited time preview is hosted in a private GitHub repo and supported through an AWS SDAccel forum. To request early access, click here.
The HTG-910 Low-Profile PCIe Development Platform from Hitech Global teams a Virtex UltraScale+ (VU9P, VU13P) or Virtex UltraScale VU190 FPGA with two QSFP28 (4x15G) optical cages, two Samtec FireFly Micro Flyover ports (each capable of 100Gbps operation), and 34Gbytes of DDR4 SDRAM in three independent banks. There’s also a Z-Ray interposer capable of carrying 16 32.75Gbps GTY SerDes transceiver ports from the FPGA to a high-speed mezzanine card.
Here’s a block diagram of the card:
And here’s a photo:
This is one big, fast PCIe card that should be capable of implementing just about anything you can think up.
Last week, stealth startup Burlywood in Longmont, Colorado unstealthed and announced TrueFlash, the industry’s first modular NAND Flash memory controller for SSDs. The controller is designed to manage media (like NAND Flash memory) that can exhibit high defects and error rates. The controller is designed to scale to 100Tbytes and beyond, accommodates 3D TLC and QLC Flash devices from multiple sources in the same controller, and can be tuned to specific environments and workloads. According to the company’s brief explanation on its Web site (they are just unstealthing), the Burlywood SSD controller IP “allows for rapid integration of customer specified requirements across interfaces, protocols, FTL, QoS, capacity, flash types, and form-factor.” No doubt all of that flexibility comes from the implementation technology: a Xilinx UltraScale+ FPGA.
For more information about the Burlywood TrueFlash SSD controller, please contact the company directly.
Two new papers, one about hardware and one about software, describe the Snowflake CNN accelerator and accompanying Torch7 compiler developed by several researchers at Purdue U. The papers are titled “Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks” (the hardware paper) and “Compiling Deep Learning Models for Custom Hardware Accelerators” (the software paper). The authors of both papers are Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, and Eugenio Culurciello from Purdue’s School of Electrical and Computer Engineering and the Weldon School of Biomedical Engineering.
In the abstract, the hardware paper states:
“Snowflake, implemented on a Xilinx Zynq XC7Z045 SoC is capable of achieving a peak throughput of 128 G-ops/s and a measured throughput of 100 frames per second and 120 G-ops/s on the AlexNet CNN model, 36 frames per second and 116 Gops/s on the GoogLeNet CNN model and 17 frames per second and 122 G-ops/s on the ResNet-50 CNN model. To the best of our knowledge, Snowflake is the only implemented system capable of achieving over 91% efficiency on modern CNNs and the only implemented system with GoogLeNet and ResNet as part of the benchmark suite.”
The primary goal of the Snowflake accelerator design was computational efficiency. Efficiency and bandwidth are the two primary factors influencing accelerator throughput. The hardware paper says that the Snowflake accelerator achieves 95% computational efficiency and that it can process networks in real time. Because it is implemented on a Xilinx Zynq Z-7045, power consumption is a miserly 5W according to the software paper, well within the power budget of many embedded systems.
The hardware paper also states:
“Snowflake with 256 processing units was synthesized on Xilinx's Zynq XC7Z045 FPGA. At 250MHz, AlexNet achieved in 93:6 frames/s and 1:2GB/s of off-chip memory bandwidth, and 21:4 frames/s and 2:2GB/s for ResNet18.”
Here’s a block diagram of the Snowflake machine architecture from the software paper, from the micro level on the left to the macro level on the right:
There’s room for future performance improvement notes the hardware paper:
“The Zynq XC7Z045 device has 900 MAC units. Scaling Snowflake up by using three compute clusters, we will be able to utilize 768 MAC units. Assuming an accelerator frequency of 250 MHz, Snowflake will be able to achieve a peak performance of 384 G-ops/s. Snowflake can be scaled further on larger FPGAs by increasing the number of clusters.”
This is where I point out that a Zynq Z-7100 SoC has 2020 “MAC units” (actually, DSP48E1 slices)—which is a lot more than you find on the Zynq Z-7045 SoC—and the Zynq UltraScale+ ZU15EG MPSoC has 3528 DSP48E2 slices—which is much, much larger still. If speed and throughput are what you desire in a CNN accelerator, then either of these parts would be worthy of consideration for further development.
This week, Everspin launched its line of MRAM-based nvNITRO NVMe Storage Accelerator cards with an incredible performance spec: up to 1.46 million IOPS for random 4Kbyte mixed 70/30 read/write operations. In the world of IOPS, that’s very fast. In fact it’s roughly 3x faster than an Intel P4800X Optane SSD card, which is spec’ed at up to 500K IOPS for random 4Kbyte mixed 70/30 read/write operations. Multiple factors contribute to the nvNITRO Storage Accelerator’s speed including Everspin’s new 1Gbit Spin Torque Magnetorestrictive RAM (ST-MRAM) with high-speed, DDR4, SDRAM-compatible I/O; a high-performance, MRAM-specific memory controller IP block compatible with NVMe 1.1+; and the Xilinx Kintex UltraScale KU060 FPGA that implements the MRAM controller and the board’s PCIe Gen3 x8 host interface. Everspin’s nvNITRO NVMe cards will ship in Q4 of 2017 and will be available in 1 and 2Gbyte capacities.
Everspin’s nvNITRO NVMe card
Nonvolatile MRAM delivers several significant advantages over other memory technologies used to implement NVMe cards. It’s non-volatile, so no backup power is needed. In addition, ST-MRAM has very high endurance (see chart below), so the nvNITRO card accommodates unlimited drive writes per day, eliminates the need for wear-leveling algorithms that steal memory cycles in NAND-Flash storage, and exhibits no degradation in read/write performance over time.
Everspin’s ST-MRAM has low write times and high write endurance
So much for the chart’s Y axis. You can see from the X axis that Everspin’s ST-MRAM has a very fast write speed—it’s about as fast as DRAM—which is one reason that the nvNITRO Storage Accelerator has such fast read/write performance.
There’s one more line in the Everspin nvNITRO NVMe Storage Accelerator’s data sheet that’s worth mentioning:
“Customer-defined features using own RTL with programmable FPGA”
There appears to be room for your own custom code in that Kintex UltraScale KU060 FPGA that implements the PCIe interface and ST-MRAM controller on the nvNITRO Storage Accelerator card. You can add your own special sauce to the design with no incremental BOM cost. Try doing that with an ASSP!