We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!


Javier Alejandro Varela and Professor Dr-Ing Norbert Wehn at of the University of Kaiserslautern’s Microelectronic Systems Design Research Group have just published a White Paper titled “Running Financial Risk Management Applications on FPGA in the Amazon Cloud” and the last sentence in the White Paper’s abstract reads:



“…our FPGA implementation achieves a 10x speedup on the compute intensive part of the code, compared to an optimized parallel implementation on multicore CPU, and it delivers a 3.5x speedup at application level for the given setup.”



The University of Kaiserslautern’s Microelectronic Systems Design Research Group has been working on accelerating financial applications using FPGAs in connection with high-performance computing systems since 2010 and that research has recently migrated to cloud-based computing systems including Amazon’s EC2 F1 Instance, which is based on Xilinx Virtex Ultrascale+ FPGAs. The results in this White Paper are based on using OpenCL code and the Xilinx SDAccel development environment.




For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:







Looking to turbocharge Amazon’s Alexa or Google Home? Aaware’s Zynq-based kit is the tool you need.

by Xilinx Employee ‎01-04-2018 02:13 PM - edited ‎01-05-2018 12:34 PM (4,753 Views)


How do you get reliable, far-field voice recognition; robust, directional voice recognition in the presence of strong background noise; and multiple wake words for voice-based cloud services such as Amazon’s Alexa and Google Home? Aaware has an answer with its $199, Zynq-based Far-Field Development Platform. (See “13 MEMS microphones plus a Zynq SoC gives services like Amazon’s Alexa and Google Home far-field voice recognition clarity.”) A new Powered by Xilinx Demo Shorts video gives you additional info and another demo. (That’s a Zynq-based krtkl snickerdoodle processing board in the video.)





Looking for a quick explanation of the FPGA-accelerated AWS EC2 F1 instance? Here’s a 3-minute video

by Xilinx Employee ‎12-19-2017 10:45 AM - edited ‎12-19-2017 10:49 AM (8,394 Views)


The AWS EC2 F1 compute instance allows you to create custom hardware accelerators for your application using cloud-based server hardware that incorporates multiple Xilinx Virtex UltraScale+ VU9P FPGAs. Several companies now list applications for FPGA-accelerated AWS EC2 F1 instances in the AWS Marketplace in application categories including:



  • Video processing
  • Data analytics
  • Genomics
  • Machine Learning



Here’s a 3-minute video overview recorded at the recent SC17 conference in Denver:







High-frequency trading is all about speed, which explains why Aldec’s new reconfigurable HES-HPC-HFT-XCVU9P PCIe card for high-frequency trading (HFT) apps is powered by a Xilinx Virtex UltraScale+ VU9P FPGA. That’s about as fast as you can get with any sort of reprogrammable or reconfigurable technology. The Virtex UltraScale+ FPGA directly connects to all of the board’s critical, high-speed interface ports—Ethernet, QSFP, and PCIe x16—and implements the communications protocols for those standard interfaces as well as the memory control and interface for the board’s three QDR-II+ memory modules. Consequently, there’s no time-consuming chip-to-chip interconnection. Picoseconds count in HFT applications, so the FPGA’s ability to implement all of the card’s logic is a real competitive advantage for Aldec. The new FPGA accelerator is extremely useful for implementing time-sensitive trading strategies such as Market Making, Statistical Arbitrage, and Algorithmic Trading and is compatible with 1U and larger trading systems.



Aldec HES-HPC-HFT-XCVU9P PCIe card .jpg 



Aldec’s HES-HPC-HFT-XCVU9P PCIe card for high-frequency trading apps—Powered by a Xilinx Virtex UltraScale+ FPGA





Here’s a block diagram of the board:




Aldec HES-HPC-HFT-XCVU9P PCIe card block diagram.jpg



Aldec’s HES-HPC-HFT-XCVU9P PCIe card block diagram




Please contact Aldec directly for more information about the HES-HPC-HFT-XCVU9P PCIe card.




An article titled “Living on the Edge” by Farhad Fallah, one of Aldec’s Application Engineers, on the New Electronics Web site recently caught my eye because it succinctly describes why FPGAs are so darn useful for many high-performance, edge-computing applications. Here’s an example from the article:


“The benefits of Cloud Computing are many-fold… However, there are a few disadvantages to the cloud too, the biggest of which is that no provider can guarantee 100% availability.”


There’s always going to be some delay when you ship data to the cloud for processing. You will need to wait for the answer. The article continues:


“Edge processing needs to be high-performance and in this respect an FPGA can perform several different tasks in parallel.”


The article then continues to describe a 4-camera ADAS demo based on Aldec’s TySOM-2-7Z100 prototyping board that was shown at this year’s Embedded Vision Summit held in Santa Clara, California. (The TySOM-2-7Z100 proto board is based on the Xilinx Zynq Z-7100 SoC—the largest member of the Zynq SoC family.)





Aldec TySOM-2-Z100 Prototyping Board.jpg 


Aldec’s TySOM-2-7Z100 prototyping board




Then the article describes the significant performance boost that the Zynq SoC’s FPGA fabric provides:


“The processing was shared between a dual-core ARM Cortex-A9 processor and FPGA logic (both of which reside within the Zynq device) and began with frame grabbing images from the cameras and applying an edge detection algorithm (‘edge’ here in the sense of physical edges, such as objects, lane markings etc.). This is a computational-intensive task because of the pixel-level computations being applied (i.e. more than 2 million pixels). To perform this task on the ARM CPU a frame rate of only 3 per second could have been realized, whereas in the FPGA 27.5 fps was achieved.”


That’s nearly a 10x performance boost thanks to the on-chip FPGA fabric. Could your application benefit similarly?





Eideticom’s NoLoad (NVMe Offload) platform uses FPGa-based acceleration on PCIe FPGA cards and in cloud-based FPGA servers to provide storage and compute acceleration through standardized NVMe and NVMe over Fabrics protocols. The No Load product itself is a set of IP that implements the NoLoad accelerator. The company is offering Hardware Eval Kits that target FPGA-based PCIe cards from Nallatech--the 250S FlashGT+ Card based on a Xilinx Kintex UltraScale+ KU15P FPGA—and the Alpha Data ADM-PCIE-9V3, which is based on a Xilinx Virtex UltraScale+ VU3P FPGA.


The NoLoad platform allows networked systems to share FPGA acceleration resources across the network fabric. For example, Eideticom offers an FPGA-accelerated Reed-Solomon Erasure Coding engine that can supply codes to any storage facility on the network.


Here’s a 6-minute video that explains the Eideticom NoLoad offering with a demo from the Xilinx booth at the recent SC17 conference:






For more information about the Nallatech 250S+ SSD accelerator, see “Nallatech 250S+ SSD accelerator boosts storage speed of four M.2 NVMe drives using Kintex UltraScale+ FPGA.”



For more information about the Alpha Data ADM-PCIE-9V3, see “Blazing new Alpha Data PCIe Accelerator card sports Virtex UltraScale+ VU3P FPGA, 4x 100GbE ports, 16Gbytes of DDR4 SDRAM.”


There was a live AWS EC2 F1 application-acceleration Developer’s Workshop during last month Amazon’s re:Invent 2017. If you couldn’t make it, don’t worry. It’s now online and you can run through it in about two hours (I’m told). This workshop teaches you how to develop accelerated applications using the AWS F1 OpenCL flow and the Xilinx SDAccel development environment for the AWS EC2 F1 platform, which uses Xilinx Virtex UltraScale+ FPGAs as high-performance hardware accelerators.


The architecture of the AWS EC2 F1 platform looks like this:



AWS EC2 F1 Architecture.jpg 


AWS EC2 F1 Architecture




This developer workshop is divided in 4 modules. Amazon recommends that you complete each module before proceeding to the next.


  1. Connecting to your F1 instance 
    You will start an EC2 F1 instance based on the FPGA developer AMI and connect to it using a remote desktop client. Once connected, you will confirm you can execute a simple application on F1.
  2. Experiencing F1 acceleration 
    AWS F1 instances are ideal to accelerate complex workloads. In this module you will experience the potential of F1 by using FFmpeg to run both a software implementation and an F1-optimized implementation of an H.265/HEVC encoder.
  3. Developing and optimizing F1 applications with SDAccel 
    You will use the SDAccel development environment to create, profile and optimize an F1 accelerator. The workshop focuses on the Inverse Discrete Cosine Transform (IDCT), a compute intensive function used at the heart of all video codecs.
  4. Wrap-up and next steps 
    Explore next steps to continue your F1 experience after the re:Invent 2017 Developer Workshop.



Access the online AWS EC2 F1 Developer’s Workshop here.



For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:









Accolade’s new ANIC-200Kq Flow Classification and Filtering Adapter brings packet processing, storage optimization, and scalable Flow Classification at 100GbE through two QSFP28 optical cages. Like the company’s ANIC-200Ku Lossless Packet Capture adapter introduced last year, the ANIC-200Kq board is based on a Xilinx UltraScale FPGA so it’s able to run a variety of line-speed packet-processing algorithms including the company’s new “Flow Shunting” feature.




Accolade ANIC-200Kq Flow Classification and Filtering Adapter.jpg 


Closeup view of the QSFP28 ports on Accolade’s ANIC-200Kq Flow Classification and Filtering Adapter




The new ANIC-200Kq adapter differs from the older ANIC-200Ku adapter in its optical I/O ports. The ANIC-200Kq adapter incorporates two QSFP28 optical cages and the ANIC-200Kq adapter incorporates two CFP2 cages. Both the QSFP28 and CFP2 interfaces accept SR4 and LR4 modules. The QSFP28 optical cages put Accolade’s ANIC-200Kq adapter squarely in the 25, 40, and 100GbE arenas, providing data center architects with additional architectural flexibility when designing their optical networks. For this reason, QSFP28 is fast becoming the universal form factor for new data center installations.



For more information in Xcell Daily about Accolade’s fast Flow Classification and Filtering Adapters, see:







The upcoming Xilinx Developer Forum in Frankfurt, Germany on January 9 will feature a hands-on Developer Lab titled “Accelerating Applications with FPGAs on AWS.” During this afternoon session, you’ll gain valuable hands-on experience with the FPGA-accelerated AWS EC2 F1 instance and hear from a special guest speaker from Amazon Web Services. Attendance is limited on a first-come-first-serve basis, so you must register, here.



For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:










Netcope’s NP4, a cloud-based programming tool allows you to specify networking behavior using declarations written in the P4 network-specific, high-level programming language for the company’s high-performance, programmable Smart NICs based on Xilinx Virtex UltraScale+ and Virtex-7 FPGAs. The programming process involves the following steps:


  1. Write the P4 code.
  2. Upload your code to the NP4 cloud.
  3. Wait for the application to autonomously translate your P4 code into VHDL and synthesize the FPGA configuration.
  4. Download the firmware bitstream and upload it to the FPGA on your Netcope NIC.


Netcope calls NP4 its “Firmware as a Service” offering. If you are interested in trying NP4, you can request free trial access to the cloud service here.



Netcope NFB-200G2QL Programmable NIC.jpg


Netcope Technologies’ NFB-200G2QL 200G Ethernet Smart NIC based on a Virtex UltraScale+ FPGA




For more information about Netcope and P4 in Xcell Daily, see:




For more information about Netcope’s FPGA-based NICs in Xcell Daily, see:








Karl Freund’s article titled “Amazon AWS And Xilinx: A Progress Report” appeared on Forbes.com today. Freund is a Moor Insights & Strategy Senior Analyst for Deep Learning and High-Performance Computing (HPC). He describes Amazon’s FPGA-based AWS EC2 F1 instance offering this way:



“…the cloud leader [Amazon] is laying the foundation to simplify FPGA adoption by creating a marketplace for accelerated applications built on Xilinx [Virtex UltraScale+] FPGAs.”



Freund then discusses what’s happened since Amazon announced its AWS EC2 F1 instance a year ago. Here are his seven highlights:


  1. "AWS has now deployed the F1 instances to four regions, with more to come…”

  2. “To support the Asian markets, where AWS has limited presence, Xilinx has won over support from the Alibaba and Huawei cloud operations.” (Well, that’s ones not really about Amazon, but let’s keep in in anyway, shall we?)

  3. “Xilinx has launched a global developer outreach program, and has already trained over 1,000 developers [on the use of AWS EC2 F1] at three Xilinx Developer Forums—with more to come.”

  4. “Xilinx has recently released a Machine Learning (ML) Amazon Machine Instance (AMI), bringing the Xilinx Reconfigurable Acceleration Stack (announced last year) for ML Inference to the AWS cloud.”

  5. “Xilinx partner Edico Genome recently achieved a Guinness World Record for decoding human genomes, analyzing 1000 full human genomes on 1000 F1 instances in 2 hours, 25 minutes; a remarkable 100-fold improvement in performance…”

  6. “AWS has added support for Xilinx SDAccel programming environment to all AWS regions for solution developers…”

  7. “Xilinx partner Ryft has built an impressive analytic platform on F1, enabling near-real-time analytics by eliminating data preparation bottlenecks…”



The rest of Freund’s article discusses the Ryft’s AWS Marketplace offering in more detail and concludes with this:



“…at least for now, Amazon.com, Huawei, Alibaba, Baidu, and Tencent have all voted for Xilinx.”




For extensive Xcell Daily coverage about the AWS EC2 F1 instance, see:















Titan IC’s newest addition to the AWS Marketplace based on the FPGA-accelerated AWS EC2 F1 instance is the Hyperion F1 10G RegEx File Scan, a high-performance file-search and file-scanning application that can process 1Tbyte of data with as many as 800,000 user-defined regular expressions in less than 15 minutes. The Hyperion F1 10G RegEx File Scan application leverages the processing power of the AWS EC2 F1 instance’s multiple Xilinx Virtex UltraScale+ VU9P FPGAs to speed the scanning of files using complex pattern and string matching, attaining a throughput as high as 10Gbps.


Here’s a block diagram showing the Hyperion F1 10G RegEx File Scan application running in an AWS EC2 f1.2xlarge instance:




Titan IC Hyperion F1 10G RegEx File Scan on AWS EC2 F1.jpg 




You can get more details about this application here in the AWS Marketplace.





For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:









This month at SC17 in Denver, Nallatech was showing its new 250S+ high-performance SSD-accelerator PCIe card, which uses a Xilinx Kintex UltraScale+ KU15P FPGA to implement an NVMe SSD controller/accelerator and the board’s PCIe Gen4 x8 interface. You can plug SSD cards or NVMe cables into the card’s four M.2 NVME slots so you can control as many as four on- or off-board drives with one card. The card comes in 3.84Tbyte and 6.4Tbyte versions with on-board M.2 NVMe SSDs and can control a drive array as large as 25.6Tbytes using NVMe cables.



Nallatech 250S NVMe accelerator card.jpg 


Nallatech 250S+ NVMe SSD accelerator card based on a Xilinx Kintex UltraScale+ FPGA





Nallatech 250S NVMe accelerator card with NVMe cables.jpg 


Nallatech 250S+ NVMe SSD accelerator card with NVMe cables




Here are the card’s specs:



Nallatech 250S NVMe accelerator card specs.jpg 



And here’s a block diagram of the Nallatech 250S+ NVMe accelerator card:




Nallatech 250S NVMe accelerator card block diagram.jpg 



As you can see, the Kintex UltraScale+ FPGA implements the entire logic design on the card, driving the PCIe connector, managing the four attached NVMe SSDs, directly controlling and operating the card’s on-board DDR4-2400 SDRAM cache, and even implementing the card's JTAG interface.



For more information about the Nallatech 250S+ NVMe accelerator card, please contact Nallatech directly.





When Xcell Daily last looked at Netcope Technologies’ NFB-200G2QL FPGA-based 200G Ethernet Smart NIC with its cool NASA-scoop heat sink in August, it had broken industry records for 100GbE performance with a throughput of 148.8M packets/sec on DPDK (the Data Plane Development Kit)—the theoretical maximum for 64-byte packets over 100GbE. (See “Netcope breaks 100GbE record @ 148.8M packets/sec (the theoretical max) with NFB-100G2Q FPGA-based NIC, then goes faster at 200GbE.”) At the time, all Netcope would say was that the NFB-200G2QL PCIe card was “based on a Xilinx Virtex UltraScale+ FPGA.” Well, Netcope was at SC17 in Denver earlier this month and has been expanding the capabilities of the board. It’s now capable of sending or receiving packets at a 200Gbps line rate with zero packet loss, still using “the latest Xilinx FPGA chip Virtex UltraScale+,” which I was told at Netcope’s SC17 booth is a Xilinx Virtex UltraScale+ VU7P FPGA.




Netcope NFB-200G2QL Programmable NIC.jpg 



Netcope Technologies’ NFB-200G2QL 200G Ethernet Smart NIC based on a Virtex UltraScale+ FPGA




One trick to doing this: using two PCIe Gen3 x16 slots to get packets to/from the server CPU(s). Why two slots? Because Netcope discovered that its 200G Smart NIC PCIe card could transfer about 110Gbps worth of packets over one PCIe Gen3 x16 slot and the theoretical maximum traffic throughput for one such slot is 128Gbps. That means 200Gbps will not pass through the eye of this 1-slot needle. Hence the need for two PCIe slots, which will carry the 200Gbps worth of packets with a comfortable margin. Where’s that second PCIe Gen3 interface coming from? Over a cable attached to the Smart NIC board and implemented in the board’s very same Xilinx Virtex UltraScale+ VU7P FPGA, of course. The company has written a White Paper describing this technique titled “Overcoming the Bandwidth Limitations of PCI Express.”


And yes, there’s a short video showing this Netcope sorcery as well:







In the short video below, Xilinx Product Marketing Manager Kamran Khan demonstrates GoogleNet running at 10K images/sec on Amazon’s AWS EC2 F1 using eight Virtex UltraScale+ FPGAs in a 16xlarge configuration. The same video also shows open-source, deep-learning app DeepDetect running in real time, classifying images from a Webcam’s real-time video stream.





For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:









Last week at SC17 in Denver, BittWare announced its TeraBox 1432D 1U FPGA server box, a modified Dell PowerEdge C4130 with a new front panel that exposes 32 100GbE QSFP ports from as many as four of the company’s FPGA accelerator cards. (That’s a total front-panel I/O bandwidth of 3.2Tbps!) The new 1U box doubles the I/O rack density with respect to the company’s previous 4U offering.



BittWare TeraBox 1432D.jpg 



BittWare’s TeraBox 1432D 1U FPGA Server Box exposes 32 100GbE QSFP ports on its front panel





The TeraBox 1432D server box can be outfitted with four of the company’s XUPP3R boards, which are based on Xilinx Virtex UltraScale+ FPGAs (VU7P, VU9P, or VU11P) and can be fitted for eight QSFPs each. (That’s four QSFP cages) on the board and four more QSFPs on a daughter card connected to the XUPP3R board via a cable to an FMC connector. This configuration underscores the extreme I/O density and capability of Virtex UltraScale+ FPGAs.




BittWare TeraBox 1432D Detail.jpg


BittWare TeraBox 1432D interior detail



The new BittWare TeraBox 1432D will be available Q1 2018 with the XUPP3R FPGA accelerator board. According to the announcement, BittWare will also release the Xilinx UltraScale+ VU13P-based XUPVV4 in 2018. This new board will also fit in the TeraBox 1432D.


Here’s a 3-minute video from SC17 with a walkthrough of the TeraBox 1432D 1U FPGA server box by BittWare's GM and VP of Network Products Craig Lund:







According to an announcement released today:


“Xilinx, Inc. (XLNX) and Huawei Technologies Co., Ltd. today jointly announced the North American debut of the Huawei FPGA Accelerated Cloud Server (FACS) platform at SC17. Powered by Xilinx high performance Virtex UltraScale+ FPGAs, the FACS platform is differentiated in the marketplace today.


“Launched at the Huawei Connect 2017 event, the Huawei Cloud provides FACS FP1 instances as part of its Elastic Compute Service. These instances enable users to develop, deploy, and publish new FPGA-based services and applications through easy-to-use development kits and cloud-based EDA verification services. Both expert hardware developers and high-level language users benefit from FP1 tailored instances suited to each development flow.


"...The FP1 demonstrations feature Xilinx technology which provides a 10-100x speed-up for compute intensive cloud applications such as data analytics, genomics, video processing, and machine learning. Huawei FP1 instances are equipped with up to eight Virtex UltraScale+ VU9P FPGAs and can be configured in a 300G mesh topology optimized for performance at scale."



Huawei’s FP1 FPGA accelerated cloud service is available on the Huawei Public Cloud today. To register for the public beta, click here.




One of the several demos in the Xilinx booth during this week’s SC17 conference in Denver was a working demo of the CCIX (Cache Coherent Interconnection for Accelerators) protocol, which simplifies the design of offload accelerators for hyperscale data centers by providing low-latency, high-bandwidth, fully coherent access to server memory.  The demo shows L2 switching acceleration using an FPGA to offload a host processor. The CCIX protocol manages a hardware cache in the FPGA, which is coherently linked to the host processor’s memory. Cache updates take place in the background without software intervention through the CCIX protocol. If cache entries are invalidated in the host memory, the CCIX protocol automatically invalidates the corresponding cache entries in the FPGA’s memory.


Senior Staff Design Engineer Sunita Jain gave Xcell Daily a 3-minute explanation of the demo, which shows a 4.5x improvement in packet transfers using CCIX versus software-controlled transfers:






There’s one thing to note about this demo. Although the CCIX standard calls for using the PCIe protocol as a transport layer at 25Gbps/lane, which is faster than PCIe Gen4, this demo only demonstrates the CCIX protocol and is using the significantly slower PCIe Gen1 for the transport layer.


For more information about the CCIX protocol as discussed in Xcell Daily, see:









This week at SC17 in Denver, Everspin was showing some impressive performance numbers for the MRAM-based nvNITRO NVMe Accelerator Card that the company introduced earlier this year. As discussed in a previous Xcell Daily blog post, the nvNITRO NVMe Accelerator Card is based on the company’s non-volatile ST-MRAM chips and a Xilinx Kintex UltraScale KU060 FPGA implements the MRAM controller and the board’s PCIe Gen3 x8 host interface. (See “Everspin’s new MRAM-based nvNITRO NVMe card delivers Optane-crushing 1.46 million IOPS (4Kbyte, mixed 70/30 read/write).”)


The target application of interest at SC17 was high-frequency trading, where every microsecond you can shave off of system response times directly adds dollars to the bottom line, so the ROI on a product like the nvNITRO NVMe Accelerator Card that cuts transaction times is easy to calculate.




Everspin nvNITRO NVMe card.jpg 



Everspin MRAM-based nvNITRO NVMe Accelerator Card




It turns out that a common thread and one of the bottlenecks for high-frequency trading applications is the use of Apache Log4j event-logging utility. However, incoming packets arrive at a variable rate—the traffic is bursty—and the Log4j logging utility needs to keep up with the highest possible burst rates to ensure that every event is logged. Piping these events directly into SSD storage sets a low limit to the burst rate that a system can handle. Inserting an nvNITRO NVMe Accelerator Card as a circular buffer in series with the incoming event stream as shown below boosts Log4j performance by 9x.




Everspin nvNITRO Circular Buffer.jpg 




Proof of efficacy appears in the chart below, which shows the much lower latency and much better determinism provided by the nvNITRO card:




Everspin nvNITRO Latency Reduction.jpg 




One more thing of note: As you can see by one of the labels on the board in the photo above, Everspin’s nvNITRO card is now available as Smart Modular Technologies’ MRAM NVM Express Accelerator Card. Click here for more information.




Ryft is one of several companies now offering FPGA-accelerated applications based on Amazon’s AWS EC2 F1 instance. Ryft was at SC17 in Denver this week with a sophisticated, cloud-based data analytics demo based on machine learning and deep learning that classified 50,000 images from one data file using a neural network, merged the classified image files with log data from another file to create a super metadata file, and then provided fast image retrieval using many criteria including image classification, a watch-list match (“look for a gun” or “look for a truck”), or geographic location using the Google Earth database. The entire demo made use of geographically separated servers containing the files used in conjunction with Amazon’s AWS Cloud. The point of this demo was to show Ryft’s ability to provide “FPGAs as a Service” (FaaS) in an easy to use manner using any neural network of your choice, any framework (Caffe, TensorFlow, MXNet), and the popular RESTful API.


This was a complex, live demo and it took Ryft’s VP of Products Bill Dentinger six minutes to walk me through the entire thing, even moving as quickly as possible. Here’s the 6-minute video of Bill giving a very clear explanation of the demo details:





Note: Ryft does a lot of work with US government agencies and as of November 15 (yesterday), Amazon’s AWS EC2 F1 instance based on Xilinx Virtex UltraScale+ FPGAs is available on GovCloud. (See “Amazon’s FPGA-accelerated AWS EC2 F1 instance now available on Amazon’s GovCloud—as of today.”)


Amazon’s FPGA-accelerated AWS EC2 F1 instance now available on Amazon’s GovCloud—as of today

by Xilinx Employee ‎11-15-2017 02:46 PM - edited ‎11-15-2017 02:54 PM (10,771 Views)


“Amazon EC2 F1 instances are now available in the AWS GovCloud (US) region.” Amazon posted the news on its AWS Web site today and the news was announced by Amazon’s Senior Director Business Development and Product Gadi Hutt during his introductory speech at a special half-day Amazon AWS EC2 F1 instance dev lab held at SC17 in Denver the same morning. According to the Amazon Web page, “With this launch, F1 instances are now available in four AWS regions, specifically US East (N. Virginia), US West (Oregon), EU (Ireland) and AWS GovCloud (US).”


Nearly 100 developers attended the lab and listened to Hutt’s presentation along with two AWS F1 instance customers, Ryft and NGCodec. The presentations were followed by a 2-hour hands-on lab.



Amazons Gadi Hutt presents to an AWS EC2 F1 Lab at SC17.jpg


Amazon's Gadi Hutt presents to an AWS EC2 F1 hands-on lab at SC17




The Amazon EC2 F1 compute instance allows you to create custom hardware accelerators for your application using cloud-based server hardware that incorporates multiple Xilinx Virtex UltraScale+ VU9P FPGAs. Each Amazon EC2 F1 compute instance can include as many as eight FPGAs, so you can develop extremely large and capable, custom acceleration engines with this technology. According to Amazon, use of the FPGA-accelerated F1 instance can accelerate applications in diverse fields such as genomics research, financial analysis, video processing (in addition to security/cryptography and machine learning) by as much as 30x over general-purpose CPUs.


For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:








Xilinx demos Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit at SC17 in Denver

by Xilinx Employee ‎11-15-2017 02:34 PM - edited ‎11-15-2017 02:45 PM (11,990 Views)


This week, if you were in the Xilinx booth at SC17, you would have seen demos of the new Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit (available in actively and passively cooled versions). Both versions are based on Xilinx Virtex UltraScale+ VU9P FPGAs with 64Gbytes of on-board DDR4 SDRAM.  




Xilinx VCU1525 Active.jpg 


Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, actively cooled version




Xilinx VCU1525_Passive_Photshopped.jpg 



Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, passively cooled version




Xilinx had several VCU1525 Acceleration Development Kits at SC17 running various applications at SC17. Here’s a short 90-second video from SC17 showing two running applications—edge-to-cloud video analytics and machine learning— narrated by Xilinx Senior Engineering Manager Khang Dao:





Note: For more information about the Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, contact your friendly neighborhood Xilinx or Avnet sales representative.



The new Mellanox Innova-2 Adapter Card teams the company’s ConnectX-5 Ethernet controller with a Xilinx Kintex UltraScale+ KU15P FPGA to accelerate computing, storage, and networking in data centers. According to the announcement, “Innova-2 is based on an efficient combination of the state-of-the-art ConnectX-5 25/40/50/100Gb/s Ethernet and InfiniBand network adapter with Xilinx UltraScale FPGA accelerator.” The adapter card has a PCIe Gen4 host interface.




Mellanox Innova-2 Adapter Card.jpg 


Mellanox’s Innova-2 PCIe Adapter Card




Key features of the card include:


  • Dual-port 25Gbps Ethernet via SFP cages
  • TLS/SSL, IPsec crypto offloads
  • Mellanox ConnectX-5 Ethernet controller and Xilinx Kintex UltraScale+ FPGA for either “bump-on-the-wire” or “look-aside” acceleration
  • Low-latency RDMA and RDMA over Converged Ethernet (RoCE)
  • OVS and Erasure Coding offloads
  • Mellanox PeerDirect communication acceleration
  • End-to-end QoS and congestion control
  • Hardware-based I/O virtualization



Innova-2 is available in multiple, pre-programmed configurations for security applications with encryption acceleration such as IPsec or TLS/SSL. Innova-2 boosts performance by 6x for security applications while reducing total cost of ownership by 10X when compared to alternatives.


Innova-2 enables SDN and virtualized acceleration and offloads for Cloud infrastructure. The on-board programmable resources allow deep-learning training and inferencing applications to achieve faster performance and better system utilization by offloading algorithms into the card’s Kintex UltraScale+ FPGA and the ConnectX acceleration engines.


The adapter card is also available as an unprogrammed card, open for customers’ specific applications. Mellanox provides configuration and management tools to support the Innova-2 Adapter Card across Windows, Linux, and VMware distributions.


Please contact Mellanox directly for more information about the Innova-2 Adapter Card.






Accolade’s new Flow-Shunting feature for its FPGA-based ANIC network adapters lets you more efficiently drive packet traffic through existing 10/40/100GbE data center networks by offloading host servers. It does this by eliminating the processing and/or storage of unwanted traffic flows, as identified by the properly configured Xilinx UltraScale FPGA on the ANIC adapter. By offloading servers and reducing storage requirements, flow shunting can deliver operational cost savings throughout the data center.


The new Flow Shunting feature is a subset of the existing Flow Classification capabilities built into the FPGA-based Advanced Packet Processor in the company’s ANIC network adapters. (The company has written a technology brief explaining the capability.) Here’s a simplified diagram of what’s happening inside of the ANIC adapter:



Accolade Flow Shunting.jpg 



The Advanced Packet Processor in each ANIC adapter performs a series of packet processing functions including flow classification (outlined in red). The flow classifier inspects each packet, determines whether each packet is part of a new flow or an existing flow, and then updates the associated lockup table (LUT)—which resides in a DRAM bank—with the flow classification. The LUT has room to store as many as 32 million unique IP flow entries. Each flow entry includes standard packet-header information (source/destination IP, protocol, etc.) along with flow metadata including total packet count, byte count, and the last time a packet was seen. The same flow entry tracks information about both flow directions to maintain a bi-directional context. With this information, the ANIC adapter can take specific actions on an individual flow. Actions might include forwarding, dropping, or re-directing packets in each flow.


These operations form the basis for flow shunting, which permits each application to decide from which flow(s) it does and does not want to receive data traffic. Intelligent, classification-based flow shunting allows an application to greatly reduce the amount of data it must analyze or handle, which frees up server CPU resources for more pressing tasks.



For more information about Accloade’s UltraScale-based ANIC network adapters, see “Accolade 3rd-gen, dual-port, 100G PCIe Packet Capture Adapter employs UltraScale FPGAs to classify 32M unique flows at once.




Today, Microsoft, Mocana, Infineon, Avnet, and Xilinx jointly introduced a highly integrated, high-assurance IIoT (industrial IoT) system based on the Microsoft Azure Cloud and Microsoft’s Azure IoT Device SDK and Azure IoT Edge runtime package, Mocana’s IoT Security Platform, Infineon’s OPTIGA TPM (Trusted Platform Module) 2.0 security cryptocontroller chip, and the Avnet UltraZed-EG SOM based on the Xilinx Zynq UltraScale+ EG MPSoC.


The Mocana IoT Security Platform stack looks like this:



Mocana IoT Security Platform.jpg 


Mocana IoT Security Platform stack




Here’s a photo of the dev board that combines all of these elements:




Infineon, Avnet, Xilinx, Microsoft, Mocana IIoT Board.jpg




The Avnet UltraZed-EG SOM appears in the lower left and the Infineon OPTIGA TPM 2.0 security chip resides on a Pmod carrier plugged into the top of the board.


If you’re interested in learning more about this highly integrated IIoT hardware/software solution, click here.




Today, Xilinx announced plans to invest $40M to expand research and development engineering work in Ireland on artificial intelligence and machine learning for strategic markets including cloud computing, embedded vision, IIoT (industrial IoT), and 5G wireless communications. The company already has active development programs in these categories and today’s announcement signals an acceleration of development in these fields. The development was formally announced in Dublin today by The Tánaiste (Deputy Prime Minister of Ireland) and Minister for Business, Enterprise and Innovation, Frances Fitzgerald T.D., and by Kevin Cooney, Senior Vice President, Chief Information Officer and Managing Director EMEA, Xilinx Inc. The new investment is supported by the Irish government through IDA Ireland.


Xilinx first established operations in Dublin in 1995. Today, the company employs 350 people at its EMEA headquarters in Citywest, Dublin, where it operates a research, product development, engineering, and an IT center along with centralized supply, finance, legal, and HR functions. Xilinx also has R&D operations in Cork, which the company established in 2001.



Xilinx Ireland.jpg 


Xilinx’s Ireland Campus


I’ve written several times about Amazon’s AWS EC2 F1 instance, a cloud-based acceleration services based on multiple Xilinx Virtex UltraScale+ VU9P FPGAs. (See “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available.”) The VINEYARD Project, a pan-European effort to significantly increase the performance and energy efficiency of data centers by leveraging the advantages of hardware accelerators, is using Amazon’s EC2 F1 instance to develop Apache Spark accelerators. VINEYARD project coordinator Christoforos Kachris from ICCS/NTU, of gave a presentation on "Hardware Acceleration of Apache Spark on Energy Efficient FPGAs" at SPARK Summit 2017 and a video of his presentation appears below.


Kachris’ presentation details experiments on accelerating machine-learning (ML) applications running on the Apache Spark cluster-computing framework by developing hardware-accelerated IP. The central idea is to create ML libraries that can be seamlessly invoked by programs simply by calling the appropriate library. No other program changes are needed to get the benefit of hardware acceleration. Raw data passes from a Spark Worker through a pipe, a Python API, and a C API to the FPGA acceleration IP and returns to the Spark Worker over a similar, reverse path.


The VINEYARD development team first prototyped their idea by creating a small model of the AWS EC2 F1 could-based system using four Digilent PYNQ-Z1 dev boards networked together via Ethernet and the Python-based, open-source PYNQ software development environment. Digilent’s PYNQ-Z1 dev boards are based on Xilinx Zynq Z-7020 SoCs. Even this small prototype dramatically doubled the performance relative to a Xeon server.


Having proved the concept, the VINEYARD development team scaled up to the AWS EC2 F1 and achieved a 3x to 10x performance improvement (cost normalized against an AWS instance with non-accelerated servers).


Here’s the 26-minute video presentation:





According to Yin Qi, Megvii’s chief exec, his company is developing a “brain” for visual computing. Beijing-based Megvii develops some of the most advanced image-recognition and AI technology in the world. The company’s Face++ facial-recognition algorithms run on the cloud and in edge devices such as the MegEye-C3S security camera, which runs Face++ algorithms locally and can capture more than 100 facial images in each 1080P video frame at 30fps.



MegEye-C3S Security Camera.jpg 



MegEye-C3S Facial-Recognition Camera based on Megvii’s Face++ technology




In its early days, Megvii ran its algorithms on GPUs, but quickly discovered the high cost and power disadvantages of GPU acceleration. The company switched to the Xilinx Zynq SoC and is able to run deep convolution on the Zynq SoC’s programmable logic while quantitative analysis runs simultaneously on the Zynq SoC’s Arm Cortex-A9 processors. The heterogeneous processing resources of the Zynq SoC allow Megvii to optimize the performance of its recognition algorithms for lowest cost and minimum power consumption in edge equipment such as the MegEye-C3S camera.



MegEye-C3S Security Camera exploded diagram.jpg 


MegEye-C3S Facial-Recognition Camera exploded diagram showing Zynq SoC (on right)



Here’s a 5-minute video where Megvii’s Sam Xie, GM of Branding and Marketing, and Jesson Liu, Megvii’s hardware leader, explain how their company has been able to attract more than 300,000 developers to the Face++ platform and how the Xilinx Zynq SoC has aided the company in developing the most advanced recognition products in the cloud and on the edge:








Now that Amazon has made the FPGA-accelerated AWS EC2 F1 instance based on multiple Xilinx Virtex UltraScale+ VU9P FPGAs generally available, the unbound imaginations of some really creative people have been set free. Case in point: the cloud-based FireSim hardware/software co-design environment and simulation platform for designing and simulating systems based on the open-source RocketChip RISC-V processor. The Computer Architecture Research Group at UC Berkeley is developing FireSim. (See “Bringing Datacenter-Scale Hardware-Software Co-design to the Cloud with FireSim and Amazon EC2 F1 Instances.”)


Here’s what FireSim looks like:



FireSim diagram.jpg 



According to the AWS blog cited above, FireSim addresses several hardware/software development challenges. Here are some direct quotes from the AWS blog:


1: “FPGA-based simulations have traditionally been expensive, difficult to deploy, and difficult to reproduce. FireSim uses public-cloud infrastructure like F1, which means no upfront cost to purchase and deploy FPGAs. Developers and researchers can distribute pre-built AMIs and AFIs, as in this public demo (more details later in this post), to make experiments easy to reproduce. FireSim also automates most of the work involved in deploying an FPGA simulation, essentially enabling one-click conversion from new RTL to deploying on an FPGA cluster.”


2: “FPGA-based simulations have traditionally been difficult (and expensive) to scale. Because FireSim uses F1, users can scale out experiments by spinning up additional EC2 instances, rather than spending hundreds of thousands of dollars on large FPGA clusters.”


3: “Finding open hardware to simulate has traditionally been difficult. Finding open hardware that can run real software stacks is even harder. FireSim simulates RocketChip, an open, silicon-proven, RISC-V-based processor platform, and adds peripherals like a NIC and disk device to build up a realistic system. Processors that implement RISC-V automatically support real operating systems (such as Linux) and even support applications like Apache and Memcached. We provide a custom Buildroot-based FireSim Linux distribution that runs on our simulated nodes and includes many popular developer tools.”


4: “Writing hardware in traditional HDLs is time-consuming. Both FireSim and RocketChip use the Chisel HDL, which brings modern programming paradigms to hardware description languages. Chisel greatly simplifies the process of building large, highly parameterized hardware components.”



Using high-speed FPGA technology to simulate hardware isn’t a new idea. Using an inexpensive, cloud-based version of that same FPGA technology to develop hardware and software from your laptop while sitting in a coffee house in Coeur d’Alene, Timbuktu, or Ballarat—now that is something new.





Envious of all the cool FPGA-accelerated applications showing up on the Amazon AWS EC2 F1 instance like the Edico Genome DRAGEN Genome Pipeline that set a Guinness World Record last week, the DeePhi ASR (Automatic speech Recognition) Neural Network announced yesterday, Ryft’s cloud-based search and analysis tools, or NGCodec’s RealityCodec video encoder?






Well, you can shake off that green monster by signing up for the free, live, half-day Amazon AWS EC2 F1 instance and SDAccel dev lab being held at SC17 in Denver on the morning of November 15 at The Studio Loft in the Denver Performing Arts Complex (1400 Curtis Street), just across the street from the Denver Convention Center where SC17 is being held. Xilinx is hosting the lab and technology experts from Xilinx, Amazon Web Services, Ryft, and NGCodec will be available onsite.



Here’s the half-day agenda:


8:00 AM               Doors open, Registration, and Continental Breakfast

9:00 AM               Welcome, Technology Discussion, F1 Developer Use Cases and Demos

9:35 AM               Break

9:45 AM               Hands-on Training Begins

12:00 PM             Developer Lab Concludes



A special guest speaker from Amazon Web Services is also on the agenda.


Lab instruction time includes:


  • Step-by-step instructions to connect to an F1 instance
  • Interactive walkthrough of the SDAccel Development Environment
  • Highlights of SDAccel IDE features: compile, debug, profile
  • Instruction for how to develop a sample framework acceleration app



Seats are necessarily limited for a lab like this, so you might want to get your request in immediately. Where? Here.



About the Author
  • Be sure to join the Xilinx LinkedIn group to get an update for every new Xcell Daily post! ******************** Steve Leibson is the Director of Strategic Marketing and Business Planning at Xilinx. He started as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He's served as Editor in Chief of EDN Magazine, Embedded Developers Journal, and Microprocessor Report. He has extensive experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.