We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!


Today, NGCodec demonstrated its hardware-accelerated RealityCodec 4K video codec running on the Amazon AWS’ FPGA-accelerated EC2 F1 instance, announced just yesterday at Amazon’s AWS re:Invent 2016 event in Las Vegas, Nevada. (See “Amazon picks Xilinx UltraScale+ FPGAs to accelerate AWS, launches F1 instance with 8x VU9P FPGAs per instance.”) Amazon AWS customers will be able to buy NGCodec’s RealityCodec when it becomes available in the AWS Marketplace and will run it on the Amazon EC2 F1 instance, which is a hardware-accelerated offering based on multiple Xilinx UltraScale+ FPGAs packed into the AWS server chassis. NGCodec’s RealityCodec running on the Amazon EC2 F1 instance delivers ultra-high-performance video compression (with up to 4K resolution) and ultra-low, sub-frame latency for cloud-based VR and AR. (See NGCodec’s press release here.)


NGCodec’s RealityCodec is an excellent example of the type of cloud-based application that can benefit from FPGA-based hardware acceleration. Amazon’s announcement yesterday of a standardized hardware-accelerated offering for AWS follows an accelerating trend towards offering FPGA-based acceleration to cloud-services customers. Like Amazon, many of the major cloud service providers have announced deployment of FPGA technology in their Hyperscale data centers to drive their services business in an extremely competitive market. FPGAs are the perfect complement to highly agile cloud computing environments because they are programmable and can be hardware-optimized for any new application or algorithm. The inherent ability of an FPGA to reconfigure and be reprogrammed over time is perhaps its greatest advantage in a fast-moving field.


For more information about FPGA-based hardware acceleration in the data center, check out the new Xilinx Acceleration Zone and take a look at this White Paper from Moor Insights & Strategy, which describes the new Xilinx Reconfigurable Acceleration Stack. (See “Xilinx Reconfigurable Acceleration Stack speeds programming of machine learning, data analytics, video-streaming apps.”)






AWS Logo.jpg 

Jeff Barr, Chief Evangelist at Amazon Web Services, just unveiled the accelerated F1 instance of its AWS (Amazon Web Services) in developer preview form. The rollout came in the form of a blog titled “Developer Preview – EC2 Instances (F1) with Programmable Hardware.


Barr writes:


“One of the more interesting routes to a custom, hardware-based solution is known as a Field Programmable Gate Array, or FPGA. In contrast to a purpose-built chip which is designed with a single function in mind and then hard-wired to implement it, an FPGA is more flexible. It can be programmed in the field, after it has been plugged in to a socket on a PC board. Each FPGA includes a fixed, finite number of simple logic gates. Programming an FPGA is “simply” a matter of connecting them up to create the desired logical functions (AND, OR, XOR, and so forth) or storage elements (flip-flops and shift registers). Unlike a CPU which is essentially serial (with a few parallel elements) and has fixed-size instructions and data paths (typically 32 or 64 bit), the FPGA can be programmed to perform many operations in parallel, and the operations themselves can be of almost any width, large or small.


“This highly parallelized model is ideal for building custom accelerators to process compute-intensive problems. Properly programmed, an FPGA has the potential to provide a 30x speedup to many types of genomics, seismic analysis, financial risk analysis, big data search, and encryption algorithms and applications.


“I hope that this sounds awesome and that you are chomping at the bit to use FPGAs to speed up your own applications!



“Today we are launching a developer preview of the new F1 instance. In addition to building applications and services for your own use, you will be able to package them up for sale and reuse in AWS Marketplace.  Putting it all together, you will be able to avoid all of the capital-intensive and time-consuming steps that were once a prerequisite to the use of FPGA-powered applications, using a business model that is more akin to that used for every other type of software. We are giving you the ability to design your own logic, simulate and verify it using cloud-based tools, and then get it to market in a matter of days.


“Equipped with Intel Broadwell E5 2686 v4 processors (2.3 GHz base speed, 2.7 GHz Turbo mode on all cores, and 3.0 GHz Turbo mode on one core), up to 976 GiB of memory, up to 4 TB of NVMe SSD storage, and one to eight FPGAs, the F1 instances provide you with plenty of resources to complement your core, FPGA-based logic. The FPGAs are dedicated to the instance and are isolated for use in multi-tenant environments.


“Here are the specs on the FPGA (remember that there are up to eight of these in a single F1 instance):


  • Xilinx UltraScale+ VU9P fabricated using a 16 nm process.
  • 64 GiB of ECC-protected memory on a 288-bit wide bus (four DDR4 channels).
  • Dedicated PCIe x16 interface to the CPU.
  • Approximately 2.5 million logic elements.
  • Approximately 6,800 Digital Signal Processing (DSP) engines.
  • Virtual JTAG interface for debugging.


“In instances with more than one FPGA, dedicated PCIe fabric allows the FPGAs to share the same memory address space and to communicate with each other across a PCIe Fabric at up to 12 Gbps in each direction.  The FPGAs within an instance share access to a 400 Gbps bidirectional ring for low-latency, high bandwidth communication (you’ll need to define your own protocol in order to make use of this advanced feature).”


Amazon is also releasing a developer tool called AMI, “a set of developer tools that you can use in the AWS Cloud at no charge,” for AWS F1 application development.


You can sign up for the Amazon EC2 F1 Instances Preview here.



Note: For additional information on the extensive support Xilinx provides for hardware acceleration in cloud environments, click over to the Xilinx Acceleration Zone, where you’ll find helpful information about the newly announced Reconfigurable Acceleration Stack. (Also, see “Xilinx Reconfigurable Acceleration Stack speeds programming of machine learning, data analytics, video-streaming apps.”)










Ask any expert in IIoT (Industrial Internet of Things) circles what the most pressing IIoT problem might be and you will undoubtedly hear “security.” Internet hacking stories are rampant. You generally hear about one a day. With the IoT and IIoT ramping up, they’re going to get more frequent. Over the recent Thanksgiving Weekend, the ticket machines and fare-collection system of San Francisco’s Muni light-rail, mass-transit system was hacked by ransomware. Agents' computer screens displayed the message "You Hacked, ALL Data Encrypted" beginning Friday night. The attackers demanded 100 Bitcoins, worth about $73,000, to undo the damage. Things were restored by Sunday without paying the ransom and Muni provided free rides until the system could be recovered.


You do not want to let this happen to your IIoT system design.


How to prevent it? Today (by sheer coincidence, honest), Avnet announced a new security module for its MicroZed IIoT (Industrial Internet of Things Starter Platform), which is based on a Xilinx Zynq Z-7000 SoC. The new Avnet Trusted Platform Module Security PMOD places an Infineon OPTIGA TPM (Trusted Platform Module) SLB9670 on a very small plug-in board conforming to the Digilent PMOD peripheral module format. The Infineon TPM SLB9670 is a secure microprocessor that adds hardware security to any system by conforming to the TPM security standard developed by the Trusted Computing Group, an international industry standardization group.


The $29.95 Avnet Trusted Platform Module Security PMOD is essentially a SPI security peripheral that provides many security services to your design based on Trusted computing Group standards. Provided services include:



  • Strong authentication of platform and users using a unique embedded endorsement certificate
  • Secure storage and management of keys and data
  • Measured and trusted booting for embedded systems
  • Random-number generation, tick counting to trigger the generation of new random numbers, and a dictionary-attack lockout
  • RSA, ECC, and SHA-256 encryption



That’s a lot of security in a 32-pin package and, for development purposes, you can get it on a $30 plug-in PMOD along with a reference design for using the module with the Zynq-based Avnet MicroZed IIoT Starter Kit.


So if you don’t want to see this in your IIoT system:



You Hacked.jpg 


Then think about buying this:



Avnet MicroZed IIoT TPM Module.jpg


Avnet MicroZed IIoT TPM PMOD








Today marks the grand opening of the Xilinx Embedded Vision Developer Zone, designed to jump-start the development of your next-generation vision systems whether you’re a software developer, a hardware developer, or a system architect. It’s a 1-stop shop—a dedicated place—for developers who need to design in multidisciplinary capabilities like sensor fusion, advanced computer-vision algorithms, real-time object detection, and video analytics based on machine learning.


The Embedded Vision Developer Zone contains a vast and deep pool of Xilinx-compatible, video-centric engineering resources including optimized software-development libraries, hardware vision IP, embedded-vision projects, and tutorials. These resources come from Xilinx, Xilinx Alliance Program Members, and community developers.


Here’s what you’ll find in the Embedded Vision Developer Zone:



  • Eval kits
  • Production-ready SOMs
  • Development tools
  • Reference designs and community projects
  • Training resources
  • Support



There’s also a Knowledge Center with articles, video tutorials, text-based tutorials (for traditionalists), and even a collection of vision-related Xcell Daily blog posts to keep you up to date on the latest announcements.


Here’s a very short video with some additional info about the new Xilinx Embedded Vision Developer Zone:






The Vivado Design Suite provides an IP-Centric development environment for FPGA-based designs and this November 30 webinar taught by Doulos, a Xilinx Authorized Training Provider, will you teach you how to customize IP from the Vivado IP Catalog, generate output products, and instantiate that IP in your design using Verilog or VHDL.


Topics include:


  • Vivado: an IP-Centric development environment
  • The IP Catalog
  • Alternative IP Flows: Out-Of-Context and Global Synthesis
  • IP Output Products
  • Simulating IP
  • Managing IP both within and outside projects
  • Using IP with revision control systems


The Webinar will occur twice on November 30 to accommodate different time zones around the world. Register here.




Think you can design the lowest-latency network switch on the planet? That’s the challenge of the NetFPGA 2017 design contest. You have until April 13, 2017 to develop a working network switch using the NetFPGA SUME dev board, which is based on a Xilinx Virtex-7 690T FPGA. Contest details are here. (The contest started on November 16.)


Competing designs will be evaluated using OSNT, an Open Source Network Tester, and testbenches will be available online for users to experiment and independently evaluate their design. The competition is open to students of all levels (undergraduate and postgraduate) as well as to non-students. Winners will be announced at the NetFPGA Developers Summit, to be held on Thursday, April 20 through Friday, April 21, 2017 in Cambridge, UK.


Note: There is no need to own a NetFPGA SUME platform to take part in the competition because the competition offers online access to one. However, you may want one for debugging purposes because there’s no online debug access to the online NetFPGA SUME platform. (NetFPGA SUME dev boards are available from Digilent. Click here.)



NetFPGA SUME Board.jpg 


NetFPGA SUME Board (available from Digilent)




Xilinx is changing the way it marks the latest All Programmable devices including all 28nm 7 series devices (FPGAs and the Zynq-7000), Virtex and Kintex UltraScale FPGAs, and UltraScale+ devices (Virtex UltraScale+ and Kintex UltraScale+ FPGAs and Zynq UltraScale+ MPSoCs). The change adds a 2D barcode to the package. This change is not merely cosmetic. The 2D barcode brings you several advantages:


  • The barcode packs a lot more information about the individual device on the outside of the package. (See below)
  • The barcode also provides improved device-level tracking over the entire supply chain from Xilinx to you. That translates into product traceability and device authentication.
  • The 2D barcode also improves your ability to perform optical inspection of incoming goods and of soldered boards.


Device genealogy information encoded in the new Xilinx 2D barcode includes:


  • Device lot
  • Date code
  • Speed
  • Temperature grade
  • SCD information (the SCD is the specification control document for the device)



You can pull all of that information from the 2D barcode using the Xilinx Go app on your smartphone or by uploading an image of the barcode to Xilinx.com.


Here’s a video that details the evolution to 2D barcode marking on Xilinx devices including rollout information:





There’s also a note detailing the features of the new Xilinx 2D barcode. Click here.




By Adam Taylor



Having re-created the base hardware overlay for our PYNQ dev board, we’ll now modify the overlay to add our own memory-mapped peripheral. As we are modifying the base overlay, this will be a new overlay—one that we need to correctly integrate into the PYNQ environment.


While this will be a simple example, we can use the same techniques used here to create as complicated or as simple an overlay as we desire.


To demonstrate how we do this, I am going to introduce a new block memory within the PL that we can read from and write to using the Python environment.






The new blocks are highlighted




To do this we need to do the following in the Vivado design:



  1. Create a new AXI port (port 13) on the AXI Interconnect connected to General Purpose Master 0
  2. Import a new BRAM controller and configure it to have only one port
  3. Use the Block Memory Generator to create a BRAM. Set the mode to BRAM Controller, single port RAM
  4. Map the new BRAM controller to the Zynq SoC’s PS memory map



With these four things completed, we are ready to build the bit file. Once the file has been generated, we are halfway towards building an overlay we can use in our design. The other half of the way requires generating a TCL script that defines the address map of the bit file. To do this we need to use the command:


write_bd_tcl <name.tcl>


Once we have the TCL and bit files, we can move on to the next stage, which is to import the files and create the drivers and application.


This is where we need to power on the PYNQ dev board and connect to it to the network with our development PC. Once the PYNQ configuration is uploaded, we can connect to it using a program like WinSCP to upload the bit file and the tcl file.


Within the current directory structure on the PYNQ board, there is a bit stream directory we can use at:






You will find the files needed to support the base overlay under this directory.






Base overlay and modified overlay following upload



Once this has been uploaded, we need to create a notebook to use it. We need to make use of the existing overlay module provided with the PYNQ package to do this. This module will allow us to download the overlay into the PL of the PYNQ. Once it is downloaded, we need to check that it downloaded correctly, which we can do using the ol.is_loaded() function.






Downloading the new overlay



The simplest way to interface with the new overlay is to use the MMIO module within the PYNQ Package. This module allows us to interface directly to memory-mapped peripherals. First however, we need to define a new class within which we can declare the functions to interact with the overlay. For this example, I have called my class part158 to follow the blog numbering.







Looking within the class, we have defined the base address and address range using the line:



mmio = MMIO(0x46000000,0x00002000)



Three function definitions in the above figure define:


  • The initialization function (in this case, this function merely writes a 0 to address 0)
  • A function that writes data into the BRAM
  • Another function that reads data from the BRAM.


(Remember that the address increments by 4 for each address because this is a 32-bit system.)


With the class defined, we can then write a simple script that writes data to and reads data from the BRAM, as we would for any other function. Initially we will write a simple counting sequence followed by writing in random numbers.






When I executed the notebook, I received the results below:





Once we have this new hardware overlay up and running, we can create a more complex overlay and interact with it using the MMIO module.



Code is available on Github as always.


If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




  • First Year E Book here
  • First Year Hardback here.




 MicroZed Chronicles hardcopy.jpg



  • Second Year E Book here
  • Second Year Hardback here



 MicroZed Chronicles Second Year.jpg





All of Adam Taylor’s MicroZed Chronicles are cataloged here.





Want to see how fast machine inference can go and how efficient it can be? The video below shows you how fast the AlexNet image-classification algorithm runs (better than 1800 image classifications/sec)—and how efficiently it runs (<50W)—using an INT8 (8-bit integer) implementation. The demo on the video shows AlexNet running in an open-source Caffe deep-learning framework, implemented with the xDNN deep neural network library running on a Xilinx UltraScale FPGA in the Xilinx Kintex UltraScale FPGA Acceleration Development Kit.


All of the above components are part of the newly announced Xilinx Reconfigurable Acceleration Stack.


Note: If you implemented this classification application using INT16 instead, you’d get about half the performance, as mentioned in the video and discussed in detail in the previous Xcell Daily blog post, “Counter-Intuitive: Fixed-Point Deep-Learning Inference Delivers 2x to 6x Better CNN Performance with Great Accuracy.”


Here’s the video showing FPGA-based image classification in action:








Intuitively, you might think that that more resolution you throw at deep-learning inference, the more accurate the result.


Nope. Not true.


Human intuition does not always guide you to a superior solution in the fields of AI, machine learning, and inference. In this case, the counter-intuitive result gets you near-maximum inference accuracy with greatly improved performance and reduced power consumption resulting in significantly better performance per watt. The technical detail is all there in a new Xilinx White Paper titled “Deep Learning with INT8 Optimization on Xilinx Devices.”


Research has shown that 32-bit floating-point computations are not required in deep learning inferences to obtain the best accuracy. For many applications such as image classification, INT8 (or even lower-precision) fixed-point computations deliver nearly identical inference accuracy compared with floating-point results. Here’s a table from the Xilinx White Paper that shows accuracy results for fine-tuned CNNs (convolutional neural networks) based on fixed-point computations that validates this claim. (The numbers in parentheses indicate accuracy without fine-tuning.):



INT8 vs Floating-Point Efficiency Graph.jpg




Note that the reduced-precision fixed-point computations and 32-bit floating-point computations deliver essentially the same inference accuracy for all six of these CNN benchmark applications.


So why should you bother with floating-point computations at all? That’s an excellent question for you to ponder once you let the data override your intuition. You know that floating-point calculations consume more power; you know they consume more resources to implement; and that fact becomes increasingly important when creating massively parallel CNNs.


There are, in fact, several reasons to employ fixed-point computations instead of floating-point computations based on these results. Your design delivers:



  1. More performance
  2. With lower power consumption
  3. And with better resource utilization



Although the first two advantages may seem obvious, the third—better resource utilization—takes some explaining, which the Xilinx White Paper describes in detail. Based on this and other research, FPGA-based floating-point DSPs turn out to be a poor match for many hyperscale applications including machine learning inference. (At the same time, other research suggests that floating-point DSPs as implemented in FPGAs fall well short of the compute efficiency attained by GPUs optimized for CNN training.)


Xilinx’s fixed-point DSP48E2 architecture used in its UltraScale and UltraScale+ FPGAs is optimized for reduced-precision integer computations because you can pack two INT8 operations into every clock tick in each Xilinx DSP48E2 slice thanks to its wide, 27x18-bit multiplier, 48-bit accumulator, and other architectural enhancements. The DSP slices in competitive FPGAs cannot accomplish this feat. (Again, see the Xilinx White Paper for the gory technical details of how this integer operand packing works and how it essentially doubles CNN performance.)


The proof’s in the performance data, so here’s a figure taken from the same Xilinx White Paper that graphically illustrates the superior efficiency of fixed-point CNN implementations using Xilinx UltraScale and UltraScale+ FPGAs:



INT8 Efficiency Graph.jpg 



You can see from this figure that you get significantly more deep-learning GOPS/watt from fixed-point CNN implementations using Xilinx UltraScale and UltraScale+ FPGAs, when compared to competitive devices. Compared to Intel's Arria 10 and Stratix 10 devices as shown in the above figure, Xilinx devices deliver 2X to 6X better GOPS/watt efficiency for deep-learning inference operations—at essentially the same accuracy level attained by 32-bit floating-point implementations.


For more information, you might want to spend some time investigating the resources on the new Xilinx Acceleration Zone Web page, which discusses myriad facets of hyperscale cloud acceleration using Xilinx FPGA technology including the new Xilinx Acceleration Development Kit based on the Xilinx Kintex UltraScale KU115 FPGA.






BittWare’s UltraScale+ XUPP3R board and Atomic Rules IP run Intel’s DPDK over PCIe Gen3 x16 @ 150Gbps

by Xilinx Employee ‎11-21-2016 02:02 PM - edited ‎11-22-2016 06:08 PM (2,272 Views)


Intel’s DPDK (Data Plane Development Kit) is a set of software libraries that improves packet processing performance on x86 CPU hosts by as much as 10x. According to Intel, its DPDK plays a critical role in SDN and NFV applications. Last week at SC16 in Salt Lake City, BittWare demonstrated Intel’s DPDK running on a Xeon CPU and streaming packets over a PCIe Gen3 x16 interface at an aggregate rate of 150Gbps (transmit + receive) to and from BittWare’s new XUPP3R PCIe board using Atomic Rules’ Arkville DPDK-aware data mover IP instantiated in the 16nm Xilinx Virtex UltraScale+ VU9P FPGA on Bittware’s board. The Arkville DPDK-aware data mover marshals packets between the IP block implemented in the FPGA’s programmable logic and the CPU host's memory using the Intel DPDK API/ABI. Atomic Rule’s Arkville IP plus a high-speed MAC looks like a line-rate-agnostic, bare-bones L2 NIC.



Bittware DPDK Demo with XUPP3R Board v2.jpg 


BittWare’s XUPP3R PCIe board with an on-board Xilinx Virtex UltraScale+ VU9P FPGA






Here’s a very short video of BittWare’s VP of Systems & Solutions Ron Huizen explaining his company’s SC16 demo:





Here’s an equally short video made by Atomic Rules with a bit more info:





If this all looks vaguely familiar, perhaps you’re remembering an Xcell Daily post that appeared just last May where BittWare demonstrated an Atomic Rules UDP Offload Engine running on its XUSP3S PCIe board, which is based on a Xilinx Virtex UltraScale VU095 FPGA. (See “BittWare and Atomic Rules demo UDP Offload Engine @ 25 GbE rates; BittWare intros PCIe Networking card for 4x 100 GbE.”) For the new XUPP3R PCIe board, BittWare has now jumped from the 20nm Virtex UltraScale FPGAs to the latest 16nm Virtex UltraScale+ FPGAs.



By Adam Taylor


In last week’s blog, we uploaded a new overlay and used it to run a Sobel filter on the PYNQ-Z1’s HDMI I/O stream. What I want to do, however, is to create my own overlay. We will do this over the next few blogs. (If you’re just joining this discussion, the PYNQ project combines the capability of the Xilinx Zynq SoC and the productivity of the Python programming language. You can read an overview of it in this earlier post: “Adam Taylor's MicroZed Chronicles Part 155: Introducing the PYNQ (Python + Zynq) Dev Board.”)


The recommended way to create a PYNQ hardware overlay is to use the existing base overlay Vivado design as a starting point, so we need to recreate that first. The reason behind the recommendation to base new overlays on the initial overlay is because the supplied overlay predefines all the configurations needed for the Zynq SoC’s PS and defines the PS-PL interface.


The first thing we need to do is download the design files from the PYNQ GitHub page. We can easily download or clone the GitHub repository from the repository’s top level.





Once downloaded, we need to extract the files into a working directory. To recreate the project in Vivado, the first thing we need is the correct version of Vivado. The current version of the PYNQ base overlay was created using the 2016.1 release of the Vivado Design Suite, so we need to ensure that we have this version installed. If we do not have it, we need to download it from the Xilinx webpage.


With the correct Vivado version installed navigate to the following directory:



<working directory> \PYNQ-master\PYNQ-master\Pynq-Z1\vivado\base



Within this directory, you will see a make file that you can run from a command shell. This make file builds the Vivado project for us and allows us to explore the base overlay design for the PYNQ board.


Before we can do this however we need to ensure that we have the correct settings in the make file. The command we need to use varies if we are using a PC running Microsoft Windows or Linux.


Open the make file with a text editing program. If you are using Linux then the command should be:




vivado -mode batch -source ${PYNQ_ROOT}/Pynq-Z1/bitstream/base.tcl




However, if you are using Microsoft Windows, then we’ll need the following command:




cmd /c "vivado -mode batch -source ${PYNQ_ROOT}/Pynq-Z1/bitstream/base.tcl"





With the correct command in the make file, open a command window in the working directory and type:



make -f makefile



This will rebuild the Vivado project so that we can begin to explore the base overlay design and then start adding in our own functionality. If you have trouble running the make file, you may want to check out Xilinx user guide UG1198, which explains how to configure Make on your system.


Once the script has been completed you will see the base overlay as shown below. It may initially look complex but it’s not so complicated once you analyze it.






The first thing to explore is the interconnectivity implemented using the AXI interfaces. At the simplest level, the Zynq SoC’s High-Performance AXI interconnects are used for interfaces that require DMA to transfer data to or from DDR memory. For the PYNQ overlay, these interconnects are used for the trace buffers for the Arduino and PMOD ports and the video stream.


The Master General-Purpose AXI interfaces are used to control and configure the remaining blocks within the design. Most interestingly, these ports are used to configure the MicroBlaze BRAMs on the fly. The MicroBlaze processor communicates with the PMODS and the Arduino shield port, providing a seamless interface.


Some of the in-depth detail of the base overlay is shown in the tables below:






Now that we have rebuilt the base platform in Vivado and we understand how it is interconnected, we can start building our own PYNQ overlay. The first one will be simple, just to demonstrate the process. After that, we can move on to creating more complex examples.




Code is available on Github as always.


If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




  • First Year E Book here
  • First Year Hardback here.



MicroZed Chronicles hardcopy.jpg 




  • Second Year E Book here
  • Second Year Hardback here




MicroZed Chronicles Second Year.jpg 




All of Adam Taylor’s MicroZed Chronicles are cataloged here.


Everspin’s NVMe Storage Accelerator mixes MRAM, UltraScale FPGA, delivers 1.5M IOPS

by Xilinx Employee ‎11-18-2016 02:50 PM - edited ‎11-18-2016 09:52 PM (2,657 Views)


Everspin, “The MRAM Company,” took an off-the-shelf Alpha Data ADM-PCIE-KU3 PCIe accelerator card, loaded 1Gbyte of MRAM DIMMs on the card, reprogrammed the on-board Kintex UltraScale KU060 FPGA to create an MRAM-based NVMe controller, and got…





From non-volatile, no-wearout-failure MRAM.


The folks at Alpha Data handed me a data sheet for the resulting Everspin NVMe card, the ES1GB-N02 Storage Accelerator, at this week’s SC16 conference in Salt Lake City. Here’s a scan of that data sheet:



Everspin MRAM Accelerator Data Sheet.jpg 



Everspin makes MRAMs with DDR3 pin-level interfaces, but these non-volatile memory devices have unique timing requirements that differ from the DDR3 SDRAM standard. It’s therefore relatively easy to create an MRAM-based DDR3 SODIMM that snaps right into the existing SDRAM socket on the Alpha Data ADM-PCIE-KU3 card. Modify the SDRAM controller in the Kintex UltraScale FPGA to accommodate the MRAM’s timing requirements and—voila!—you’ve created an MRAM storage accelerator card.


There’s a key point to be made about a product like this. The folks at Alpha Data likely never envisioned an MRAM-based storage accelerator when they designed the ADM-PCIE-KU3 PCIe accelerator card but they implemented their design using an advanced Xilinx UltraScale FPGA knowing that they were infusing flexibility into the design. Everspin simply took advantage of this built-in flexibility in a way that produced a really interesting NVMe storage product.


Isn’t that the sort of flexibility you’d like to have in your products?



(Note: MRAM is magnetic RAM.)





Alpha Data’s booth at this week’s SC16 conference in Salt Lake City held the company’s latest top-of-the-line FPGA accelerator card, the ADM-PCIE-9V3, based on the 16nm Xilinx Virtex UltraScale+ VU3P-2 FPGA. Announced just this week, the card also features two QSFP28 sockets that each accommodate one 100GbE connection or four 25GbE connections. If you have a full-height slot available, you can add two more 100GbE interfaces using Samtec FireFly Micro Flyover Optical modules and run four 100GbE interfaces simultaneously. All of this high-speed I/O capability comes courtesy of the 40 32.75Gbps SerDes ports on the Virtex UltraScale+ VU3P FPGA.



Alpha Data ADM-PCIE-9V3.jpg 


Alpha Data ADM-PCIE-9V3 Accelerator Card based on a Xilinx Virtex UltraScale+ VU3P-2 FPGA



To back up the board’s extreme Ethernet bandwidth, the ADM-PCIE-9V3 board incorporates two banks of 72-bit, DDR2400 SDRAM with ECC and a per-bank capacity of 8Gbytes for a total of 16Gbytes of on-board SDRAM. All of this fits on a half-length, low-profile PCIe card, which features a PCIe Gen4 x8 or a PCIe Gen3 x16 host connection and the board supports the OpenPOWER CAPI coherent interface. (The PCIe configuration is programmable, thanks to the on-board Virtex UltraScale+ FPGA.)



Taken as a whole, this new accelerator card delivers serious processing and I/O firepower along every dimension you might care to measure, whether it’s Ethernet bandwidth, memory capacity, or processing power.


The Alpha Data ADM-PCIE-9V3 board is based on a Xilinx Virtex UltraScale+ FPGA so it can serve as a target for the Xilinx SDAccel development environment, which delivers a CPU- and GPU-like development environment for application developers who wish to develop high-performance code using OpenCL, C, or C++ while targeting ready-to-go, plug-in FPGA hardware. In addition, Alpha Data offers an optional Board Support Package for the ADM-PCIE-9V3 accelerator board with example FPGA designs, application software, a mature API, and driver support for Microsoft Windows and Linux to further ease cloud-scale application development and deployment in hyperscale data centers.




This week at SC16 in Salt Lake City, Smart IOPS demonstrated its FPGA-powered Data Engine NVMe SSD, which delivers 1.7M IOPS—which the company claims is 4x that of competing NVMe SSDs. The secret, besides the embedded Xilinx Kintex UltraScale FPGA running the show in hardware, is Smart IOPS’ TruRandom technology, which uses pattern-recognition heuristics baked into the FPGA logic to speed read/write transactions between the host CPU and the Data Engine’s NAND Flash storage. This technology makes sustained random and sequential read/write transactions indistinguishable, meaning they run...







Smart IOPS Data Engine.jpg


Smart IOPS Data Engine NVMe SSD




Smart IOPS is offering the Data Engine NVMe SSD in 2 to 10Tbyte capacities and three flavors: T2, t2D, and T4. The T2 Data Engines employ 16nm MLC NAND Flash memory; the T2D Data Engines employ 3D MLC NAND Flash memory; and the T4 Data Engines employ 15nm MLC NAND Flash memory. The different types of flash affect the drives’ speeds as shown in these specs:




Smart IOPS specs.jpg



Smart IOPS Data Engine NVMe SSD specifications




Smart IOPS also packages one or more of its Data Engine SSDs in a rack-mounted Flash Appliance.


The on-board Xilinx Kintex UltraScale FPGA implements all of the functions in the Smart IOPS Data Engine including the PCIe Gen3 host interface; NAND Flash control; and of course the company’s proprietary, patent-pending, speed-multiplying TruRandom heuristics.



By Adam Taylor



Having done the easy part and got the Pynq all set up and running a simple “hello world” program, I wanted to look next at the overlays which sit within the PL, how they work, and how we can use the base overlay provided.


What is an overlay? The overlay is a design that’s loaded into the Zynq SoC’s programmable logic (PL). The overlay can be designed to accelerate a function in the programmable logic or provide an interfacing capability using the PL. In short, overlays give Pynq its, unique capabilities.


What is important to understand about the overlay is that there is not a Python-to-PL high-level synthesis process involved. Instead, we develop the overlay using one of the standard Xilinx design methodologies (SDSoC, Vivado, or Vivado HLS). Once we’ve created the bit file for the overlay, we then integrate it within the Pynq architecture and establish the required parameters to communicate with it using Python.


Like all things with the Zynq SoC that we have looked at to date, this is very simple. We can easily integrate with the Python environment using the bit file and other files provided with the Vivado build. We do this with the Python MMIO class, which allows us to interact with designs in the PL through memory-mapped reads and writes.  The memory map of the current overlay in the PL is all we need. Of course, we can change the contents of the PL on the fly as our application requires to accelerate functions in the PL.


We will be looking more at how we can create our own overlay over the next few weeks. However, if you want to know more in the short term, I suggest you read the Pynq manual here. If you are thinking of developing your own overlay, be sure that you base it on the base overlay Vivado design to ensure that the configuration of the Zynq SoC’s Processor System (PS) and the PS/PL interface s are correct.


The supplied base overlay provides support for several interfaces including the HDMI port and a wide range of PMODs.


The real power of the Pynq system comes from the open source community developing and sharing overlays. I want to look at a couple of these in the remainder of this blog. These overlays are available via GitHub and provide a Sobel Filter for the HDMI input and output and a FIR filter. You’ll find them here:





The first thing we need to do is the install the packages. For this example, I am going to install the Sobel filter. To do this we need to use a terminal program to download and install the overlay and its associated files.



We can do this using PuTTY and log in easily with the user name and password of Xilinx. The command to install the overlay is then:



sudo -H pip install --upgrade 'git+https://github.com/beja65536/pz1_sobelfilter'





Installing the Sobel Filter



Once this has been downloaded, the next step is to download the zip file containing the Juypter notebook from GitHub and upload it under the examples directory. This is simple to do. Just select the upload and navigate to the location of the notebook you wish to upload.





This notebook also performs the installation of the overlay if you have not done this via the terminal. You do however only need to do this once.



Once this is uploaded, we can connect the Pynq to an HDMI source and an HDMI monitor and run the example. For this example, I am going to connect the Pynq between the Embedded Vision Kit and the display and then run the notebook.






When I did this, the notebook produced the image below showing the result of the Sobel Filter. Overall, this was very easy to get up and running using a different overlay that is not the base overlay.






Code is available on Github as always.


If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




  • First Year E Book here
  • First Year Hardback here.



MicroZed Chronicles hardcopy.jpg 




  • Second Year E Book here
  • Second Year Hardback here




MicroZed Chronicles Second Year.jpg 




All of Adam Taylor’s MicroZed Chronicles are cataloged here.




Today marks the launch of the Xilinx Reconfigurable Acceleration Stack for reducing the programming hurdles associated with accelerating workloads in hyperscale datacenters in three acceleration stack categories:


  • machine learning
  • data analytics
  • live-video streaming


Here’s a graphical overview of the material you’ll find in the Xilinx Reconfigurable Acceleration Stack:



Xilinx Acceleration Stack.jpg


The several libraries already included in the Xilinx Reconfigurable Acceleration Stack include:


DNN – A Deep Neural Network (DNN) library from Xilinx, which is a highly optimized library for building deep learning inference applications.  This library is designed for maximum compute efficiency at 16-bit and 8-bit integer data types.


GEMM – A General Matrix Multiply (GEMM) library from Xilinx, which is based on the level-3 Basic Linear Algebra Subprograms (BLAS). This library delivers optimized performance at 16-bit and 8-bit integer data types and supports matrices of any size.


HEVC Decoder & Encoder – HEVC/H.265 is the latest video-compression standard from the MPEG and ITU standards bodies. HEVC/H.265 is the successor to the H.264 video-compression standard and it can reduce video bandwidth requirements by as much as 50% relative to H.264. Xilinx provides two HEVC/H.265 video encoders: a high-quality, flexible, real-time encoder to address the majority of video-centric data-center workloads and an alternate encoder for non-camera generated content.  One decoder supports all forms of encoded HEVC/H.265 video from either encoder.


Data Mover (SQL) – The SQL data-mover library makes it easy to accelerate data analytics workloads using a Xilinx FPGA. The data-mover library orchestrates standard connections to SQL databases by sending blocks of data from database tables to the FPGA accelerator card’s on-chip memory over a PCIe interface. The library automatically maximizes PCIe bandwidth between the host CPU and the FPGA-based hardware accelerator.


Compute Kernel (SQL) – A library that accelerates numerous core SQL functions on the FPGA hardware accelerator including decimal type, date type, scan, compare, and filter. The library’s compute functions optimally exploit the on-board FPGa’s massive hardware parallelism.


Three of the top seven hyperscale cloud companies including Baidu have already deployed Xilinx FPGAs for hardware acceleration. Last month, Baidu announced that it had designed a Xilinx Kintex UltraScale FPGA into an accelerator card and was using pools of these cards to accelerate machine learning inference. Qualcomm and IBM have announced strategic collaborations with Xilinx for data-center acceleration and the IBM engagement already has already resulted in a storage and networking acceleration framework called CAPI SNAP that eases the creation of accelerated applications such as NoSQL using Xilinx FGPAs. (See last month’s Xcell Daily blog post “OpenPOWER’s CAPI SNAP Framework eases the task of developing high-performance, FPGA-based accelerators for data centers.”)


In addition, Xilinx has been leading an industry initiative toward the development of an intelligent, cache coherent interconnect called CCIX.  Xilinx along with AMD, ARM, Huawei, IBM, Mellanox, and Qualcomm formed the CCIX Consortium in May 2016. The initiative’s membership has since tripled in just five months and the CCIX Consortium announced the Release1 specification covering the physical, data-link, and protocol layers, which is now available to the consortium’s members. (See “CCIX Consortium develops Release1 of its fully cache-coherent interconnect specification, grows to 22 members.”)


There’s a new resource center on www.xilinx.com called the Xilinx Acceleration Zone that you can access for much more information about the new Xilinx Reconfigurable Acceleration Stack.



This week, Techfocus Media’s President Kevin Morris wrote the following in an article published on the EEJournal Web site:


“Designers of FPGA tools should take heed. There is a vast number of different types of users entering the FPGA domain, and the majority are not FPGA experts. If FPGAs are to expand into the numerous new and exciting markets for which they’re suitable, the primary battleground will be tools, not chips. New users should not have to learn FPGA-ese in order to get an FPGA to work in their system. At some point, people with little or no hardware expertise at all will need to be able to customize the function of FPGAs.”


In a nutshell, this paragraph describes the philosophy behind the Xilinx SDx Development Environments including SDAccel, SDSoC, and SDNet. These application-specific development environments are designed to allow people versed in software engineering and other disciplines to get a hardware performance boost from Xilinx All Programmable devices without the need to become FPGA experts (in Morris’ terminology).


Later in the article, Morris writes:


“Higher levels of abstraction in design creation need to replace HDL. System-level design tools need to take into account both the hardware and software components of an application. Tools - particularly lower-level implementation tools such as synthesis and place-and-route - need to move ever closer to full automation.”


He might as well be writing about the Xilinx SDSoC development environment. If this is the sort of development tool you seek, you might want to check it out.




Today, Xilinx announced four members of a new Virtex UltraScale+ HBM device family that combines high-performance 16nm Virtex UltraScale+ FPGAs with 32 or 64Gbits of HBM (high-bandwidth memory) DRAM in one device. The resulting devices deliver a 20x improvement in memory bandwidth relative to DDR SDRAM—more than enough to keep pace with the needs of 400G Ethernet, multiple 8K digital-video channels, or high-performance hardware acceleration for cloud servers.


These new Virtex UltraScale+ HBM devices are part of the 3rd generation of Xilinx 3D FPGAs, which started with the Virtex-7 2000T that Xilinx started shipping way, way back in 2011. (See “Generation-jumping 2.5D Xilinx Virtex-7 2000T FPGA delivers 1,954,560 logic cells using 6.8 BILLION transistors (PREVIEW!)”) Xilinx co-developed this 3D IC technology with TSMC and the Virtex UltraScale+ HBM devices represent the current, production-proven state of the art.


Here’s a table listing salient features of these four new Virtex UltraScale+ HBM devices:



Xilinx UltraScale Plus HBM Device Table.jpg




Each of these devices incorporates 32 or 64Gbits of HBM DRAM with more than 1000 I/O lines connecting each HBM stack through the silicon interposer to the logic device, which contains a hardened HBM memory controller that manages one or two HBM devices. This memory controller has 32 high-performance AXI channels, allowing high-bandwidth interconnect to the Virtex UltraScale+ devices’ programmable logic and access to many routing channels in the FPGA fabric. Any AXI port can access any physical memory location in the HBM devices.


In addition, these Virtex UltraScale+ HBM FPGAs are the first Xilinx devices to offer the new, high-performance CCIX cache-coherent interface announced just last month. (See “CCIX Consortium develops Release1 of its fully cache-coherent interconnect specification, grows to 22 members.”) CCIX simplifies the design of offload accelerators for hyperscale data centers by providing low-latency, high-bandwidth, fully coherent access to server memory. The specification employs a subset of full coherency protocols and is ISA-agnostic, meaning that the specification’s protocols are independent of the attached processors’ architecture and instruction sets. CCIX pairs well with HBM and the new Xilinx UltraScale+ HBM FPGAs provide both in one package.


Here’s an 8-minute video with additional information about the new Virtex UltraScale+ HBM devices:






Time borrowing, a clock-frequency-boosting design technique long available to ASIC designers through automated clock-tree synthesis tools, is now available to designers using Xilinx Virtex UltraScale+ and Kintex UltraScale+ FPGAs and Zynq UltraScale+ MPSoCs with the latest versions of the Xilinx Vivado HLx Design Suite. That was one of Parivallal Kannan’s key messages in yesterday’s ICCAD 2016 presentation titled “Performance-Driven Routing for FPGAs.” The key issue here is balancing logic delays in successive pipeline stages to maximize clock frequency. Usually, that means trying to exactly balance logic delays between pipeline flip-flops. That’s quite a trick and it’s often just not possible to do this, especially in a short amount of time. Time borrowing turns this problem on its head by injecting clock delays in a stage to “borrow” time from the next succeeding stage. It looks like this:



Time Borrowing .jpg

The clock buffer driving FFj in the above figure permanently “borrows” half a nanosecond from the logic stage on the right (between FFj and FFk) using a programmable UltraScale+ clock-buffer delay element, effectively creating a 2.5nsec time period between FFi and FFj on the left side of the figure. This design technique maintains the pipeline’s overall 2nsec clock period and avoids the need to use a slower, 2.5nsec period to accommodate the additional logic in the slower pipeline stage.


A graph showing 89 results of this time-borrowing technique applied to Xilinx customer designs appears in the figure below:



Time Borrowing Results.jpg



As you can see, the resulting Fmax increase ranges from a low of about 1% to a high of nearly 14% with an average of about 5.5%. That’s some serious potential performance improvement from a relatively simple and automated design technique.


Again, these sorts of optimizations are made possible by the innovations incorporated into the programmable-logic fabric in Xilinx UltraScale+ FPGAs and Zynq UltraScale+ MPSoCs.



VisualApplets 3.jpg


Silicon Software’s VisualApplets has long been a handy GUI-based tool for designers creating high-performance, image-processing systems using FPGAs. The company is now offering a free e-book that shows you how the latest version, VisualApplets 3, lets you create such systems with Silicon Software’s V-series frame grabbers or compatible Baumer LX VisualApplets video cameras in as little as 1 week.


Click here to sign up for a free copy of the book.




Are you attending Supercomputing 2016 (SC16) in Salt Lake City next week? Would you like to learn about reconfigurable hardware acceleration for data centers? (Hint: Think superior performance/watt.) Well, you’re in luck. There’s a free, 1-hour briefing on this topic taking place right next door to the conference in the Utah Museum of Contemporary Art on the morning of November 16.


Xilinx is hosting the briefing and if you’d like to attend this space-limited event and hear from Xilinx engineers and researchers about how FPGAs are accelerating the widest range of data center workloads, click here to learn more and to register.




Adam Taylor's MicroZed Chronicles Part 155: Introducing the PYNQ (Python + Zynq) Dev Board

by Xilinx Employee ‎11-06-2016 03:59 PM - edited ‎11-11-2016 11:41 AM (7,997 Views)


By Adam Taylor


Having recently received a Xilinx/Digilent PYNQ Dev Board, I want to spend some time looking at this rather exciting Zynq-based board. For those not familiar with the PYNQ, it combines the capability of the Zynq and the productivity of the Python programming language and it comes in a rather catching pink color.





PYNQ up and running on my desk



Hardware-wise, PYNQ incorporates an on-board Xilinx Zynq Z-7020 SoC, 512Mbytes of DDR SDRAM, HMDI In and Out, Audio In and Out, two PMOD ports, and support for the popular Arduino Interface Header. We can configure the board from either the SD card or QSPI. On its own, PYNQ would be a powerful development board. However, there are even more exciting aspects to this board that enable us to develop applications that use the Zynq SoC’s Programmable Logic.


The Zynq SoC runs a Linux kernel with a specific package that supports all of the PYNQ’s capabilities. Using this package, it is possible to place hardware overlays (in reality bit files developed in Vivado) in to the programmable logic of the Zynq.


The base PYNQ supports all of the PYNQ interfaces as shown below:





PYNQ PL hardware overlay



Within the supplied software environment, the PYNQ hardware and interfaces are supported by the Pynq Package. This package allows you to use the Python language to drive PYNQ’s GPIO, video, and audio interfaces along with a wide range of PMOD boards. We use this package within the code we have developed and documented using the Jupyter note book, which is the next part of the PYNQ framework.


As engineers, we ought to be familiar with the Python Language and Linux, even if we are not experts in either. However, we may be unfamiliar with Jupyter notebooks. These are Web-based, interactive environments that allow us to run code, widgets, document, plots, and even video within the Jupyter notebook Web pages.


A Jupyter notebook server runs within the Linux kernel that’s running on the PYNQ’s Zynq SoC. We use this interface to develop our PYNQ applications. Jupyter notebooks and overlays are the core of the PYNQ development methodology and over the next series of blogs we are going to explore how we can use these notebooks and overlays and even develop our own as required.


Let’s look at how we can power up the board and get our first “hello world” program running. We’ll develop a simple program that allows us to understand the process flow.


The first thing to do is to configure an SD card with the latest kernel image, which we can download from here. With this downloaded, the next step is to write the ISO file to the SD card using an application like Win Disk Imager (if we are using Microsoft Windows).


Insert the SD card into the PYNQ board (check that the jumper is set for SD boot) and connect a network cable to the Ethernet port. Power the board up and, once it boots, we can connect to the PYNQ board using a browser.


In a new browser window enter the address http://pynq:9090, which will take us to a log-on page where we enter the username Xilinx. From there we will see the Juypter notebook’s welcome page:




The PYNQ welcome page



Clicking on “Welcome to Pynq.ipynb” will open a welcome page that tells us how to navigate around the notebook and where to find supporting material.


For this example, we are going to create our own very simple example to demonstrate the flow, as I mentioned earlier. Again, we run the Python programs from within the Juypter notebook. We can see which programs we currently have running on the PYNQ by clicking on the “Running” tab, which is present on most notebook pages. Initially we have no notebooks running, so clicking on it right now will only show us that there are no running notebooks.




Notebooks running on the PYNQ



To create your own example, click on the examples page and then click on “New.” Select “notebooks Python 3” from the icon on the right:




Creating a new notebook



This will create a new notebook called untitled. We can change the name to whatever we desire by clicking on “untitled,” which will open a dialog box to allow us to change the name. I am going to name my example after the number of this MicroZed Chronicles blog post (Part 155).





Changing the name of the Notebook



The next thing we wish to do is enter the code we wish to run on the PYNQ. Within the notebook, we can mark text as either Code, Markdown, Heading, or Raw NBConvert.





We can mark text as either Code, Markdown, Heading, or Raw NBConvert



For now, select “code” (if it is not already selected) and enter the code: print(“hello world”)






The code to run in the notebook



We click the play button to run this very short program. With the box selected and all being well, you will see the result appear as below:





Running the code





 Result of Running the Code



If we look under the running tab again, we will see that this time there is a running application:




Running Notebooks




If we wish to stop the notebook from running then we click on the shutdown button.


Next time, we will look at how we can use the PYNQ in more complex scenarios.


We can also use the PYNQ board as a traditional Zynq based development board if desired. This makes the PYNQ one of the best dev board choices available now.


Note, you can also log on to the PYNQ board using a terminal programme like PuTTY with the username and password Xilinx.





Code is available on Github as always.


If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




  • First Year E Book here
  • First Year Hardback here.




 MicroZed Chronicles hardcopy.jpg



  • Second Year E Book here
  • Second Year Hardback here




 MicroZed Chronicles Second Year.jpg




All of Adam Taylor’s MicroZed Chronicles are cataloged here.






Pentek 71861 4-channel 200MHz ADC with DDC.jpg 

Pentek has taken proven DSP designs in its existing Cobalt and Onyx module product lines and migrated these designs to what the company is calling its Jade architecture based on 20nm Xilinx Kintex UltraScale FPGAs. As a result, according to Pentek, the modules based on the new Jade architecture deliver a 50% DSP performance boost while using 18% less power. There’s also a 39% cost reduction. Now that’s a technology migration story worth studying!


The first Pentek product to employ the Jade architecture is the Model 71861 XMC module with four 200MHz ADC channels and programmable multiband DDCs (digital downconverters). This module is designed for radar and software-defined radio applications. In addition to the XMC form factor, similar products are available in PCIe, 3U & 6U VPX, AMC, and 3U & 6U cPCI form factors. Pentek offers versions of these modules for both commercial and rugged environments.


Here’s a block diagram of the Pentek Model 78161 4-channel ADC module:



Pentek 71861 Block Diagram.jpg


Pentek Model 71861 Block Diagram



Note that the Kintex UltraScale FPGA implements every bit of the module’s digital system design except for the on-board Flash and DDR4 memory and the clock generator. The FPGA has access to all of the module’s data and control paths, permitting factory-defined functions to control the system’s data multiplexing, channel selection, data packing, gating, triggering, and the on-board acquisition and processing memory—which can be as large as 5Gbytes. All of these functions exist in the FPGA as IP blocks.


Here’s a block diagram of the IP design in the Pentek Model 71861 XMC module’s Kintex UltraScale FPGA:



Pentek 71861 FPGA Block Diagram.jpg 


Pentek Model 71861 FPGA Block Diagram



If this block diagram looks somewhat familiar, then perhaps you’re recalling similar block diagrams that appeared in this year’s July Xcell Daily blog post about Pentek’s Model 71791 L-band tuner in an XMC form factor and the earlier February Xcell Daily blog post about Pentek’s Cobalt 71664. (See “Pentek L-band Tuner XMC card handles 975-2175MHz RF with DDC and control help from a Virtex-7 FPGA” and “Virtex-6 FPGA powers Pentek VITA 49 Radio Transport Standard CompactPCI/AMC/PCIe/VPX Modules for SDR.”)


The titles of those blog posts along with the similar block diagrams tell you that Pentek has created three broad, board-level product generations by moving its IP—DDC, data packing, flow control, metadata generator, and DMA engine—from the Xilinx Virtex-6 FPGA family to the Virtex-7 FPGA family and on to the Kintex UltraScale FPGA family.


Pentek is not just spanning Xilinx FPGA generations with this latest product family based on its new Jade architecture, the company is taking advantage of the pin compatibility engineered into the Xilinx UltraScale FPGA product line to offer customers a range of products with different price/performance ratios.


More products = more choices for the customer.


As the Pentek press release states:


“The [modules'] Kintex UltraScale FPGA site can be populated with a range of FPGAs to match the specific requirements of the processing task, spanning the KU035 through KU115. The KU115 features 5520 DSP48E2 slices and is ideal for modulation/demodulation, encoding/decoding, encryption/decryption, and channelization of the signals between transmission and reception. For applications not requiring large DSP resources or logic, a lower-cost FPGA can be installed.”


Now that’s true design reuse in action!


Note: For an additional discussion of ways to use Xilinx UltraScale device pin compatibility, see today’s earlier Xcell Daily post “PRO DESIGN adds five Virtex UltraScale modules and one Kintex UltraScale module to its growing FPGA prototyping line.”





Last week at ARM TechCon in Santa Clara, California, PRO DESIGN introduced five new FPGA modules based on different Xilinx Virtex UltraScale FPGAs for its proFPGA prototyping system. The company also introduced a new prototyping module based on a Xilinx Kintex UltraScale FPGA for the same system.


The five boards based on Virtex UltraScale FPGAs include:


  • The proFPGA XCVU190
  • The proFPGA XCVU160
  • The proFPGA XCVU125
  • The proFPGA XCVU095
  • The proFPGA XCVU080



proFPGA XCVU190 Module.jpg 


proFPGA XCVU190 FPGA Module Block Diagram



Note that the module designation also identifies the Xilinx FPGA used as the foundation for the module’s design.


Here’s a block diagram of the proFPGA XCVU190 FPGA Module:




proFPGA XCVU190 Module Block Diagram.jpg 

proFPGA XCVU190 FPGA Module Block Diagram



PRO DESIGN also introduced the proFPGA XCKU115 FPGA module, based on a Xilinx Kintex UltraScale device with a similar part number.


All of these modules are now shipping and complement existing proFPGA including the proFPGA XCVU440 module introduced early this year (see “FPGA Prototyping board based on a large Xilinx FPGA has rated capacity of 30M ASIC gates”) and the proFPGA Zynq 7000 module, introduced a couple of years ago and based on Xilinx Zynq Z-7045 or Z-7100 SoCs. All of these proFPGA modules plug into the company’s proFPGA motherboards that handle one, two, or four modules. You can mix and match different modules to achieve the exact mix of programmable resources that you need for prototyping your system. The company also provides its proFPGA Builder design environment for creating system designs and partitioning these designs across multiple FPGAs using its proFPGA modules.


What’s the secret behind launching so many new hardware products in such a short time? It’s having a well-thought-out platform approach to product-family development, aided in no small way by the careful engineering behind the Xilinx UltraScale FPGA packaging, which produces a large number of pin-compatible devices. Here’s a revealing chart from Xilinx User Guide UG575, pragmatically titled “UltraScale Device Packaging and Pinouts,” that clearly illustrates this Xilinx UltraScale feature:



UltraScale Pin Compatibility Table with Highlighting.jpg 


Note the six pin-compatible UltraScale FPGAs shown for the B2104 package. They’re the same six Xilinx UltraScale devices PRO DESIGN used to implement its new proFPGA modules. Pin compatibility is an important consideration for creating multi-member, end-product families and Xilinx UltraScale FPGAs are designed to support this forward-thinking design style. PRO DESIGN’s six new FPGA prototyping modules based on UltraScale devices proves that this design approach works well.



For more information about these six new proFPGA modules, contact PRO DESIGN directly.



Over the weekend, the HiRISE (High Resolution Imaging Science Experiment) camera in NASA’s MRO (Mars Reconnaissance Orbiter) sent the following hi-res image of the Schiaparelli module’s landing site on Mars. The image identifies three new visible features on the Martian landscape, corresponding to what appears to be the final resting places for the probe’s parachute and rear heat shield, the Schiaparelli module itself, and the front heat shield. (The image scale is 29.5 cm/pixel.) Schiaparelli is the landing component of the ExoMars 2016 mission—a joint project between the European and Russian Space Agencies.



Schiaparelli Landing Site.jpg



Zooming in on Schiaparelli components on Mars. Copyright NASA/JPL-Caltech/U. of Arizona



Analysis of this image leaves little doubt that the Schiaparelli module crashed on Mars during the attempted landing but only after sending 600Mbytes of invaluable scientific data back to Earth.


NASA’s MRO has been orbiting Mars since early 2006 following an August, 2005 launch. The HiRISE camera has been capturing 5 to 20 images per day, every day, over the past 10 years. The camera is based on 14 CCDs with 14 FPGAs in the camera’s CCD Processing and Memory Module (CPMM). Each of those FPGAs is a radiation-tolerant Xilinx Virtex 300E device configured to perform a variety of tasks including control, signal processing, lookup table compression, data storage, maintenance, and external I/O. (More technical details here.)


Each of the 180nm, 1.8V Xilinx Virtex-E devices contains 6912 logic cells. These devices were first introduced in 1998. These days, you get that many logic cells in some of the smallest Xilinx devices thanks to the 100x density improvements made possible by newer IC process nodes.


As you can see, Xilinx FPGAs have a very long history of being used for challenging imaging applications.






Xilinx now has four families of cost-optimized All Programmable devices to help you build systems:




  • Artix-7: The cost-optimized Xilinx FPGA family with 6.6Gbps serial transceivers for designs that need compatibility with high-speed serial I/O protocols such as PCIe Gen2 and USB 3.0. The Artix-7 family now has two smaller members: the Artix-7 A12T and A25T with 12,300 and 23,360 logic cells respectively.




Here’s a 12-minute video that further clarifies the options you now have:





Adam Taylor’s MicroZed Chronicles, Part 154: SDSoC Tracing

by Xilinx Employee on ‎10-31-2016 09:48 AM (3,904 Views)


By Adam Taylor


Using the Xilinx SDSoC Development Environment allows us to create a complex system with functions in both the Zynq SoC’s PS and the PL (the Processor System and the Programmable Logic). We need to see the interaction between the PS and PL to achieve the best system performance and optimization. This is where tracing comes in. Unlike AXI profiling, which enables us to ensure we are using the AXI links between the PS and PL optimally, tracing shows you the detailed interaction taking place between the software and hardware acceleration. To do this SDSoC instruments the design and includes several additional blocks to trace events within the PL design, enabling tracing for the global solution.


We can enable tracing using the SDSoC Project overview: select the Enable Event Tracing check box and set the active build to SDDebug. We are then able to build our application with the trace functionality built in, which ensures that there are no issues. I recommend that you clean the build first (Project-> Clean).





Using SDSoC Project Overview to configure a build with built-in trace



For this blog post, I am going to target the Avnet ZedBoard with standalone OS and use a matrix multiplication example.


Once SDSoC has generated the build files, we need to execute and run the trace design from within SDSoC itself. We need to connect both the JTAG and UART to our development PC to run things from within the SDSoC environment. If we are using a Linux operating system, then we also need to connect the Ethernet port, to the same network connected to the development PC.


Power up the target board and then, within SDSoC project Explorer, expand your working project. Beneath the SDDebug folder, right click on the resultant ELF file, then select debug and Trace application.






Executing the trace application



This will then program the bit file into the target board, execute the instrumented design, capture the trace results, and upload them to SDSoC. If we have a terminal window connected to the target board to capture the UART’s output, we will see the results of the program being executed:






UART results of the program




The trace application executes within SDSoC and uploads a trace log and a graphical view of this log. This log shows the starting and stoping of the software, data transfers, and hardware acceleration:





SDSoC Trace Log






Results of tracing an example application

(Orange = Software, Green = Accelerated function and Blue = Transfer)




Running the trace creates a new project containing the trace details, which can be seen within the project explorer.





New Project containing trace data



Tracing instruments the PL side of the design so I ran the same build with and without tracing enabled to compare the differences in resource allocation. The results appear below:






Resource Utilization with Tracing enabled (left) and Normal Build (right)



There is a small difference between the two builds but implementing trace within the design does not significantly increase the resources required.


I also made a short video for this blog post. Here it is:







Code is available on Github as always.


If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




  • First Year E Book here
  • First Year Hardback here.



 MicroZed Chronicles hardcopy.jpg




  • Second Year E Book here
  • Second Year Hardback here



MicroZed Chronicles Second Year.jpg




All of Adam Taylor’s MicroZed Chronicles are cataloged here.






Today, the OpenPOWER Accelerator Workgroup announced development of the CAPI SNAP Framework, which makes it easier for developers to harness the power of FPGA acceleration in data center applications. (“SNAP” stands for “Storage, Networking, and Acceleration Programming.”) The CAPI SNAP Framework is being developed with significant contributions from IBM, Xilinx, Rackspace, Eiditicom, Reconfigure.io, Alpha-Data, and Nallatech. According to today’s announcement, CAPI SNAP:



  • Makes it easy for developers to create new, specialized algorithms for FPGA acceleration using high-level programming languages such as C++ and Go instead of more conventional and lesser-known FPGA-programming languages such as VHDL and Verilog.


  • Makes FPGA acceleration more accessible to ISVs and other organizations to bring faster data analysis to their users.


  • Leverages the OpenPOWER Foundation ecosystem to continually drive collaborative innovation.



CAPI, the “Coherent Attached Processor Interface,” appeared back in early 2014. (See “IBM’s OpenPOWER Foundation allows 3rd parties to enhance the POWER8 processor with coherent hardware accelerators using CAPI”) and rolled out as OpenCAPI, an open, evolved, ISA-agnostic standard earlier this month. (See “OpenCAPI Consortium rolls out coherent, ISA-agnostic interconnect for Hyperscale Data Centers.”) CAPI SNAP aims to make it easier to make use of CAPI’s high-speed, coherent interconnect to develop high-performance, application-specific, FPGA-based hardware accelerators that are closely connected to server processors.



All Star PC.jpg 

Twenty six years ago working as a Senior Regional Editor for EDN Magazine, I wrote the “All-Star PC Project” series of five articles detailing the construction of the world’s most capable PC (at the time) that could run the PC-based CAD and CAE software that was just starting to appear. As part of that project, I needed a capable SCSI card that could run the All Star PC’s two 338Mbyte Seagate Wren Runner SCSI drives (HUGE for the time) as primary disk storage and the 2.5Gbyte Exabyte 8mm SCSI Tape Cartridge Subsystem (back when tape was still a good backup medium for PCs). The best SCSI controller card I could find back then was the Always Technology IN-2000 AS SCSI Host Adapter, which was based on a first-generation Xilinx XC2018 FPGA and a second-generation XC3020 FPGA. As it turns out, there’s a story there.


The All-Star PC ran Microsoft DOS and was using QEMM memory management, supplied by Quarterdeck, to get beyond the existing DOS RAM limit of 640Kbytes. (Remember that?) However, QEMM created problems for SCSI controllers like the Adaptec AHA-1540A, which employed first-party DMA. There was no way to keep the Adaptec drivers synchronized with the QEMM memory mapping so the DMA didn’t work. The Always IN-2000 SCSI adapter didn’t have this problem. Its drivers were compatible with QEMM.




IN-2000 Baseball Card.jpg


IN-2000 SCSI Adapter Baseball Card from the All-Star PC Project series. You can see the Xilinx XC2018 and XC3020 FPGAs in the center of the board.



Replacing the Adaptec board with the Always IN-2000 solved my QEMM/DMA problem and, after properly configuring SCSI addresses so that the IN-2000 didn’t try to initialize the Exabyte tape drive as a disk, the All-Star PC booted. However, performance was 25% less than what I’d gotten from the Adaptec card. A BIOS ROM change for the Always IN-2000 adapter fixed this problem but then a new problem arose. Writes to the hard drives were not working properly. The data was corrupt.


To understand why, you need some information about the All-Star PC’s motherboard. Unlike nearly all other PC motherboards, the All-Star PC’s Cheetah Gold 425 motherboard from Cheetah International was not based on an LSI chip set. Instead, it was based on PLDs and TTL MSI parts to accommodate the brand new, 66MHz Intel 80486 microprocessor (very, very fast for the time). If you know about PC AT architecture, you know that the PC AT bus was derived from the Intel x86 bus and ran synchronously with it. However, that system architecture doesn’t work when your processor runs at 66MHz instead of 4.77, 6, 8, or 10MHz because the PC AT bus would run far too fast for the expansion cards of the day. So Cheetah had developed a PLD-based motherboard that operated the expansion bus independently from the microprocessor clock. During our efforts to boost SCSI performance, we’d upped the PC AT bus speed from 6MHz to 8MHz and that 33% speed increase broke the IN-2000.


As I wrote in the article:


“Johan Olstenius, the IN-2000's designer, discovered a logic path on the adapter that was too slow to handle the faster SCSI transfers. Fortunately, the IN-2000 employs Xilinx field-programmable gate arrays (FPGAs) for most of the board’s logic, so the correction required only a ROM change. A small EPROM on the IN-2000 holds the FPGA configuration information, which is loaded automatically by the Xilinx parts at power up. Olstenius rewired some of the circuitry in the Xilinx chip to speed up the slow logic path. This exercise underscores the advantages of the ‘soft hardware’ design approach employed on the IN-2000.”


I wrote that paragraph back in early 1990 and there are a lot of FPGA-related statements in that article that still apply to system design today. In particular, Xilinx devices are still used to implement the logic for entire systems including bus interfaces, and for the very same reasons. As recently as this week, I’ve discussed several such products including:







Coincidentally, that’s three very different, high-performance products implementing an entire digital system including the bus interface with three successive Xilinx FPGA generations—and all announced just this week. A lot of things have changed since 1990. Designers don’t use XC2000 and XC3000 parts for new designs any more. Speeds have gotten faster and systems have gotten a lot more complex. However, “…the advantages of the ‘soft hardware’ design approach employed on the IN-2000” cited in the article still apply even after a quarter of a century.




Personal notes: The founder of Cheetah International, Ron Sartore, became my good friend during the All-Star PC project and we’re having dinner tonight, so this seemed an especially appropriate day to publish this post. However, it would not have been possible to write today’s blog if another good friend, EDN’s Senior Technical Editor Martin Rowe, had not sent me the pages from EDN’s All-Star PC Project series last week. He sliced them out of spare archived issues. Martin was responsible for saving more than 60 years of EDN from the dumpster and getting them archived in the Gordon McKay Library at the John A. Paulson School of Engineering and Applied Science at Harvard, where they belong.


I’ve made PDF archives of the five articles in the All Star PC Project series. You’ll find them at the bottom of this Web page. Thanks Martin!

About the Author
  • Be sure to join the Xilinx LinkedIn group to get an update for every new Xcell Daily post! ******************** Steve Leibson is the Director of Strategic Marketing and Business Planning at Xilinx. He started as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He's served as Editor in Chief of EDN Magazine, Embedded Developers Journal, and Microprocessor Report. He has extensive experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.