We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!


This week, EXFO announced and demonstrated its FTBx-88400NGE Power Blazer 400G Ethernet Tester at the ECOC 2017 optical communications conference in Gothenburg, Sweden using a Xilinx VCU140 FPGA design platform as an interoperability target. The VCU140 development platform is based on a Xilinx Virtex UltraScale+ VU9P FPGA. EXFO’s FTBx-88400NGE Power Blazer offers advanced testing for the full suite of new 400G technologies including support for FlexE (Flex Ethernet), 400G Ethernet, and high-speed transceiver validation. The Flex Ethernet (FlexE) function supports one or more bonded 100GBASE-R PHYs supporting multiple Ethernet MAC operating at a rate of 10, 40, or n x 25Gbps. Flex Ethernet is a key data center technology that helps data centers deliver links that are faster than emerging 400G solutions.


Here’s a photo of the ECOC 2017 demo:




EXFO FTBx-88400NGE Power Blazer Demo.jpg 




This demonstration is yet one more proof point for the 400GbE standard, which will be used in a variety of high-speed communications applications including data-center interconnect, next-generation switch and router line cards, and high-end OTN transponders.




Farhad Fallahlalehzari, an Aldec Application Engineer, has just published a blog on the Aldec Web site titled “Demystifying AXI Interconnection for Zynq SoC FPGA.” So if it’s a mystery, please click over to that blog post so that the topic will no longer mystify.



Adam Taylor’s MicroZed Chronicles, Part 216: Capturing the HDMI video mode with the ADV7611 HDMI FMC

by Xilinx Employee ‎09-18-2017 11:33 AM - edited ‎09-18-2017 11:34 AM (649 Views)


By Adam Taylor



With the bit file generated, we are now able to create software that configures the ADV7611 Low-Power HDMI Receiver chip and the Zynq SoC’s VTC (Video Timing Controller). If we do this correctly, the VTC will then be able to report the input video mode.


To be able to receive and detect the video mode, the software must perform the following steps:


  • Initialize and configure the Zynq SoC’s I2C controller for master operation at 100KHz
  • Initialize and configure the VTC
  • Configure the ADV7611
  • Sample the VTC once a second, reporting the detected video mode





ZedBoard, FMC HDMI, and the PYNQ dev board connected for testing




Configuring the I2C and VTC is very simple. We have done both several times throughout this series (See these links: I2C, VTC.) Configuring the ADV7611 is more complicated and is performed using I2C. This is where this example gets a little complicated as the ADV7611 uses eight internal I2C slave addresses to configure different sub functions.







To reduce address-contention issues, seven of these addresses are user configurable. Only the IO Map has a fixed default address.


I2C addressing uses 7 bits. However, the ADV7611 documentation specifies 8-bit addresses, which includes a Read/Write bit. If we do not understand the translation between these 7- and 8-bit addresses, we will experience addressing issues because the Read/Write bit is set or cleared depending on the function we call from XIICPS.h.


The picture below shows the conversion from 8-bit to 7-bit format. The simplest method is to shift the 8-bit address one place to the right.







We need to create a header file containing the commands to configure each of the eight ADV7611’s sub functions.


This raises the question of where to obtain the information to configure the ADV7611 device. Rather helpfully, the  Analog Devices engineer zone, provides several resources including a recommended registers settings guide and several pre-tested scripts that you can download and use to configure the device for most use cases. All we need to do is select the desired use case and incorporate the commands into our header file.


One thing we must be very careful with is that the first command issued to the AD7611 must be an I2C reset command. You may see a NACK on the I2C bus in response to this command as the reset asserts very quickly. We also need to wait an appropriate period after issuing the reset command before continuing to load commands. In this example, I decided to wait the same time as following a hard reset, which the data sheet specifies as 5msec.


Once 5msec has elapsed following the reset, we can continue loading configuration data, which includes the Extended Display Identification Data (EDID) table. The EDID identifies to the source the capabilities of the display. Without a valid EDID table, the HDMI source will not start transmitting data.


Having properly configured the ADV7611, we may want to read back registers to ensure that it is properly configured or to access the device’s status. To do this successfully, we need to perform what is known as a I2C repeat start in the transaction following the initial I2C write. A repeat start is used when a master issues a write command and then wants to read back the result immediately. Issuing the repeat start prevents another device from interrupting the sequence.


We can configure the I2C controller to issue repeat starts between write and read operations within our software application by using the function call XIicPs_SetOptions(&Iic,XIICPS_REP_START_OPTION). Once we have completed the transaction we need to clear the repeat start option using the XIicPs_ClearOptions(&Iic,XIICPS_REP_START_OPTION) function call. Otherwise we may have issues with communication.


Once configured, the ADV7611 starts free running. It will generate HDMI Frames even with no source connected. The VTC will receive these input frames, lock to them and determine the video mode. We can obtain both the timing parameters and video mode by using the VTC API. The video modes that can be detected are:







Initially in its free-running mode, the ADV7611 outputs video in 480x640 pixel format. Checking the VTC registers, it is also possible to observe that the detector has locked with the incoming sync signals and has detected the mode correctly, as shown in the image below:







With the free-running mode functioning properly, the next step is to stimulate the FMC HDMI with different resolutions to ensure that they are correctly detected.


To test the application, we will use a PYNQ Dev Board. The PYNQ is ideal for this application because it is easily configured for different HDMI video standards using just a few lines of Python, as shown below. The only downside is the PYNQ board does not generate fully compliant 1080P video timing.



SVGA video outputting 800 pixels by 600 lines @ 60Hz






720P video outputting 1280 pixels by 720 Lines @ 60 Hz






SXGA video outputting 1280 pixels by 1024 lines @ 60Hz







Having performed these tests, it is clear the ADV7611 on the FMC HDMI is working as required and is receiving and decoding different HDMI resolutions correctly. At the same time, the VTC is correctly detecting the video mode, enabling us to capture video data on our Zynq SoC or Zynq UltraScale+ MPSoC systems for further processing.


The FMC HDMI has another method of receiving HDMI that equalizes the channel and passes it through to the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL for decoding. I will create an example design based upon that input over the next few weeks.


Note that we can also use this same approach with a MicroBlaze soft processor core instantiated in a Xilinx FPGA.




Code is available on Github as always.



If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




First Year E Book here

First Year Hardback here.



MicroZed Chronicles hardcopy.jpg 




Second Year E Book here

Second Year Hardback here



MicroZed Chronicles Second Year.jpg




Ryft deploys cloud-based search and analysis on Amazon’s FPGA-accelerated AWS EC2 F1 instance

by Xilinx Employee ‎09-13-2017 11:52 AM - edited ‎09-13-2017 12:00 PM (1,373 Views)


Ryft has announced that it now offers its Ryft Cloud cloud-based search and analysis tools on Amazon’s FPGA-accelerated AWS EC2 F1 instance through Amazon’s AWS Marketplace. When Xcell Daily last covered Ryft, the company had introduced the Ryft ONE, an FPGA-accelerated data analytics platform.  (See “FPGA-based Ryft ONE search accelerator delivers 100x performance advantage over Apache Spark in the data center.”)


Now you can access Ryft’s accelerated search and analysis algorithms instantly through Amazon’s EC2 F1 compute instance, which gets its acceleration from multiple Xilinx Virtex UltraScale+ VU9P FPGAs. According to Ryft, FPGA acceleration using the AWS EC2 F1 instance boosts application performance by 91X compared to traditional CPU-based cloud analytics.


How fast is that? Ryft has published a benchmark chart that shows you just how fast that is:




Ryft Acceleration Chart for AWS EC2 F1.jpg 




The announcement includes a link to a Ryft White Paper titled “Powering Elastic Search in the Cloud: Transform High-Performance Analytics in the AWS Cloud for Fast, Data-Driven Decisions.”




For more information about Amazon’s AWS EC2 F1 instance, see:











SDAccel—Xilinx’s development environment for accelerating cloud-based applications using C, C++, or OpenCL—is now available on Amazon’s AWS EC2 F1 instance. (Formal announcement here.) The Amazon EC2 F1 compute instance allows you to create custom hardware accelerators for your application using cloud-based server hardware that incorporates multiple Xilinx Virtex UltraScale+ VU9P FPGAs. SDAccel automates the acceleration of software applications by building application-specific FPGA kernels for the AWS EC2 F1. You can also use HDLs including Verilog and VHDL to define hardware accelerators in SDAccel. With this release, you can access SDAccel through the AWS FPGA developer AMI.



For more information about Amazon’s AWS EC2 F1 instance, see:









For more information about SDAccel, see:







What happens when you host a genomic analysis application on the FPGA-accelerated Amazon AWS EC2 F1 instance? You get Edico Genome’s and DNAnexus’ dramatic announcement of a $20, 90-minute offer to analyze an entire human genome. Edico Genome previously ported the DRAGEN pipeline to Amazon’s FPGA instances and DNAnexus customers can now leverage Edico Genome’s Dragen app as a turnkey solution. DNAnexus provides a global network for sharing and managing genomic data and tools to accelerate genomics. New and existing DNAnexus customers have access to the DRAGEN app.


The two companies have launched a promotion, lasting from Aug. 28 to Oct. 31, where whole-genome analysis on the AWS EC2 F1 2x instances costs $20 and takes about an hour and a half. In the next few weeks, Edico Genome’s DRAGEN will be available through DNAnexus on the F1 16x instances as well, which reduces analysis time to 20 minutes or so. Whole-exome analysis will cost about $5 during the promotional period.


The Amazon AWS EC2 F1 instance is a cloud service that’s based on multiple Xilinx Virtex UltraScale+ VU9P FPGAs installed in Amazon’s Web servers.




For more information about Edico Genome’s DRAGEN processor and genome analysis in Xcell Daily, see:









BrainChip Holdings has just announced the BrainChip Accelerator, a PCIe server-accelerator card that simultaneously processes 16 channels of video in a variety of video formats using spiking neural networks rather than convolutional neural networks (CNNs). The BrainChip Accelerator card is based on a 6-core implementation BrainChip’s Spiking Neural Network (SNN) processor instantiated in an on-board Xilinx Kintex UltraScale FPGA.


Here’s a photo of the BrainChip Accelerator card:



BrainChip FPGA Board.jpg 


BrainChip Accelerator card with six SNNs instantiated in a Kintex UltraScale FPGA




Each BrainChip core performs fast, user-defined image scaling, spike generation, and SNN comparison to recognize objects. The SNNs can be trained using low-resolution images as small as 20x20 pixels. According to BrainChip, SNNs as implemented in the BrainChip Accelerator cores excel at recognizing objects in low-light, low-resolution, and noisy environments.


The BrainChip Accelerator card can process 16 channels of video simultaneously with an effective throughput of more than 600 frames per second while dissipating a mere 15W for the entire card. According to BrainChip, that’s a 7x improvement in frames/sec/watt when compared to a GPU-accelerated CNN-based, deep-learning implementation for neural networks like GoogleNet and AlexNet. Here’s a graph from BrainChip illustrating this claim:




BrainChip Efficiency Chart.jpg 





SNNs mimic human brain function (synaptic connections, neuron thresholds) more closely than do CNNs and rely on models based on spike timing and intensity. Here’s a graphic from BrainChip comparing a CNN model with the Spiking Neural Network model:





BrainChip Spiking Neural Network comparison.jpg 



For more information about the BrainChip Accelerator card, please contact BrainChip directly.




ARM, Cadence, TSMC, and Xilinx have announced a collaboration to develop a CCIX (Cache Coherent Interconnect for Accelerators) test chip in TSMC’s 7nm FinFET process technology with a 2018 completion date. The test chip will demonstrate multiple ARM CPUs, CMN-600 coherent on-chip bus, and foundation IP communicating to other chips including Xilinx’s Virtex UltraScale+ FPGAs over the coherent, 25Gbps CCIX fabric. Cadence is supplying the CCIX controller and PHY IP for the test chip as well as PCIe Gen 4, DDR4 PHY, and Peripheral IP blocks. In addition, Cadence verification and implementation tools are being used to design and build the test chip. According to the announced plan, the test chip tapes out early in the first quarter of 2018, with silicon availability expected in the second half of 2018.


You can’t understand the importance of this announcement if you aren’t fully up to speed on CCIX, which Xcell Daily has discussed a few times in the recent past.


CCIX simplifies the design of offload accelerators for hyperscale data centers by providing low-latency, high-bandwidth, fully coherent access to server memory. The specification employs a subset of full coherency protocols and is ISA-agnostic, meaning that the specification’s protocols are independent of the attached processors’ architecture and instruction sets. Full coherency is unique to the CCIX specification. It permits accelerators to cache processor memory and processors to cache accelerator memory.


CCIX is designed to provide coherent interconnection between server processors and hardware accelerators, memory, and among hardware accelerators as shown below:



CCIX Configurations.jpg


Sample CCIX Configurations



The CCIX Consortium announced Release1 of the CCIX spec a little less than a year ago. CCIX Consortium members Xilinx and Amphenol FCI demonstrated a CCIX interface operating at 25Gbps using two Xilinx 16nm UltraScale+ devices through an Amphenol/FCI PCI Express CEM connector and a trace card earlier this year.


As the CCIX Consortium’s Web site says:


“CCIX simplifies the development and adoption by extending well-established data center hardware and software infrastructure.  This ultimately allows system designers to seamlessly integrate the right combination of heterogeneous components to address their specific system needs.”


For more information, see these earlier Xcell Daily CCIX blog posts:










By Adam Taylor



When we surveyed the different types of HDMI sources and sinks recently for our Zynq SoC and Zynq UltraScale+ MPSoC designs, one HDMI receiver we discussed was the ADV7611. This device receives three TDMS data streams and converts them into discrete video and audio outputs, which can then be captured and processed. Of course, the ADV7611 is a very capable and somewhat complex device. It requires configuration prior to use. We are going to examine how we can include one within our design.






ZedBoard HDMI Demonstration Configuration




To do this, we need an ADV7611. Helpfully, the FMC HDMI card provides two HDMI inputs, one of which uses an ADV7611. The second equalizes the TMDS data lanes and passes them on directly to the Zynq SoC for decoding.


To demonstrate how we can get this device up and running with our Zynq SoC or Zynq UltraScale+ MPSoC, we will create an example that uses the ZedBoard with the HDMI FMC. For this example, we first need to create a hardware design in Vivado that interfaces with the ADV7611 on the HDMI FMC card. To keep this initial example simple, I will be only receiving the timing signals output by the ADV7611. These signals are:


  • Local Locked Clock (LLC) – The pixel clock.
  • HSync – Horizontal Sync, indicates the start of a new line.
  • VSync – Vertical Sync, indicates the start of a new frame.
  • Video Active – indicates that the pixel data is valid (e.g. we are not in a Sync or Blanking period)


This approach uses the VTG’s (Video Timing Generator’s) detector to receive the sync signals and identify the received video’s timing parameters and video mode. Once the ADV7611 correctly identifies the video mode, we have configured correctly. It is then a simple step to connect the received pixel data to a Video-to-AXIS IP block and use VDMA to write the received video frames into DDR memory for further processing.


For this example, we need the following IP blocks:


  • VTC (Video Timing Controller) – Configured for detection and to receive sync signals only.
  • ILA – Connected to the sync signals so that we can see that they are toggling correctly—to aid debugging and commissioning.
  • Constant – Set to a constant 1 to enable the clock and detector enables.


The resulting block diagram appears below. The eagle-eyed will also notice the addition both a GPIO output and I2C bus from the processor system. We need these to control and configure the ADV7611.






Simple Architecture to detect the video type



Following power up, the ADV7611 generates no sync signals or video. We must first configure the device, which requires the use of an I2C bus. We therefore need to enable one of the two I2C controllers within the Zynq PS and route the IO to the EMIO so that we can then route the I2C signals (SDA and SCL) to the correct pins on the FMC connector. The ADV7611 is a complex device to configure with multiple I2C addresses that address different internal functions within the device. EDID and High-bandwidth Digital Content Protection (HDCP), for example.


We also need to be able to reset the ADV7611 following the application of power to the ZedBoard and FMC HDMI. We use a PS GPIO pin, output via the EMIO, to do this. Using a controllable I/O pin for this function allows the application software to reset of the device each time we run the program. This capability is also helpful when debugging the software application to ensure that we start from a fresh reset each time the program runs—a procedure that prevents previous configurations form affecting the next.


With the block diagram completed, all that remains is to build the design with the location constraints (identified below) to connect to the correct pins on the FMC connector for the ADV7611.






Vivado Constraints for the ADV7611 Design




Once Vivado generates the bit file, we are ready to begin configuring the ADV7611. Using the I2C interface this way is quite complex, so we will examine the steps we need to do this in detail in the next blog. However, the image below shows one set of the results from the testing of the completed software as it detects a VGA (640 pixel by 480 line) input:







VTC output when detecting VGA input format















Code is available on Github as always.



If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




  • First Year E Book here
  • First Year Hardback here.




MicroZed Chronicles hardcopy.jpg




  • Second Year E Book here
  • Second Year Hardback here



MicroZed Chronicles Second Year.jpg 



By Chetan Khona, Xilinx


The term Industrial IoT (IIoT) refers to a multidimensional, tightly coupled chain of systems involving edge devices, cloud applications, sensors, algorithms, safety, security, vast protocol libraries, human-machine interfaces (HMI), and other elements that must interoperate. If you’re designing equipment destined for IIoT networks, you have a lot of requirements to meet. This article discusses several.


Note: This article has been adapted from a new Xilinx White Paper titled “Key Attributes of an Intelligent IIoT Edge Platform.”



IT-OT Convergence


Some describe the IIoT as a convergence of information technology (IT) and operational technology (OT). The data-intensive nature of IT applications requires all these elements to come together with critical tasks performed reliably and on schedule. There’s usually a far more time-sensitive element to the OT applications. Designers generally meet these diverse IIoT requirements and challenges using embedded electronics at the IIoT edge (e.g., motion controllers, protection relays, programmable logic controllers, and similar systems) because embedded systems support deterministic communication and real-time control.


Equipment operating on IIoT networks at timescales on the order of hundreds of microseconds (or less) often need to operate in factories and remote locations for decades without being touched—but they can be updated remotely via the networks that connect them. Relying solely on multicore embedded processors in these applications can lead to a series of difficult and costly marketing and engineering trade-offs focused on managing functional timing issues and performance bottlenecks. A more advanced approach that manages determinism, latency, and performance while eliminating interference between the IT and OT domains and within subsystems in the OT domain produces better results.


Sometimes, you just need hardware to meet these challenges because software is just too slow, even when running on multiple processor cores. Augmenting static microprocessor architectures with specialized hardware to create a balanced division of labor is not a new concept in the world of embedded electronics. What is new is the need to adapt both the tasks and the division of labor over time. For example, an upgraded predictive-maintenance algorithm might require more sensor inputs than previous inputs—or entirely new types of sensors with new types of interfaces. These sensors invariably require local processing as well to offload the centralized cloud application that’s crunching the data from all of the edge nodes. Offloading the incremental sensor-processing calculations to hardware maintains the overall loading and avoids overburdening the edge processor.



TSN and Legacy Industrial Networks


The IIoT networks linking these new systems are equally dynamic. They evolve almost daily. Edge and system-wide protocols including OPC-UA (the OPC Foundation Open Platform Communications-Unified Architecture) and DDS (Data Distribution Service for Real-Time Systems) are gaining significant momentum. Both of these protocols benefit from time-sensitive networking (TSN), a deterministic Ethernet-based transport that manages mixed-criticality data streams. TSN significantly advances the vision of a unified network protocol across the edge and throughout the majority of the IIoT solution chain because it supports varying degrees of scheduled traffic alongside best-effort traffic.


The goal is to get TSN integrated into the IIoT Endpoint to enable scheduled traffic versus best-effort traffic with minimum impact on control function timing. Yet TSN is an evolving standard so using ASICs or ASSP chipsets developed before all aspects of the TSN standard and market-specific profiles are finalized carry some risk. Similarly, attempting to add TSN support to an existing controller using a software-only approach may exhibit unpredictable timing behavior and might not meet timing requirements.


Ultimately, TSN requires a form of time-awareness not available in controllers today. A good TSN implementation requires the addition of both hardware and software—something that’s easily done using a device that integrates processors and programmable hardware like the Xilinx Zynq SoC and Zynq UltraScale+ MPSoC. These devices minimize the effects of adding TSN capabilities by implementing bandwidth-intensive, time-critical functions in hardware without significant impact to the software timing. (Xilinx offers an internally developed, fully standards-compatible, optimized TSN subsystem for the Zynq SoC and Zynq UltraScale+ MPSoC device families.)


Because industrial networking not new, IIoT systems will need to support the lengthy list of legacy industrial protocols that have been developed and used throughout the industry’s past. This need will exist for many years. Most modern SoCs don’t offer support and cannot easily be retrofitted for even a small fraction of these industrial protocols. In addition, the number of network interfaces that one controller must support can often exceed an SoC’s I/O capabilities. In contrast, the programmable hardware and I/O within Zynq SoCs and Zynq UltraScale+ MPSoCs easily support these legacy protocols without causing the unwanted timing side effects to mainstream software and firmware that a software-based networking approach might cause.




Security and the IIoT


IIoT design must follow a “defense-in-depth” approach to cybersecurity. Defense in depth is a form of multilayered security that reaches all the way from the supply chain to the end-customers’ enterprise and cloud application software. (That’s a very long chain—and one that requires its own article. This article’s scope is the chain of trust for deployed embedded electronics at the IIoT edge.)


With the network extending to the analog-digital boundary, data needs to be secured as soon as it enters the digital domain—usually at the edge. Defense-in-depth security requires a strong hardware root of trust that starts with secure and measured boot operations; run-time security through isolation of hardware, operating systems, and software; and secure communications. The entire network should employ trusted remote attestation servers for independent validation of credentials, certificate authorities, and so forth.


Security is not a static proposition. Five notable revisions have been made to the transport layer security (TLS) secure messaging protocol since 1995, with more to come. Cryptographic algorithms that underscore protocols like TLS can be implemented in software but such changes on the IT side can adversely affect time-critical OT performance. Architectural tools such as hypervisors and other isolation methods can reduce this impact but it is also possible to pair these software concepts with the ability to support new, and even yet-to-be-defined cryptographic functions years after equipment deployment if the design is based on devices that incorporate programmable hardware like the Zynq SoC and Zynq UltraScale+ MPSoC.




Edico Genome moves genomic analysis and storage to the cloud using Amazon’s AWS EC2 F1 Instance

by Xilinx Employee ‎09-08-2017 09:57 AM - edited ‎09-08-2017 10:00 AM (2,049 Views)


Edico Genome has been developing genetic-analysis algorithms for a while now. (See this Xcell Daily story from 2015, “FPGA-based Edico Genome Dragen Accelerator Card for IBM OpenPOWER Server Speeds Exome/Genome Analysis by 60x”). The company originally planned to accelerate its algorithm by developing an ASIC, but decided this was a poor implementation choice because of the rapid development of its algorithms. Once you develop an ASIC, it’s frozen in time. Instead, Edico Genome found that Xilinx FPGAs were an ideal match for the company’s development needs and so the company developed the Dragen Accelerator Card for exome/genome analysis.


This hardware was well suited to Edico Genome’s customers that wanted to have on-site hardware for genomic analysis but the last couple of years have seen a huge movement to cloud-based apps including genomic analysis. So Edico Genome moved its algorithms to Amazon’s AWS EC2 F1 Instance, which offers accelerated computing thanks to Xilinx UltraScale+ VU9P FPGAs. (See “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available.”)


Edico Genome now offers cloud-based genomic processing and genomic storage in the cloud through Amazon’s AWS EC2 F1 Instance. Like its genomic analysis algorithms, the company’s cloud-based genomic storage takes advantage of the FPGA acceleration offered by Amazon’s AWS EC2 F1 Instance to achieve 2x to 4x compression. When you’re dealing with the human genome, you’re talking about storing 80Gbytes per genome so fast, 2x to 4x compression is a pretty important benefit.


This is all explained by Edico Genome’s VP of Engineering Rami Mehio in an information-packed 3-minute video:






Embedded C Coding Standard Book.jpg 

Embedded Systems Design magazine’s former editor-in-chief Michael Barr published the “Embedded C coding Standard” a decade ago and now he’d like for you to have a free PDF copy. Developing coding standards is not nearly as much fun as actually developing code, so getting a big head start with a standard developed by one of the world’s foremost embedded software experts is a huge advantage. Getting it for free—that’s even huger.


Oh, and that link above… it leads to an online HTML version of the Embedded C Coding Standard as well.


These are great resources if you are developing embedded systems based on the Xilinx Zynq SoC or Zynq UltraScale+ MPSoC.


Tell Michael that Steve sent you.






A new open-source tool named GUINNESS makes it easy for you to develop binarized (2-valued) neural networks (BNNs) for Zynq SoCs and Zynq UltraScale+ MPSoCs using the SDSoC Development Environment. GUINNESS is a GUI-based tool that uses the Chainer deep-learning framework to train a binarized CNN. In a paper titled “On-Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA,” presented at the recent 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, authors Haruyoshi Yonekawa and Hiroki Nakahara describe a system they developed to implement a binarized CNN for the VGG-16 benchmark on the Xilinx ZCU102 Eval Kit, which is based on a Zynq UltraScale+ ZU9EG MPSoC. Nakahara presented the GUINNESS tool again this week at FPL2017 in Ghent, Belgium.


According to the IEEE paper, the Zynq-based BNN is 136.8x faster and 44.7x more power efficient than the same CNN running on an ARM Cortex-A57 processor. Compared to the same CNN running on an Nvidia Maxwell GPU, the Zynq-based BNN is 4.9x faster and 3.8x more power efficient.


GUINNESS is now available on GitHub.




ZCU102 Board Photo.jpg 



Xilinx ZCU102 Zynq UltraScale+ MPSoC Eval Kit








Need to build a networking monster for financial services, low-latency trading, or cloud-based applications? The raw materials you need are already packed into Silicom Denmark’s SmartNIC fb4CGg3@VU PCIe card, which is based on a Xilinx Virtex UltraScale or Virtex UltraScale+ FPGA:



  • 1-to-16-lane PCIe Gen1/Gen2/Gen3
  • Optional 2x8 PCIe lanes on a secondary connector
  • Xilinx Virtex UltraScale+ (VU9P) or Virtex UltraScale (VU125 or VU080) FPGA (other FPGA sizes optional)
  • Four QSFP28 ports for 100G, 4x25G, or 4x10G optical modules or direct-attached copper cabling
  • 4Gbytes of DDR4-2400 SDRAM
  • SODIMM sockets for 4Gbytes of DDR4-2133 SDRAM




Silicom Denmark fb4CGg3 PCIe Card.jpg 


Silicom Denmark’s SmartNIC fb4CGg3@VU PCIe card




The SmartNIC fb4CGg3@VU PCIe card includes complete NIC functionality (TCP Offload Engine (TOE), UDP Offload Engine, and drivers).



Please contact Silicom Denmark directly for more information about the SmartNIC fb4CGg3@VU PCIe card.





The Xilinx Technology Showcase 2017 will highlight FPGA-acceleration as used in Amazon’s cloud-based AWS EC2 F1 Instance and for high-performance, embedded-vision designs—including vision/video, autonomous driving, Industrial IoT, medical, surveillance, and aerospace/defense applications. The event takes place on Friday, October 6 at the Xilinx Summit Retreat Center in Longmont, Colorado.


You’ll also have a chance to see the latest ways you can use the increasingly popular Python programming language to create Zynq-based designs. The Showcase is a prelude to the 30-hour Xilinx Hackathon starting immediately after the Showcase. (See “Registration is now open for the Colorado PYNQ Hackathon—strictly limited to 35 participants. Apply now!”)


The Xilinx Technology Showcase runs from 3:00 to 5:00 PM.


Click here for more details and for registration info.




Xilinx Longmont.jpg


Xilinx Colorado, Longmont Facility





For more information about the FPGA-accelerated Amazon AWS EC2 F1 Instance, see:









Amazon previews OpenCL Development Environment to FPGA-accelerated AWS EC2 F1 Instance

by Xilinx Employee ‎09-06-2017 10:40 AM - edited ‎09-06-2017 11:05 AM (2,146 Views)


Yesterday, Amazon announced a preview of an OpenCL development flow for the AWS EC2 F1 Instance, which is an FPGA-accelerated cloud-computing service based on Xilinx Virtex UltraScale+ VU9P FPGAs. According to Amazon, “…developers with little to no FPGA experience, will find a familiar development experience and now can use the cloud-scale availability of FPGAs to supercharge their applications.” In addition, wrote Amazon: “The FPGA Developer AMI now enables a graphical design canvas, enabling faster AFI development using a graphical flow, and leveraging pre-integrated verified IP blocks,” and "We have also upgraded the FPGA Developer AMI to Vivado 2017.1 SDx, improving the synthesis quality and runtime capabilities."


A picture is worth 1000 words:




Amazon AWS EC2 F1 Graphical Design.jpg 





For more information and to sign-up for the preview, please visit Amazon’s preview page



For more information about the Amazon EC2 F1 Instance based on Xilinx Virtex UltraScale+ FPGAs, see “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available” and “AWS does a deep-dive video on the Amazon EC2 F1 Instance, a cloud accelerator based on Xilinx Virtex UltraScale+ FPGAs.”




Curious about using Amazon’s AWS EC2 F1 Instance? Want a head start? Falcon Computing in Santa Clara, California has a 2-day seminar just for you titled “Accelerate Applications on AWS EC2 F1.” It’s being taught by Professor Jason Cong from the Computer Science Department at the U. of California in Los Angeles and it’s taking place on September 28-29 at Falcon’s HQ.


Here’s the agenda:



Falcon Computing AWS F1 Instance Seminar Agenda.jpg 



Register here.


Please contact Falcon Computing directly for more information about this Amazon AWS EC2 F1 Instance Seminar.



For more information about the Amazon EC2 F1 Instance based on Xilinx Virtex UltraScale+ FPGAs, see “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available” and “AWS does a deep-dive video on the Amazon EC2 F1 Instance, a cloud accelerator based on Xilinx Virtex UltraScale+ FPGAs.”



Xilinx has announced at HUAWEI CONNECT 2017 that Huawei’s new, accelerated cloud service and its FPGA Accelerated Cloud Server (FACS) is based on Xilinx Virtex UltraScale+ VU9P FPGAs. The Huawei FACS platform allows users to develop, deploy, and publish new FPGA-based services and applications on the Huawei Public Cloud with a 10-50x speed-up for compute-intensive cloud applications such as machine learning, data analytics, and video processing. Huawei has more than 15 years of experience in the development of FPGA systems for telecom and data center markets. "The Huawei FACS is a fully integrated hardware and software platform offering developer-to-deployment support with best-in-class industry tool chains and access to Huawei's significant FPGA engineering expertise," said Steve Langridge, Director, Central Hardware Institute, Huawei Canada Research Center.


The FPGA Accelerated Cloud Server is available on the Huawei Public Cloud today. To register for the public beta, please visit http://www.hwclouds.com/product/fcs.html. For more information on the Huawei Cloud, please visit www.huaweicloud.com.



For more information, see this page.




Yesterday, Premier Farnell announced that has added the Xilinx All Programmable device product line including Zynq SoCs, Zynq UltraScale+ MPSoCs, and FPGAs to its line card. That means Xilinx All Programmable devices are available from Farnell element14 in Europe, Newark element14 in North America, and element14 in APAC. Premier Farnell is a business unit of Avnet, Inc.



Adam Taylor’s MicroZed Chronicles, Part 214: Addressing VDMA Issues

by Xilinx Employee ‎09-05-2017 12:04 PM - edited ‎09-06-2017 08:41 AM (3,453 Views)


By Adam Taylor



Video Direct Memory Access (VDMA) is one of the key IP blocks used within many image-processing applications. It allows frames to be moved between the Zynq SoC’s and Zynq UltraScale+ MPSoC’s PS and PL with ease. Once the frame is within the PS domain, we have several processing options available. We can implement high-level image processing algorithms using open-source libraries such as OpenCV and acceleration stacks such as the Xilinx reVISION stack if we wish to process images at the edge. Alternatively, we can transmit frames over Gigabit Ethernet, USB3, PCIe, etc. for offline storage or later analysis.


It can be infuriating when our VDMA-based image-processing chain does not work as intended. Therefore, we are going to look at a simple VDMA example and the steps we can take to ensure that it works as desired.


The simple VDMA example shown below contains the basic elements needed to provide VDMA output to a display. The processing chain starts with a VDMA read that obtains the current frame from DDR memory. To correctly size the data stream width, we use an AXIS subset convertor to convert 32-bit data read from DDR memory into a 24-bit format that represents each RGB pixel with 8 bits. Finally, we output the image with an AXIS-to-video output block that converts the AXIS stream to parallel video with video data and sync signals, using timing provided by the Video Timing Controller (VTC). We can use this parallel video output to drive a VGA, HDMI, or other video display output with an appropriate PHY.


This example outlines a read case from the PS to the PL and corresponding output. This is a more complicated case than performing a frame capture and VDMA write because we need to synchronize video timing to generate an output.







Simple VDMA-Based Image-Processing Pipeline




So what steps can we take if the VDMA-based image pipeline does not function as intended? To correct the issue:


  1. Check Reset and Clocks as we would when debugging any application. Ensure that the reset polarity is correct for each module as there will be mixed polarities. Ensure that the pixel clock is correct for the required video timing and that it is supplied to both the VTC and the AXIS-to-Video Out blocks. While the clock required for the AXIS network must be able to support the image throughput.
  2. Check the Clock Enables on both the VTC and AXIS to Video Out blocks are tied to the correct level to enable the clocks.
  3. Check that the VTC is correctly configured, especially if you are using the AXI interface to define the configuration through the application software. When configuring the VTC using AXI, it is important to make sure we have set the source registers to the VTC generator, enabled register updates, and defined the timing parameters required.
  4. Check the connections between the VTC and AXIS-to-Video-Out Blocks. Ensure that the horizontal and vertical blanking signals are also connected along with the horizontal and vertical syncs.
  5. Check the AXIS-to-Video-Out If we are using VDMA, the timing mode of the AXIS-to-Video-Out block should be set to master. This enables the AXIS-to-Video-Out block to assert back pressure on the AXIS data stream to halt the frame buffer output. This mechanism permits the AXIS-to-Video-Out block to manage the flow of pixels by enabling synchronization and lock. You may also want to increase the size of the internal buffer from the default.
  6. Check that the AXIS-to-Video-Out VTC_ce signal is not connected to the VTC gen clock enable as is the case when configured for slave operation. This will prevent the AXIS-to-Video-Out block from being able to lock to the AXIS video stream.
  7. Insert ILA’s. Inserting these within the design allow us to observe the detailed workings of the AXI buses. When commissioning a new image processing pipeline, I insert ILA blocks on the VTC output and the VDMA MM-to-AXIS port so that I can observe the generated timing signals and VDMA output stream. When observing the AXI Stream the tuser signal identifies the start of frame and the tlast signal represents the end of line. You may also want to observe the AXIS-to-Video-Out 32-bit status output, which provides indication of the locked status along with additional debug information.
  8. Ensure that HSize and Stride are set correctly. These are defined by the application software and configure the VMDA with frame-store information. HSize represents the horizontal size of the image and Stride represents the distance in memory between the image lines. Both HSize and Stride are defined in bytes. As such, when working with U32 or U16 types, take care to correctly set these values to reflect the number of bytes used.



Hopefully by the time you have checked these points, the issue with your VDMA based image processing pipeline will have been identified and you can start developing the higher-level image processing algorithms needed for the application.



Code is available on Github as always.



If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




  • First Year E Book here
  • First Year Hardback here.



MicroZed Chronicles hardcopy.jpg 




  • Second Year E Book here
  • Second Year Hardback here



MicroZed Chronicles Second Year.jpg 


by Anthony Boorsma, DornerWorks



Why aren’t you getting all of the performance that you expect after moving a task or tasks from the Zynq PS (processing system) to its PL (programmable logic)? If you used SDSoC to develop your embedded design, there’s help available. Here’s some advice from DornerWorks, a Premier Xilinx Alliance Program member. This blog is adapted from a recent post on the DornerWorks Web site titled “Fine Tune Your Heterogeneous Embedded System with Emulation Tools.




Thanks to Xilinx’s SDSoC Development Environment, offloading portions of your software algorithm to a Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL (programmable logic) to meet system performance requirements is straightforward. Once you have familiarized yourself with SDSoC’s data-transfer options for moving data back and forth between the PS and PL, you can select the appropriate data mover that represents the best choice for your design. SDSoC’s software estimation tool then shows you the expected performance results.


Yet when performing the ultimate test of execution—on real silicon—the performance of your system sometimes fails to match expectations and you need to discover the cause… and the cure. Because you’ve offloaded software tasks to the PL, your existing software debugging/analysis methods do not fully apply because not all of the processing occurs in the PS.


You need to pinpoint the cause of the unexpected performance gap. Perhaps you made a sub-optimal choice of data mover. Perhaps the offloaded code was not a good candidate for offloading to the PL. You cannot cure the performance problem without knowing its cause.


Just how do you investigate and debug system performance on a Zynq-based heterogeneous embedded system with part of the code running in the PS and part in the PL?


If you are new to the world of debugging PL data processing, you may not be familiar with the options you have for viewing PL data flow. Fortunately, if you used SDSoC to accelerate software tasks by offloading them to the PL, there is an easy solution. SDSoC has an emulation capability for viewing the simulated operation of your PL hardware that uses the context of your overall system.


This emulation capability allows you to identify any timing issues with the data flow into or out of the auto-generated IP blocks that accelerate your offloaded software. The same capability can also show you if there is an unexpected slowdown in the offloaded software acceleration itself.


Using this tool can help you find performance bottlenecks. You can investigate these potential bottlenecks by watching your data flow through the hardware via the displayed emulation signal waveforms. Similarly, you can investigate the interface points by watching the data signals transfer data between the PS and the PL. This information provides key insights that help you find and fix your performance issues.


We’ll focus on the multiplier IP block from the Xilinx MMADD example to demonstrate how you can debug/emulate a hardware-accelerated function. For simplicity, we will focus on one IP block, the matrix multiplier IP block from the Multiply and Add example, shown in Figure 1.






Figure 1: Multiplier IP block with Port A expanded to show its signals




We will look at the waveforms for the signals to and from this Mmult IP block in the emulation. Specifically we will view the A_PORTA signals as shown in the figure above. These signals represent the data input for matrix A, which corresponds to the software input param A to the matrix multiplier function.


To get started with the emulation, enable generation of the “emulation model” configuration for the build in SDSoC’s project’s settings, as shown in Figure 2.







Figure 2: The mmult Project Settings needed to enable emulation




Next, rebuild your project as normal. After building your project with emulation model support enabled in the configuration, run the emulator by selecting “Start/Stop Emulation” under the “Xilinx Tools” menu option. When a window opens, select “Start” to start the emulator. SDSoC will then automatically launch an instance of Xilinx Vivado, which triggers the auto-generated PL project that SDSoC created for you as a subproject within your SDSoC project.


We specifically want to view the A_PORTA signals of the Mmult IP block. These signals must be added to the Wave Window to be viewed during a simulation. The available Mmult signals can be viewed in the Objects pane by selecting the mmult_1 block in the Scopes pane. To add the A_PORTA signals to the Wave Window, select all of the “A_*” signals in the Objects pane, right click, and select “Add to Wave Window” as shown in Figure 3.







Figure 3: Behavioral Simulation – mmult_1 signals highlighted




Now you can run the emulation and view the signal states in the waveform viewer. Start the emulator by clicking “Run All” from the “Run” drop-down menu as shown in Figure 4.







Figure 4: Start emulation of the PL




Back SDSoC’s toolchain environment, you can now run a debugging session that connects to this emulation session as it would to your software running on the target. From the “Run” menu option, select “Debug As -> 1 Launch on Emulator (SDSoC Debugger)” to start the debug session as shown in Figure 5.







Figure 5: Connect Debug Session to run the PL emulation




Now you can step or run through your application test code and view the signals of interest in the emulator. Shown below in Figure 6 are the A_PORTA signals we highlighted earlier and their signal values at the end of the PL logic operation using the Mmult and Add example test code.





Figure 6: Emulated mmult_1 signal waveforms




These signals tell us a lot about the performance of the offloaded code now running in the PL and we used familiar emulation tools to obtain this troubleshooting information. This powerful debugging method can help illuminate unexpected behavior in your hardware-accelerated C algorithm by allowing you to peer into the black box of PL processing, thus revealing data-flow behavior that could use some fine-tuning.



Fidus Systems based the design of its Sidewinder-100 PCIe NVMe Storage Controller on a Xilinx Zynq UltraScale+ MPSoC ZU19EG for many reasons but among the most important are PCIe Gen3/4 capability; high-speed, bulletproof SerDes for the board’s two 100Gbps-capable QSFP optical network cages; vast I/O flexibility inherent in Xilinx All Programmable devices to control DDR SDRAM, to drive the two SFF-8643 Mini SAS connectors for off-board SSDs, etc.; and the immense processing capabilities that come from the six on-chip ARM processor cores (four 64-bit ARM Cortex-A53 MPcore processors and two 32-bit ARM Cortex-R5 MPCore processors); and the big chunk of on-chip programmable logic based on the Xilinx UltraScale architecture. The same attributes that made the Zynq UltraScale+ MPSoC a good foundation for a high-performance NVMe controller like the Sidewinder-100 also make the board an excellent development target for a truly wide variety of hardware designs—just about anything you might imagine.


The Sidewinder-100’s significant performance advantage over SCSI and SAS storage arrays comes from its use of NVMe Over Fabrics technology reduce storage transaction latencies. In addition, there are two on-board M.2 connectors available for docking NVMe SSD cards. The board also accepts two DDR4 SO-DIMMs that are independently connected to the Zynq UltraScale+ MPSoC’s PS (processing system) and PL (programmable logic). That independent connection allows the PS-connected DDR4 SO-DIMM to operate at 1866Mtransfers/sec and the PL-connected DDR4 SO-DIMM to operate at 2133Mtransfers/sec.


All of this makes for a great PCIe Gen4 development platform, as you can see from this photo:



Fidus Sidewinder-100 NVMe Storage Controller.jpg


Fidus Sidewinder-100 PCIe NVMe Storage Controller



Because Fidus is a design house, it had general-purpose uses in mind for the Sidewinder-100 PCIe NVMe Storage Controller from the start. The board makes an excellent, ready-to-go development platform for any sort of high-performance PCIe Gen 3 or Gen4 development and Fidus would be happy to help you develop something else using this platform.


Oh, and one more thing. Tucked onto the bottom of the Sidewinder-100 PCIe NVMe Storage Controller Web page is this interesting PCIe Power and Loopback Adapter:



Fidus PCIe Power and Loopback Adapter.jpg 


Fidus PCIe Power and Loopback Adapter



It’s just the thing you’ll need to bring up a PCIe card on the bench without a motherboard. After all, PCIe Gen4 motherboards are scarce at the moment and this adapter looks like it should cost a lot less than a motherboard with a big, power-hungry processor on board. Just look at that tiny dc power connector to operate the adapter!



Please contact Fidus Systems directly for more information about the Sidewinder-100 PCIe NVMe Storage Controller and the PCIe Power and Loopback Adapter.





A commitment to “Any Media over Any Network” when video has rapidly proliferated across all markets requires another commitment: any-to-any video transcoding. That’s because the video you want is often not coded in the format you want (compression standard, bit rate, frame rate, resolution, color depth, etc.). As a result, transcoding has become a big deal and supporting the myriad video formats already available, and the new ones to come, is a big challenge.


Would you like some help? Wish granted.


Xilinx’s Pro AV & Broadcast Video Systems Architect Alex Luccisano is presenting two free, 1-hour Webinars on September 26 that covers video transcoding and how you can use Xilinx Zynq UltraScale+ EV MPSoCs for real-time, multi-stream video transcoding in your next design.



Click here for the 7:00 am (PST), 14:00 (GMT) Webinar on September 26.


Click here for the 10:00 am (PST), 17:00 (GMT) Webinar on September 26.



Avnet publishes article that serves as a Buyer’s Guide for its Zynq-based Dev Boards and SOMs

by Xilinx Employee ‎08-29-2017 02:04 PM - edited ‎08-30-2017 05:25 AM (4,462 Views)


Avnet just published an article titled “Zynq SoMs Decrease Customer Development Times and Costs” that provides a brief-but-good buyer’s guide for several of its Zynq-based dev boards and SOMs including the MicroZed (based on the Xilinx Zynq Z-7010 or Z-7020 SoCs), PicoZed (based on the Zynq Z-7010, 7015, 7020, or Z-7030 SoCs), and the Mini-Module Plus (based on the Xilinx Zynq Z-7045 or Z-7100 SoCs). These three boards give you pre-integrated access to nearly the entire broad line of Zynq Z-7000 dual-ARM-core SoCs.



Avnet PicoZed SOM.jpg 


Avnet PicoZed SOM



The article also lists several important points to consider when contemplating a make-or-buy decision for a Zynq-based board including:



  • “Designing the high-speed DDR3 interface for Zynq requires a deep understanding of transmission line theory. The PCB layout calls for matching trace lengths, controlling impedances and using proper termination. If designed improperly, several PCB spins and months of development times can be wasted.”



  • “Avnet jumpstarts Zynq-based software, firmware and HDL development by providing the necessary tools to get started. MicroZed, PicoZed and Avnet’s entire portfolio of Zynq-based SoM have board support packages (BSPs) available.”



Whichever way you choose to go, the Zynq SoC (and the more powerful Zynq UltraScale+ MPSoC), give you a unique blend of software-based processor horsepower and programmable-logic that delivers hardware-level performance when and where you need it in your design.




A recent Sensorsmag,com article written by Nick Ni and Adam Taylor titled “Accelerating Sensor Fusion Embedded Vision Applications” discusses some of the sensor-fusion principles behind, among other things, 3D stereo vision as used in the Carnegie Robotics Multisense stereo cameras discussed in today’s earlier blog titled “Carnegie Robotics’ FPGA-based GigE 3D cameras help robots sweep mines from a battlefield, tend corn, and scrub floors.” We’re starting to put a large amount of sensors into systems and turning the deluge of raw sensor data into usable information is a tough computational job.


Describing some of that job’s particulars consumes the first half of Ni’s and Taylor’s article. The second half of the article then discusses some implementation strategies based on the new Xilinx reVISION stack, which is built on top of Xilinx Zynq SoCs and Zynq UltraScale+ MPSoCs.


If there are a lot of sensors in your next design, particularly image sensors, be sure to take a look at this article.



Even though I knew this was coming, it’s still hard to write this blog post without grinning. Last week, acknowledged FPGA-based processor wizard Jan Gray of Gray Research LLC presented a Hot Chips poster titled “GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Framework: A 1680-core, 26 MB SRAM Parallel Processor Overlay on Xilinx UltraScale+ VU9P.” Allow me to unpack that title and the details of the GRVI Phalanx for you.


Let’s start with 1680 “austere” processing elements in the GRVI Phalanx, which are based on the 32-bit RISC-V processor architecture. (Is that parallel enough for you?) The GRVI processing element design follows “Jan’s Razor”: In a chip multiprocessor, cut nonessential resources from each CPU, to maximize CPUs per die. Thus, a GRVI processing element is a 3-stage, user-mode RV321 core minus a few nonessential bits and pieces. It looks like this:



GRVI Processing Element.jpg


A GRVI Processing Element




Each GRVI processing element requires ~320 LUTs and runs at 375MHz. Typical of a Jan Gray design, the GRVI processing element is hand-mapped and –floorplanned into the UltraScale+ architecture and then stamped 1680 times into the Virtex UltraScale+ VU9P FPGA on a VCU118 Eval Kit.


Now, dropping a bunch of processor cores onto a large device like the Virtex UltraScale+ VU9P FPGA is interesting but less than useful unless you give all of those cores some memory to operate out of, some way for the processors to communicate with each other and with the world beyond the FPGA package, and some way to program the overall machine.


Therefore, the GRVI processing elements are packaged in clusters containing as many as eight processing elements with 32 to 128Kbytes of RAM, and additional accelerator(s). Each cluster is tied to the other on-chip clusters and to the external-world I/O through a HOPLITE router to a NOC (network on chip) with 100Gbps links between nodes. The HOPLITE router is an FPGA-optimized, directional router designed for a 2D torus network.


A GRVI Phalanx cluster looks like this:



GRVI Phalanx Cluster.jpg


A GRVI Phalanx Cluster




Currently, Gray’s paper says there a multithreaded C++ compiler with message-passing runtime layered on top of a RISC-V RV321MA GCC compiler with future plans to support OpenCL, P4, and other programming tools.


In development: an 80-core educational version of the GRVI Phalanx instantiated in the programmable logic of a “low-end” Zynq Z-7020 SoC on the Digilent PYNQ-Z1 board.


Now if all that were not enough (and you will find a lot more packed into Gray’s poster), there’s a Xilinx Virtex UltraScale+ VU9P available to you. It’s as near as your keyboard and Web browser on the Amazon AWS EC2 F1.2XL and F1.16XL instances and Jan Gray is working on putting the GRVI Phalanx on that platform as well.


Incredibly, it’s all in that Hot Chips poster.


Engineering Advisory Explicit Content.jpg


The following blog post contains explicitly competitive information. If you do not like to read such things or if you live in a country where you’re not supposed to read such things, then stop reading.




In this blog post, I will discuss device performance in a competitive context. Now, whenever you read about “the competition” on a vendor’s Web site, you need to take the information provided with a big grain of salt. It’s hard to believe anything one vendors says about the competition, which is why I so rarely attempt to do so in the Xcell Daily blog.


This post is an exception.


With that caveat stated, let’s rush in where angels fear to tread.


There’s a new 18-page White Paper on the Xilinx.com Web site titled “Measuring Device Performance and Utilization: A Competitive Overview” and written by Frederic Rivoallon, the Vivado HLS and RTL Synthesis Product Manager here at Xilinx. Rivoallon’s White Paper “compares actual Kintex UltraScale FPGA results to Intel’s (formerly Altera) Arria 10, based on publicly available OpenCores designs.” (OpenCores.org declares itself to be “the world’s largest site/community for development of hardware IP cores as open source.”) The data for this White Paper was generated in June, 2017 and is based on the latest versions of the respective design tools available at that time (Vivado Design Suite 2017.1 and Quartus Prime v16.1).


Cutting to the chase, here’s the White Paper’s conclusion, conveniently summarized in the same White Paper’s introduction:


“Verifiable results based on OpenCores designs demonstrate that the Xilinx UltraScale architecture delivers a two-speed-grade performance boost over competing devices while implementing 20% more design content. This boost equates to a generation leap over the closest competitive offering.”


I place in evidence Exhibit 1 (actually Figure 1 in the White Paper), which compares Kintex UltraScale FPGA device utilization versus Arria 10 device utilization and shows that it’s much harder to use all of the Arria 10’s device capacity than it is for the Kintex UltraScale device:




wp496 Figure 1.jpg 




It’s quite reasonable for you to ask “why is this so?” at this point. In fact, you certainly should. I’m told and the White Paper explains that there’s a fundamental architectural reason for this significant utilization disparity. You see it in the architectural difference between a Xilinx UltraScale CLB and an Arria ALM (adaptive logic module). Here’s the picture (which is Figure 2 in the White Paper):




wp496 Figure 2.jpg 




You can see that the two 6-input LUTs in the Arria 10 ALM share four inputs while the two 6-input LUTs in the UltraScale device have independent inputs. (Xilinx UltraScale+ devices employ the same LUT configuration.) There’s no sleight of hand here. Given enough routing resources (which the Xilinx UltraScale architecture has) and a sufficiently clever place-and-route tool (which Vivado has), you will be able to use both 6-input LUTs more often if they have independent inputs than if they have several shared inputs. Hence the greater maximum usable resource capacity for UltraScale and UltraScale+ devices.


And now for Exhibit 2. Here’s the associated performance graph showing FMAX for the various OpenCores IP cores (Figure 3 in the White Paper):




wp496 Figure 3.jpg 




As you might expect from a Xilinx White Paper, the UltraScale device performs better after placement and routing. There are many more such Exhibits (charts and graphs) for you to peruse in the White Paper and Xilinx does not always win.


So what?


Well, the purpose of this blog post is twofold. First, I wanted you to be aware of this White Paper. If you’ve read this far, that goal has been achieved. Second, I don’t want you to take my word for it. I am reporting what’s stated in the White Paper but you should know that this White Paper was created in response to a similar White Paper published a few months back by “the competition.” No surprise, the competition’s White Paper came to different conclusions.


So who is right?


As a former Editor-in-Chief of both EDN Magazine and Microprocessor Report, I am well aware of benchmarks. In fact, EEMBC, the industry alliance that developed industry-standard benchmarks for embedded systems, was based on a hands-on project conducted by former EDN editor Markus Levy in 1996 while I was EDN’s Editor-in-Chief. Markus founded EEMBC a year later. I devoted a portion of Chapter 3 in my book “Designing SoCs with Configured Cores” to microprocessor benchmarking and I wrote an entire chapter (Chapter 10) about the history of microprocessor benchmarking for the textbook titled “EDA for IC System Design, Verification, and Testing,” published in 2006. That chapter also discussed some of the many ways to achieve the results you desire from benchmarks. FPGA benchmarks are in a similar state of affairs, going back at least to the 1990s and the famous/infamous PREP benchmark suite.


Here’s what Alexander Carlton at HP in Cupertino, California wrote way back in 1994 in his article on the SPEC Web site titled “Lies, **bleep** Lies, and Benchmarks”:


“It has been said that there are three classes of untruths, and these can be classified (in order from bad to worse) as: Lies, **bleep** Lies, and Benchmarks. Actually, this view is a corollary to the observation that ‘Figures don't lie, but liars can figure...’ Regardless of the derivation of this opinion, criticism of the state of performance marketing has become common in the computer industry press.”



[Editorial note: The blogging tool has modified the article's title to meet its Victorian sense of propriety.]



To my knowledge, no shenanigans were used to achieve the above FPGA benchmark results (I did ask) but I nevertheless caution you to be careful when interpreting the numbers. Here’s how I’d view these White Paper benchmark results:


Your mileage may vary. (Even the US EPA says so.) The only benchmark truly indicative of the device utilization and performance you’ll get for your design is… your design. Benchmarks are merely surrogates for your design.


So go ahead. Download and read the new Xilinx “Measuring Device Performance and Utilization: A Competitive Overview” White Paper, get educated, and then start asking questions.



Hardent, a Xilinx Authorized Training Partner, has announced a 3-day embedded design class based on the Xilinx Zynq UltraScale+ MPSoC and you can attend either in person at one of several North American locations or live over the Internet. Here’s a course outline:


  • Zynq UltraScale+ MPSoC Architecture Overview
  • Zynq MPSoC Processor System (PS)
  • The Application Processing Unit (APU)
  • The Real-Time Processing Unit (RPU)
  • The Platform Management Unit (PMU)
  • The Quick Emulator (QEMU)
  • System-Level Features
  • Boot and Configuration
  • Coherency
  • AXI Interfaces between the PS and PL (Programmable Logic)
  • Power Management
  • Clocks and Resets
  • DDR and QoS
  • Security and Safety
  • System Protection
  • Security and Software
  • ARM TrustZone Technology
  • Linux and the MPSoC {Lectures, Labs}
  • Symmetric Multi-Processor Linux
  • Yocto
  • PetaLinux
  • Virtualization
  • HW-SW Virtualization
  • Introduction to the Xen Hypervisor
  • OpenAMP
  • The Software Ecosystem
  • Software Ecosystem Support
  • FreeRTOS
  • Software Stack



There are eleven scheduled classes, and the first one starts today.


For more information and to register, click here.



Now that Amazon has made the FPGA-accelerated Amazon EC2 F1 compute instance generally available to all AWS customers (see “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available”), just about anyone can get access to the latest Xilinx All Programmable UltraScale+ devices from anywhere, just as long as you have an Internet connection and a Web browser. Xilinx has just published a new video demonstrating the use of its Vivado IP Integrator, a graphical-based design tool, with the AWS EC2 F1 compute instance.


Why use Vivado IP Integrator? As the video says, there are five main reasons:


  • Simplified connectivity
  • Block automation
  • Connectivity automation
  • DRC (design rule checks)
  • Advanced hardware debug



Here’s the 5-minute video:








Baidu details FPGA-based Cloud acceleration with 256-core XPU today at Hot Chips in Cupertino, CA

by Xilinx Employee ‎08-22-2017 11:38 AM - edited ‎08-22-2017 11:40 AM (5,446 Views)


Xcell Daily covered an announcement by Baidu about its use of Xilinx Kintex UltraScale+ FPGAs for the acceleration of cloud-based applications last October. (See “Baidu Adopts Xilinx Kintex UltraScale FPGAs to Accelerate Machine Learning Applications in the Data Center.”) Today, Baidu discussed more architectural particulars of its FPGA-acceleration efforts at the Hot Chips conference in Cupertino, California—according to Nicole Hemsoth’s article appearing on the NextPlatform.com site (“An Early Look at Baidu’s Custom AI and Analytics Processor”).


Hemsoth writes:


“…Baidu has a new processor up its sleeve called the XPU… The architecture they designed is aimed at this diversity with an emphasis on compute-intensive, rule-based workloads while maximizing efficiency, performance and flexibility, says Baidu researcher, Jian Ouyang. He unveiled the XPU today at the Hot Chips conference along with co-presenters from FPGA maker, Xilinx…


“’The FPGA is efficient and can be aimed at specific workloads but lacks programmability,’ Ouyang explains. ‘Traditional CPUs are good for general workloads, especially those that are rule-based and they are very flexible. GPUs aim at massive parallelism and have high performance. The XPU is aimed at diverse workloads that are compute-intensive and rule-based with high efficiency and performance with the flexibility of a CPU,’ Ouyang says. The part that is still lagging, as is always the case when FPGAs are involved, is the programmability aspect. As of now there is no compiler, but he says the team is working to develop one…


“’To support matrix, convolutional, and other big and small kernels we need a massive math array with high bandwidth, low latency memory and with high bandwidth I/O,” Ouyang explains. “The XPU’s DSP units in the FPGA provide parallelism, the off-chip DDR4 and HBM interface push on the data movement side and the on-chip SRAM provide the memory characteristics required.’”


According to Hemsoth’s article, “The XPU has 256 cores clustered with one shared memory for data synchronization… Somehow the all 256 cores are running at 600MHz.”


For more details, see Hemsoth’s article on the NextPlatform.com Web site.


About the Author
  • Be sure to join the Xilinx LinkedIn group to get an update for every new Xcell Daily post! ******************** Steve Leibson is the Director of Strategic Marketing and Business Planning at Xilinx. He started as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He's served as Editor in Chief of EDN Magazine, Embedded Developers Journal, and Microprocessor Report. He has extensive experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.