We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!


Ann Stefora Mutschler has just published an article on the SemiconductorEngineering.com Web site titled “Mixing Interface Protocols” that describes some of the complexities of SoC design—all related to the proliferation of various on- and off-chip I/O protocols. However, the article can just as easily be read as a reason for using programmable-logic devices such as Xilinx Zynq SoCs, Zynq UltraScale+ MPSoCs, and FPGAs in your system designs.


For example, here’s Mutschler’s lead sentence:


“Continuous and pervasive connectivity requires devices to support multiple interface protocols, but that is creating problems at multiple levels because each protocol is based on a different set of assumptions.”


This sentence nicely sums up the last two decades of interface design philosophy for programmable-logic devices. Early on, it became clear that a lot of logic translation was needed to connect early FPGAs to the rest of a system. When Xilinx developed I/O pins with programmable logic levels, it literally wiped out a big chunk of the market for level-translator chips. When MGT (multi-gigabit serial transceivers) started to become popular for moving large amounts of data from one subsystem to another, Xilinx moved those onto its devices as well.


So if you’d like to briefly glimpse into the chaotic I/O scene that’s creating immense headaches for SoC designers, take a read through Ann Stefora Mutschler’s new article. If you’d like to sidestep those headaches, just remember that Xilinx’s engineering team has already suffered them for you.




Mathworks has been advocating model-based design using its MATLAB and Simulink development tools for some time because the design technique allows you to develop more complex software with better quality in less time. (See the Mathworks White Paper: “How Small Engineering Teams Adopt Model-Based Design.”) Model-based design employs a mathematical and visual approach to developing complex control and signal-processing systems through the use of system-level modeling throughout the development process—from initial design, through design analysis, simulation, automatic code generation, and verification. These models are executable specifications that consist of block diagrams, textual programs, and other graphical elements. Model-based design encourages rapid exploration of a broader design space than other design approaches because you can iterate your design more quickly, earlier in the design cycle. Further, because these models are executable, verification becomes an integral part of the development process at every step. Hopefully, this design approach results in fewer (or no) surprises at the end of the design cycle.


Xilinx supports model-based design using MATLAB and Simulink through the new Xilinx Model Composer, a design tool that integrates into the MATLAB and Simulink environments. The Xilinx Model Composer includes libraries with more than 80 high-level, performance-optimized, Xilinx-specific blocks including application-specific blocks for computer vision, image processing, and linear algebra. You can also import your own custom IP blocks written in C and C++, which are subsequently processed by Vivado HLS.


Here’s a block diagram that shows you the relationship among Mathworks’ MATLAB, Simulink, and Xilinx Model Composer:




Xilinx Model Composer.jpg 




Finally, here’s a 6-minute video explaining the benefits and use of Xilinx Model Composer:







Digilent’s end-of-the-year sale includes a few Xilinx-based products

by Xilinx Employee ‎12-20-2017 11:43 AM - edited ‎12-20-2017 02:35 PM (7,843 Views)


There are a few Xilinx-based dev boards and instruments on sale right now at Digilent. You’ll find them on the “End of the Year Sale” Web page:



  • Zynq-based ZedBoard: $396.00 (marked down from $495.00)


ZedBoard V2.jpg




  • Digital Discovery Logic Analyzer and Pattern Generator (based on a Spartan-6 FPGA): $169.99 (marked down from $199.99)




Digilent Digital Discovery Module.jpg




  • Analog Discovery 2 Maker Bundle (based on a Spartan-6 FPGA): $261.75 (marked down from $349.00)



Digilent Analog Discovery 2 v2.jpg




If you were considering one of these Digilent products, now’s probably the time to buy.







RHS Research’s PicoEVB FPGA dev board based on an Artix-7 A50T FPGA snaps into an M.2 2230 key A or E slot, which is common in newer laptops. The board measures 22x30mm, which is slightly larger than the Artix-7 FPGA and configuration EEPROM mounted on one side of the board. It has a built-in JTAG connection that works natively with Vivado.


Here’s a photo that shows you the board’s size in relation to a US 25-cent piece:




RHS Research PicoEVB.jpg 




Even though the board itself is small, you still get a lot of resources in the Artix-7 A50T FPGA including 52,160 logic cells, 120 DSP48 slices, and 2.7Mbits of BRAM.


Here’s a block diagram of the board:





RHS Research PicoEVB Block Diagram.jpg 



The PicoEVB is available on Crowd Supply. The project was funded at the end of October.


Fairwave’s XTRX, a “truly” embedded SDR (software-defined radio) module now up as a Crowd Supply crowdfunding project, manages to pack an entire 2x2 MIMO SDR with an RF tuning range of 30MHz to 3.8GHz into a diminutive Mini PCIe format (30x51mm) by pairing Lime Microsystems’ LMS7002M 2nd-generation field-programmable RF Transceiver with a Xilinx Artix-7 35T FPGA. As of today, the project is 84% funded with 27 days left in the funding period and 317 pledges. The industry-standard Mini PCIe form factor allws you to embed the XTRX SDR module just about anywhere. According to Fairwaves, the XTRX is compatible with all of the popular SDR development tool suites.



Fairwaves XTRX SDR Module.jpg 



Fairwaves’ XTRX 2x2 MIMO SDR Mini PCIe Module




Here’s a block diagram of the XTRX SDR module:



Fairwaves XTRX SDR Module Block Diagram.jpg 


Fairwaves’ XTRX 2x2 MIMO SDR Module Block Diagram




Perhaps even more interesting, here’s a comparison chart that Fairwaves developed to point out the advantages of the XTRX SDR module:



Fairwaves XTRX SDR Module Comparison Chart.jpg 




Need something more complex than a 2x2 MIMO arrangement? There’s a PCIe Octopack carrier board that accepts as many as eight XTRX Mini PCIe modules (four on each side) creating a 16x16 MIMO SDR and you can synchronize multiple Octopack boards to create massive-MIMO configurations.




Fairwaves XTRX SDR Module Octopack.jpg 


Fairwaves’ XTRX 2x2 MIMO SDR Module Octopack




More information about the XTRX is available on the Crowd Supply project page.



Looking for a quick explanation of the FPGA-accelerated AWS EC2 F1 instance? Here’s a 3-minute video

by Xilinx Employee ‎12-19-2017 10:45 AM - edited ‎12-19-2017 10:49 AM (8,062 Views)


The AWS EC2 F1 compute instance allows you to create custom hardware accelerators for your application using cloud-based server hardware that incorporates multiple Xilinx Virtex UltraScale+ VU9P FPGAs. Several companies now list applications for FPGA-accelerated AWS EC2 F1 instances in the AWS Marketplace in application categories including:



  • Video processing
  • Data analytics
  • Genomics
  • Machine Learning



Here’s a 3-minute video overview recorded at the recent SC17 conference in Denver:







In an article published in EETimes today titled “Programmable Logic Holds the Key to Addressing Device Obsolescence,” Xilinx’s Giles Peckham argues that the use of programmable devices—such as the Zynq SoCs, Zynq UltraScale+ MPSoCs, and FPGAs offered by Xilinx—can help prevent product obsolescence in long-lived products designed for industrial, scientific, and military applications. And that assertion is certainly true. But in this blog, I want to highlight the response by a reader using the handle MWagner_MA who wrote:


Given the pace of change in FPGA's, I don't know if an FPGA will be a panacea for chip obsolescence issues. However, when changes in system design occur for hooking up new peripherals to a design off board, FPGA's can extend the life of a product 5+ years assuming you can get board-compatible FPGA's. Comm channels are what come to mind. If you use the same electrical interface but have an updated protocol, programmable logic can be a solution. Another solution is that when devices on SPI or I2C busses go obsolete, FPGA code can get updated to accomodate, even changing protocol if necessary assuming the right pins are connected at the other chip (like an A/D).”



MWagner_MA’s response is nuanced and tempered with obvious design experience. However, I will need to differ with the comment that the pace of change in FPGAs means something significant within the context of product obsolescence. Certainly FPGAs go obsolete, but it takes a long, long time.


Case in point:


I received an email just today from Xilinx about this very topic. (Feel free to insert amusement here about Xilinx’s corporate blogger being on the company’s promotional email list.) The email is about Xilinx’s Spartan-6 FPGAs, which were first announced in 2009. That’s eight or nine years ago. Today’s email states that Xilinx plans to ship Spartan-6 devices “until at least 2027.” That’s another nine or ten years into the future for a resulting product-line lifespan of nearly two decades and that’s not all that unusual for Xilinx parts. In other words, Xilinx FPGAs are in another universe entirely when compared to the rapid pace of obsolescence for semiconductor devices like PC and server processors. That’s something to keep in mind when you’re designing products destined for a long life in the field.


If you want to see the full long-life story for the Spartan-6 FPGA family, click here.




The Raptor from Rincon Research implements a 2x2 MIMO SDR (software-defined radio) in a compact 5x2.675-inch form factor by combining the capabilities of the Analog Devices AD9361 RF Agile Transceiver and the Zynq UltraScale+ ZU9EG MPSoC. The board has an RF tuning range of 70MHz to 6GHz. On-board memory includes 4Gbytes of DDR4 SDRAM, a pair of QSPI Flash memory chips, and an SD card socket. Digital I/O options include three on-board USB connectors (two USB 3.0 ports and one USB 2.0 port) and, through a mezzanine board, 10/100/1000 Ethernet, two SFP+ optical cages, an M.2 SATA port, DisplayPort, and a Samtec FireFly connector. Rincon Research provides the board along with a BSP, drivers, and COTS tool support.


Here’s a block diagram of the Raptor board:



Rincon Research Raptor Block Diagram.jpg


Rincon Research’s Raptor, a 2x2 MIMO SDR Board, Block Diagram




Here are photos of the Raptor main board and its I/O expansion mezzanine board:




Rincon Research Raptor Board.jpg 


Rincon Research’s Raptor 2x2 MIMO SDR Board





Rincon Research Raptor IO Mezzanine Board.jpg 


Rincon Research’s Raptor I/O Expansion Board




Please contact Rincon Research for more information about the Raptor SDR.




By Adam Taylor



For the final MicroZed Chronicles blog of the year, I thought I would wrap up with several tips to help when you are creating embedded-vision systems based on Zynq SoC, Zynq UltraScale+ MPSoC, and Xilinx FPGA devices.


Note: These tips and more will be part of Adam Taylor’s presentation at the Xilinx Developer Forum that will be held in Frankfurt, Germany on January 9.









  1. Design in Flexibility from the Beginning






Video Timing Controller used to detect the incoming video standard



Use the flexibility provided by the Video Timing Controller (VTC) and reconfigurable clocking architectures such as Fabric Clocks, MMCM, and PLLs.  Using the VTC and associated software running on the PS (processor system) in the Zynq SoC and Zynq UltraScale+ MPSoC, it is possible to detect different video standards from an input signal at run time and to configure the processing and output video timing accordingly. Upon detection of a new video standard, the software running on the PS can configure new clock frequencies for the pixel clock and the image-processing chain along with re-configuring VDMA frame buffers for the new image settings. You can use the VTC’s timing detector and timing generator to define the new video timing. To update the output video timings for the new standard, the VTC can use the detected video settings to generate new output video timings.




  1. Convert input video to AXI Interconnect as soon as possible to leverage IP and HLS






Converting Data into the AXI Streaming Format




Vivado provides a range of key IP cores that implement most of the functions required by an image processing chain—functions such as Color Filter Interpolation, Color Space Conversion, VDMA, and Video Mixing. Similarity Vivado HLS can generate IP cores that use the AXI interconnect to ease integration within Vivado designs. Therefore, to get maximum benefit from the available IP and tool chain capabilities, we need to convert our incoming video data into the AXI Streaming format as soon as possible in the image-processing chain. We can use the Video-In-to-AXI-Stream IP core as an aid here. This core converts video from a parallel format consisting of synchronization signals and pixel values into our desired AXI Streaming format. A good tip when using this IP core is that the sync inputs do not need to be timed as per a VGA standard; they are edge triggered. This eases integration with different video formats such as Camera Link, with its frame-valid, line-valid, and pixel information format, for example. 




  1. Use Logic Debugging Resources







Insertion of the ILA monitoring the output stage




Insert integrated logic analyzers (ILAs) at key locations within the image-processing chain. Including these ILAs from day one in the design can help speed commissioning of the design. When implementing an image-processing chain in a new design, I insert ILA’s as a minimum in the following locations:


  • Directly behind the receiving IP module—especially if it is a custom block. This ILA enables me to be sure that I am receiving data from the imager / camera.
  • On the output of the first AXI Streaming IP Core. This ILA allows me to be sure the image-processing core has started to move data through the AXI interconnect. If you are using VDMA, remember you will not see activity on the interconnect until you have configured the VDMA via software.
  • On the AXI-Streaming-to-Video-Out IP block, if used. I also consider connecting the video timing controller generator outputs to this ILA as well. This enables me to determine if the AXI-Stream-to-Video-Out block is correctly locked and the VTC is generating output timing.


When combined with the test patterns discussed below, insertion of ILAs allows us to zero in faster on any issues in the design which prevent the desired behavior.




  1. Select an Imager / Camera with a Test Pattern capability






Incorrectly received incrementing test pattern captured by an ILA




If possible when selecting the imaging sensor or camera for a project, choose one that provides a test pattern video output. You can then use this standard test pattern to ensure the reception, decoding, and image-processing chain is configured correctly because you’ll know exactly what the original video signal looks like. You can combine the imager/camera test pattern with ILAs connected close to the data reception module to determine if any issues you are experiencing when displaying an image is internal to the device and the image processing chain or are the result of the imager/camera configuration.


We can verify the deterministic pixel values of the test pattern using the ILA. If the pixel values, line length, and the number of lines are as we expect, then it is not an imager configuration issue. More likely you will find the issue(s) within the receiving module and the image-processing chain.  This is especially important when using complex imagers/cameras that require several tens, or sometimes hundreds of configuration settings to be applied before an image is obtained.



  1. Include a Test Patter Generator in your Zynq SoC, Zynq UltraScale+ MPSoC, or FPGA design






Tartan Color Bar Test Pattern




If you include a test-pattern generator within the image-processing chain, you can use it to verify the VDMA frame buffers, output video timing, and decoding prior to the integration of the imager/camera. This reduces integration risks. To gain maximum benefit, the test-pattern generator should be configured with the same color space and resolution as the final imager. The test pattern generator should be included as close to the start of the image-processing chain as possible. This enables more of the image-processing pipeline to be verified, demonstrating that the image-processing pipeline is correct. When combined with test pattern capabilities on the imager, this enables faster identification of any problems.




  1. Understand how Video Direct Memory Access stores data in memory







Video Direct Memory Access (VDMA) allows us to use the processor DDR memory as a frame buffer. This enables access to the images from the processor cores in the PS to perform higher-level algorithms if required. VDMA also provides the buffering required for frame-rate and resolution changes. Understanding how VDMA stores pixel data within the frame buffers is critical if the image-processing pipeline is to work as desired when configured.


One of the major points of confusion when implementing VDMA-based solutions centers around the definition of the frame size within memory. The frame buffer is defined in memory by three parameters: Horizontal Size (HSize), Vertical Size (VSize). and Stride.  The two parameters that define the Horizontal Size of the image are the HSize and the stride of the image. Like VSize, which defines the number of lines in the image, the HSize defines the length of each line. However instead of being measured in pixels the horizontal size is measured in bytes. We therefore need to know how many bytes make up each pixel.


The Stride defines the distance between the start of one line and another. To gain efficient use of the DDR memory, the Stride should at least equal the horizontal size. Increasing the Stride introduces a gap between lines. Implementing this gap can be very useful when verifying that the imager data is received correctly because it provides a clear indication of when a line of the image starts and ends with memory.


These six simple techniques have helped me considerably when creating imageprocessing examples for this blog or solutions for clients and they significantly ease both the creation and commissioning of designs.


As I said, this is my last blog of the year. We will continue this series in the New Year. Until then I wish you all happy holidays.




You can find the example source code on GitHub.



Adam Taylor’s Web site is http://adiuvoengineering.com/.



If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




First Year E Book here


First Year Hardback here.




MicroZed Chronicles hardcopy.jpg 




Second Year E Book here


Second Year Hardback here




MicroZed Chronicles Second Year.jpg 







The European Tulipp (Towards Uniquitous Low-power Image Processing Platforms) project has just published a very short, 102-second video that graphically demonstrates the advantages of FPGA-accelerated video processing over CPU-based processing. In this demo, a Sundance EMC²-DP board based on a Xilinx Zynq Z-7030 SoC accepts real-time video from a 720P camera and applies filters to the video before displaying the images on an HDMI screen.


Here are the performance comparisons for the real-time video processing:


  • CPU-based grayscale filter: ~15fps
  • FPGA-accelerated grayscale filter: ~90fps (6x speedup)
  • CPU-based sobel filter: ~2fps
  • FPGA-accelerated sobel filter: ~90fps (45x speedup)



The video rotates through the various filters so that you can see the effects.






TSN (time-sensitive networking) is a set of evolving IEEE standards that support a mix of deterministic, real-time and best-effort traffic over fast Ethernet connections. The TSN set of standards is bocming increasingly important in many industrial networking sutuations, particularly for IIoT (the Industrial Internet of Things). SoC-e has developed TSN IP that you can instantiate in Xilinx All Programmable devices. (Because the standards are still evolving, implementing the TSN hardware in reprogrammable hardware is a good idea.)


In particular, the company offers the MTSN (Multiport TSN Switch IP Core) IP core, which provides precise time synchronization of network nodes using synchronized, distributed local clocks with a reference and IEEE 802.1Qbv for enhanced traffic scheduling. You can currently instantiate the SoC-e core on all of the Xilinx 7 series devices (the Zynq SoC and Spartan-7, Artix-7, Kintex-7, and Virtex-7 FPGAs), Virtex and Kintex UltraScale devices, and all UltraScale+ devices (the Zynq UltraScale+ MPSoCs and Virtex and Kintex UltraScale+ FPGAs).


Here’s a short three-and-a-half minute video explaining TSN and the SoC-e MSTN IP:





Tincy YOLO: a real-time, low-latency, low-power object detection system running on a Zynq UltraScale+ MPSoC

by Xilinx Employee ‎12-14-2017 10:39 AM - edited ‎12-15-2017 06:15 AM (20,046 Views)


Last week at the NIPS 2017 conference in Long Beach, California, a Xilinx team demonstrated a live object-detection implementation of a YOLO—“you only look once”—network called Tincy YOLO (pronounced “teensy YOLO”) running on a Xilinx Zynq UltraScale+ MPSoC. Tincy YOLO combines reduced precision, pruning, and FPGA-based hardware acceleration to speed network performance by 160x, resulting in a YOLO network capable of operating on video frames at 16fps while dissipating a mere 6W.



Figure 5.jpg 


Live demo of Tincy YOLO at NIPS 2017. Photo credit: Dan Isaacs




Here’s a description of that demo:





TincyYOLO: a real-time, low-latency, low-power object detection system running on a Zynq UltraScale+ MPSoC



By Michaela Blott, Principal Engineer, Xilinx



The Tincy YOLO demonstration shows real-time, low-latency, low-power object detection running on a Zynq UltraScale+ MPSoC device. In object detection, the challenge is to identify objects of interest within a scene and to draw bounding boxes around them, as shown in Figure 1. Object detection is useful in many areas, particularly in advanced driver assistance systems (ADAS) and autonomous vehicles where systems need to automatically detect hazards and to take the right course of action. Tincy YOLO leverages the “you only look once” (YOLO) algorithm, which delivers state-of-the-art object detection. Tincy YOLO is based on the Tiny YOLO convolutional network, which is based on the Darknet reference network. Tincy YOLO has been optimized through heavy quantization and modification to fit into the Zynq UltraScale+ MPSoC’s PL (programmable logic) and Arm Cortex-A53 processor cores to produce the final, real-time demo.



Figure 1.jpg 


Figure 1: YOLO-recognized people with bounding boxes




To appreciate the computational challenge posed by Tiny YOLO, note that it takes 7 billion floating-point operations to process a single frame. Before you can conquer this computational challenge on an embedded platform, you need to pull many levers. Luckily, the all-programmable Zynq UltraScale+ MPSoC platform provides many levers to pull. Figure 2 summarizes the versatile and heterogeneous architectural options of the Zynq platform.



Figure 2.jpg 


Figure 2: Tincy YOLO Platform Overview




The vanilla Darknet open-source neural network framework is optimized for CUDA acceleration but its generic, single-threaded processing option can target any C-programmable CPU. Compiling Darknet for the embedded Arm processors in the Zynq UltraScale+ MPSoC left us with a sobering performance of one recognized frame every 10 seconds. That’s about two orders of magnitude of performance away from a useful ADAS implementation. It also produces a very limited live-video experience.


To create Tincy YOLO, we leveraged several of the Zynq UltraScale+ MPSoC’s architectural features in steps, as shown in Figure 3. Our first major move was to quantize the computation of the network’s twelve inner (aka. hidden) layers by giving them binary weights and 3-bit activations. We then pruned this network to reduce the total operations to 4.5 GOPs/frame.




Figure 3.jpg 


Figure 3: Steps used to achieve a 160x speedup of the Tiny YOLO network




We created a reduced-precision accelerator using a variant of the FINN BNN library (https://github.com/Xilinx/BNN-PYNQ) to offload the quantized layers into the Zynq UltraScale+ MPSoC’s PL. These layers account for more than 97% of all the computation within the network. Moving the computations for these layers into hardware bought us a 30x speedup of their specific execution, which translated into an 11x speedup within the overall application context, bringing the network’s performance up to 1.1fps.


We tackled the remaining outer layers by exploiting the NEON SIMD vector capabilities built into the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor cores, which gained another 2.2x speedup. Then we cracked down on the complexity of the initial convolution using maxpool elimination for another 2.2x speedup. This work raised the frame rate to 5.5fps. A final re-write of the network inference to parallelize the CPU computations across all four of the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor delivered video performance at 16fps.


The result of these changes appears in Figure 4, which demonstrates better recognition accuracy than Tiny YOLO.




Figure 4.jpg 


Figure 4: Tincy YOLO results






High-frequency trading is all about speed, which explains why Aldec’s new reconfigurable HES-HPC-HFT-XCVU9P PCIe card for high-frequency trading (HFT) apps is powered by a Xilinx Virtex UltraScale+ VU9P FPGA. That’s about as fast as you can get with any sort of reprogrammable or reconfigurable technology. The Virtex UltraScale+ FPGA directly connects to all of the board’s critical, high-speed interface ports—Ethernet, QSFP, and PCIe x16—and implements the communications protocols for those standard interfaces as well as the memory control and interface for the board’s three QDR-II+ memory modules. Consequently, there’s no time-consuming chip-to-chip interconnection. Picoseconds count in HFT applications, so the FPGA’s ability to implement all of the card’s logic is a real competitive advantage for Aldec. The new FPGA accelerator is extremely useful for implementing time-sensitive trading strategies such as Market Making, Statistical Arbitrage, and Algorithmic Trading and is compatible with 1U and larger trading systems.



Aldec HES-HPC-HFT-XCVU9P PCIe card .jpg 



Aldec’s HES-HPC-HFT-XCVU9P PCIe card for high-frequency trading apps—Powered by a Xilinx Virtex UltraScale+ FPGA





Here’s a block diagram of the board:




Aldec HES-HPC-HFT-XCVU9P PCIe card block diagram.jpg



Aldec’s HES-HPC-HFT-XCVU9P PCIe card block diagram




Please contact Aldec directly for more information about the HES-HPC-HFT-XCVU9P PCIe card.




An article titled “Living on the Edge” by Farhad Fallah, one of Aldec’s Application Engineers, on the New Electronics Web site recently caught my eye because it succinctly describes why FPGAs are so darn useful for many high-performance, edge-computing applications. Here’s an example from the article:


“The benefits of Cloud Computing are many-fold… However, there are a few disadvantages to the cloud too, the biggest of which is that no provider can guarantee 100% availability.”


There’s always going to be some delay when you ship data to the cloud for processing. You will need to wait for the answer. The article continues:


“Edge processing needs to be high-performance and in this respect an FPGA can perform several different tasks in parallel.”


The article then continues to describe a 4-camera ADAS demo based on Aldec’s TySOM-2-7Z100 prototyping board that was shown at this year’s Embedded Vision Summit held in Santa Clara, California. (The TySOM-2-7Z100 proto board is based on the Xilinx Zynq Z-7100 SoC—the largest member of the Zynq SoC family.)





Aldec TySOM-2-Z100 Prototyping Board.jpg 


Aldec’s TySOM-2-7Z100 prototyping board




Then the article describes the significant performance boost that the Zynq SoC’s FPGA fabric provides:


“The processing was shared between a dual-core ARM Cortex-A9 processor and FPGA logic (both of which reside within the Zynq device) and began with frame grabbing images from the cameras and applying an edge detection algorithm (‘edge’ here in the sense of physical edges, such as objects, lane markings etc.). This is a computational-intensive task because of the pixel-level computations being applied (i.e. more than 2 million pixels). To perform this task on the ARM CPU a frame rate of only 3 per second could have been realized, whereas in the FPGA 27.5 fps was achieved.”


That’s nearly a 10x performance boost thanks to the on-chip FPGA fabric. Could your application benefit similarly?





Eideticom’s NoLoad (NVMe Offload) platform uses FPGa-based acceleration on PCIe FPGA cards and in cloud-based FPGA servers to provide storage and compute acceleration through standardized NVMe and NVMe over Fabrics protocols. The No Load product itself is a set of IP that implements the NoLoad accelerator. The company is offering Hardware Eval Kits that target FPGA-based PCIe cards from Nallatech--the 250S FlashGT+ Card based on a Xilinx Kintex UltraScale+ KU15P FPGA—and the Alpha Data ADM-PCIE-9V3, which is based on a Xilinx Virtex UltraScale+ VU3P FPGA.


The NoLoad platform allows networked systems to share FPGA acceleration resources across the network fabric. For example, Eideticom offers an FPGA-accelerated Reed-Solomon Erasure Coding engine that can supply codes to any storage facility on the network.


Here’s a 6-minute video that explains the Eideticom NoLoad offering with a demo from the Xilinx booth at the recent SC17 conference:






For more information about the Nallatech 250S+ SSD accelerator, see “Nallatech 250S+ SSD accelerator boosts storage speed of four M.2 NVMe drives using Kintex UltraScale+ FPGA.”



For more information about the Alpha Data ADM-PCIE-9V3, see “Blazing new Alpha Data PCIe Accelerator card sports Virtex UltraScale+ VU3P FPGA, 4x 100GbE ports, 16Gbytes of DDR4 SDRAM.”


The latest hypervisor to host Wind River’s VxWorks RTOS alongside with Linux is the Xen Project Hypervisor, an open-source virtualization platform from the Linux Foundation. DornerWorks has released a version of the Xen Project Hypervisor called Virtuosity (the hypervisor formerly known as the Xen Zynq Distribution) that runs on the Arm Cortex-A53 processor cores in the Xilinx Zynq UltraScale+ MPSoC. Consequently, Wind River has partnered with DornerWorks to provide a Xen Project Hypervisor solution for VxWorks and Linux on the Xilinx Zynq UltraScale+ MPSoC ZCU102 eval kit.


Having VxWorks and Linux running on the same system allows developers to create hybrid software systems that offer the combined advantages of the two operating systems, with VxWorks managing mission-critical functions and Linux managing human-interactive functions and network cloud connection functions.


Wind River has just published a blog about using VxWorks and Linux on the Arm cortex-A53 processor, concisely titled “VxWorks on Xen on ARM Cortex A53,” written by Ka Kay Achacoso. The blog describes an example system with VxWorks running signal-processing and spectrum-analysis applications. Results are compiled into a JSON string and sent through the virtual network to Ubuntu.  On Ubuntu, the Apache2 HTTP server sends results to a browser using Node.js and Chart.js to format the data display.


Here’s a block diagram of the system in the Wind River blog:




Wind River VxWorks and Linux Hybrid System.jpg 


VxWorks and Linux Hybrid OS System




VxWorks runs as a guest OS on top of the unmodified Virtuosity hypervisor.




For more information about DornerWorks Xen Hypervisor (Virtuosity), see:







There was a live AWS EC2 F1 application-acceleration Developer’s Workshop during last month Amazon’s re:Invent 2017. If you couldn’t make it, don’t worry. It’s now online and you can run through it in about two hours (I’m told). This workshop teaches you how to develop accelerated applications using the AWS F1 OpenCL flow and the Xilinx SDAccel development environment for the AWS EC2 F1 platform, which uses Xilinx Virtex UltraScale+ FPGAs as high-performance hardware accelerators.


The architecture of the AWS EC2 F1 platform looks like this:



AWS EC2 F1 Architecture.jpg 


AWS EC2 F1 Architecture




This developer workshop is divided in 4 modules. Amazon recommends that you complete each module before proceeding to the next.


  1. Connecting to your F1 instance 
    You will start an EC2 F1 instance based on the FPGA developer AMI and connect to it using a remote desktop client. Once connected, you will confirm you can execute a simple application on F1.
  2. Experiencing F1 acceleration 
    AWS F1 instances are ideal to accelerate complex workloads. In this module you will experience the potential of F1 by using FFmpeg to run both a software implementation and an F1-optimized implementation of an H.265/HEVC encoder.
  3. Developing and optimizing F1 applications with SDAccel 
    You will use the SDAccel development environment to create, profile and optimize an F1 accelerator. The workshop focuses on the Inverse Discrete Cosine Transform (IDCT), a compute intensive function used at the heart of all video codecs.
  4. Wrap-up and next steps 
    Explore next steps to continue your F1 experience after the re:Invent 2017 Developer Workshop.



Access the online AWS EC2 F1 Developer’s Workshop here.



For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:









SEGGER has added the RISC-V processor to its list of more than 80 ports for the company’s embOS RTOS, which guarantees 100% deterministic, real-time operation for any embedded device. The embOS RTOS is fully compliant with the MISRA-C:2012 standard, making it suitable for demanding automotive and high-integrity applications. The RISC-V embOS port comes with a BSP (board support package) for Digilent’s $99 ARTY evaluation board—based on a Xilinx Artix-7 A35T FPGA—providing a straightforward getting-started experience with SEGGER software on RISC-V. In support of the RTOS, SEGGER offers emWin to construct user interfaces, emFile file system, emSSL, emSSH and emSecure to secure internet communications, cryptographic and security libraries for encryption, code signing and authentication (digital signatures), embOS/IP, emModbus, emUSB-Host and emUSB-Device communication stacks for Internet and industrial applications, and emLoad to enable firmware updates from portable storage or delivered over the air.



Arty Board V3.jpg 


Digilent’s $99 Arty dev board is based on a Xilinx Artix-7 FPGA




For more information about instantiating the RISC-V processor architecture in Xilinx All Programmable devices, see:









Accolade’s new ANIC-200Kq Flow Classification and Filtering Adapter brings packet processing, storage optimization, and scalable Flow Classification at 100GbE through two QSFP28 optical cages. Like the company’s ANIC-200Ku Lossless Packet Capture adapter introduced last year, the ANIC-200Kq board is based on a Xilinx UltraScale FPGA so it’s able to run a variety of line-speed packet-processing algorithms including the company’s new “Flow Shunting” feature.




Accolade ANIC-200Kq Flow Classification and Filtering Adapter.jpg 


Closeup view of the QSFP28 ports on Accolade’s ANIC-200Kq Flow Classification and Filtering Adapter




The new ANIC-200Kq adapter differs from the older ANIC-200Ku adapter in its optical I/O ports. The ANIC-200Kq adapter incorporates two QSFP28 optical cages and the ANIC-200Kq adapter incorporates two CFP2 cages. Both the QSFP28 and CFP2 interfaces accept SR4 and LR4 modules. The QSFP28 optical cages put Accolade’s ANIC-200Kq adapter squarely in the 25, 40, and 100GbE arenas, providing data center architects with additional architectural flexibility when designing their optical networks. For this reason, QSFP28 is fast becoming the universal form factor for new data center installations.



For more information in Xcell Daily about Accolade’s fast Flow Classification and Filtering Adapters, see:







By Adam Taylor


Getting the best performance from our embedded-vision systems often requires that we can capture frames individually for later analysis in addition to displaying them. Programs such as Octave, Matlab, or Image J can analyze these captured frames, allowing us to examine parameters such as:


  • Compare the received pixel values against those expected for a test or calibration pattern.
  • Examine the Image Histogram, enabling histogram equalization to be implemented if necessary.
  • Ensure that the integration time of the imager is set correctly for the scene type.
  • Examine the quality of the image sensor to identify defective pixels—for example dead or stuck-at pixels.
  • Determine the noise present in the image. The noise present will be due to both inherent imager noise sources—for example fixed pattern noise, device noise and dark current—and also due to system noise as coupled in via power supplies and other sources of electrical noise in the system design.


Typically, this testing may occur in the lab as part of the hardware design validation and is often performed before the higher levels of the application software are available.  Such testing is often implemented using a bare-metal approach on the processor system.


If we are using VDMA, the logical point to extract the captured data is from the frame buffer in the DDR SDRAM attached to the Zynq SoC’s or MPSoC’s PS. There are two methods we can use to examine the contents of this buffer:


  • Use XSCT terminal to read out the frame buffer and post process it using a TCL script.
  • Output the frame buffer over RS232 or Ethernet using the Light Weight IP Stack and then capturing the image data in a terminal for post processing using a TCL file.


For this example, I am going to use the UltraZed design we created a few weeks ago to examine PL-to-PS image transfers in the Zynq UltraScale+ MPSoC (see here). This design rather helpfully uses the test pattern generator to transfer a test image to a frame buffer in the PS-attached DDR SDRAM. In this example, we will extract the test pattern and convert it into a bit-map (BMP) file. Once we have the bit-map file, we can read it into the analysis program of choice.


BMP files are very simple. In the most basic format, they consist of a BMP Header, Device Independent Bitmap (DIB) Header, and the pixel array. In this example the pixel array will consist of 24-bit pixels, using eight bits each for blue, green and red pixel values.


It is important to remember two key facts when generating the pixel array. First, when generating the pixel array each line must be padded with zeros so that its length is a multiple of four, allowing for 32-bit word access. Second, the BMP image is stored upside down in the array. That is the first line of the pixel array is the bottom line of the image.


Combined, both headers equal 54 bytes in length and are structured as shown below:






Bitmap Header Construction






DIB Header Construction




Having understood what is involved in creating the file, all we need to do now is gather the pixel data from the PS-attached DDR SDRAM and output it in the correct format.


As we have done several times before in this blog, when we extract the pixel values it is a good idea to double check that the frame buffer contains pixel values. We can examine the contents of the frame buffer using the memory viewer in SDK. However, the view we choose will ease our understanding of the pixel values and hence the frame. This is due to how the VDMA packs the pixels into the frame buffer.


The default view for the Memory viewer is to display 32-bit words as shown below:






TPG Test Pattern in memory




The data we are working with has a pixel width of 24 bits. To ensure efficient use of the DDR SDRAM memory, the VDMA packs the 24-bit pixels into 32-bit values, splitting pixels across locations. This can make things a little confusing when we look at the memory contents for expected pixel values. Because we know the image is formatted as 8-bit RGB, a better view is to configure the memory display to list the memory contents in byte order. We then know that each group of three bytes represents one pixel.







TPG Test Pattern in memory Byte View





Having confirmed that the frame buffer contains image data, I am going to output the BMP information over the RS232 port for this example. I have selected this interface because it is the simplest interface available on many development boards and it takes only a few seconds to read out even a large image.


The first thing I did in my SDK application was to create a structure that defines the header and sets the values as required for this example:





Header Structure in the application




I then created a simple loop that creates three u8 arrays, each the size of the image. There is one array for each color element. I then used these arrays with the header information to output the BMP information, taking care to use the correct format for the pixel array. A BMP pixel array organizes the pixel element as Blue-Green-Red:





Body of the Code to Output the Image




Wanting to keep the processes automated and without the need to copy and paste to capture the output, I used Putty as the terminal program to receive the output data. I selected Putty because it is capable of saving received data a log file.





Putty Configuration for logging




Of course, this log file contains an ASCII representation of the BMP. To view it, we need to convert it to a binary file of the same values. I wrote a simple TCL script to do this. The script performs the conversion, reading in the ASCII file and writing out the binary BMP File.





TCL ASCII to Binary Conversion Widget




With this complete, we have the BMP image which we can load into Octave, Matlab, or another tool for analysis. Below is an example of the tartan color-bar test pattern that I captured from the Zynq frame buffer using this method:






Generated BMP captured from the PS DDR




Now if we can read from the frame buffer, then it springs to mind that we can use the same process to write a BMP image into the frame buffer. This can be especially useful when we want to generate overlays and use them with the video mixer.


We will look at how we do this in a future blog.




You can find the example source code on GitHub.



Adam Taylor’s Web site is http://adiuvoengineering.com/.



If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




First Year E Book here


First Year Hardback here.




MicroZed Chronicles hardcopy.jpg 



Second Year E Book here


Second Year Hardback here



MicroZed Chronicles Second Year.jpg 






Last month, Dave Jones tore down yet another low-end DSO—the Uni-T Ultra Phosphor UPO2104CS 4-channel, 100MHz, 1Gsamples/sec scope—on his EEVBlog Web site and he found a Xilinx Spartan-6 FPGA inside. (It appears in the video at 22:00.) The Spartan-6 LX45 FPGA manages the DSO’s sample memories which include a fast SRAM and an SDRAM.




Uni-T UPO2014CS DSO.jpg


Uni-T Ultra Phosphor UPO2104CS 4-channel, 100MHz, 1Gsamples/sec DSO





Here’s a photo of the Spartan-6 LX45 FPGA on the DSO’s main board.




Uni-T UPO2014CS Spartan-6 detail.jpg 


Spartan-6 FPGA detail from Uni-T Ultra Phosphor UPO2104CS DSO

(Photo courtesy of Dave Jones)




As with many DSOs that Dave Jones has torn down, the Spartan-7 FPGA manages the flow of captured data from the DSO’s 1Gsample/sec ADC and through the DSO’s sample memory. As a result, the scope can capture 30,000 waveforms/sec and it can capture as many as 65,000 waveform frames in its SDRAM capture buffer. In the image above, the length-matched signal lines emerging from the top of the Spartan-6 FPGA lead to the fast SRAM capture memory and the length-matched lines emerging from the bottom of the FPGA lead to the SDRAM waveform memory.


Here’s Jones’ 33-minute teardown video. The Spartan-6 FPGA appears at around 22:00 in the video.








Uni-T’s UPO2104CS DSO currently sells for $538.88 on Banggood.com.




Every year at the end of the year for the past few decades, the staff of EDN has sifted through the thousands of electronic products they’ve written about over the past year to select the Hot 100 products for the year. In EDN’s opinion, the Xilinx Zynq UltraScale+ RFSoC is one of the Hot 100 products for 2017 in the “RF & Networking” Category.


Members of the Xilinx Zynq UltraScale+ RFSoC device family integrates multi-gigasample/sec RF ADCs and DACs, soft-decision forward error correction (SD-FEC) IP blocks, Xilinx UltraScale architecture programmable logic fabric, and an Arm Cortex-A53/Cortex-R5 multi-core processing subsystem into one chip. The Zynq UltraScale+ RFSoC is a category killer for many, many applications that need “high-speed analog-in, high-speed analog-out, digital-processing-in-the-middle” capabilities due to the devices’ extremely high integration level. It most assuredly will reduce the size, power, and complexity of traditional antenna structures in many RF applications—especially for 5G antenna systems.


As I wrote when the Zynq UltraScale+ RFSoC family won the IET Innovation Award in the Communications category, “There's simply no other device like the Zynq UltraScale+ RFSoC on the market, as suggested by this award. “




RFSoC Conceptual Diagram.jpg 


Zynq UltraScale+ RFSoC Conceptual Diagram




For more information about the Zynq UltraScale+ RFSoC, see:













Good machine learning heavily depends on large training-data sets, which are not always available. There’s a solution to this problem called transfer learning, which allows the new neural network to leverage an already trained neural network as a starting point. Kaan Kara at ETH Zurich has published an example of transfer learning as a Jupyter Notebook for the Zynq-and-Python based PYNQ development environment on Github. This demo uses the ZipML-PYNQ overlay and analyzes astronomical images of galaxies and puts the images into one of two classes: one showing images of merging galaxies and one that doesn’t.


The work is discussed further in a paper presented at the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017. The paper is titled “FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off.”










Xcell Daily often features Xilinx-based products and highlights how the unique features of Xilinx’s All Programmable devices have enabled innovative end-product capabilities. Many of these Xcell Daily blog posts originated with videos from the “Powered by Xilinx” program. If you have an awesome product based on one or more Xilinx devices and would like to see it showcased in the “Powered by Xilinx” program, we’d like to know. You can submit a request to be considered for this program here.




Powered by Xilinx Logo.jpg 


The upcoming Xilinx Developer Forum in Frankfurt, Germany on January 9 will feature a hands-on Developer Lab titled “Accelerating Applications with FPGAs on AWS.” During this afternoon session, you’ll gain valuable hands-on experience with the FPGA-accelerated AWS EC2 F1 instance and hear from a special guest speaker from Amazon Web Services. Attendance is limited on a first-come-first-serve basis, so you must register, here.



For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:










Netcope’s NP4, a cloud-based programming tool allows you to specify networking behavior using declarations written in the P4 network-specific, high-level programming language for the company’s high-performance, programmable Smart NICs based on Xilinx Virtex UltraScale+ and Virtex-7 FPGAs. The programming process involves the following steps:


  1. Write the P4 code.
  2. Upload your code to the NP4 cloud.
  3. Wait for the application to autonomously translate your P4 code into VHDL and synthesize the FPGA configuration.
  4. Download the firmware bitstream and upload it to the FPGA on your Netcope NIC.


Netcope calls NP4 its “Firmware as a Service” offering. If you are interested in trying NP4, you can request free trial access to the cloud service here.



Netcope NFB-200G2QL Programmable NIC.jpg


Netcope Technologies’ NFB-200G2QL 200G Ethernet Smart NIC based on a Virtex UltraScale+ FPGA




For more information about Netcope and P4 in Xcell Daily, see:




For more information about Netcope’s FPGA-based NICs in Xcell Daily, see:







Designing SDRs (software-defined radios)? MathWorks and Analog Devices have joined together to bring you a free Webinar titled “Radio Deployment on SoC Platforms.” It a 45-minute class that discusses hardware and software development for SDR designs using MathWorks’ MATLAB, Simulink, and HDL Coder to:


  • Model and simulate radio designs
  • Verify algorithms in simulation with streaming RF data
  • Deploy radio designs on hardware with HDL and C-code generation


The target hardware here is Analog Devices’ ADRV9361 RF SOM, which is based on ADI’s AD9361 RF Agile Transceiver and Xilinx’s Zynq Z-7035 All-Programmable SoC. ADI’s RF SOM has 2x2 MIMO capability.





Analog Devices’ Zynq-based RF SOM on a Carrier Card




There will be three broadcasts of the Webinar on December 13 to accommodate viewers around the world. Register here. Register even if you cannot attend and you’ll receive a link to a recording of the session.




By Adam Taylor


Over the last couple of weeks, we have examined how we can debug our designs using Micrium’s μC/Probe (Post 1 and Post 2) or with the JTAG to AXI Bridge. However, the best way to minimize time spent debugging is to generate high quality designs in the first place. We can then focus on ensuring that the design functionality is as specified instead of hunting bugs.


To improve the quality of our design, there are several things we can do that help us achieve timing closure and identify design issues and bugs:


  1. Review code to ensure that it not only complies with coding and design standards and to catch functional issues earlier in the design stage.
  2. Ensure compliance with device/tool-chain-recommended coding standards—for example the Xilinx Ultrafast design methodology.
  3. Correctly constrain the design for clocks, multicycle, and false paths.
  4. Analyze CDCs (Clock Domain Crossings) to ensure that all CDCs are correctly handled.
  5. Perform detailed simulation to test corner cases and boundary conditions.


Over my career, I have spent many hours performing code reviews, checking designs for functionality, and for compliance with coding standards and tool-chain recommendations.


The Blue Pearl Visual Verification Suite is an EDA tool that automates design checking over a range of different customizable rule sets including basic rules, Xilinx Ultrafast design methodology rules, and DO254. The Blue Pearl tools also perform detailed analysis of clocks, counters, state machines, CDCs, paths, and constraints. All of this checking helps engineers gain a better understanding of the functional side of their design. In short, this is a very useful tool set to have in our tool box to improve design quality. Let’s look at how this tool integrates with the Xilinx Vivado design environment and how we can use it on a simple design.


With Blue Pearl installed, the first step is to integrate it with Vivado. To do this we use the Xilinx TCL Store to install the Blue Pearl Visual Verification Suite.





Installing Blue Pearl via the Xilinx TCL Store




Once Blue Pearl is installed, the next step is to create two custom commands. The first command allows us to open a new Blue Pearl project from an open Vivado project. The second command allows updates from Vivado into the Blue Pearl project.


We create these custom commands by selecting tools->custom commands->customizes commands.






Open the Command Customization dialog




This opens a dialog that allows you to create custom commands. For each command, we need to define the callable TCL procedures in the Blue Pearl Visual Verification Suite.






Creating the Launch BPS command



For the “launch BPS” command, we need to use the command:









Creating the Update Command




For the update BPS command, we call the following command:







Once you have completed the addition of the customized commands, you will see two new buttons on the Vivado tool bar.


With the integration completed, we can now use Blue Pearl to analyze and improve the quality of our design if we identify issues that need analysis. Clicking the newly created “launch Blue Pearl” command within a Vivado project opens a new Blue Pearl project for analysis.


As it loads the Vivado design, Blue Pearl checks the code for synthesis and identifies any black boxes. Any syntax errors encountered will be flagged for correction before further analysis can be performed.  


There are an extensive number of checks and analysis that can be run on the loaded design, ranging from basic checks to DO254 compliance. There are so many possible checklist items that it might take a little time to select the checks that are important to you. However, once you’ve specified the checks you want, you can save the rules and use them across multiple projects. What is interesting is the tool also reports if the check has been run and not just its status as pass of fail. This explicit feedback mechanism removes the ability of designers to achieve compliance by omission. (And that’s a good thing.)





Blue Pearl Environment





Design Check configuration



As an example, I loaded a project that I am working on to see what the design check and analysis reports look like. The design is simple. It decodes a MIPI stream to frame sync, line sync, and pixel values. While this is a simple design, Blue Pearl still identified a few issues within the code that need consideration to see if they present an issue or not.


The first potential issue identified was in the If/Then/Else (ITE) analysis. The design contains a VHDL process that decodes the MIPI header type. This process is written using an if / elsif structure, which implies a priority encoder. Furthermore, to differentiate between five different header commands, the length of the priority encoder contains a five deep if / elsif structure. Blue Pearl calls this a length of five. By default, Blue Pearl generates warnings on lengths greater than 3. In this case no priority required and a case statement would provide better synthesis results because there is no need to consider input priority. Although each application is different, you as the engineer need to use your own experience and knowledge of the design to decide whether or not priority is needed.


Along with reporting the length of the if structure, ITE analysis also analyzes the number of conditions within a statement. This is important when an if statement contains several conditions because additional conditions require additional logic resources and routing, which will impact our timing performance.





Identification of if / then /else large length



State machines are of course used in designs for control structures. Complex control structures requires large state machines, which can be difficult to follow in the RTL. As part of its analysis, Blue Pearl creates visualizations of state machines within a design. This visualization details the transitions among states, along with identifying any unreachable states. I found this capability very useful not only in debugging and verifying the behavior of my own state machines, but also for visualizing third-party designs. This graphical capability definitely helps me understand the designer’s intent.






FSM Analysis Viewer




Blue Pearl also provides the ability to visualise CDCs and paths and to monitor fan out within a design. These features allow us to identify places in the design where we might want to add CDC-mitigation measures such as re-timing or pipeline registers within the design.





Clock Domain Crossing Analysis






Path Analysis







Flip-Flop Fan out reporting




Having touched lightly on the capabilities of Blue Pearl, I am impressed with the results once you have taken the time to set the correct checks and analysis. The analysis provided allows you to catch potential issues earlier in the design cycle, which should reduce the time spent in the lab hunting bugs. In turn, this frees us to spend more of our time testing functionality.


You can find the example source code on GitHub.



Adam Taylor’s Web site is http://adiuvoengineering.com/.



If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




First Year E Book here


First Year Hardback here.




MicroZed Chronicles hardcopy.jpg 




Second Year E Book here


Second Year Hardback here



MicroZed Chronicles Second Year.jpg 






Karl Freund’s article titled “Amazon AWS And Xilinx: A Progress Report” appeared on Forbes.com today. Freund is a Moor Insights & Strategy Senior Analyst for Deep Learning and High-Performance Computing (HPC). He describes Amazon’s FPGA-based AWS EC2 F1 instance offering this way:



“…the cloud leader [Amazon] is laying the foundation to simplify FPGA adoption by creating a marketplace for accelerated applications built on Xilinx [Virtex UltraScale+] FPGAs.”



Freund then discusses what’s happened since Amazon announced its AWS EC2 F1 instance a year ago. Here are his seven highlights:


  1. "AWS has now deployed the F1 instances to four regions, with more to come…”

  2. “To support the Asian markets, where AWS has limited presence, Xilinx has won over support from the Alibaba and Huawei cloud operations.” (Well, that’s ones not really about Amazon, but let’s keep in in anyway, shall we?)

  3. “Xilinx has launched a global developer outreach program, and has already trained over 1,000 developers [on the use of AWS EC2 F1] at three Xilinx Developer Forums—with more to come.”

  4. “Xilinx has recently released a Machine Learning (ML) Amazon Machine Instance (AMI), bringing the Xilinx Reconfigurable Acceleration Stack (announced last year) for ML Inference to the AWS cloud.”

  5. “Xilinx partner Edico Genome recently achieved a Guinness World Record for decoding human genomes, analyzing 1000 full human genomes on 1000 F1 instances in 2 hours, 25 minutes; a remarkable 100-fold improvement in performance…”

  6. “AWS has added support for Xilinx SDAccel programming environment to all AWS regions for solution developers…”

  7. “Xilinx partner Ryft has built an impressive analytic platform on F1, enabling near-real-time analytics by eliminating data preparation bottlenecks…”



The rest of Freund’s article discusses the Ryft’s AWS Marketplace offering in more detail and concludes with this:



“…at least for now, Amazon.com, Huawei, Alibaba, Baidu, and Tencent have all voted for Xilinx.”




For extensive Xcell Daily coverage about the AWS EC2 F1 instance, see:















Like the genie in Aladdin, KORTIQ’s FPGA-based AIScale CNN Accelerator takes pre-trained CNNs (convolutional neural networks)—including industry standards such as ResNet, AlexNet, Tiny Yolo, and VGG-16—compresses them, and fits them into Xilinx’s full range of programmable logic fabrics. Devices such as the Zynq SoC and Zynq UltraScale+ MPSoC have multiple on-chip processors that can provide data to the AIScale CNN Accelerator instantiated in the FPGA fabric and accept its classification output, enabling designs such as single-chip, intelligent industrial or surveillance video cameras.


KORTIQ’s AIScale DeepCompressor compresses the trained CNN and outputs a resulting description file that represents the trained CNN. KORTIQ’s TensorFlow2AIScale translator then prepares the compressed CNN for use with KORTIQ’s AIScale RCC (reconfigurable compute core) IP that performs real-time recognition based on the trained CNN. Because the compressed CNN takes the form of a relatively small description, many such description files can be stored in on- or off-chip memory, making fast switching among trained CNNs quite feasible. Currently, KORTIQ is focusing on embedded vision and computer vision applications such as image classification, object recognition, object tracking, and face recognition.


Here’s a conceptual block diagram of the KORTIQ offering:



KORTIQ AIScale.jpg 




The hardware portion of this product, the AIScale RCC, is a coarse-grained, scalable, accelerator that can be instantiated in programmable logic—for example in the FPGA fabric of a Zynq Z-7020 SoC for small-footprint instances of the AIScale RCC. Larger All Programmable devices such as larger Zynq SoCs and Zynq UltraScale+ MPSoCs can implement more processing blocks within the accelerator core, which in turn makes the accelerator go even faster. You can use this feature to scale system performance up by picking devices with larger FPGA arrays or reducing power consumption by picking smaller devices.


For more information about the AIScale product family, contact KORTIQ directly.



About the Author
  • Be sure to join the Xilinx LinkedIn group to get an update for every new Xcell Daily post! ******************** Steve Leibson is the Director of Strategic Marketing and Business Planning at Xilinx. He started as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He's served as Editor in Chief of EDN Magazine, Embedded Developers Journal, and Microprocessor Report. He has extensive experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.