OKI IDS and Avnet have jointly announced a new board for developing ADAS (automated driver assist systems) and advanced SAE Level 4/5 autonomous driving systems based on two Xilinx UltraScale+ MPSoCs.
Avnet plans to start distributing the board in Japan in February, 2018 and will then expand into other parts of Asia. The A4-sized board interfaces to as many as twelve sensors including cameras and other types of imagers. The board operates on 12V and, according to the announcement, consumes about 20% of the power compared to similar hardware based on GPUs because it employs the Xilinx UltraScale+ MPSoCs as its foundation.
Want to see this board in person? You can, at the Xilinx booth at Automotive World 2018 being held at Tokyo Big Site from January 17th to 19th. (Hall East 6, 54-47)
Ann Stefora Mutschler has just published an article on the SemiconductorEngineering.com Web site titled “Mixing Interface Protocols” that describes some of the complexities of SoC design—all related to the proliferation of various on- and off-chip I/O protocols. However, the article can just as easily be read as a reason for using programmable-logic devices such as Xilinx Zynq SoCs, Zynq UltraScale+ MPSoCs, and FPGAs in your system designs.
For example, here’s Mutschler’s lead sentence:
“Continuous and pervasive connectivity requires devices to support multiple interface protocols, but that is creating problems at multiple levels because each protocol is based on a different set of assumptions.”
This sentence nicely sums up the last two decades of interface design philosophy for programmable-logic devices. Early on, it became clear that a lot of logic translation was needed to connect early FPGAs to the rest of a system. When Xilinx developed I/O pins with programmable logic levels, it literally wiped out a big chunk of the market for level-translator chips. When MGT (multi-gigabit serial transceivers) started to become popular for moving large amounts of data from one subsystem to another, Xilinx moved those onto its devices as well.
So if you’d like to briefly glimpse into the chaotic I/O scene that’s creating immense headaches for SoC designers, take a read through Ann Stefora Mutschler’s new article. If you’d like to sidestep those headaches, just remember that Xilinx’s engineering team has already suffered them for you.
In an article published in EETimes today titled “Programmable Logic Holds the Key to Addressing Device Obsolescence,” Xilinx’s Giles Peckham argues that the use of programmable devices—such as the Zynq SoCs, Zynq UltraScale+ MPSoCs, and FPGAs offered by Xilinx—can help prevent product obsolescence in long-lived products designed for industrial, scientific, and military applications. And that assertion is certainly true. But in this blog, I want to highlight the response by a reader using the handle MWagner_MA who wrote:
“Given the pace of change in FPGA's, I don't know if an FPGA will be a panacea for chip obsolescence issues. However, when changes in system design occur for hooking up new peripherals to a design off board, FPGA's can extend the life of a product 5+ years assuming you can get board-compatible FPGA's. Comm channels are what come to mind. If you use the same electrical interface but have an updated protocol, programmable logic can be a solution. Another solution is that when devices on SPI or I2C busses go obsolete, FPGA code can get updated to accomodate, even changing protocol if necessary assuming the right pins are connected at the other chip (like an A/D).”
MWagner_MA’s response is nuanced and tempered with obvious design experience. However, I will need to differ with the comment that the pace of change in FPGAs means something significant within the context of product obsolescence. Certainly FPGAs go obsolete, but it takes a long, long time.
Case in point:
I received an email just today from Xilinx about this very topic. (Feel free to insert amusement here about Xilinx’s corporate blogger being on the company’s promotional email list.) The email is about Xilinx’s Spartan-6 FPGAs, which were first announced in 2009. That’s eight or nine years ago. Today’s email states that Xilinx plans to ship Spartan-6 devices “until at least 2027.” That’s another nine or ten years into the future for a resulting product-line lifespan of nearly two decades and that’s not all that unusual for Xilinx parts. In other words, Xilinx FPGAs are in another universe entirely when compared to the rapid pace of obsolescence for semiconductor devices like PC and server processors. That’s something to keep in mind when you’re designing products destined for a long life in the field.
If you want to see the full long-life story for the Spartan-6 FPGA family, click here.
The Raptor from Rincon Research implements a 2x2 MIMO SDR (software-defined radio) in a compact 5x2.675-inch form factor by combining the capabilities of the Analog Devices AD9361 RF Agile Transceiver and the Zynq UltraScale+ ZU9EG MPSoC. The board has an RF tuning range of 70MHz to 6GHz. On-board memory includes 4Gbytes of DDR4 SDRAM, a pair of QSPI Flash memory chips, and an SD card socket. Digital I/O options include three on-board USB connectors (two USB 3.0 ports and one USB 2.0 port) and, through a mezzanine board, 10/100/1000 Ethernet, two SFP+ optical cages, an M.2 SATA port, DisplayPort, and a Samtec FireFly connector. Rincon Research provides the board along with a BSP, drivers, and COTS tool support.
Here’s a block diagram of the Raptor board:
Rincon Research’s Raptor, a 2x2 MIMO SDR Board, Block Diagram
Here are photos of the Raptor main board and its I/O expansion mezzanine board:
Rincon Research’s Raptor 2x2 MIMO SDR Board
Rincon Research’s Raptor I/O Expansion Board
Please contact Rincon Research for more information about the Raptor SDR.
For the final MicroZed Chronicles blog of the year, I thought I would wrap up with several tips to help when you are creating embedded-vision systems based on Zynq SoC, Zynq UltraScale+ MPSoC, and Xilinx FPGA devices.
Note: These tips and more will be part of Adam Taylor’s presentation at the Xilinx Developer Forum that will be held in Frankfurt, Germany on January 9.
Design in Flexibility from the Beginning
Video Timing Controller used to detect the incoming video standard
Use the flexibility provided by the Video Timing Controller (VTC) and reconfigurable clocking architectures such as Fabric Clocks, MMCM, and PLLs. Using the VTC and associated software running on the PS (processor system) in the Zynq SoC and Zynq UltraScale+ MPSoC, it is possible to detect different video standards from an input signal at run time and to configure the processing and output video timing accordingly. Upon detection of a new video standard, the software running on the PS can configure new clock frequencies for the pixel clock and the image-processing chain along with re-configuring VDMA frame buffers for the new image settings. You can use the VTC’s timing detector and timing generator to define the new video timing. To update the output video timings for the new standard, the VTC can use the detected video settings to generate new output video timings.
Convert input video to AXI Interconnect as soon as possible to leverage IP and HLS
Converting Data into the AXI Streaming Format
Vivado provides a range of key IP cores that implement most of the functions required by an image processing chain—functions such as Color Filter Interpolation, Color Space Conversion, VDMA, and Video Mixing. Similarity Vivado HLS can generate IP cores that use the AXI interconnect to ease integration within Vivado designs. Therefore, to get maximum benefit from the available IP and tool chain capabilities, we need to convert our incoming video data into the AXI Streaming format as soon as possible in the image-processing chain. We can use the Video-In-to-AXI-Stream IP core as an aid here. This core converts video from a parallel format consisting of synchronization signals and pixel values into our desired AXI Streaming format. A good tip when using this IP core is that the sync inputs do not need to be timed as per a VGA standard; they are edge triggered. This eases integration with different video formats such as Camera Link, with its frame-valid, line-valid, and pixel information format, for example.
Use Logic Debugging Resources
Insertion of the ILA monitoring the output stage
Insert integrated logic analyzers (ILAs) at key locations within the image-processing chain. Including these ILAs from day one in the design can help speed commissioning of the design. When implementing an image-processing chain in a new design, I insert ILA’s as a minimum in the following locations:
Directly behind the receiving IP module—especially if it is a custom block. This ILA enables me to be sure that I am receiving data from the imager / camera.
On the output of the first AXI Streaming IP Core. This ILA allows me to be sure the image-processing core has started to move data through the AXI interconnect. If you are using VDMA, remember you will not see activity on the interconnect until you have configured the VDMA via software.
On the AXI-Streaming-to-Video-Out IP block, if used. I also consider connecting the video timing controller generator outputs to this ILA as well. This enables me to determine if the AXI-Stream-to-Video-Out block is correctly locked and the VTC is generating output timing.
When combined with the test patterns discussed below, insertion of ILAs allows us to zero in faster on any issues in the design which prevent the desired behavior.
Select an Imager / Camera with a Test Pattern capability
Incorrectly received incrementing test pattern captured by an ILA
If possible when selecting the imaging sensor or camera for a project, choose one that provides a test pattern video output. You can then use this standard test pattern to ensure the reception, decoding, and image-processing chain is configured correctly because you’ll know exactly what the original video signal looks like. You can combine the imager/camera test pattern with ILAs connected close to the data reception module to determine if any issues you are experiencing when displaying an image is internal to the device and the image processing chain or are the result of the imager/camera configuration.
We can verify the deterministic pixel values of the test pattern using the ILA. If the pixel values, line length, and the number of lines are as we expect, then it is not an imager configuration issue. More likely you will find the issue(s) within the receiving module and the image-processing chain. This is especially important when using complex imagers/cameras that require several tens, or sometimes hundreds of configuration settings to be applied before an image is obtained.
Include a Test Patter Generator in your Zynq SoC, Zynq UltraScale+ MPSoC, or FPGA design
Tartan Color Bar Test Pattern
If you include a test-pattern generator within the image-processing chain, you can use it to verify the VDMA frame buffers, output video timing, and decoding prior to the integration of the imager/camera. This reduces integration risks. To gain maximum benefit, the test-pattern generator should be configured with the same color space and resolution as the final imager. The test pattern generator should be included as close to the start of the image-processing chain as possible. This enables more of the image-processing pipeline to be verified, demonstrating that the image-processing pipeline is correct. When combined with test pattern capabilities on the imager, this enables faster identification of any problems.
Understand how Video Direct Memory Access stores data in memory
Video Direct Memory Access (VDMA) allows us to use the processor DDR memory as a frame buffer. This enables access to the images from the processor cores in the PS to perform higher-level algorithms if required. VDMA also provides the buffering required for frame-rate and resolution changes. Understanding how VDMA stores pixel data within the frame buffers is critical if the image-processing pipeline is to work as desired when configured.
One of the major points of confusion when implementing VDMA-based solutions centers around the definition of the frame size within memory. The frame buffer is defined in memory by three parameters: Horizontal Size (HSize), Vertical Size (VSize). and Stride. The two parameters that define the Horizontal Size of the image are the HSize and the stride of the image. Like VSize, which defines the number of lines in the image, the HSize defines the length of each line. However instead of being measured in pixels the horizontal size is measured in bytes. We therefore need to know how many bytes make up each pixel.
The Stride defines the distance between the start of one line and another. To gain efficient use of the DDR memory, the Stride should at least equal the horizontal size. Increasing the Stride introduces a gap between lines. Implementing this gap can be very useful when verifying that the imager data is received correctly because it provides a clear indication of when a line of the image starts and ends with memory.
These six simple techniques have helped me considerably when creating imageprocessing examples for this blog or solutions for clients and they significantly ease both the creation and commissioning of designs.
As I said, this is my last blog of the year. We will continue this series in the New Year. Until then I wish you all happy holidays.
TSN (time-sensitive networking) is a set of evolving IEEE standards that support a mix of deterministic, real-time and best-effort traffic over fast Ethernet connections. The TSN set of standards is bocming increasingly important in many industrial networking sutuations, particularly for IIoT (the Industrial Internet of Things). SoC-e has developed TSN IP that you can instantiate in Xilinx All Programmable devices. (Because the standards are still evolving, implementing the TSN hardware in reprogrammable hardware is a good idea.)
Last week at the NIPS 2017 conference in Long Beach, California, a Xilinx team demonstrated a live object-detection implementation of a YOLO—“you only look once”—network called Tincy YOLO (pronounced “teensy YOLO”) running on a Xilinx Zynq UltraScale+ MPSoC. Tincy YOLO combines reduced precision, pruning, and FPGA-based hardware acceleration to speed network performance by 160x, resulting in a YOLO network capable of operating on video frames at 16fps while dissipating a mere 6W.
Live demo of Tincy YOLO at NIPS 2017. Photo credit: Dan Isaacs
Here’s a description of that demo:
TincyYOLO: a real-time, low-latency, low-power object detection system running on a Zynq UltraScale+ MPSoC
By Michaela Blott, Principal Engineer, Xilinx
The Tincy YOLO demonstration shows real-time, low-latency, low-power object detection running on a Zynq UltraScale+ MPSoC device. In object detection, the challenge is to identify objects of interest within a scene and to draw bounding boxes around them, as shown in Figure 1. Object detection is useful in many areas, particularly in advanced driver assistance systems (ADAS) and autonomous vehicles where systems need to automatically detect hazards and to take the right course of action. Tincy YOLO leverages the “you only look once” (YOLO) algorithm, which delivers state-of-the-art object detection. Tincy YOLO is based on the Tiny YOLO convolutional network, which is based on the Darknet reference network. Tincy YOLO has been optimized through heavy quantization and modification to fit into the Zynq UltraScale+ MPSoC’s PL (programmable logic) and Arm Cortex-A53 processor cores to produce the final, real-time demo.
Figure 1: YOLO-recognized people with bounding boxes
To appreciate the computational challenge posed by Tiny YOLO, note that it takes 7 billion floating-point operations to process a single frame. Before you can conquer this computational challenge on an embedded platform, you need to pull many levers. Luckily, the all-programmable Zynq UltraScale+ MPSoC platform provides many levers to pull. Figure 2 summarizes the versatile and heterogeneous architectural options of the Zynq platform.
Figure 2: Tincy YOLO Platform Overview
The vanilla Darknet open-source neural network framework is optimized for CUDA acceleration but its generic, single-threaded processing option can target any C-programmable CPU. Compiling Darknet for the embedded Arm processors in the Zynq UltraScale+ MPSoC left us with a sobering performance of one recognized frame every 10 seconds. That’s about two orders of magnitude of performance away from a useful ADAS implementation. It also produces a very limited live-video experience.
To create Tincy YOLO, we leveraged several of the Zynq UltraScale+ MPSoC’s architectural features in steps, as shown in Figure 3. Our first major move was to quantize the computation of the network’s twelve inner (aka. hidden) layers by giving them binary weights and 3-bit activations. We then pruned this network to reduce the total operations to 4.5 GOPs/frame.
Figure 3: Steps used to achieve a 160x speedup of the Tiny YOLO network
We created a reduced-precision accelerator using a variant of the FINN BNN library (https://github.com/Xilinx/BNN-PYNQ) to offload the quantized layers into the Zynq UltraScale+ MPSoC’s PL. These layers account for more than 97% of all the computation within the network. Moving the computations for these layers into hardware bought us a 30x speedup of their specific execution, which translated into an 11x speedup within the overall application context, bringing the network’s performance up to 1.1fps.
We tackled the remaining outer layers by exploiting the NEON SIMD vector capabilities built into the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor cores, which gained another 2.2x speedup. Then we cracked down on the complexity of the initial convolution using maxpool elimination for another 2.2x speedup. This work raised the frame rate to 5.5fps. A final re-write of the network inference to parallelize the CPU computations across all four of the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor delivered video performance at 16fps.
The result of these changes appears in Figure 4, which demonstrates better recognition accuracy than Tiny YOLO.
High-frequency trading is all about speed, which explains why Aldec’s new reconfigurable HES-HPC-HFT-XCVU9P PCIe card for high-frequency trading (HFT) apps is powered by a Xilinx Virtex UltraScale+ VU9P FPGA. That’s about as fast as you can get with any sort of reprogrammable or reconfigurable technology. The Virtex UltraScale+ FPGA directly connects to all of the board’s critical, high-speed interface ports—Ethernet, QSFP, and PCIe x16—and implements the communications protocols for those standard interfaces as well as the memory control and interface for the board’s three QDR-II+ memory modules. Consequently, there’s no time-consuming chip-to-chip interconnection. Picoseconds count in HFT applications, so the FPGA’s ability to implement all of the card’s logic is a real competitive advantage for Aldec. The new FPGA accelerator is extremely useful for implementing time-sensitive trading strategies such as Market Making, Statistical Arbitrage, and Algorithmic Trading and is compatible with 1U and larger trading systems.
Aldec’s HES-HPC-HFT-XCVU9P PCIe card for high-frequency trading apps—Powered by a Xilinx Virtex UltraScale+ FPGA
The NoLoad platform allows networked systems to share FPGA acceleration resources across the network fabric. For example, Eideticom offers an FPGA-accelerated Reed-Solomon Erasure Coding engine that can supply codes to any storage facility on the network.
Here’s a 6-minute video that explains the Eideticom NoLoad offering with a demo from the Xilinx booth at the recent SC17 conference:
The latest hypervisor to host Wind River’s VxWorks RTOS alongside with Linux is the Xen Project Hypervisor, an open-source virtualization platform from the Linux Foundation. DornerWorks has released a version of the Xen Project Hypervisor called Virtuosity (the hypervisor formerly known as the Xen Zynq Distribution) that runs on the Arm Cortex-A53 processor cores in the Xilinx Zynq UltraScale+ MPSoC. Consequently, Wind River has partnered with DornerWorks to provide a Xen Project Hypervisor solution for VxWorks and Linux on the Xilinx Zynq UltraScale+ MPSoC ZCU102 eval kit.
Having VxWorks and Linux running on the same system allows developers to create hybrid software systems that offer the combined advantages of the two operating systems, with VxWorks managing mission-critical functions and Linux managing human-interactive functions and network cloud connection functions.
Wind River has just published a blog about using VxWorks and Linux on the Arm cortex-A53 processor, concisely titled “VxWorks on Xen on ARM Cortex A53,” written by Ka Kay Achacoso. The blog describes an example system with VxWorks running signal-processing and spectrum-analysis applications. Results are compiled into a JSON string and sent through the virtual network to Ubuntu. On Ubuntu, the Apache2 HTTP server sends results to a browser using Node.js and Chart.js to format the data display.
Here’s a block diagram of the system in the Wind River blog:
VxWorks and Linux Hybrid OS System
VxWorks runs as a guest OS on top of the unmodified Virtuosity hypervisor.
For more information about DornerWorks Xen Hypervisor (Virtuosity), see:
There was a live AWS EC2 F1 application-acceleration Developer’s Workshop during last month Amazon’s re:Invent 2017. If you couldn’t make it, don’t worry. It’s now online and you can run through it in about two hours (I’m told). This workshop teaches you how to develop accelerated applications using the AWS F1 OpenCL flow and the Xilinx SDAccel development environment for the AWS EC2 F1 platform, which uses Xilinx Virtex UltraScale+ FPGAs as high-performance hardware accelerators.
The architecture of the AWS EC2 F1 platform looks like this:
AWS EC2 F1 Architecture
This developer workshop is divided in 4 modules. Amazon recommends that you complete each module before proceeding to the next.
Connecting to your F1 instance You will start an EC2 F1 instance based on the FPGA developer AMI and connect to it using a remote desktop client. Once connected, you will confirm you can execute a simple application on F1.
Experiencing F1 acceleration AWS F1 instances are ideal to accelerate complex workloads. In this module you will experience the potential of F1 by using FFmpeg to run both a software implementation and an F1-optimized implementation of an H.265/HEVC encoder.
Developing and optimizing F1 applications with SDAccel You will use the SDAccel development environment to create, profile and optimize an F1 accelerator. The workshop focuses on the Inverse Discrete Cosine Transform (IDCT), a compute intensive function used at the heart of all video codecs.
Wrap-up and next steps Explore next steps to continue your F1 experience after the re:Invent 2017 Developer Workshop.
Access the online AWS EC2 F1 Developer’s Workshop here.
For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:
Closeup view of the QSFP28 ports on Accolade’s ANIC-200Kq Flow Classification and Filtering Adapter
The new ANIC-200Kq adapter differs from the older ANIC-200Ku adapter in its optical I/O ports. The ANIC-200Kq adapter incorporates two QSFP28 optical cages and the ANIC-200Kq adapter incorporates two CFP2 cages. Both the QSFP28 and CFP2 interfaces accept SR4 and LR4 modules. The QSFP28 optical cages put Accolade’s ANIC-200Kq adapter squarely in the 25, 40, and 100GbE arenas, providing data center architects with additional architectural flexibility when designing their optical networks. For this reason, QSFP28 is fast becoming the universal form factor for new data center installations.
For more information in Xcell Daily about Accolade’s fast Flow Classification and Filtering Adapters, see:
Getting the best performance from our embedded-vision systems often requires that we can capture frames individually for later analysis in addition to displaying them. Programs such as Octave, Matlab, or Image J can analyze these captured frames, allowing us to examine parameters such as:
Compare the received pixel values against those expected for a test or calibration pattern.
Examine the Image Histogram, enabling histogram equalization to be implemented if necessary.
Ensure that the integration time of the imager is set correctly for the scene type.
Examine the quality of the image sensor to identify defective pixels—for example dead or stuck-at pixels.
Determine the noise present in the image. The noise present will be due to both inherent imager noise sources—for example fixed pattern noise, device noise and dark current—and also due to system noise as coupled in via power supplies and other sources of electrical noise in the system design.
Typically, this testing may occur in the lab as part of the hardware design validation and is often performed before the higher levels of the application software are available. Such testing is often implemented using a bare-metal approach on the processor system.
If we are using VDMA, the logical point to extract the captured data is from the frame buffer in the DDR SDRAM attached to the Zynq SoC’s or MPSoC’s PS. There are two methods we can use to examine the contents of this buffer:
Use XSCT terminal to read out the frame buffer and post process it using a TCL script.
Output the frame buffer over RS232 or Ethernet using the Light Weight IP Stack and then capturing the image data in a terminal for post processing using a TCL file.
For this example, I am going to use the UltraZed design we created a few weeks ago to examine PL-to-PS image transfers in the Zynq UltraScale+ MPSoC (see here). This design rather helpfully uses the test pattern generator to transfer a test image to a frame buffer in the PS-attached DDR SDRAM. In this example, we will extract the test pattern and convert it into a bit-map (BMP) file. Once we have the bit-map file, we can read it into the analysis program of choice.
BMP files are very simple. In the most basic format, they consist of a BMP Header, Device Independent Bitmap (DIB) Header, and the pixel array. In this example the pixel array will consist of 24-bit pixels, using eight bits each for blue, green and red pixel values.
It is important to remember two key facts when generating the pixel array. First, when generating the pixel array each line must be padded with zeros so that its length is a multiple of four, allowing for 32-bit word access. Second, the BMP image is stored upside down in the array. That is the first line of the pixel array is the bottom line of the image.
Combined, both headers equal 54 bytes in length and are structured as shown below:
Bitmap Header Construction
DIB Header Construction
Having understood what is involved in creating the file, all we need to do now is gather the pixel data from the PS-attached DDR SDRAM and output it in the correct format.
As we have done several times before in this blog, when we extract the pixel values it is a good idea to double check that the frame buffer contains pixel values. We can examine the contents of the frame buffer using the memory viewer in SDK. However, the view we choose will ease our understanding of the pixel values and hence the frame. This is due to how the VDMA packs the pixels into the frame buffer.
The default view for the Memory viewer is to display 32-bit words as shown below:
TPG Test Pattern in memory
The data we are working with has a pixel width of 24 bits. To ensure efficient use of the DDR SDRAM memory, the VDMA packs the 24-bit pixels into 32-bit values, splitting pixels across locations. This can make things a little confusing when we look at the memory contents for expected pixel values. Because we know the image is formatted as 8-bit RGB, a better view is to configure the memory display to list the memory contents in byte order. We then know that each group of three bytes represents one pixel.
TPG Test Pattern in memory Byte View
Having confirmed that the frame buffer contains image data, I am going to output the BMP information over the RS232 port for this example. I have selected this interface because it is the simplest interface available on many development boards and it takes only a few seconds to read out even a large image.
The first thing I did in my SDK application was to create a structure that defines the header and sets the values as required for this example:
Header Structure in the application
I then created a simple loop that creates three u8 arrays, each the size of the image. There is one array for each color element. I then used these arrays with the header information to output the BMP information, taking care to use the correct format for the pixel array. A BMP pixel array organizes the pixel element as Blue-Green-Red:
Body of the Code to Output the Image
Wanting to keep the processes automated and without the need to copy and paste to capture the output, I used Putty as the terminal program to receive the output data. I selected Putty because it is capable of saving received data a log file.
Putty Configuration for logging
Of course, this log file contains an ASCII representation of the BMP. To view it, we need to convert it to a binary file of the same values. I wrote a simple TCL script to do this. The script performs the conversion, reading in the ASCII file and writing out the binary BMP File.
TCL ASCII to Binary Conversion Widget
With this complete, we have the BMP image which we can load into Octave, Matlab, or another tool for analysis. Below is an example of the tartan color-bar test pattern that I captured from the Zynq frame buffer using this method:
Generated BMP captured from the PS DDR
Now if we can read from the frame buffer, then it springs to mind that we can use the same process to write a BMP image into the frame buffer. This can be especially useful when we want to generate overlays and use them with the video mixer.
Members of the Xilinx Zynq UltraScale+ RFSoC device family integrates multi-gigasample/sec RF ADCs and DACs, soft-decision forward error correction (SD-FEC) IP blocks, Xilinx UltraScale architecture programmable logic fabric, and an Arm Cortex-A53/Cortex-R5 multi-core processing subsystem into one chip. The Zynq UltraScale+ RFSoC is a category killer for many, many applications that need “high-speed analog-in, high-speed analog-out, digital-processing-in-the-middle” capabilities due to the devices’ extremely high integration level. It most assuredly will reduce the size, power, and complexity of traditional antenna structures in many RF applications—especially for 5G antenna systems.
As I wrote when the Zynq UltraScale+ RFSoC family won the IET Innovation Award in the Communications category, “There's simply no other device like the Zynq UltraScale+ RFSoC on the market, as suggested by this award. “
Zynq UltraScale+ RFSoC Conceptual Diagram
For more information about the Zynq UltraScale+ RFSoC, see:
The upcoming Xilinx Developer Forum in Frankfurt, Germany on January 9 will feature a hands-on Developer Lab titled “Accelerating Applications with FPGAs on AWS.” During this afternoon session, you’ll gain valuable hands-on experience with the FPGA-accelerated AWS EC2 F1 instance and hear from a special guest speaker from Amazon Web Services. Attendance is limited on a first-come-first-serve basis, so you must register, here.
For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:
Netcope’s NP4, a cloud-based programming tool allows you to specify networking behavior using declarations written in the P4 network-specific, high-level programming language for the company’s high-performance, programmable Smart NICs based on Xilinx Virtex UltraScale+ and Virtex-7 FPGAs. The programming process involves the following steps:
Write the P4 code.
Upload your code to the NP4 cloud.
Wait for the application to autonomously translate your P4 code into VHDL and synthesize the FPGA configuration.
Download the firmware bitstream and upload it to the FPGA on your Netcope NIC.
Netcope calls NP4 its “Firmware as a Service” offering. If you are interested in trying NP4, you can request free trial access to the cloud service here.
Netcope Technologies’ NFB-200G2QL 200G Ethernet Smart NIC based on a Virtex UltraScale+ FPGA
For more information about Netcope and P4 in Xcell Daily, see:
Karl Freund’s article titled “Amazon AWS And Xilinx: A Progress Report” appeared on Forbes.com today. Freund is a Moor Insights & Strategy Senior Analyst for Deep Learning and High-Performance Computing (HPC). He describes Amazon’s FPGA-based AWS EC2 F1 instance offering this way:
“…the cloud leader [Amazon] is laying the foundation to simplify FPGA adoption by creating a marketplace for accelerated applications built on Xilinx [Virtex UltraScale+] FPGAs.”
Freund then discusses what’s happened since Amazon announced its AWS EC2 F1 instance a year ago. Here are his seven highlights:
"AWS has now deployed the F1 instances to four regions, with more to come…”
“To support the Asian markets, where AWS has limited presence, Xilinx has won over support from the Alibaba and Huawei cloud operations.” (Well, that’s ones not really about Amazon, but let’s keep in in anyway, shall we?)
“Xilinx has launched a global developer outreach program, and has already trained over 1,000 developers [on the use of AWS EC2 F1] at three Xilinx Developer Forums—with more to come.”
“Xilinx has recently released a Machine Learning (ML) Amazon Machine Instance (AMI), bringing the Xilinx Reconfigurable Acceleration Stack (announced last year) for ML Inference to the AWS cloud.”
“Xilinx partner Edico Genome recently achieved a Guinness World Record for decoding human genomes, analyzing 1000 full human genomes on 1000 F1 instances in 2 hours, 25 minutes; a remarkable 100-fold improvement in performance…”
“AWS has added support for Xilinx SDAccel programming environment to all AWS regions for solution developers…”
“Xilinx partner Ryft has built an impressive analytic platform on F1, enabling near-real-time analytics by eliminating data preparation bottlenecks…”
The rest of Freund’s article discusses the Ryft’s AWS Marketplace offering in more detail and concludes with this:
“…at least for now, Amazon.com, Huawei, Alibaba, Baidu, and Tencent have all voted for Xilinx.”
For extensive Xcell Daily coverage about the AWS EC2 F1 instance, see:
Like the genie in Aladdin, KORTIQ’s FPGA-based AIScale CNN Accelerator takes pre-trained CNNs (convolutional neural networks)—including industry standards such as ResNet, AlexNet, Tiny Yolo, and VGG-16—compresses them, and fits them into Xilinx’s full range of programmable logic fabrics. Devices such as the Zynq SoC and Zynq UltraScale+ MPSoC have multiple on-chip processors that can provide data to the AIScale CNN Accelerator instantiated in the FPGA fabric and accept its classification output, enabling designs such as single-chip, intelligent industrial or surveillance video cameras.
KORTIQ’s AIScale DeepCompressor compresses the trained CNN and outputs a resulting description file that represents the trained CNN. KORTIQ’s TensorFlow2AIScale translator then prepares the compressed CNN for use with KORTIQ’s AIScale RCC (reconfigurable compute core) IP that performs real-time recognition based on the trained CNN. Because the compressed CNN takes the form of a relatively small description, many such description files can be stored in on- or off-chip memory, making fast switching among trained CNNs quite feasible. Currently, KORTIQ is focusing on embedded vision and computer vision applications such as image classification, object recognition, object tracking, and face recognition.
Here’s a conceptual block diagram of the KORTIQ offering:
The hardware portion of this product, the AIScale RCC, is a coarse-grained, scalable, accelerator that can be instantiated in programmable logic—for example in the FPGA fabric of a Zynq Z-7020 SoC for small-footprint instances of the AIScale RCC. Larger All Programmable devices such as larger Zynq SoCs and Zynq UltraScale+ MPSoCs can implement more processing blocks within the accelerator core, which in turn makes the accelerator go even faster. You can use this feature to scale system performance up by picking devices with larger FPGA arrays or reducing power consumption by picking smaller devices.
For more information about the AIScale product family, contact KORTIQ directly.
Titan IC’s newest addition to the AWS Marketplace based on the FPGA-accelerated AWS EC2 F1 instance is the Hyperion F1 10G RegEx File Scan, a high-performance file-search and file-scanning application that can process 1Tbyte of data with as many as 800,000 user-defined regular expressions in less than 15 minutes. The Hyperion F1 10G RegEx File Scan application leverages the processing power of the AWS EC2 F1 instance’s multiple Xilinx Virtex UltraScale+ VU9P FPGAs to speed the scanning of files using complex pattern and string matching, attaining a throughput as high as 10Gbps.
Here’s a block diagram showing the Hyperion F1 10G RegEx File Scan application running in an AWS EC2 f1.2xlarge instance:
You can get more details about this application here in the AWS Marketplace.
For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:
This month at SC17 in Denver, Nallatech was showing its new 250S+ high-performance SSD-accelerator PCIe card, which uses a Xilinx Kintex UltraScale+ KU15P FPGA to implement an NVMe SSD controller/accelerator and the board’s PCIe Gen4 x8 interface. You can plug SSD cards or NVMe cables into the card’s four M.2 NVME slots so you can control as many as four on- or off-board drives with one card. The card comes in 3.84Tbyte and 6.4Tbyte versions with on-board M.2 NVMe SSDs and can control a drive array as large as 25.6Tbytes using NVMe cables.
Nallatech 250S+ NVMe SSD accelerator card based on a Xilinx Kintex UltraScale+ FPGA
Nallatech 250S+ NVMe SSD accelerator card with NVMe cables
Here are the card’s specs:
And here’s a block diagram of the Nallatech 250S+ NVMe accelerator card:
As you can see, the Kintex UltraScale+ FPGA implements the entire logic design on the card, driving the PCIe connector, managing the four attached NVMe SSDs, directly controlling and operating the card’s on-board DDR4-2400 SDRAM cache, and even implementing the card's JTAG interface.
For more information about the Nallatech 250S+ NVMe accelerator card, please contact Nallatech directly.
Netcope Technologies’ NFB-200G2QL 200G Ethernet Smart NIC based on a Virtex UltraScale+ FPGA
One trick to doing this: using two PCIe Gen3 x16 slots to get packets to/from the server CPU(s). Why two slots? Because Netcope discovered that its 200G Smart NIC PCIe card could transfer about 110Gbps worth of packets over one PCIe Gen3 x16 slot and the theoretical maximum traffic throughput for one such slot is 128Gbps. That means 200Gbps will not pass through the eye of this 1-slot needle. Hence the need for two PCIe slots, which will carry the 200Gbps worth of packets with a comfortable margin. Where’s that second PCIe Gen3 interface coming from? Over a cable attached to the Smart NIC board and implemented in the board’s very same Xilinx Virtex UltraScale+ VU7P FPGA, of course. The company has written a White Paper describing this technique titled “Overcoming the Bandwidth Limitations of PCI Express.”
And yes, there’s a short video showing this Netcope sorcery as well:
Here’s a real find, courtesy of LinkedIn: a full semester of lecture notes from Philip Koopman’s new Embedded System Software Engineering grad-level class at CMU. That’s more than 40 modules in PDF form online and these are packed with info!
What’s this got to do with Xilinx? Anyone writing code for the multiple Arm Cortex processors in Xilinx Zynq SoCs and Zynq UltraScale MPSoCs or for embedded MicroBlaze processors instantiated in Xilinx FPGA fabrics will benefit from the info in these lecture notes. My hat’s off to Professor Koopman for posting these notes and then posting the URL on LinkedIn.
Oh, he’s got a book too titled “Better Embedded System Software,” which you can get from Amazon for $66.05 or from Professor Koopman’s Web site for $69.99 with included shipping. (The book retails for $140.)
If you’ve got some high-speed RF analog work to do, VadaTech’s new AMC598 and VPX598 Quad ADC/Quad DAC modules appear to be real workhorses. The four 14-bit ADCs (using two AD9208 dual ADCs) operate at 3Gsamples/sec and the quad 16-bit DACs (using four AD9162 or AD9164 DACs) operate at 12Gsamples/sec. You’re not going to drive those sorts of data rates over the host bus so the modules have local memory in the form of three DDR4 SDRAM banks for a total of 20Gbytes of on-board SDRAM. A Xilinx Kintex UltraScale KCU115 FPGA (aka the DSP Monster, the largest Kintex UltraScale FPGA family member with 5520 DSP slices that give you an immense amount of digital signal processing power to bring to bear on those RF analog signals) manages all of the on-board resources (memory, analog converters, and host bus) and handles the blazingly fast data-transfer rates allowing you to create RF waveform generators and advanced RF-capture systems for applications including communications and signal intelligence (COMINT/SIGINT), radar, and electronic warfare using Xilinx tools including the Vivado Design Suite HLx Editions and the Xilinx Vivado System Generator for DSP, which can be used in conjunction with MathWorks’ MATLAB and the Simulink model-based design tool.
Here’s a block diagram of the AMC598 module:
VadaTech AMC598 Quad ADC/Quad DAC Block Diagram
And here’s a photo of the AMC598 Quad ADC/Quad DAC module:
VadaTech AMC598 Quad ADC/Quad DAC
Note: Please contact VadaTech directly for more information about the AMC598 and VPX598 Quad ADC/Quad DAC modules.
In the short video below, Xilinx Product Marketing Manager Kamran Khan demonstrates GoogleNet running at 10K images/sec on Amazon’s AWS EC2 F1 using eight Virtex UltraScale+ FPGAs in a 16xlarge configuration. The same video also shows open-source, deep-learning app DeepDetect running in real time, classifying images from a Webcam’s real-time video stream.
For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:
Alpha Data had a nice, real-time 3D ray-casting and volume rendering demo in its booth at last week’s SC17. The demo harnesses three Alpha Data ADM-PCIE-KU3 FPGA cards based on Xilinx Kintex UltraScale KU060 FPGAs. A model of a buckyball volumetric data set is distributed to the three cards, which are then teamed to perform ray casting and 3D color rendering in real time. Of particular note: the three FPGA cards perform the video rendering at 43fps while consuming a little more than 2J. The same task running on an i5 x86 processor renders only 2fps while consuming more than 52J. That’s a 20x speed improvement and a 25x improvement in energy consumption resulting in a 400x speed/energy improvement.
Last week at SC17 in Denver, BittWare announced its TeraBox 1432D 1U FPGA server box, a modified Dell PowerEdge C4130 with a new front panel that exposes 32 100GbE QSFP ports from as many as four of the company’s FPGA accelerator cards. (That’s a total front-panel I/O bandwidth of 3.2Tbps!) The new 1U box doubles the I/O rack density with respect to the company’s previous 4U offering.
BittWare’s TeraBox 1432D 1U FPGA Server Box exposes 32 100GbE QSFP ports on its front panel
The TeraBox 1432D server box can be outfitted with four of the company’s XUPP3R boards, which are based on Xilinx Virtex UltraScale+ FPGAs (VU7P, VU9P, or VU11P) and can be fitted for eight QSFPs each. (That’s four QSFP cages) on the board and four more QSFPs on a daughter card connected to the XUPP3R board via a cable to an FMC connector. This configuration underscores the extreme I/O density and capability of Virtex UltraScale+ FPGAs.
BittWare TeraBox 1432D interior detail
The new BittWare TeraBox 1432D will be available Q1 2018 with the XUPP3R FPGA accelerator board. According to the announcement, BittWare will also release the Xilinx UltraScale+ VU13P-based XUPVV4 in 2018. This new board will also fit in the TeraBox 1432D.
Here’s a 3-minute video from SC17 with a walkthrough of the TeraBox 1432D 1U FPGA server box by BittWare's GM and VP of Network Products Craig Lund:
Thirty years ago, my friends and co-workers Jim Reyer and KB and I would drive to downtown Denver for a long lunch at a dive Mexican bar officially known as “The Brewery Bar II.” But the guy who owned it, the guy who was always perched on a stool inside to the door to meet and seat customers, was named Abe Schur so we called these trips “Abe’s runs.” This week, I found myself in downtown Denver again at the SC17 supercomputer conference at the Colorado Convention Center. The Brewery Bar II is still in business and only 15 blocks away from the convention center, so on a fine, sunny day, I set out on foot for one more Abe’s run.
I arrived about 45-minutes later.
I walked in the front door and 30 years instantly evaporated. I couldn’t believe it but the place didn’t look any different. The same rickety tables. The same neon signs on the wall. The same bar. The same weird red, flocked wallpaper. It was all the same except my friends weren’t there with me and Abe wasn’t sitting on a stool. I’d already known that he’d passed away many years ago.
Also the same was the crowded state of the place at lunch time. The waitress (they don’t have servers at Abe’s) told me there were no tables available but I could eat at the bar. I took a place at the end of the bar and sat next to a guy typing on a laptop. That wasn’t the same as it was 30 years ago.
The bartender came up and asked me what I wanted to drink. I said I’d not been in for more than 25 years and asked if they still served “Tinys.” A Tiny is Abe’s-speak for a large beer. He said “Of course,” so I ordered a Tiny ice tea. (Not the Long Island variety.)
Then he asked me what I wanted to eat. There’s only one response for that at Abe’s and since they still understood what a Tiny was, I answered without ever touching a menu: “One special relleno, green, with sour cream as a neutron moderator.” He asked me if I wanted the green chile hot, mild, or half and half. Thirty years ago, I’d have ordered hot. My digestive system now has three more decade’s worth of mileage on it, so I ordered half and half. Good thing. The chile’s hotness still registered a 6 or 7 on the Abe’s 1-to-10 scale.
After I ordered, the guy with the laptop next to me said “The rellenos are still as good as they were 25 years ago.” Indeed, that’s what he was eating. The ice had broken with Abe’s hot rellenos and so we started talking. The laptop guy’s name was Scott and he maintains cellular antenna installations on towers and buildings. His company owns a lot of cell tower sites in the Denver area.
Scott is very familiar with the changes taking place in cellular infrastructure and cell-site ownership, particularly with the imminent arrival of the latest 5G gear. He told me that the electronics is migrating up the towers to be as near the antennas as possible. “All that goes up there now is 48V power and a fiber,” he said. Scott is also familiar with the migration of the electronics directly into the antennas.
It turns out that Scott is also a ham radio operator, so we talked about equipment. He’s familiar with and has repaired just about everything that’s been on the market going back to tube-based gear but he was especially impressed with the new all-digital Icom rig he now uses most of the time. Scott’s not an engineer, but hams know a ton about electronics, so we started discussing all sorts of things. He’s especially interested in the newer LDMOS power FETs. So much so that he’s lost interest in using high-voltage transmitter tubes. "Why mess with high voltage when I can get just as far with 50V?" he mused.
I was wearing my Xilinx shirt from the SC17 conference, so I took the opportunity to start talking about the very relevant Xilinx Zynq UltraScale+ RFSoC, which is finding its way into a lot of 5G infrastructure equipment. Scott hadn’t heard about it, which really isn’t surprising considering how new it is, but after I described it he said he looked forward to maybe finding one in his next ham rig.
The special relleno, green with sour cream, arrived and one bite immediately took me back three decades again. The taste had not changed one morsel. Scott and I continued to talk for an hour. Sadly, the relleno didn’t last nearly that long.
Scott and I left Abe's together. He got into his truck and I started the 15-block walk back to the convention center. The conversation and the food formed one of those really remarkable time bubbles you sometimes stumble into—and always at Abe’s.
Communications: Xilinx, for its single-chip 5G antenna interface device that dramatically reduces the size, power and complexity of traditional antenna structures.
Note: E&T is the IET's award-winning monthly magazine and associated website.
Xilinx’s Giles Peckham (center) accepts the IET Innovation Award in the Communications Category for the
Zynq UltraScale+ RFSoC from Professor Will Stewart (IET Communications Policy Panel, on left) and
Rick Edwards (Awards emcee, television presenter, and writer/comic, on right). Photo courtesy of IET.
Classifying the Xilinx Zynq UltraScale+ RFSoC device family, with its integrated multi-gigasample/sec RF ADCs and DACs, soft-decision forward error correction (SD-FEC) IP blocks, UltraScale architecture programmable logic fabric, and Arm Cortex-A53/Cortex-R5 multi-core processing subsystem as an “antenna interface device,” even a “Massive-MIMO Antenna Interface” device, sort of shortchanges the RFSoC in my opinion. The Zynq UltraScale+ RFSoC is a category killer for many, many applications that need “high-speed analog-in, high-speed analog-out, digital-processing-in-the-middle” capabilities due to the devices’ extremely high integration level, though it most assuredly will reduce the size, power, and complexity of traditional antenna structures as cited in the IET Innovation Awards literature. There's simply no other device like the Zynq UltraScale+ RFSoC on the market, as suggested by this award. (If you drill down to here on the IET Innovation Awards Web page, you’ll find that the Zynq UltraScale+ RFSoC was indeed Xilinx’s IET Innovation Awards entry in the communications category this year.)
Zynq UltraScale+ RFSoC Conceptual Diagram
The UK-based IET is one of the world’s largest engineering institutions with more than 168,000 members in 150 countries and so winning one of the IET’s annual Innovation Awards is an honor not to be taken lightly. This year, the Communications category of the IET Innovation Awards was sponsored by GCHQ (Government Communications Headquarters), the UK’s intelligence and security organization responsible for providing signals intelligence and information assurance to the UK’s government and armed forces.
For more information about the IET Innovation Awards and to see all of the various categories, click here for an animated brochure.
For more information about the Zynq UltraScale+ RFSoC, see:
“Xilinx, Inc. (XLNX) and Huawei Technologies Co., Ltd. today jointly announced the North American debut of the Huawei FPGA Accelerated Cloud Server (FACS) platform at SC17. Powered by Xilinx high performance Virtex UltraScale+ FPGAs, the FACS platform is differentiated in the marketplace today.
“Launched at the Huawei Connect 2017 event, the Huawei Cloud provides FACS FP1 instances as part of its Elastic Compute Service. These instances enable users to develop, deploy, and publish new FPGA-based services and applications through easy-to-use development kits and cloud-based EDA verification services. Both expert hardware developers and high-level language users benefit from FP1 tailored instances suited to each development flow.
"...The FP1 demonstrations feature Xilinx technology which provides a 10-100x speed-up for compute intensive cloud applications such as data analytics, genomics, video processing, and machine learning. Huawei FP1 instances are equipped with up to eight Virtex UltraScale+ VU9P FPGAs and can be configured in a 300G mesh topology optimized for performance at scale."
Huawei’s FP1 FPGA accelerated cloud service is available on the Huawei Public Cloud today. To register for the public beta, click here.