Today, Digilent announced a $299 bundle including its Zybo Z7-20 dev board (based on a Xilinx Zynq Z-7020 SoC), a Pcam 5C 5Mpixel (1080P) color video camera, and a Xilinx SDSoC development environment voucher. (That’s the same price as a Zybo Z7-20 dev board without the camera.) The Zybo Z7 dev board includes a new 15-pin FFC connector that allows the board to interface with the Pcam 5C camera over a 2-lane MIPI CSI-2 and I2C interfaces. (This connector is pin-compatible with the Raspberry Pi’s FFC camera port.) The Pcam 5C camera is based on the Omnivision OV5640 image sensor.
Digilent has created the Pcam 5C + Zybo Z7 demo project to get you started. The demo accepts video from the Pcam 5C camera and passes it out to a display via the Zybo Z7’s HDMI port. All IP used in the demo including a D-PHY receiver, CSI-2 decoder, Bayer to RGB converter and gamma correction is free and open-source so you can study exactly how the D-PHY and CSI-2 decoding works and then develop you own embedded vision products.
If you want this deal, you’d better hurry. The offer expires February 23—three weeks from today.
The MV1-D1280-L01-3D05-1280-G2 based on a LUXIMA LUX1310 image sensor has a triangulation rate of 948fps for a 1280x1024-pixel image but narrow its programmable region of interest to 768x16 pixels and its triangulation rate jumps to a blindingly fast 68,800fps.
Photonfocus’ MV1-D1280-L01-3D05-1280-G2 high-speed 3D camera operates at triangulation rates as fast as 68,800fps
As you might expect, the high-speed interface and processing requirements for the sensors in these two 3D-imaging cameras differ significantly, which is why both of these cameras, like other cameras in Photonfocus’ MV1 product line, are based on a Xilinx Spartan-6 LX75T FPGA. As discussed in the prior blog post last September, use of the Spartan-6 FPGA permits Photonfocus to use an extremely flexible and programmable, real-time, vision-processing platform that serves as a foundation for many different types of cameras with very different imaging sensors and very different sensor interfaces—all operating at high speed.
If you’re designing and debugging high-speed logic—as you might with video or radar applications, for example—then perhaps you could use some fast debugging capability. As in really-fast. As in much, much, much faster than JTAG. EXOSTIV Labs has got a solution. It’s called the EXOSTIV FPGA Debug Probe and it uses the [bulletproof] high-speed SerDes ports that are pervasive throughout Xilinx All Programmable device families to extract debug data from running devices with great alacrity.
Here’s a 3-minute video showing the EXOSTIV FPGA Debug Probe communicating with a Xilinx Virtex UltraScale VCU108 Eval Kit, connected through the kit’s high-speed QSFP connector, creating a 50Gbps link between the board and the debugger.
Here’s a second 3-minute video with some additional information. This one shows the EXOSTIV Probe and Dashboard being used to monitor 640 signals in a high-speed video interface design:
You observe the captured data using the EXOSTIV Dashboard, as demonstrated in the above video. The probe and software can handle debug data from as many as 32,768 internal nodes per capture. The mind boggles at the potential complexity being handled here.
According to EXOSTIV, the FPGA Debug Probe and Dashboard give you 200,000x more observability into your design than the tools you might currently be using. That’s a major leap in debugging speed and capability that could save you days or weeks of debugging time.
When you’ve exhausted JTAG’s debug capabilities, consider EXOSTIV.
In a new report titled “Hitting the accelerator: the next generation of machine-learning chips,” Deloitte Global predicted that “by the end of 2018, over 25 percent of all chips used to accelerate machine learning in the data center will be FPGAs and ASICs.” The report then continues: “These new kinds of chips should increase dramatically the use of ML, enabling applications to consume less power and at the same time become more responsive, flexible and capable, which is likely to expand the addressable market.” And later in the Deloitte Global report:
“There will also be over 200,000 FPGA and 100,000 ASIC chips sold for ML applications.”
“…the new kinds of chips may dramatically increase the use of ML, enabling applications to use less power and at the same time become more responsive, flexible and capable, which is likely to expand the addressable market…”
“Total 2018 FPGA chip volume for ML would be a minimum of 200,000. The figure is almost certainly going to be higher, but by exactly how much is difficult to predict.”
These sorts of statements are precisely why Xilinx has rapidly expanded its software offerings for machine-learning development from the edge to the cloud. That includes the reVISION stack for developing responsive and reconfigurable vision systems and the Reconfigurable Acceleration stack for developing and deploying platforms at cloud scale.
XIMEA has added two new high-speed industrial cameras to its xiB-64 family: the 1280x864-pixel CB013 capable of imaging at 3500fps and the 1920x1080-pixel CB019 capable of imaging at 2500fps. As with all digital cameras, the story for these cameras starts with the sensors. The CB013 camera is based on a LUXIMA Technology LUX13HS 1.1Mpixel sensor and the CB019 is based on a LUXIMA Technology LUX19HS 2Mpixel sensor. Both cameras use PCIe 3.0 x8 interfaces capable of 64Gbps sustained transfer rates. Use of the PCIe interface allows a host PC to use DMA for direct transfers of the video stream into the computer’s main memory with virtually no CPU overhead.
Both cameras are also based on a Xilinx Kintex UltraScale KU035 FPGA. Why such a fast FPGA in an industrial video camera? The frame rates and 64Gbps PCIe interface transfer rate are all the explanation you need. The Kintex UltraScale KU035 FPGA has 444K system logic cells and 1700 DSP48E2 slices—ample for handling the different sensors in the camera product line and just about any sort of video processing that’s needed. The Kintex UltraScale FPGA also incorporates two integrated (hardened) PCIe Gen3 IP blocks with sixteen bulletproof 16.3Gbps SerDes transceivers to handle the camera’s PCIe Gen3 interface.
When the engineering team at MADV Technology set out to develop the Madventure 360—the world’s smallest, lightest, and thinnest consumer-grade, 360° 4K video camera—in late 2015, it discovered that an FPGA was the only device capable of meeting all of their project goals. As a result, the Madventure camera relies on a Xilinx Spartan-6 FPGA to stitch and synchronize the video streams from two image sensors (one aimed to the front, one aimed to the back) while also performing additional image processing. The tiny, palm-sized camera measures only 3.1x2.6 inches and is only 0.9 inches thick. It sells on Amazon at the moment for $309.99, complete with selfie stick and mini tripod.
MADV Madventure 360° video camera
Now you can hear a little about how MADV Technology created this tiny video wonder in this new Powered by Xilinx video:
Photonfocus has just introduced another industrial video camera in its MV1 industrial camera line—the MV1-D1280-L01-1280-G2 1280x1024-pixel, 85fps (948fps in burst mode), with a GigE interface—which implements all standard features of the MV1 platform as well as burst mode, MROI (multiple regions of interest), and binning. In burst mode, the camera’s internal 2Gbit burst memory can store image sequences for subsequent analysis. The amount of storage depends on image resolution: 250msec at 1024x124 pixels, 1000msec at 512x512 pixels. The maximum amount of stored video also varies with the size of the specified ROI.
The MV1-D1280-L01-1280-G2 1280x1024-pixel, 85fps (948fps in burst mode) industrial video camera with a GigE interface
Like many of its existing industrial video cameras, Photonfocus’ MV1-D1280-L01-1280-G2 is based on a platform design that uses a Xilinx Spartan-6 FPGA for a foundation. Use of the Spartan-6 FPGA permitted Photonfocus to create an extremely flexible vision-processing platform that serves as a common hardware foundation for several radically different types of rugged, industrial cameras in multiple camera lines. These cameras use very different imaging sensors to meet a wide variety of application requirements. The different sensors have very different sensor interfaces, which is why using the Spartan-6 FPGA—an interfacing wizard if there ever was one—as a foundation technology is such a good idea.
Here are some of the other Xilinx-based Photonfocus cameras covered previously in Xcell Daily:
Embedded-vision applications present many design challenges and a new ElectronicsWeekly.com article written by Michaël Uyttersprot, a Technical Marketing Manager at Avnet Silica, and titled “Bringing embedded vision systems to market” discusses these challenges and solutions.
First, the article enumerates several design challenges including:
Meeting hi-res image-processing demands within cost, size, and power goals
Handling a variety of image-sensor types
Handling multiple image sensors in one camera
Real-time compensation (lens correction) for inexpensive lenses
Distortion correction, depth detection, dynamic range, and sharpness enhancement
Next, the article discusses Avnet Silica’s various design offerings that help engineers quickly develop embedded-vision designs. Products discussed include:
Adam Taylor has been writing about the use of Xilinx All Programmable devices for image-processing platforms for quite a while and he has wrapped up much of what he knows into a 44-minute video presentation, which appears below. Adam is presenting tomorrow at the Xilinx Developer Forum being held in Frankfurt, Germany.
Mathworks has been advocating model-based design using its MATLAB and Simulink development tools for some time because the design technique allows you to develop more complex software with better quality in less time. (See the Mathworks White Paper: “How Small Engineering Teams Adopt Model-Based Design.”) Model-based design employs a mathematical and visual approach to developing complex control and signal-processing systems through the use of system-level modeling throughout the development process—from initial design, through design analysis, simulation, automatic code generation, and verification. These models are executable specifications that consist of block diagrams, textual programs, and other graphical elements. Model-based design encourages rapid exploration of a broader design space than other design approaches because you can iterate your design more quickly, earlier in the design cycle. Further, because these models are executable, verification becomes an integral part of the development process at every step. Hopefully, this design approach results in fewer (or no) surprises at the end of the design cycle.
Xilinx supports model-based design using MATLAB and Simulink through the new Xilinx Model Composer, a design tool that integrates into the MATLAB and Simulink environments. The Xilinx Model Composer includes libraries with more than 80 high-level, performance-optimized, Xilinx-specific blocks including application-specific blocks for computer vision, image processing, and linear algebra. You can also import your own custom IP blocks written in C and C++, which are subsequently processed by Vivado HLS.
Here’s a block diagram that shows you the relationship among Mathworks’ MATLAB, Simulink, and Xilinx Model Composer:
Finally, here’s a 6-minute video explaining the benefits and use of Xilinx Model Composer:
For the final MicroZed Chronicles blog of the year, I thought I would wrap up with several tips to help when you are creating embedded-vision systems based on Zynq SoC, Zynq UltraScale+ MPSoC, and Xilinx FPGA devices.
Note: These tips and more will be part of Adam Taylor’s presentation at the Xilinx Developer Forum that will be held in Frankfurt, Germany on January 9.
Design in Flexibility from the Beginning
Video Timing Controller used to detect the incoming video standard
Use the flexibility provided by the Video Timing Controller (VTC) and reconfigurable clocking architectures such as Fabric Clocks, MMCM, and PLLs. Using the VTC and associated software running on the PS (processor system) in the Zynq SoC and Zynq UltraScale+ MPSoC, it is possible to detect different video standards from an input signal at run time and to configure the processing and output video timing accordingly. Upon detection of a new video standard, the software running on the PS can configure new clock frequencies for the pixel clock and the image-processing chain along with re-configuring VDMA frame buffers for the new image settings. You can use the VTC’s timing detector and timing generator to define the new video timing. To update the output video timings for the new standard, the VTC can use the detected video settings to generate new output video timings.
Convert input video to AXI Interconnect as soon as possible to leverage IP and HLS
Converting Data into the AXI Streaming Format
Vivado provides a range of key IP cores that implement most of the functions required by an image processing chain—functions such as Color Filter Interpolation, Color Space Conversion, VDMA, and Video Mixing. Similarity Vivado HLS can generate IP cores that use the AXI interconnect to ease integration within Vivado designs. Therefore, to get maximum benefit from the available IP and tool chain capabilities, we need to convert our incoming video data into the AXI Streaming format as soon as possible in the image-processing chain. We can use the Video-In-to-AXI-Stream IP core as an aid here. This core converts video from a parallel format consisting of synchronization signals and pixel values into our desired AXI Streaming format. A good tip when using this IP core is that the sync inputs do not need to be timed as per a VGA standard; they are edge triggered. This eases integration with different video formats such as Camera Link, with its frame-valid, line-valid, and pixel information format, for example.
Use Logic Debugging Resources
Insertion of the ILA monitoring the output stage
Insert integrated logic analyzers (ILAs) at key locations within the image-processing chain. Including these ILAs from day one in the design can help speed commissioning of the design. When implementing an image-processing chain in a new design, I insert ILA’s as a minimum in the following locations:
Directly behind the receiving IP module—especially if it is a custom block. This ILA enables me to be sure that I am receiving data from the imager / camera.
On the output of the first AXI Streaming IP Core. This ILA allows me to be sure the image-processing core has started to move data through the AXI interconnect. If you are using VDMA, remember you will not see activity on the interconnect until you have configured the VDMA via software.
On the AXI-Streaming-to-Video-Out IP block, if used. I also consider connecting the video timing controller generator outputs to this ILA as well. This enables me to determine if the AXI-Stream-to-Video-Out block is correctly locked and the VTC is generating output timing.
When combined with the test patterns discussed below, insertion of ILAs allows us to zero in faster on any issues in the design which prevent the desired behavior.
Select an Imager / Camera with a Test Pattern capability
Incorrectly received incrementing test pattern captured by an ILA
If possible when selecting the imaging sensor or camera for a project, choose one that provides a test pattern video output. You can then use this standard test pattern to ensure the reception, decoding, and image-processing chain is configured correctly because you’ll know exactly what the original video signal looks like. You can combine the imager/camera test pattern with ILAs connected close to the data reception module to determine if any issues you are experiencing when displaying an image is internal to the device and the image processing chain or are the result of the imager/camera configuration.
We can verify the deterministic pixel values of the test pattern using the ILA. If the pixel values, line length, and the number of lines are as we expect, then it is not an imager configuration issue. More likely you will find the issue(s) within the receiving module and the image-processing chain. This is especially important when using complex imagers/cameras that require several tens, or sometimes hundreds of configuration settings to be applied before an image is obtained.
Include a Test Patter Generator in your Zynq SoC, Zynq UltraScale+ MPSoC, or FPGA design
Tartan Color Bar Test Pattern
If you include a test-pattern generator within the image-processing chain, you can use it to verify the VDMA frame buffers, output video timing, and decoding prior to the integration of the imager/camera. This reduces integration risks. To gain maximum benefit, the test-pattern generator should be configured with the same color space and resolution as the final imager. The test pattern generator should be included as close to the start of the image-processing chain as possible. This enables more of the image-processing pipeline to be verified, demonstrating that the image-processing pipeline is correct. When combined with test pattern capabilities on the imager, this enables faster identification of any problems.
Understand how Video Direct Memory Access stores data in memory
Video Direct Memory Access (VDMA) allows us to use the processor DDR memory as a frame buffer. This enables access to the images from the processor cores in the PS to perform higher-level algorithms if required. VDMA also provides the buffering required for frame-rate and resolution changes. Understanding how VDMA stores pixel data within the frame buffers is critical if the image-processing pipeline is to work as desired when configured.
One of the major points of confusion when implementing VDMA-based solutions centers around the definition of the frame size within memory. The frame buffer is defined in memory by three parameters: Horizontal Size (HSize), Vertical Size (VSize). and Stride. The two parameters that define the Horizontal Size of the image are the HSize and the stride of the image. Like VSize, which defines the number of lines in the image, the HSize defines the length of each line. However instead of being measured in pixels the horizontal size is measured in bytes. We therefore need to know how many bytes make up each pixel.
The Stride defines the distance between the start of one line and another. To gain efficient use of the DDR memory, the Stride should at least equal the horizontal size. Increasing the Stride introduces a gap between lines. Implementing this gap can be very useful when verifying that the imager data is received correctly because it provides a clear indication of when a line of the image starts and ends with memory.
These six simple techniques have helped me considerably when creating imageprocessing examples for this blog or solutions for clients and they significantly ease both the creation and commissioning of designs.
As I said, this is my last blog of the year. We will continue this series in the New Year. Until then I wish you all happy holidays.
The European Tulipp (Towards Uniquitous Low-power Image Processing Platforms) project has just published a very short, 102-second video that graphically demonstrates the advantages of FPGA-accelerated video processing over CPU-based processing. In this demo, a Sundance EMC²-DP board based on a Xilinx Zynq Z-7030 SoC accepts real-time video from a 720P camera and applies filters to the video before displaying the images on an HDMI screen.
Here are the performance comparisons for the real-time video processing:
Last week at the NIPS 2017 conference in Long Beach, California, a Xilinx team demonstrated a live object-detection implementation of a YOLO—“you only look once”—network called Tincy YOLO (pronounced “teensy YOLO”) running on a Xilinx Zynq UltraScale+ MPSoC. Tincy YOLO combines reduced precision, pruning, and FPGA-based hardware acceleration to speed network performance by 160x, resulting in a YOLO network capable of operating on video frames at 16fps while dissipating a mere 6W.
Live demo of Tincy YOLO at NIPS 2017. Photo credit: Dan Isaacs
Here’s a description of that demo:
TincyYOLO: a real-time, low-latency, low-power object detection system running on a Zynq UltraScale+ MPSoC
By Michaela Blott, Principal Engineer, Xilinx
The Tincy YOLO demonstration shows real-time, low-latency, low-power object detection running on a Zynq UltraScale+ MPSoC device. In object detection, the challenge is to identify objects of interest within a scene and to draw bounding boxes around them, as shown in Figure 1. Object detection is useful in many areas, particularly in advanced driver assistance systems (ADAS) and autonomous vehicles where systems need to automatically detect hazards and to take the right course of action. Tincy YOLO leverages the “you only look once” (YOLO) algorithm, which delivers state-of-the-art object detection. Tincy YOLO is based on the Tiny YOLO convolutional network, which is based on the Darknet reference network. Tincy YOLO has been optimized through heavy quantization and modification to fit into the Zynq UltraScale+ MPSoC’s PL (programmable logic) and Arm Cortex-A53 processor cores to produce the final, real-time demo.
Figure 1: YOLO-recognized people with bounding boxes
To appreciate the computational challenge posed by Tiny YOLO, note that it takes 7 billion floating-point operations to process a single frame. Before you can conquer this computational challenge on an embedded platform, you need to pull many levers. Luckily, the all-programmable Zynq UltraScale+ MPSoC platform provides many levers to pull. Figure 2 summarizes the versatile and heterogeneous architectural options of the Zynq platform.
Figure 2: Tincy YOLO Platform Overview
The vanilla Darknet open-source neural network framework is optimized for CUDA acceleration but its generic, single-threaded processing option can target any C-programmable CPU. Compiling Darknet for the embedded Arm processors in the Zynq UltraScale+ MPSoC left us with a sobering performance of one recognized frame every 10 seconds. That’s about two orders of magnitude of performance away from a useful ADAS implementation. It also produces a very limited live-video experience.
To create Tincy YOLO, we leveraged several of the Zynq UltraScale+ MPSoC’s architectural features in steps, as shown in Figure 3. Our first major move was to quantize the computation of the network’s twelve inner (aka. hidden) layers by giving them binary weights and 3-bit activations. We then pruned this network to reduce the total operations to 4.5 GOPs/frame.
Figure 3: Steps used to achieve a 160x speedup of the Tiny YOLO network
We created a reduced-precision accelerator using a variant of the FINN BNN library (https://github.com/Xilinx/BNN-PYNQ) to offload the quantized layers into the Zynq UltraScale+ MPSoC’s PL. These layers account for more than 97% of all the computation within the network. Moving the computations for these layers into hardware bought us a 30x speedup of their specific execution, which translated into an 11x speedup within the overall application context, bringing the network’s performance up to 1.1fps.
We tackled the remaining outer layers by exploiting the NEON SIMD vector capabilities built into the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor cores, which gained another 2.2x speedup. Then we cracked down on the complexity of the initial convolution using maxpool elimination for another 2.2x speedup. This work raised the frame rate to 5.5fps. A final re-write of the network inference to parallelize the CPU computations across all four of the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor delivered video performance at 16fps.
The result of these changes appears in Figure 4, which demonstrates better recognition accuracy than Tiny YOLO.
An article titled “Living on the Edge” by Farhad Fallah, one of Aldec’s Application Engineers, on the New Electronics Web site recently caught my eye because it succinctly describes why FPGAs are so darn useful for many high-performance, edge-computing applications. Here’s an example from the article:
“The benefits of Cloud Computing are many-fold… However, there are a few disadvantages to the cloud too, the biggest of which is that no provider can guarantee 100% availability.”
There’s always going to be some delay when you ship data to the cloud for processing. You will need to wait for the answer. The article continues:
“Edge processing needs to be high-performance and in this respect an FPGA can perform several different tasks in parallel.”
Then the article describes the significant performance boost that the Zynq SoC’s FPGA fabric provides:
“The processing was shared between a dual-core ARM Cortex-A9 processor and FPGA logic (both of which reside within the Zynq device) and began with frame grabbing images from the cameras and applying an edge detection algorithm (‘edge’ here in the sense of physical edges, such as objects, lane markings etc.). This is a computational-intensive task because of the pixel-level computations being applied (i.e. more than 2 million pixels). To perform this task on the ARM CPU a frame rate of only 3 per second could have been realized, whereas in the FPGA 27.5 fps was achieved.”
That’s nearly a 10x performance boost thanks to the on-chip FPGA fabric. Could your application benefit similarly?
Getting the best performance from our embedded-vision systems often requires that we can capture frames individually for later analysis in addition to displaying them. Programs such as Octave, Matlab, or Image J can analyze these captured frames, allowing us to examine parameters such as:
Compare the received pixel values against those expected for a test or calibration pattern.
Examine the Image Histogram, enabling histogram equalization to be implemented if necessary.
Ensure that the integration time of the imager is set correctly for the scene type.
Examine the quality of the image sensor to identify defective pixels—for example dead or stuck-at pixels.
Determine the noise present in the image. The noise present will be due to both inherent imager noise sources—for example fixed pattern noise, device noise and dark current—and also due to system noise as coupled in via power supplies and other sources of electrical noise in the system design.
Typically, this testing may occur in the lab as part of the hardware design validation and is often performed before the higher levels of the application software are available. Such testing is often implemented using a bare-metal approach on the processor system.
If we are using VDMA, the logical point to extract the captured data is from the frame buffer in the DDR SDRAM attached to the Zynq SoC’s or MPSoC’s PS. There are two methods we can use to examine the contents of this buffer:
Use XSCT terminal to read out the frame buffer and post process it using a TCL script.
Output the frame buffer over RS232 or Ethernet using the Light Weight IP Stack and then capturing the image data in a terminal for post processing using a TCL file.
For this example, I am going to use the UltraZed design we created a few weeks ago to examine PL-to-PS image transfers in the Zynq UltraScale+ MPSoC (see here). This design rather helpfully uses the test pattern generator to transfer a test image to a frame buffer in the PS-attached DDR SDRAM. In this example, we will extract the test pattern and convert it into a bit-map (BMP) file. Once we have the bit-map file, we can read it into the analysis program of choice.
BMP files are very simple. In the most basic format, they consist of a BMP Header, Device Independent Bitmap (DIB) Header, and the pixel array. In this example the pixel array will consist of 24-bit pixels, using eight bits each for blue, green and red pixel values.
It is important to remember two key facts when generating the pixel array. First, when generating the pixel array each line must be padded with zeros so that its length is a multiple of four, allowing for 32-bit word access. Second, the BMP image is stored upside down in the array. That is the first line of the pixel array is the bottom line of the image.
Combined, both headers equal 54 bytes in length and are structured as shown below:
Bitmap Header Construction
DIB Header Construction
Having understood what is involved in creating the file, all we need to do now is gather the pixel data from the PS-attached DDR SDRAM and output it in the correct format.
As we have done several times before in this blog, when we extract the pixel values it is a good idea to double check that the frame buffer contains pixel values. We can examine the contents of the frame buffer using the memory viewer in SDK. However, the view we choose will ease our understanding of the pixel values and hence the frame. This is due to how the VDMA packs the pixels into the frame buffer.
The default view for the Memory viewer is to display 32-bit words as shown below:
TPG Test Pattern in memory
The data we are working with has a pixel width of 24 bits. To ensure efficient use of the DDR SDRAM memory, the VDMA packs the 24-bit pixels into 32-bit values, splitting pixels across locations. This can make things a little confusing when we look at the memory contents for expected pixel values. Because we know the image is formatted as 8-bit RGB, a better view is to configure the memory display to list the memory contents in byte order. We then know that each group of three bytes represents one pixel.
TPG Test Pattern in memory Byte View
Having confirmed that the frame buffer contains image data, I am going to output the BMP information over the RS232 port for this example. I have selected this interface because it is the simplest interface available on many development boards and it takes only a few seconds to read out even a large image.
The first thing I did in my SDK application was to create a structure that defines the header and sets the values as required for this example:
Header Structure in the application
I then created a simple loop that creates three u8 arrays, each the size of the image. There is one array for each color element. I then used these arrays with the header information to output the BMP information, taking care to use the correct format for the pixel array. A BMP pixel array organizes the pixel element as Blue-Green-Red:
Body of the Code to Output the Image
Wanting to keep the processes automated and without the need to copy and paste to capture the output, I used Putty as the terminal program to receive the output data. I selected Putty because it is capable of saving received data a log file.
Putty Configuration for logging
Of course, this log file contains an ASCII representation of the BMP. To view it, we need to convert it to a binary file of the same values. I wrote a simple TCL script to do this. The script performs the conversion, reading in the ASCII file and writing out the binary BMP File.
TCL ASCII to Binary Conversion Widget
With this complete, we have the BMP image which we can load into Octave, Matlab, or another tool for analysis. Below is an example of the tartan color-bar test pattern that I captured from the Zynq frame buffer using this method:
Generated BMP captured from the PS DDR
Now if we can read from the frame buffer, then it springs to mind that we can use the same process to write a BMP image into the frame buffer. This can be especially useful when we want to generate overlays and use them with the video mixer.
Like the genie in Aladdin, KORTIQ’s FPGA-based AIScale CNN Accelerator takes pre-trained CNNs (convolutional neural networks)—including industry standards such as ResNet, AlexNet, Tiny Yolo, and VGG-16—compresses them, and fits them into Xilinx’s full range of programmable logic fabrics. Devices such as the Zynq SoC and Zynq UltraScale+ MPSoC have multiple on-chip processors that can provide data to the AIScale CNN Accelerator instantiated in the FPGA fabric and accept its classification output, enabling designs such as single-chip, intelligent industrial or surveillance video cameras.
KORTIQ’s AIScale DeepCompressor compresses the trained CNN and outputs a resulting description file that represents the trained CNN. KORTIQ’s TensorFlow2AIScale translator then prepares the compressed CNN for use with KORTIQ’s AIScale RCC (reconfigurable compute core) IP that performs real-time recognition based on the trained CNN. Because the compressed CNN takes the form of a relatively small description, many such description files can be stored in on- or off-chip memory, making fast switching among trained CNNs quite feasible. Currently, KORTIQ is focusing on embedded vision and computer vision applications such as image classification, object recognition, object tracking, and face recognition.
Here’s a conceptual block diagram of the KORTIQ offering:
The hardware portion of this product, the AIScale RCC, is a coarse-grained, scalable, accelerator that can be instantiated in programmable logic—for example in the FPGA fabric of a Zynq Z-7020 SoC for small-footprint instances of the AIScale RCC. Larger All Programmable devices such as larger Zynq SoCs and Zynq UltraScale+ MPSoCs can implement more processing blocks within the accelerator core, which in turn makes the accelerator go even faster. You can use this feature to scale system performance up by picking devices with larger FPGA arrays or reducing power consumption by picking smaller devices.
For more information about the AIScale product family, contact KORTIQ directly.
Alpha Data had a nice, real-time 3D ray-casting and volume rendering demo in its booth at last week’s SC17. The demo harnesses three Alpha Data ADM-PCIE-KU3 FPGA cards based on Xilinx Kintex UltraScale KU060 FPGAs. A model of a buckyball volumetric data set is distributed to the three cards, which are then teamed to perform ray casting and 3D color rendering in real time. Of particular note: the three FPGA cards perform the video rendering at 43fps while consuming a little more than 2J. The same task running on an i5 x86 processor renders only 2fps while consuming more than 52J. That’s a 20x speed improvement and a 25x improvement in energy consumption resulting in a 400x speed/energy improvement.
“Xilinx, Inc. (XLNX) and Huawei Technologies Co., Ltd. today jointly announced the North American debut of the Huawei FPGA Accelerated Cloud Server (FACS) platform at SC17. Powered by Xilinx high performance Virtex UltraScale+ FPGAs, the FACS platform is differentiated in the marketplace today.
“Launched at the Huawei Connect 2017 event, the Huawei Cloud provides FACS FP1 instances as part of its Elastic Compute Service. These instances enable users to develop, deploy, and publish new FPGA-based services and applications through easy-to-use development kits and cloud-based EDA verification services. Both expert hardware developers and high-level language users benefit from FP1 tailored instances suited to each development flow.
"...The FP1 demonstrations feature Xilinx technology which provides a 10-100x speed-up for compute intensive cloud applications such as data analytics, genomics, video processing, and machine learning. Huawei FP1 instances are equipped with up to eight Virtex UltraScale+ VU9P FPGAs and can be configured in a 300G mesh topology optimized for performance at scale."
Huawei’s FP1 FPGA accelerated cloud service is available on the Huawei Public Cloud today. To register for the public beta, click here.
Ryft is one of several companies now offering FPGA-accelerated applications based on Amazon’s AWS EC2 F1 instance. Ryft was at SC17 in Denver this week with a sophisticated, cloud-based data analytics demo based on machine learning and deep learning that classified 50,000 images from one data file using a neural network, merged the classified image files with log data from another file to create a super metadata file, and then provided fast image retrieval using many criteria including image classification, a watch-list match (“look for a gun” or “look for a truck”), or geographic location using the Google Earth database. The entire demo made use of geographically separated servers containing the files used in conjunction with Amazon’s AWS Cloud. The point of this demo was to show Ryft’s ability to provide “FPGAs as a Service” (FaaS) in an easy to use manner using any neural network of your choice, any framework (Caffe, TensorFlow, MXNet), and the popular RESTful API.
This was a complex, live demo and it took Ryft’s VP of Products Bill Dentinger six minutes to walk me through the entire thing, even moving as quickly as possible. Here’s the 6-minute video of Bill giving a very clear explanation of the demo details:
This week, if you were in the Xilinx booth at SC17, you would have seen demos of the new Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit (available in actively and passively cooled versions). Both versions are based on Xilinx Virtex UltraScale+ VU9P FPGAs with 64Gbytes of on-board DDR4 SDRAM.
Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, actively cooled version
Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, passively cooled version
Xilinx had several VCU1525 Acceleration Development Kits at SC17 running various applications at SC17. Here’s a short 90-second video from SC17 showing two running applications—edge-to-cloud video analytics and machine learning— narrated by Xilinx Senior Engineering Manager Khang Dao:
Note: For more information about the Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, contact your friendly neighborhood Xilinx or Avnet sales representative.
Now, Swift Navigation has just appeared in the latest “Powered by Xilinx” video. In this video, Swift Navigation’s CEO and Founder Timothy Harris describes his company’s use of the Zynq SoC in the Piksi Multi. The Zynq SoC’s programmable logic processes the incoming signals from multiple global-positioning satellite constellations on multiple frequencies and performs measurements on those signals that is normally performed by dedicated hardware. Then the Zynq SoC’s dual-core Arm Cortex-A9 MPCore processor calculates a physical position from those measurements.
The advantages that hardware and software programmability confer on Swift Navigation’s Piksi Multi includes the ability to quickly adapt the GNSS module for specific customer requirements and the ability to update, upgrade, and add features to the module via over-the-air transmissions. These capabilities give Swift Navigation a competitive advantage over competitive designs that employ dedicated hardware.
Mercury Systems recently announced the BuiltSAFE GS Multi-Core Renderer, which runs on the multi-core ARM Cortex-A53 processor inside Xilinx Zynq UltraScale+ MPSoCs. The BuiltSAFE GS Multi-Core Renderer—a high-performance, small-footprint OpenGL library designed to render highly complex 3D graphics in safety-critical embedded systems, is certifiable to DO-178C at the highest design assurance level (DAL-A) as well as the highest Automotive Safety Integrity Level (ASIL D). Because it runs on the CPU, performance of the Multi-Core Renderer scales up with more CPU cores and can run on Zynq Ultrascale+ CG MPSoC variants that do not include the Arm Mali-400 GPU.
According to Mercury’s announcement:
“Hardware certification requirements (DO-254/ED80) present huge challenges when using a graphics-processing unit (GPU), and the BuiltSAFE GS Multi-Core Renderer is the ideal solution to this problem. It uses a deterministic, processor architecture-independent model optimized for any multicore-based platform to maximize performance and minimize power usage. All of the BuiltSAFE Graphics Libraries use industry standard OpenGL API specifications that are compatible with most new and legacy applications, but it can also be completely tailored to meet any customer requirements.”
Please contact Mercury Systems for more information about the BuiltSAFE GS Multi-Core Renderer.
XIMEA has announced an 8K version of its existing xiB series of PCIe embedded-vision cameras. The new camera, called the CB500, incorporates a CMOSIS CMV50000 sensor with 47.6Mpixel (7920x6004) resolution at 12bit conversion depth. The camera is available in either color or monochrome version and can stream 30fps at 8bits/pixel transport mode (22fps at 12bits/pixel transport mode). Both cameras employ a 20Gbps PCIe Gen2 x4 system interface.
Ximea 8K, 47.6Mpixel CB500 xiB embedded-vision camera with PCIe interface
Like many of its cameras, the XIMEA CM500 relies on the programmability of a Xilinx FPGA to accommodate the different interface needs and processing requirements of the sensors and interfaces in its cameras. In the case of the CM500, the FPGA is an Artix-7 A75T.
For information about the XIMEA CM500 8K camera, please contact XIMEA directly.
For more information about other XIMEA embedded-vision cameras based on Xilinx all Programmable devices, see:
So far, all of my image-processing examples have used only one sensor and produce one video stream within the Zynq SoC or Zynq UltraScale+ MPSoC PL (programmable logic). However, if we want to work with multiple sensors or overlay information like telemetry on a video frame, we need to do some video mixing.
Video mixing merges several different video streams together to create one output stream. In our designs we can use this merged video stream in several ways:
Tile together multiple video streams to be displayed on a larger display. For example, stitching multiple images into a 4K display.
Blend together multiple image streams as vertical layers to create one final image. For example, adding an overlay or performing sensor fusion.
To do this within our Zynq SoC or Zynq UltraScale+ MPSoC system, we use the Video Mixer IP core, which comes with the Vivado IP library. This IP core mixes as many as eight image streams plus a final logo layer. The image streams are provided to the core via AXI Streaming or AXI memory-mapped inputs. You can select which one on a stream-by-stream basis. The IP Core’s merged-video output uses an AXI Stream.
To give a demonstration of the how we can use the video mixer, I am going to update the MiniZed FLIR Lepton project to use the 10-inch touch display and merge a second video stream using a TPG. Using the 10-inch touch display gives me a larger screen to demonstrate the concept. This screen has been sitting in my office for a while now so it’s time it became useful.
Upgrading to the 10-inch display is easy. All we need to do in the Vivado design is increase the pixel clock frequency (fabric clock 2) from 33.33MHz to 71.1MHz. Along with adjusting the clock frequency, we also need to set the ALI3 controller block to 71.1MHz.
Now include a video mixer within the MiniZed Vivado design. Enable layer one and select a streaming interface with global alpha control enabled. Enabling a layer’s global alpha control allows the video mixer to blend the alpha on a pixel-by-pixel basis. This setting allows pixels to be merged according to the defined alpha value rather than just over riding the pixel on the layer beneath. The alpha value for each layer ranges between 0 (transparent) and 1 (opaque). Each layer’s alpha value is defined within an 8-bit register.
Insertion of the Video Mixer and Video Test Pattern Generator
Enabling layer 1, for AXI streaming and Global Alpha Blending
The FLIR camera provides the first image stream. However we need a second image stream for this example, so we’ll instantiate a video TPG core and connect its output to the video mixer’s layer 1 input. For the video mixer and test pattern generator, be sure to use the high-speed video clock used in the image-processing chain. Build the design and export it to SDK.
We use the API xv_mix.h to configure the video mixer in SDK. This API provides the functions needed to control the video mixer.
The principle of the mixer is simple. There is a master layer and you declare the vertical and horizontal size of this layer using the API. For this example using the 10-inch display, we set the size to 1280 pixels by 800 lines. We can then fill this image space using the layers, either tiling or overlapping them as desired for our application.
Each layer has an alpha register to control blending along with X and Y origin registers and height and width registers. These registers tell the mixer how it should create the final image. Positional location for a layer that does not fill the entire display area is referenced from the top left of the display. Here’s an illustration:
Video Mixing Layers, concept. Layer 7 is a reduced-size image in this example.
To demonstrate the effects of layering in action, I used the test pattern generator to create a 200x200-pixel checkerboard pattern with the video mixer’s TPG layer alpha set to opaque so that it overrides the FLIR Image. Here’s what that looks like:
Test Pattern FLIR & Test Pattern Generator Layers merged, test pattern has higher alpha.
Then I set the alpha to a lower value, enabling merging of the two layers:
Test Pattern FLIR & Generator Layers merged, test pattern alpha lower.
We can also use the video mixer to tile images as shown below. I added three more TPGs to create this image.
Four tiled video streams using the mixer
The video mixer is a good tool to have in our toolbox when creating image-processing or display solutions. It is very useful if we want to merge the outputs of multiple cameras working in different parts of the electromagnetic spectrum. We’ll look at this sort of thing in future blogs.
You can find the example source code on the GitHub.
Today, Xilinx announced plans to invest $40M to expand research and development engineering work in Ireland on artificial intelligence and machine learning for strategic markets including cloud computing, embedded vision, IIoT (industrial IoT), and 5G wireless communications. The company already has active development programs in these categories and today’s announcement signals an acceleration of development in these fields. The development was formally announced in Dublin today by The Tánaiste (Deputy Prime Minister of Ireland) and Minister for Business, Enterprise and Innovation, Frances Fitzgerald T.D., and by Kevin Cooney, Senior Vice President, Chief Information Officer and Managing Director EMEA, Xilinx Inc. The new investment is supported by the Irish government through IDA Ireland.
Xilinx first established operations in Dublin in 1995. Today, the company employs 350 people at its EMEA headquarters in Citywest, Dublin, where it operates a research, product development, engineering, and an IT center along with centralized supply, finance, legal, and HR functions. Xilinx also has R&D operations in Cork, which the company established in 2001.
According to Yin Qi, Megvii’s chief exec, his company is developing a “brain” for visual computing. Beijing-based Megvii develops some of the most advanced image-recognition and AI technology in the world. The company’s Face++ facial-recognition algorithms run on the cloud and in edge devices such as the MegEye-C3S security camera, which runs Face++ algorithms locally and can capture more than 100 facial images in each 1080P video frame at 30fps.
MegEye-C3S Facial-Recognition Camera based on Megvii’s Face++ technology
In its early days, Megvii ran its algorithms on GPUs, but quickly discovered the high cost and power disadvantages of GPU acceleration. The company switched to the Xilinx Zynq SoC and is able to run deep convolution on the Zynq SoC’s programmable logic while quantitative analysis runs simultaneously on the Zynq SoC’s Arm Cortex-A9 processors. The heterogeneous processing resources of the Zynq SoC allow Megvii to optimize the performance of its recognition algorithms for lowest cost and minimum power consumption in edge equipment such as the MegEye-C3S camera.
MegEye-C3S Facial-Recognition Camera exploded diagram showing Zynq SoC (on right)
Here’s a 5-minute video where Megvii’s Sam Xie, GM of Branding and Marketing, and Jesson Liu, Megvii’s hardware leader, explain how their company has been able to attract more than 300,000 developers to the Face++ platform and how the Xilinx Zynq SoC has aided the company in developing the most advanced recognition products in the cloud and on the edge:
Yesterday, DeePhi Tech announced several new deep-learning products at an event held in Beijing. All of the products are based on DeePhi’s hardware/software co-design technologies for neural network (NN) and AI development and use deep compression and Xilinx All Programmable technology as a foundation. Central to all of these products is DeePhi’s Deep Neural Network Development Kit (DNNDK), an integrated framework that permits NN development using popular tools and libraries such as Caffe, TensorFlow, and MXNet to develop and compile code for DeePhi’s DPUs (Deep Learning Processor Units). DeePhi has developed two FPGA-based DPUs: the Aristotle Architecture for convolutional neural networks (CNNs) and the Descartes Architecture for Recurrent Neural Networks (RNNs).
DeePhi’s DNNDK Design Flow
DeePhi’s Aristotle Architecture
DeePhi’s Descartes Architecture
DeePhi’s approach to NN development using Xilinx All Programmable technology uniquely targets the company’s carefully optimized, hand-coded DPUs instantiated in programmable logic. In the new book “FPGA Frontiers” published the Next Platform Press, DeePhi’s co-founder and CEO Song Yao describes using his company’s DPUs: “The algorithm designer doesn’t need to know anything about the underlying hardware. This generates instruction instead of RTL code, which leads to compilation in 60 seconds.” The benefits are rapid development and the ability to concentrate on NN code development rather than the mechanics of FPGA compilation, synthesis, and placement and routing.
Part of yesterday’s announcement included two PCIe boards oriented towards vision processing that implement DeePhi’s Aristotle Architecture DPU. One board, based on the Xilinx Zynq Z-7020 SoC, handles real-time CNN-based video analysis including facial detection for more than 30 faces simultaneously for 1080p, 18fps video using only 2 to 4 watts. The second board, based on a Xilinx Zynq UltraScale+ ZU9 MPSoC, supports simultaneous, real-time video analysis for 16 channels of 1080p, 18fps video and draws only 30 to 60 watts.
DeePhi PCIe NN board based on a Xilinx Zynq Z-7020 SoC
DeePhi PCIe NN board based on a Xilinx Zynq UltraScale+ ZU9 MPSoC
For more information about these products, please contact DeePhi Tech directly.
RedZone Robotics’ Solo—a camera-equipped, autonomous sewer-inspection robot—gives operators a detailed, illuminated view of the inside of a sewer pipe by crawling the length of the pipe and recording video of the conditions it finds inside. A crew can deploy a Solo robot in less than 15 minutes and then move to another site to launch yet another Solo robot, thus conducting several inspections simultaneously and cutting the cost per inspection. The treaded robot traverses the pipeline autonomously and then returns to the launch point for retrieval. If the robot encounters an obstruction or blockage, it attempts to negotiate the problem three times before aborting the inspection and returning to its entry point. The robot fits into pipes as small as eight inches in diameter and even operates in pipes that contain some residual waste water.
Justin Starr, RedZone’s VP of Technology, says that the Solo inspection robot uses its on-board Spartan FPGA for image processing and for AI. Image-processing algorithms compensate for lens aberrations and also perform a level of sensor fusion for the robot’s multiple sensors. “Crucial” AI routines in the Spartan FPGA help the robot keep track of where it is in the pipeline and tell the robot what to do when it encounters an obstruction.
Starr also says that RedZone is already evaluating Xilinx Zynq devices to extend the robot’s capabilities. “It’s not enough for the Solo to just grab information about what it sees, but let’s actually look at those images. Let’s have the solo go through that inspection data in real time and generate a preliminary report of what it saw. It used to be the stuff of science fiction but now it’s becoming reality.”
Want to see the Solo in action? Here’s a 3-minute video:
Here’s a hot-off-the-camera, 3-minute video showing a demonstration of two ZCU106 dev boards based on the Xilinx Zynq UltraScale+ ZU7EV MPSoCs with integrated H.265 hardware encoders and decoders. The first ZCU106 board in this demo processes an input stream from a 4K MIPI video camera by encoding it, packetizing it, and then transmitting it over a GigE connection to the second board, which depacketizes, decodes, and displays the video stream on a 4K monitor. Simultaneously, the second board performs the same encoding, packetizing, and transmission of another video stream from a second 4K MIPI camera to the first ZCU106 board, which displays the second video stream on another 4K display.
Note that the integrated H.265 hardware codecs in the Zynq UltraScale+ ZU7EV MPSoC can handle as many as eight simultaneous video streams in both directions.
Here’s the short video demo of this system in action:
For more information about the ZCU106 dev board and the Zynq UltraScale+ EV MPSoCs, contact your friendly, neighborhood Xilinx or Avnet sales representative.
One ongoing area we have been examining is image processing. We’ve look at the algorithms and how to capture images from different sources. A few weeks ago, we looked at the different methods we could use to receive HDMI data and followed up with an example using an external CODEC (P1 & P2). In this blog we are going to look at using internal IP cores to receive HDMI images in conjunction with the Analog Devices AD8195 HDMI buffer, which equalizes the line. Equalization is critical when using long HDMI cables.
Nexys board, FMC HDMI and the Digilent PYNQ-Z1
To do this I will be using the Digilent FMC HDMI card, which provisions one of its channels with an AD8195. The AD8195I on the FMC HDMI card needs a 3v3 supply, which is not available on the ZedBoard unless I break out my soldering iron. Instead, I broke out my Digilent Nexys Video trainer board, which comes fitted with an Artix-7 FPGA and an FMC connector. This board has built-in support for HDMI RX and TX but the HDMI RX path on this board supports only 1m of HDMI cable while the AD8195 on the FMC HDMI card supports cable runs of up to 20m—far more useful in many distributed applications. So we’ll add the FMC HDMI card.
First, I instantiated a MicroBlaze soft microprocessor system in the Nexys Video card’s Artix-7 FPGA to control the simple image-processing chain needed for this example. Of course, you can implement the same approach to the logic design that I outline here using a Xilinx Zynq SoC or Zynq UltraScale+ MPSoC. The Zynq PS simply replaces the MicroBlaze.
The hardware design we need to build this system is:
MicroBlaze controller with local memory, AXI UART, MicroBlaze Interrupt controller, and DDR Memory Interface Generator.
DVI2RGB IP core to receive the HDMI signals and convert them to a parallel video format.
Video Timing Controller, configured for detection.
ILA connected between the VTC and the DVI2RBG cores, used for verification.
Clock Wizard used to generate a 200MHz clock, which supplies the DDR MIG and DVI2RGB cores. All other cores are clocked by the MIG UI clock output.
Two 3-bit GPIO modules. The first module will set the VADJ to 3v3 on the HDMI FMC. The second module enables the ADV8195 and provides the hot-plug detection.
The final step in this hardware build is to map the interface pins from the AD8195 to the FPGA’s I/O pins through the FMC connector. We’ll use the TMDS_33 SelectIO standard for the HDMI clock and data lanes.
Once the hardware is built, we need to write some simple software to perform the following:
Disable the VADJ regulator using pin 2 on the first GPIO port.
Set the desired output voltage on VADJ using pins 0 & 1 on the first GPIO port.
Enable the VADJ regulator using pin 2 on the first GPIO port.
Enable the AD8195 using pin 0 on the second GPIO port.
Enable pre- equalization using pin 1 on the second GPIO port.
Assert the Hot-Plug Detection signal using pin 2 on the second GPIO port.
Read the registers within the VTC to report the modes and status of the video received.
To test this system, I used a Digilent PYNQ-Z1 board to generate different video modes. The first step in verifying that this interface is working is to use the ILA to check that the pixel clock is received and that its DLL is locked, along with generating horizontal and vertical sync signals and the correct pixel values.
Provided the sync signals and pixel clock are present, the VTC will be able to detect and classify the video mode. The application software will then report the detected mode via the terminal window.
ILA Connected to the DVI to RGB core monitoring its output
Software running on the Nexys Video detecting SVGA mode (600 pixels by 800 lines)
With the correct video mode being detected by the VTC, we can now configure a VDMA write channel to move the image from the logic into a DDR frame buffer.