We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Compute Acceleration: GPU or FPGA? New White Paper gives you numbers

by Xilinx Employee ‎06-14-2017 02:24 PM - edited ‎06-14-2017 02:28 PM (3,431 Views)


Cloud computing and application acceleration for a variety of workloads including big-data analytics, machine learning, video and image processing, and genomics are big data-center topics and if you’re one of those people looking for acceleration guidance, read on. If you’re looking to accelerate compute-intensive applications such as automated driving and ADAS or local video processing and sensor fusion, this blog post’s for you to. The basic problem here is that CPUs are too slow and they burn too much power. You may have one or both of these challenges. If so, you may be considering a GPU or an FPGA as an accelerator in your design.


How to choose?


Although GPUs started as graphics accelerators, primarily for gamers, a few architectural tweaks and a ton of software have made them suitable as general-purpose compute accelerators. With the right software tools, it’s not too difficult to recode and recompile a program to run on a GPU instead of a CPU. With some experience, you’ll find that GPUs are not great for every application workload. Certain computations such as sparse matrix math don’t map onto GPUs well. One big issue with GPUs is power consumption. GPUs aimed at server acceleration in a data-center environment may burn hundreds of watts.


With FPGAs, you can build any sort of compute engine you want with excellent performance/power numbers. You can optimize an FPGA-based accelerator for one task, run that task, and then reconfigure the FPGA if needed for an entirely different application. The amount of computing power you can bring to bear on a problem is scary big. A Virtex UltraScale+ VU13P FPGA can deliver 38.3 INT8 TOPS (that’s tera operations per second) and if you can binarize the application, which is possible with some neural networks, you can hit 500TOPS. That’s why you now see big data-center operators like Baidu and Amazon putting Xilinx-based FPGA accelerator cards into their server farms. That’s also why you see Xilinx offering high-level acceleration programming tools like SDAccel to help you develop compute accelerators using Xilinx All Programmable devices.


For more information about the use of Xilinx devices in such applications including a detailed look at operational efficiency, there’s a new 17-page White Paper titled “Xilinx All Programmable Devices: A Superior Platform for Compute-Intensive Systems.”






Although humans once served as the final inspectors for pcbs, today’s component dimensions and manufacturing volumes mandate the use of camera-based automated optical inspection (AOI) systems. Amfax has developed a 3D AOI system—the a3Di—that uses two lasers to make millions of 3D measurements with better than 3μm accuracy. One of the company’s customers uses an a3Di system to inspect 18,000 assembled pcbs per day.


The a3Di control system is based on a National Instruments (NI) cRIO-9075 CompactRIO controller—with an integrated Xilinx Virtex-5 LX25 FPGA—programmed with NI’s LabVIEW systems engineering software. The controller manages all aspects of the a3Di AOI system including monitoring and control of:



  • Machine motors
  • Control switches
  • Optical position sensors
  • Inverters
  • Up and downstream SMEMA (Surface Mount Equipment Manufacturers Association) conveyor control
  • Light tower
  • Pneumatics
  • Operator manual controls for width PCB control
  • System emergency stop



The system provides height-graded images like this:




Amfax 3D PCB image.jpg 


3D Image of a3Di’s Measurement Data: Colors represent height, with Z resolution down to less than a micron. The blue section at the top indicates signs of board warp. Laser etched component information appears on some of the ICs.




The a3Di system then compares this image against a stored golden reference image to detect manufacturing defects.


Amfax says that it has found the CompactRIO system to be “CompactRIO system has proven to be a dependable, reliable, and cost-effective.” In addition, the company found it could get far better timing resolution with the CompactRIO system than the 1msec resolution usually provided by PLC controllers.



This project was a 2017 NI Engineering Impact Award Finalist in the Electronics and Semiconductor category last month at NI Week. It is documented in this NI case study.



Linc, Perrone Robotics’ autonomous Lincoln MKZ automobile, took a drive around the Perrone paddock at the TU Automotive autonomous vehicle show in Detroit last week and Dan Isaacs, Xilinx’s Director Connected Systems in Corporate Marketing, was there to shoot photos and video. Perrone’s Linc test vehicle operates autonomously using the company’s MAX (Mobile Autonomous X), a “comprehensive full-stack, modular, real-time capable, customizable, robotics software platform for autonomous (self-driving) vehicles and general purpose robotics.” MAX runs on multiple computing platforms including one based on an Iveia controller, which is based on an Iveia Atlas SOM, which in turn is based on a Xilinx Zynq UltraScale+ MPSoC. The Zynq UltraScale+ MPSoC handles the avalanche of data streaming from the vehicle’s many sensors to ensure that the car travels the appropriate path and avoids hitting things like people, walls and fences, and other vehicles. That’s all pretty important when the car is driving itself in public. (For more information about Perrone Robotics’ MAX, see “Perrone Robotics builds [Self-Driving] Hot Rod Lincoln with its MAX platform, on a Zynq UltraScale+ MPSoC.”)


Here’s a photo of Perrone’s sensored-up Linc autonomous automobile in the Perrone Robotics paddock at TU Automotive in Detroit:



Perrone Robotics Linc Autonomous Driving Lincoln MKZ.jpg 



And here’s a photo of the Iveia control box with the Zynq UltraScale+ MPSoC inside, running Perrone’s MAX autonomous-driving software platform. (Note the controller’s small size and lack of a cooling fan):



Iveia Autonomous Driving Controller for Perrone Robotics.jpg 



Opinions about the feasibility of autonomous vehicles are one thing. Seeing the Lincoln MKZ’s 3800 pounds of glass, steel, rubber, and plastic being controlled entirely by a little silver box in the trunk, that’s something entirely different. So here’s the video that shows Perrone Robotics’ Linc in action, driving around the relative safety of the paddock while avoiding the fences, pedestrians, and other vehicles:




When someone asks where Xilinx All Programmable devices are used, I find it a hard question to answer because there’s such a very wide range of applications—as demonstrated by the thousands of Xcell Daily blog posts I’ve written over the past several years.


Now, there’s a 5-minute “Powered by Xilinx” video with clips from several companies using Xilinx devices for applications including:


  • Machine learning for manufacturing
  • Cloud acceleration
  • Autonomous cars, drones, and robots
  • Real-time 4K, UHD, and 8K video and image processing
  • VR and AR
  • High-speed networking by RF, LED-based free-air optics, and fiber
  • Cybersecurity for IIoT


That’s a huge range covered in just five minutes.


Here’s the video:






A wide range of commercial, government, and social applications require precise aerial imaging. These application range from the management of high-profile, international-scale humanitarian and disaster relief programs to everyday commercial use—siting large photovoltaic arrays for example. Satellites can capture geospatial imagery across entire continents, often at the expense of spatial resolution. Satellites also lack the flexibility to image specific areas on demand. You must wait until the satellite is above the real estate of interest. Spookfish Limited in Australia along with ICON Technologies have developed the Spookfish Airborne Imaging Platform (SAIP) based on COTS (commercial off-the-shelf) products including National Instruments’ (NI’s) PXIe modules and LabVIEW systems engineering software that can capture precise images with resolutions of 6cm/pixel to better than 1cm/pixel from a light aircraft cruising at 160 knots at altitudes to 12,000 feet.


The 1st-generation SAIP employs one or more cameras installed in a tube attached to the belly of a light aircraft. Success with the initial prototype led to the development of a 2nd-generation design with two camera tubes. The system has continued to grow and now accommodates as many as three camera tubes with as many as four cameras per tube.


The multiple cameras must be steered precisely in continuous, synchronized motion while recording camera angles, platform orientation, and platform acceleration. All of this data is used to post-process the image data. At typical operating altitudes and speeds, the cameras must be steered with millidegree precision and the camera angles and platform position must be logged with near-microsecond accuracy and precision. Spookfish then uses a suite of open-source and proprietary computer-vision and photogrammetry techniques to process the imagery, which results in orthophotos, elevation data, and 3D models.


Here’s a block diagram of the Spookfish SAIP:



Spookfish SAIP Block diagram.jpg 




The NI PXIe system in the SAIP design consists of a PXIe-1082DC chassis, a PXIe-8135 RT controller, a PXI-6683H GPS/PPS synchronization module, a PXIe-6674T clock and timing module, a PXIe-7971R FlexRIO FPGA Module, and a PXIe-4464 sound and vibration module. (The PXIe7971R FlexRIO module is based on a Xilinx Kintex-7 325T FPGA. The PXI-6683H synchronization module and the PXIe-6674T clock and timing module are both based on Xilinx Virtex-5 FPGAs.)


Here’s an aerial image captured by an SAIP system at 6cm/pixel:



Spookfish SAIP image at 6cm per pixel.jpg 



And here’s a piece of an aerial image taken by an SAIP system at 1.5cm/pixel:



Spookfish SAIP image at 6cm per pixel.jpg 




During its multi-generation development, the SAIP system quickly evolved far beyond its originally envisioned performance specification as new requirements arose. For example, initial expectations were that logged data would only need to be tagged with millisecond accuracy. However, as the project progressed, ICON Technologies and NI improved the system’s timing accuracy and precision by three orders of magnitude.


NI’s FPGA-based FlexRIO technology was also crucial in meeting some of these shifting performance targets. Changing requirements pushed the limits of some of the COTS interfaces, so custom FlexRIO interface implementations optimized for the tasks were developed as higher-speed replacements. Often, NI’s FlexRIO technology is employed for the high-speed computation available in the FPGA’s DSP slices, but in this case it was the high-speed programmable I/O that was needed.


Spookfish and ICON Technologies are now developing the next-generation SAIP system. Now that the requirements are well understood, they’re considering a Xilinx FPGA-based or Zynq-based NI CompactRIO controller as a replacement for the PXIe system. NI’s addition of TSN (time-sensitive networking) to the CompactRIO family’s repertoire makes such a switch possible. (For more information about NI’s TSN capabilities, see “IOT and TSN: Baby you can drive my [slot] car. TSN Ethernet network drives slot cars through obstacles at NI Week.”)




This project was a 2017 NI Engineering Impact Award finalist in the Energy category last month at NI Week. It is documented in this NI case study.


DFC Design’s Xenie FPGA module product family pairs a Xilinx Kintex-7 FPGA (a 70T or a 160T) with a Marvell Alaska X 88X3310P 10GBASE-T PHY on a small board. The module breaks out six of the Kintex-7 FPGA’s 12.5Gbps GTX transceivers and three full FPGA I/O banks (for a total of 150 single-ended I/O or up to 72 differential pairs) with configurable I/O voltage to two high-speed, high-pin-count, board-to-board connectors. A companion Xenie BB Carrier board accepts the Xenie FPGA board and breaks out the high-speed GTX transceivers into a 10GBASE-T RJ45 connector, an SFP+ optical cage, and four SDI connectors (two inputs and two outputs).


Here’s a block diagram and photo of the Xenia FPGA module:





DFC Design Xenia FPGA Module.jpg 



Xenia FPGA module based on a Xilinx Kintex-7 FPGA




And here’s a photo of the Xenie BB Carrier board that accepts the Xenia FPGA module:




DFC Design Xenia BB Carrier Board.jpg 


Xenia BB Carrier board



These are open-source designs.


DFC Design has developed a UDP core for this design, available on OpenCores.org and has published two design examples: an Ethernet example and a high-speed camera design.


Here’s a block diagram of the Ethernet example:





DFC Design Ethernet Example System.jpg 



Please contact DFC Design directly for more information.




LMI TechnologiesGocator 3210 is a smart, metrology-grade, stereo-imaging snapshot sensor that produces 3D point clouds of scanned objects with 35μm accuracy over fields as large as 100x154mm at 4fps. The diminutive (190x142x49mm) Gocator 3210 pairs a 2Mpixel stereo camera with an industrial LED-based illuminator that projects structured blue light onto the subject to aid precise measurement of object width, height, angles, and radii. An integral Xilinx Zynq SoC accelerates these measurements so that the Gocator 3210 can scan objects at 4Hz, which LMI says is 4x the speed of such a sensor setup feeding raw data to a host CPU for processing. This fast scanning speed means that parts can pass by the Gocator for inspection on a production line without stopping for the measurement to be made. The Gocator uses a GigE interface for host connection.



LMI Technologies Gocator 3210.jpg


LMI Technologies Gocator 3210 3D Smart Stereo Vision Sensor



LMI provides a browser-based GUI to process the point clouds and 3D models generated by the Gocator. That means the processing—which includes the calculation of object width, height, angles, and radii—all takes place inside of the Gocator. No additional host software is required.


Here’s a photo of LMI’s GUI showing a 3D scan of an automotive cylinder head (a typical application for this type of sensor):




LMI Gocator GUI.jpg



LMI also offers an SDK so that you can develop sophisticated inspection programs that run on the Gocator. The company has also produced an extensive series of interesting training videos for the Gocator sensor family.


Finally, here’s a short (3 minutes) but information-dense video explaining the Gocator’s features and capabilities:






LMI’s VP of Sales Len Chamberlain has just published a blog titled “Meeting the Demand for Application-Specific 3D Solutions” that further discusses the Gocator 3210’s features and applications.



A paper titled “Evaluating Rapid Application Development with Python for Heterogeneous Processor-based FPGAs” that discusses the advantages and efficiencies of Python-based development using the PYNQ development environment—based on the Python programming language and Jupyter Notebooks—and the Digilent PYNQ-Z1 board, which is based on the Xilinx Zynq SoC, recently won the Best Short Paper award at the 25th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2017) held in Napa, CA. The paper’s authors—Senior Computer Scientist Andrew G. Schmidt, Computer Scientist Gabriel Weisz, and Research Director Matthew French from the USC Viterbi School of Engineering’s Information Sciences Institute—evaluated the impact of, the performance implications, and the bottlenecks associated with using PYNQ for application development on Xilinx Zynq devices. The authors then compared their Python-based results against existing C-based and hand-coded implementations.



The authors do a really nice job of describing what PYNQ is:



“The PYNQ application development framework is an open source effort designed to allow application developers to achieve a “fast start” in FPGA application development through use of the Python language and standard “overlay” bitstreams that are used to interact with the chip’s I/O devices. The PYNQ environment comes with a standard overlay that supports HDMI and Audio inputs and outputs, as well as two 12-pin PMOD connectors and an Arduino-compatible connector that can interact with Arduino shields. The default overlay instantiates several MicroBlaze processor cores to drive the various I/O interfaces. Existing overlays also provide image filtering functionality and a soft-logic GPU for experimenting with SIMT [single instruction, multiple threads] -style programming. PYNQ also offers an API and extends common Python libraries and packages to include support for Bitstream programming, directly access the programmable fabric through Memory-Mapped I/O (MMIO) and Direct Memory Access (DMA) transactions without requiring the creation of device drivers and kernel modules.”



They also do a nice job of explaining what PYNQ is not:



“PYNQ does not currently provide or perform any high-level synthesis or porting of Python applications directly into the FPGA fabric. As a result, a developer still must use create a design using the FPGA fabric. While PYNQ does provide an Overlay framework to support interfacing with the board’s IO, any custom logic must be created and integrated by the developer. A developer can still use high-level synthesis tools or the aforementioned Python-to-HDL projects to accomplish this task, but ultimately the developer must create a bitstream based on the design they wish to integrate with the Python [code].”



Consequently, the authors did not simply rely on the existing PYNQ APIs and overlays. They also developed application-specific kernels for their research based on the Redsharc project (see “Redsharc: A Programming Model and On-Chip Network for Multi-Core Systems on a Programmable Chip”) and they describe these extensions in the FCCM 2017 paper as well.




Redsharc Project.jpg




So what’s the bottom line? The authors conclude:


“The combining of both Python software and FPGA’s performance potential is a significant step in reaching a broader community of developers, akin to Raspberry Pi and Ardiuno. This work studied the performance of common image processing pipelines in C/C++, Python, and custom hardware accelerators to better understand the performance and capabilities of a Python + FPGA development environment. The results are highly promising, with the ability to match and exceed performances from C implementations, up to 30x speedup. Moreover, the results show that while Python has highly efficient libraries available, such as OpenCV, FPGAs can still offer performance gains to software developers.”


In other words, there’s a vast and unexplored territory—a new, more efficient development space—opened to a much broader system-development audience by the introduction of the PYNQ development environment.


For more information about the PYNQ-Z1 board and PYNQ development environment, see:







The new PALTEK DS-VU 3 P-PCIE Data Brick places a Xilinx Virtex UltraScale+ VU3P FPGA along with 8Gbytes of DDR4-2400 SDRAM, two VITA57.1 FMC connectors, and four Samtec FireFly Micro Flyover ports on one high-bandwidth, PCIe Gen3 with a x16 host connector. The card aims to provide FPGA-based hardware acceleration for applications including 2K/4K video processing, machine learning, big data analysis, financial analysis, and high-performance computing.



Paltek Data Brick.jpg 


PALTEK Data Brick packs Virtex UltraScale+ VU3P FPGA onto a PCIe card




The Samtec Micro Flyover ports accept both ECUE copper twinax and ECUO optical cables. The ECUE twinax cables are for short-reach applications and have a throughput of 28Gbps per channel. The ECUO optical cables operate at a maximum data rate of 14Gbps per channel and are available with as many as 12 simplex or duplex channels (with 28Gbps optical channels in development at Samtec).


For broadcast video applications, PALTEK also offers companion 12G-SDI Rx and 12G-SDI-Tx cards that can break out eight 12G-SDI video channels from one FireFly connection.


Please contact PALTEK directly for more information about these products.




 For more information about the Samtec FireFly system, see:







A Tale of Two Cameras: You’re gonna need a bigger FPGA

by Xilinx Employee on ‎05-05-2017 04:02 PM (4,595 Views)


Cutting-edge industrial and medical camera maker NET (New Electronic Technology) had a table at this week’s Embedded Vision Summit where the company displayed two generations of GigE cameras: the GigEPRO and CORSIGHT. Both of these camera families include multiple cameras that accommodate a wide range of monochrome and color image sensors. There are sixteen different C- and CS-mount cameras in the GigEPRO family with sensors ranging from WVGA (752x480 pixels) to WQUXGA (3664x2748 pixels) and a mix of global and rolling shutters. The CORSIGHT family includes eleven cameras with sensors ranging from WVGA (752x480 pixels) to QSXGA (2592x1944 pixels), or one of two line-scan sensors (2048 or 4096 pixels), with a mix of global and rolling shutters. In addition to its Gigabit Ethernet interface, the CORSIGHT cameras have WiFi, Bluetooth, USB 2.0, and optional GSM interfaces. Both the GigEPRO and CORSIGHT cameras are user-programmable and have on-board, real-time image processing, which can be augmented with customer-specific image-processing algorithms.



NET GigEPRO Camera.jpg 



GigEPRO Camera from NET







CORSIGHT Camera from NET




You program both cameras with NET’s GUI-based SynView Software Development Kit, which generates code for controlling the NET cameras and for processing the acquired images. When you create a program, SynView automatically determines if the required functionality is available in camera hardware. If not, SynView will do the necessary operations in software (although this increases the host CPU’s load). NET’s GigEPRO and CORSIGHT cameras are capable of performing significant on-board image processing right out of the box including Bayer decoding for color cameras, LUT (Lookup Table) conversion, white balance, gamma, brightness, contrast, color correction, and saturation.


Which leads to the question: What’s performing all of these real-time, image-processing functions in NET’s GigEPRO and CORSIGHT cameras?


Xilinx FPGAs, of course. (This should not be a surprise. After all, you’re reading a post in the Xilinx Xcell Daily blog.)


The GigEPRO cameras are based on Spartan-6 FPGAs—an LX45, LX75, or LX100 depending on the family member. At the Embedded Vision Summit, Dr. Thomas Däubler, NET’s Managing Director and CTO, explained to me that “the FPGAs are what give the GigEPRO cameras their PRO features.” In fact, there is user space reserved in the larger FPGAs for customer-specific algorithms to be performed in real time inside of the camera itself. What sort of algorithms? Däubler gave me two examples: laser triangulation and Q-code recognition. In fact, he said, some of NET’s customers perform all of the image processing and analysis in the camera and never send the image to the host—just the results of the analysis. Of course, this distributed-processing approach greatly reduces the host CPU’s processing load and therefore allows one host computer to handle many more cameras.


Here’s a photo from the Summit showing a NEW GigEPRO camera inspecting a can on a spinning platform while reading the label on the can:




NET GigEPRO Camera Inspects Object on Spinning Table and Reads Label.jpg 



NET GigEPRO Camera Inspects Object on Spinning Table and Reads Label



There’s a second important reason for using the FPGA in NET’s GigEPRO cameras: the FPGAs create a hardware platform that allowed NET to develop the sixteen GigEPRO family members that handle many different image sensors with varied hardware interfaces and timing requirements. NET relied on the Spartan-6 FPGAs’ I/O programmability to help with this aspect of the camera family’s design.


So when it came time for NET to develop a new intelligent camera family—the recently introduced CORSIGHT smart vision system—with even more features, did NET’s design engineers continue to use FPGAs for real-time image processing?


Of course they did. For the newer camera, and for the same reasons, they chose the Xilinx Artix-7 FPGA family.


And here’s the CORSIGHT camera in action:




NET CORSIGHT Camera Inspects Object on Spinning Table and Reads Label.jpg



NET CORSIGHT Camera Inspects Object on Spinning Table




Note: For more information about the GigEPRO and CORSIGHT camera families, and the SynView software, please contact NET directly.




This week, just in time for the Embedded Vision Summit in Santa Clara, Aldec announced its TySOM-2A Embedded Prototyping Board based on a Xilinx Zynq Z-7030 SoC. The board features a combination of memories (1Gbyte of DDR3 SDRAM, SPI flash memory, EEPROM, microSD), communication interfaces (2× Gigabit Ethernet, 4× USB 2.0, UART-via-USB, Wi-Fi, Bluetooth, HDMI 1.4), an FMC connector, and other miscellaneous modules (LEDs, DIP switches, XADC, RTC, accelerometer, temperature sensor). Here’s a photo of the TySOM-2A board:





Aldec TySOM-2A Board.jpg



Aldec TySOM-2A Embedded Prototyping Board based on a Xilinx Zynq Z-7030 SoC



In its booth at the Summit, Aldec demonstrated a real-time, face-detection reference design running on the Zynq SoC. The program depends on the accelerated processing capabilities of the Zynq SoC’s programmable logic to run this complex code, processing a 1280x720-pixel video stream in real time. The most computationally intensive parts of the code including edge detection, color-space conversion, and frame merging were off-loaded from Zynq SoC’s ARM Cortex-A9 processor to the device’s programmable logic using Xilinx’s SDSoC Development Environment.


Here’s a very short video showing the demo:






This week at the Embedded Vision Summit in Santa Clara, CA, Mario Bergeron demonstrated a design he’d created that combines real-time visible and IR thermal video streams from two different sensors. (Bergeron is a Senior FPGA/DSP Designer with Avnet.) The demo runs on an Avnet PicoZed SOM (System on Module) based on a Xilinx Zynq Z-7030 SoC. The PicoZed SOM is the processing portion of the Avnet PicoZed Embedded Vision Kit. An FMC-mounted Python-1300-C image sensor supplies the visible video stream in this demo and a FLIR Systems Lepton image sensor supplies the 60x80-pixel IR video stream. The Lepton IR sensor connects to the PicoZed SOM over a Pmod connector on the PicoZed.


Here’s a block diagram of this demo:



Avnet reVISION demo with PicoZed Embedded Vision Kit.jpg 



Bergeron integrated these two video sources and developed the code for this demo using the new Xilinx reVISION stack, which includes a broad range of development resources for vision-centric platform, algorithm, and application development. The Xilinx SDSoC Development Environment and the Vivado Design Suite including the Vivado HLS high-level synthesis tool are all part of the reVISION stack, which also incorporates OpenCV libraries and machine-learning frameworks such as Caffe.


In this demo, Bergeron’s design takes the visible image stream and performs a Sobel edge extraction on the video. Simultaneously, the design also warps and resizes the IR Thermal image stream so that the Sobel edges can be combined with the thermal image. The Sobel and resizing algorithms come from the Xilinx reVISION stack library and Bergeron wrote the video-combining code in C. He then synthesized these three tasks in hardware to accelerate them because they were the most compute-intensive tasks in the demo. Vivado HLS created the hardware accelerators for these tasks directly from the C code and SDSoC connected the accelerator cores to the ARM processor with DMA hardware and generated the software drivers.


Here’s a diagram showing the development process for this demo and the resulting system:



Avnet reVISION demo Project Diagram.jpg 



In the video below, Bergeron shows that the unaccelerated Sobel algorithm running in software consumes 100% of an ARM Cortex-A9 processor in the Zynq Z-7030 SoC and still only achieves about one frame/sec—far too slow. By accelerating this algorithm in the Zynq SoC’s programmable logic using SDSoC and Vivado HLS, Bergeron cut the processor load by more than 80% and achieved real-time performance. (By my back-of-the envelope calculation, that’s about a 150x speedup: going from 1 to 30 frames/sec and cutting the processor load by more than 80%.)


Here’s the 5-minute video of this fascinating demo:







For more information about the Avnet PicoZed Embedded Vision Kit, see “Avnet’s $1500, Zynq-based PicoZed Embedded Vision Kit includes Python-1300-C camera and SDSoC license.”



For more information about the Xilinx reVISION stack, see “Xilinx reVISION stack pushes machine learning for vision-guided applications all the way to the edge,” and “Yesterday, Xilinx announced the reVISION stack for software-defined embedded-vision apps. Today, there’s two demo videos.”


How to tackle KVM (Keyboard/Video/Mouse) challenges at 4K and beyond: Any Media Over Any Network

by Xilinx Employee ‎05-04-2017 11:05 AM - edited ‎05-04-2017 11:08 AM (3,677 Views)


We’ve had KVM (keyboard, video, mouse) switches for controlling multiple computers from one set of user-interface devices for a long, long time. Go back far enough, and you were switching RS-232 ports to control multiple computers or other devices with one serial terminal. Here’s what they looked like back in the day:



Old KVM Switch.jpg 



In those days, these KVM switches could be entirely mechanical. Now, they can’t. There are different video resolutions, different coding and compression standards, there’s video over IP (Ethernet), etc. Today’s KVM switch is also a many-to-many converter. Your vintage rotary switch isn’t going to cut it for today’s Pro AV and Broadcast applications.


If you need to meet this kind of design challenge—today—you need low-latency video codecs like H.265/HEVC and compression standards such as TICO; you need 4K and 8K video resolution with conversion to and from HD; and you need compatibility and interoperability with all sorts of connectivity standards including 3G/12G SGI and high-speed Ethernet. In short, you need “Any Media Over Any Network” capability and you need all of that without exploding your BOM cost.


Where are you going to get it?


Well, considering that this is the Xilinx Xcell Daily blog, it’s a good bet that you’re going to hear about the capabilities of at least one Xilinx All Programmable device.


Actually, this blog is about a couple of upcoming Webinars being held on May 23 titled “Any Media Over Any Network: KVM Extenders, Switches and KVM-over-IP.” The Webinars are identical but are being held at two different times to accommodate worldwide time zones. In this webinar, Xilinx will show you how you can use the Zynq UltraScale+ MPSoC in KVM applications. The webinar will highlight how Xilinx and its partners’ video-processing and -connectivity IP cores along with the integrated H.265/HEVC codec in the three Zynq UltraScale+ MPSoC EV family members can quickly and easily address new opportunities in the KVM market.



  • Register here for the free webinar being held at 7am Pacific Daylight Time (UTC-08:00).


  • Register here for the free webinar being held at 10am Pacific Daylight Time (UTC-08:00).







In this 40-minute webinar, Xilinx will present a new approach that allows you to unleash the power of the FPGA fabric in Zynq SoCs and Zynq UltraScale+ MPSoCs using hardware-tuned OpenCV libraries, with a familiar C/C++ development environment and readily available hardware development platforms. OpenCV libraries are widely used for algorithm prototyping by many leading technology companies and computer vision researchers. FPGAs can achieve unparalleled compute efficiency on complex algorithms like dense optical flow and stereo vision in only a few watts of power.


This Webinar is being held on July 12. Register here.


Here’s a fairly new, 4-minute video showing a 1080p60 Dense Optical Flow demo, developed with the Xilinx SDSoC Development Environment in C/C++ using OpenCV libraries:





For related information, see Application Note XAPP1167, “Accelerating OpenCV Applications with Zynq-7000 All Programmable SoC using Vivado HLS Video Libraries.”


Plethora IIoT develops cutting‑edge solutions to Industry 4.0 challenges using machine learning, machine vision, and sensor fusion. In the video below, a Plethora IIoT Oberon system monitors power consumption, temperature, and the angular speed of three positioning servomotors in real time on a large ETXE-TAR Machining Center for predictive maintenance—to spot anomalies with the machine tool and to schedule maintenance before these anomalies become full-blown faults that shut down the production line. (It’s really expensive when that happens.) The ETXE-TAR Machining Center is center-boring engine crankshafts. This bore is the critical link between a car’s engine and the rest of the drive train including the transmission.




Plethora IIoT Oberon System.jpg 




Plethora uses Xilinx Zynq SoCs and Zynq UltraScale+ MPSoCs as the heart of its Oberon system because these devices’ unique combination of software-programmable processors, hardware-programmable FPGA fabric, and programmable I/O allow the company to develop real-time systems that implement sensor fusion, machine vision, and machine learning in one device.


Initially, Plethora IIoT’s engineers used the Xilinx Vivado Design Suite to develop their Zynq-based designs. Then they discovered Vivado HLS, which allows you to take algorithms in C, C++, or SystemC directly to the FPGA fabric using hardware compilation. The engineers’ first reaction to Vivado HLS: “Is this real or what?” They discovered that it was real. Then they tried the SDSoC Development Environment with its system-level profiling, automated software acceleration using programmable logic, automated system connectivity generation, and libraries to speed programming. As they say in the video, “You just have to program it and there you go.”


Here’s the video:





Plethora IIoT is showcasing its Oberon system in the Industrial Internet Consortium (IIC) Pavilion during the Hannover Messe Show being held this week. Several other demos in the IIC Pavilion are also based on Zynq All Programmable devices.


intoPIX announces IP core support for 8K TICO video compression with <1msec end-to-end latency

by Xilinx Employee ‎04-21-2017 02:01 PM - edited ‎04-21-2017 02:16 PM (3,243 Views)


Today, intoPIX announced that it’s lightweight TICO video-compression IP cores for Xilinx FPGAs can now support frame resolutions and rates to 8K60p as well as the previously supported HD and 4K resolutions. Currently, the compression cores support 10-bit, 4:2:2 workflows but intoPIX also disclosed in a published table (see below) that a future release of the IP core will support 4:4:4 color sampling. The TICO compression standard simplifies the management of live and broadcast video streams over existing video network infrastructures based on SDI and Ethernet by reducing the bandwidth requirements of high-definition and ultra-high-definition video at compression ratios as large as 5:1 (visually lossless at ratios to 4:1). TICO compression supports live video streams through its low latency—less than 1msec end-to-end.


Conveniently, intoPIX has published a comprehensive chart showing its various TICO compression IP cores and the Xilinx FPGAs that can support them. Here’s the intoPIX chart:



intoPIX TICO Compression Table for Xilinx FPGAs.jpg 



Note that the most cost-effective Xilinx FPGAs including the Spartan-6 and Artix-7 families support TICO compression at HD and even some UHD/4K video formats while the Kintex-7, Virtex-7, and UltraScale device families support all video formats through 8K.


Please contact intoPIX for more information about these IP cores.




Mega65 Logo.jpgThe MEGA65 is an open-source microcomputer modeled on the incredibly successful Commodore 64/65 circa 1982-1990. Ye olde Commodore 64 (C64)—introduced in 1982—was based on an 8-bit MOS Technology 6510 microprocessor, which was derived from the very popular 6502 processor that powered the Apple II, Atari 400/800, and many other 8-bit machines in the 1980s. The 6510 processor added an 8-bit parallel I/O port to the 6502, which no doubt dropped the microcomputer’s BOM cost a buck or two. According to Wikipedia, “The 6510 was only widely used in the Commodore 64 home computer and its variants.” Also according to Wikipedia, “For a substantial period (1983–1986), the C64 had between 30% and 40% share of the US market and two million units sold per year, outselling the IBM PC compatibles, Apple Inc. computers, and the Atari 8-bit family of computers.”


Now that is indeed a worthy computer to serve as a “Jurassic Park” candidate and therefore, the non-profit MEGA (Museum of Electronic Games & Art), “dedicated to the preservation of our digital heritage,” is supervising the physical recreation of the Commodore 64 microcomputer (mega65.org). It’s called the MEGA65 and it’s software-compatible with the original Commodore 64, only faster. (The 6510 processor emulation in the MEGA65 runs at 48MHz compared to the original MOS Technology 6510’s ~1MHz clock rate.) MEGA65 hardware designs and software are open-source (LGPL).


How do you go about recreating the hardware of a machine that’s been gone for 25 years? Fortunately, it’s a lot easier than extracting DNA from the stomach contents of ancient mosquitos trapped in amber. Considering that this blog is appearing in Xcell Daily on the Xilinx Web site, the answer’s pretty obvious: you use an FPGA. And that’s exactly what’s happening.


A few days ago, the MEGA65 team celebrated initial bringup of the MEGA65 pcb. You can read about the bringup efforts here and here is a photo of the pcb:



MEGA65 pcb.jpg 


The first MEGA65 PCB




The MEGA65 pcb is designed to fit into the existing Commodore 65 plastic case. (The Commodore 65 was prototyped but not put into production.)


Sort of gives a new meaning to “single-chip microcomputer,” does it not. That big chip in the middle of the board is an Xilinx Artix-7 A200T. It implements the Commodore 64’s entire motherboard in one programmable logic device. Yes, that includes the RAM. The Artix-7 A200T FPGA has 13.14Mbits of on-chip block RAM. That’s more than 1.5Mbytes of RAM, or 25x more RAM than the original Commodore 64 motherboard, which used eight 4164 64Kbit, 150nsec DRAMs for RAM storage. The video’s a bit improved too, from 160x200 pixels, with a maximum of four colors per 4x8 character block, or 320x200 pixels, with a maximum of two colors per 8x8 character block, to a more modern 1920x1200 pixels with 12-bit color (23-bit color is planned). Funny what 35 years of semiconductor evolution can produce.


What’s the project’s progress status? Here’s a snapshot from the MEGA65 site:




MEGA65 Progress.jpg



MEGA65 Project Status




And here’s a video of the MEGA65 in action:






Remember, what you see and hear is running on a Xilinx Artix-7 A200T, configured to emulate an entire Commodore 64 microcomputer. Most of the code in this video was written in the Jurassic period of microcomputer development. If you’re of a certain age, these old programs should bring a chuckle or perhaps just a smile to your lips.



Note: You’ll find a MEGA65 project log by Antti Lukats here.







Basic problem: When you’re fighting aliens to save the galaxy wearing your VR headset, having a wired tether to pump the video to your headset is really going to crimp your style. Spin around to blast that battle droid sneaking up on you from behind is just as likely to garrote you as save your neck. What to do? How will you successfully save the galaxy?


Well, NGCodec and Celeno Communications have a demo for you in the NGCodec booth (N2635SP-A) at NAB in the Las Vegas Convention Center next week. Combine NGCodec’s low-latency H.265/HEVC “RealityCodec” video coder/decoder IP with Celeno’s 5GHz 802.11ac WiFi connection and you have a high-definition (2160x1200), high-frame-rate (90 frames/sec) wireless video connection over a 15Mbps wireless channel. This demo uses a 250:1 video compression setting to fit the video into the 15Mbps channel.


In the demo, a RealityCodec hardware instance in a Xilinx Virtex UltraScale+ VU9P FPGA on a PCIe board plugged into a PC running Windows 10 compresses generated video in real time. The PC sends the compressed, 15Mbps video stream to a Celeno 802.11ac WiFi radio, which transmits the video over a standard 5GHz 802.11ac WiFi connection. Another Celeno WiFi radio receives the compressed video stream and sends it to a second RealityCodec for decoding. The decoder hardware is instantiated in a relatively small Xilinx Kintex-7 325T FPGA. The decoded video stream feeding the VR goggles requires 6Gbps of bandwidth, which is why you want to compress it for RF transmission.


Of course, if you’re going to polish off the aliens quickly, you really need that low compression latency. Otherwise, you’re dead meat and the galaxy’s lost. A bad day all around.


Here’s a block diagram of the NAB demo:



NGCodec Wireless VR Demo for NAB.jpg 






You are never going to get past a certain performance barrier by compiling C for a software-programmable processor. At some point, you need hardware acceleration.


As an analogy: You can soup up a car all you want; it’ll never be an airplane.


Sure, you can bump the processor clock rate. You can add processor cores and distribute the tasks. Both of these approaches increase power consumption, so you’ll need a bigger and more expensive power supply; they increase heat generation, which means you will need better cooling and probably a bigger heat sink or a fan (or another fan); and all of these things increase BOM costs.


Are you sure you want to take that path? Really?


OK, you say. This blog’s from an FPGA company (actually, Xilinx is an “All Programmable” company), so you’ll no doubt counsel me to use an FPGA to accelerate these tasks and I don’t want to code in Verilog or VHDL, thank you very much.


Not a problem. You don’t need to.


You can get the benefit of hardware acceleration while coding in C or C++ using the Xilinx SDSoC development environment. SDSoC produces compiled software automatically coupled to hardware accelerators and all generated directly from your high-level C or C++ code.


That’s the subject of a new Chalk Talk video just posted on the eejournal.com Web site. Here’s one image from the talk:



SDSoC Acceleration Results.jpg



This image shows three complex embedded tasks and the improvements achieved with hardware acceleration:



  • 2-camera, 3D disparity mapping – 292x speed improvement


  • Sobel filter video processing – 30x speed improvement


  • Binary neural network – 1000x speed improvement



A beefier software processor or multiple processor cores will not get you 1000x more performance—or even 30x—no matter how you tweak your HLL code, and software coders will sweat bullets just to get a few percentage points of improvement. For such big performance leaps, you need hardware.


Here’s the 14-minute Chalk Talk video:






By Adam Taylor


So far, we have examined the FPGA hardware build for the Aldec TySOM-2 FPGA Prototyping board example in Vivado, which is a straightforward example of a simple image-processing chain. This hardware design allows an image to be received, stored in DDR SDRAM attached to the Zynq SoC’s PS, and then output to an HDMI display. What the hardware design at the Vivado level does not do is perform any face-detection functions. And to be honest, why would it?


With the input and output paths of the image-processing pipeline defined, we can use the untapped resources of the Zynq SoC’s PL and PS/PL interconnects to create the application at a higher level. We need to use SDSoC to do this, which allows us to develop our design using a higher-level language like C or C++ and then move the defined functionality from the PS into the PL—to accelerate that function.


The Vivado design we examined last week forms an SDSoC Platform, which we can use with the Linux operating system to implement the final design. The use of Linux allows us to use OpenCV within the Zynq SoC’s PS cores to support creation of the example design. If we develop with the new Xilinx reVISION stack, we can go even further and accelerate some of the OpenCV functions.


The face-detection example supplied with the TySOM-2 board implements face detection using a Pixel Intensity Comparison-based Object (PICO) detection framework developed by N Markus et al. The PICO framework scans the image with a cascade of binary classifiers. This PICO-based approach permits more efficient implementations that do not require the computation of integral images, HOG Pyramids, etc.


In this example, we need to define a frame buffer within the device tree blob to allow the Linux application to access the images stored within the Zynq SoC’s PS DDR SDRAM. The Linux application then uses “Video for Linux 2” (V4L2) to access this frame buffer and to allow further processing.







Once we get an image frame from the frame buffer, the software application can process it. The application will do the following things:


  1. Receive the input frame from the DDR SDRAM frame buffer using the V4L2 Linux Driver.
  2. Convert the input frame YUV4:2:2 format as received by the Blue Eagle camera into grey scale. This conversion extracts the Lumina component as the greyscale value.
  3. Perform the PICO object detection on the greyscale frame.
  4. Perform Sobel edge detection on the faces detected within the PICO object detector output.
  5. Perform further YUV to RGB conversion on the original received image frame.
  6. Use the OpenCV Circle function to highlight detected faces.
  7. Output the image to the HDMI port in the RGBA 8:8:8:8 format using the libdrm library within the Linux OS.


Looking at the above functions, not all of them can be accelerated in to the hardware. In this example, the conversion from YUV to greyscale, Sobel Edge Detection, and YUV-to-RGB conversion can be accelerated using the PL to increase performance.


Moving these functions into the PL is as easy as selecting the two functions we wish to accelerate with hardware and then clicking on build to create the example.  







Once this was completed, the algorithm ran as expected using both the PS and PL in the Zynq SoC.






Using this approach allows us to exploit both the Zynq SoC’s PL and PS for image processing without the need to implement a fixed RTL design in Vivado. In short, this ability allows us to use a known good platform design to implement image capture and display across several different applications. Meanwhile, the use of SDSoC also allows us to exploit the Zynq SoC’s PL at a higher level without the need to develop the HDL from scratch, reducing development time.



My code is available on Github as always.


If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




  • First Year E Book here
  • First Year Hardback here.



MicroZed Chronicles hardcopy.jpg 



  • Second Year E Book here
  • Second Year Hardback here


MicroZed Chronicles Second Year.jpg 




Later this month at the NAB Show in Las Vegas, you’ll be able to see several cutting-edge video demos based on the Xilinx Zynq SoC and Zynq UltraScale+ MPSoC in the Omnitek booth (C7915). First up is an HEVC video encoder demo using the embedded, hardened video codec built into the Zynq UltraScale+ ZU7EV MPSoC on a ZCU106 eval board. (For more information about the ZCU106 board, see “New Video: Zynq UltraScale+ EV MPSoCs encode and decode 4K60p H.264 and H.265 video in real time.”)


Next up is a demo of Omnitek’s HDMI 2.0 IP core, announced earlier this year. This core consists of separate transmit and receive subsystems. The HDMI 2.0 Rx subsystem can convert an HDMI video stream (up to 4KP60) into an RGB/YUV video AXI4-Stream and places AUX data in an auxiliary AXI4-Stream. The HDMI 2.0 Tx subsystem converts an RGB/YUV video AXI4-Stream plus AUX data into an HDMI video stream. This IP features a reduced resource count (small footprint in the programmable logic) and low latency.


Finally, Omnitek will be demonstrating a new addition to its OSVP Video Processor Suite: a real-time Image Signal Processing (ISP) Pipeline Subsystem, which can create an RGB video stream from raw image-sensor outputs. The ISP pipeline includes blocks that perform image cropping, defective-pixel correction, black-level compensation, vignette correction, automatic white balancing, and Bayer filter demosaicing.




Omnitek ISP Pipeline Subsystem.jpg



Omnitek’s Image Signal Processing (ISP) Pipeline Subsystem





Both the HDMI 2.0 and ISP Pipeline Subsystem IP are already proven on Xilinx All Programmable devices including all 7 series devices (Artix-7, Kintex-7, and Virtex-7), Kintex UltraScale and Virtex UltraScale devices, Kintex UltraScale+ and Virtex UltraScale+ devices, and Zynq-7000 SoCs and Zynq UltraScale+ MPSoCs.




NMI, a non-profit organization dedicated to improving electronic engineering and manufacturing in the UK, is organizing a one-day, machine-vision event for May 18 titled “Implementing Machine Vision with FPGA & SoC Platforms.” MBDA Missile Systems is hosting the event in its Stevenage location in the UK. (That’s roughly midway between London and Cambridge for those of us who are geographically challenged.)


Key themes for the event will include: OpenCV with FPGAs and SoCs, ADAS, Robotic Guided Vision/Drones, Industry 4.0, Defense, and Machine Learning.


Register here.



As a follow-on to last month’s announcement that RFEL had supplied the UK’s Defence Science and Technology Laboratory (DSTL) with two of its Zynq-based HALO Rapid Prototype Development Systems (RPDS), RFEL has now announced that DSTL has contracted with a three-company team to develop an adaptive, real-time, FPGA-based vision platform “to solve complex defence vision and surveillance problems, facilitating the rapid incorporation of best-in-class video processing algorithms while simultaneously bridging the gap between research prototypes and deployable equipment.” The three company team includes RFEL, 4Sight Imaging, and team leader Plextek.


The press release explains, “This innovative work draws together the best aspects of two approaches to video processing: high performance, bespoke FPGA processing supporting the computationally intensive tasks, and the flexibility (but lower performance) of CPU-based processing. This heterogeneous, hybrid approach is possible by using contemporary system-on-chip (SoC) devices, such as Xilinx’s Zynq devices, that provide embedded ARM CPUs with closely coupled FPGA fabric. The use of a modular FPGA design, with generic interfaces for each module, enables FPGA functions, which are traditionally inflexible, to be dynamically re-configured under software control.”





HALO Rapid Prototype Development Systems (RPDS)





  • For more information about the broad range of hardware, software, and development-tool technologies for vision-system development in the Xilinx reVISION stack, click here.



By Adam Taylor



Having introduced the Aldec TySOM-2 FPGA Prototyping Board, based on the Xilinx Zynq SoC, and the face detection application running on it, I thought it would be a good idea to take a more detailed examination of the face-detection application’s architecture.


The face detection example uses one Blue Eagle camera, which is connected to the Aldec FMC-ADAS card. The processed frames showing the detected face are output via the TySOM-2 board’s HDMI port. What is worth pointing out is that the application running on the TySOM-2 board, face detection in this case, is enabled by the software. The Zynq PL (programmable logic) hardware design provides the capability to interface with the camera, for sharing the video frames with the Zynq PS (processing system) through the DDR SDRAM, and for display output.


Any application could be implemented—not just face detection. It could be object tracking. I could be corner detection. It could be anything. This is one of the things that makes development of image-processing systems on the Zynq so powerful. We can use the same base platform on the TySOM-2 board and customize the application in software. Of course, we can also use the Xilinx SDSoC development environment to further accelerate the algorithm into the TySOM-2 platform’s remaining resources to increase performance.


The Blue Eagle camera transmits the video stream using a, FPD-Link III link. These links use a high-speed, bi-directional CML (Current Mode Logic) link to transfer the image data. An FPD-Link III receiving device (a TI DS90UB914Q-Q1 FPD-Link III SER/DES) is used on the ADAS FMC to implement this camera interface. This device is configured for the application in hand using the I2C peripheral in the Zynq SoC’s PS. This device provides video to the Zynq PL in a parallel format: the parallel data bits, HSync, VSync, and a pixel clock.






We need to process the frames and store them within the Zynq PS’ DDR SDRAM using Video DMA (Direct Memory Access) to ensure that we can access the image frames within DDR memory using the Zynq SoC’s ARM Cortex-A9 processor. We need to use several IP blocks that come as standard IP within Vivado to implement this. These IP blocks transfer data using the AXI streaming protocol--AXIS.


Therefore, the first thing needed is to convert the received video in parallel format into an AXIS stream. Once the video is in the correct format, we can use the VDMA IP block to transfer video data to and from the Zynq PS’ DDR SDRAM, where the software running on the Zynq SoC’s ARM Cortex-A9 processors can access the frames and implement the application algorithms.


Unlike previous examples we have examined, which used a single AXI High Performance (AXI HP) port, this example uses two of the Zynq SOC’s AXI HP interface ports, one in each direction. This configuration requires a slightly more complicated DMA architecture because we’ll need two VDMA IP Blocks. Within the Zynq PL, the AXI standard used for most IP blocks is AXI 4.0 while the ports on the Zynq SoC implement AXI 3.0. Therefore, we need to use an AXI Interconnect or a protocol convertor to convert between the two standards.







This use of two interfaces will make no performance difference when compared to a single HP AXI interface because the S0 and S1 AXI HP Ports on the Zynq SoC which are used by this configuration are multiplexed down to the M0 port on the memory interconnect and finally connected to the S3 port on the DDR SDRAM controller. This is shown below in the interconnection diagram from UG585, the TRM for the Zynq SoC.







Once the VDMA is implemented, the design then perform color-space conversion, chroma resampling, and finally passes to an on-screen display module. Once this has been completed, the video stream must be converted from AXIS to parallel video, which can then be output to the HDMI transmitter.


With this hardware platform completed, the next step is to write the software to create the application. For this we have the choice of using SDK or using SDSoC, which adds the ability to accelerate some of the application algorithm functions using programmable logic. As this example is implemented on the Zynq Z-7100 SoC, which has a significant amount of free, on-chip programmable resources following the implementation of the base platform, we’ll be using SDSoC for this example. We will look at the software architecture next time.


My code is available on Github as always.


If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




  • First Year E Book here
  • First Year Hardback here.



MicroZed Chronicles hardcopy.jpg 



  • Second Year E Book here
  • Second Year Hardback here


MicroZed Chronicles Second Year.jpg 


Adam Taylor’s MicroZed Chronicles Part 183: Introducing the Aldec TySOM 2 and a Face-Detection app

by Xilinx Employee ‎04-05-2017 01:55 PM - edited ‎04-05-2017 03:48 PM (3,905 Views)


By Adam Taylor


So far on this journey, most of the boards we have looked at have been fitted with either the Zynq Z-7010 or Z-7020 SoCs. The new Aldec TySOM 2 board comes which with either a Zynq Z-7045 or Z-7100 device fitted, making it the most powerful Zynq-based board we have looked at to date. Especially with the Z-7100 SoC fitted as is the example Aldec has provided to me.




Aldec Tysom 2 - Adam Taylor.jpg




The TySOM 2 board is intended for development prototyping. As such, it provides you with a range of I/O pins, broken out on two FMC connectors that connect to 288 of the Zynq Z-7100 SoC’s 362 I/O pins and all 16 GTX lanes. It also provides some simple user peripherals including switches and LEDS along with an HDMI port connected to the Zynq SoC’s PL (programmable logic). Meanwhile, the Zynq PS (processing system) provides four USB 2.0 ports, Ethernet, and a USB/UART for connectivity and 1Gbyte of DDR memory. In short, the Aldec TySOM 2 board has everything we need to create a very power single board computer.


Here’s a block diagram of the TySOM 2 board:




Aldec TySOM 2 Block Diagram.jpg


Aldec TySOM 2 board block diagram



Of course, there’s a range of FPGA Mezzanine Cards (FMC) available from Aldec and other vendors to enable prototyping over a wide range of applications including vision, IIOT and ADAS. Aldec supplied my board with the ADAS daughter board, which enables the connection of up to five cameras using FPD-Link III connections. As FMC is an ANSI standard, there are a wide range of 3rd-party FMCs available, which further widen the prototyping options to support applications such as Software Defined Radio.


As I mentioned before, the Zynq Z-7100 SoC is the most powerful Zynq device we have examined to date. So what does the Z-7100 bring to the party that we have not seen before (not including the PL’s increased logic resources)? The most obvious addition is the provision of the 16 GTX transceivers that support data rates to 12.5Gbps. You can also use these high-speed serial links to implement Gen1 (2.5 Gbps) and Gen2 (5.0 Gbps) PCIe interfaces. Multi-lane solutions are also possible. The Z-7100 can support as many as 8 lanes if we so desire.


We also gain access to high performance I/O pins for the first-time, which introduce digitally controlled, on-chip termination for better signal integrity. Zynq Z-7020 devices and below only provide High Range (HR) I/O, which handle a wider range of I/O voltages (1.2V to 3.3V) although with reduced performance. When it comes to logic resources, the Zynq Z-7100 SoC is very impressive. It gives us 444K logic cells, 2020 DSP slices, 26.5Mbits of block RAM, and 554,800 flip flops.


We will look more in detail at how we can use this development board over the next few weeks. However, Aldec shipped this board pre-installed with a face-detection application, which connects to a single camera using the ADAS FMC and an HDMI display. When I connected it all up and ran the application, the example sprung to life and detected my face as I moved about in front of the supplied camera:



Aldec Face Detection using a Zynq Z-7100.jpg 



My code is available on Github as always.


If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.




  • First Year E Book here
  • First Year Hardback here.



MicroZed Chronicles hardcopy.jpg 




  • Second Year E Book here
  • Second Year Hardback here




MicroZed Chronicles Second Year.jpg 

Mentor’s DRS360 autonomous driving platform is based on the Xilinx Zynq UltraScale+ MPSoC

by Xilinx Employee ‎04-04-2017 10:46 AM - edited ‎04-04-2017 10:46 AM (3,756 Views)


Mentor has just announced the DRS360 platform for developing autonomous driving systems based on the Xilinx Zynq UltraScale+ MPSoC. The automotive-grade DRS360 platform is already designed and tested for deployment in ISO 26262 ASIL D-compliant systems.


This platform offers comprehensive sensor-fusion capabilities for multiple cameras, radar, LIDAR, and other sensors while offering “dramatic improvements in latency reduction, sensing accuracy and overall system efficiency required for SAE Level 5 autonomous vehicles.” In particular, the DRS360 platform’s use of the Zynq UltraScale+ MPSoC permits the use of “raw data sensors,” thus avoiding the power, cost, and size penalties of microcontrollers and the added latency of local processing at the sensor nodes.


Eliminating pre-processing microcontrollers from all system sensor nodes brings many advantages to the autonomous-driving system design including improved real-time performance, significant reductions in system cost and complexity, and access to all of the captured sensor data for a maximum-resolution, unfiltered model of the vehicle’s environment and driving conditions.


Mentor DRS360 platform Block diagram.jpg



Rather than try to scale lower levels of ADAS up, Mentor’s DRS360 platform is optimized for Level 5 autonomous driving, and it’s engineered to easily scale down to Levels 4, 3 and even 2. This approach makes it far easier to develop systems at the appropriate level for the system you’re developing because the DRS360 platform is already designed to handle the most complex tasks from the beginning.






If you’re working with any sort of video, there’s a new 4-minute demo video you need to see. This video shows two new Zynq UltraScale+ EV MPSoC devices working in tandem to decode and display 4K60p streaming video in both H.264 and H.265 video formats in real time. Zynq UltraScale+ EV MPSoC devices incorporate hardened, low-latency H.264 and H.265 video codecs (encode and decode). The demo employs two Xilinx ZCU106 boards in the following configuration:




Zynq UltraScale Plus EV Video Codec Demo Diagram.jpg




The first ZCU106 extracts the 4K60p video stream from a USB stick at 60Mbps, decodes the video, and displays it on a local monitor using a DisplayPort interface. At the same time, the on-board Zynq UltraScale+ EV device re-encodes the video using the on-chip H.265 encoder, which reduces the video bit rate to 10Mbps thanks to the improved encoding efficiency of the H.265 standard. The board then transmits the resulting 10Mbps video stream over a wired Ethernet connection to a second ZCU106 board, which decodes the video and displays it on a second monitor. The entire process occurs with such low latency that it’s hard to see any delay between the two displayed video streams.


Here’s the video demo:








Here’s a 40-minute teardown video of a Vision Research Phantom v5 high-speed high-speed, 1024x1024-pixel, 1000frames/sec video camera (circa 2001) from tesla500’s YouTube video channel. His methodical teardown and excellent system-level explanation uncovers a couple of “huge” Xilinx XC4020 FPGAs (circa 2000) on the timing and interface boards and Xilinx XC9500 CPLDs implementing the timing and control on the four high-speed capture-memory boards. There’s also a Hitachi SH-2 32-bit RISC processor with a hardware MAC (for DSP) on the timing board.


The XC4020 FPGAs are 3rd-generation devices that each have 784 CLBs (1560 LUTs total). They were big in their day but they’re very small now. These days, I think you could implement all of the digital timing and control circuitry in this camera including the SH-2 processor’s capabilities using the smallest single-core Zynq Z-7007S SoC—with the ARM Cortex-A9 processor in the Zynq SoC running considerably more than 20x faster than the turn-of-the-millennium SH-2 processor’s roughly 28MHz maximum clock rate.


Of course, Vision Research has moved far beyond 1000 frames/sec over the past 17 years. Its latest cameras can go 1000x faster than that, hitting 1M frames/sec when configured with the company’s FAST option (fast indeed!), while the Phantom v5 is no longer listed even on the company’s “discontinued cameras” page. Nevertheless, I found tesla500’s teardown and explanations fascinating and valuable.


Of course, Xilinx All Programmable devices have long been used to design advanced video equipment like the Vision Research Phantom v5 high-speed camera. Which allows me to quickly remind you of the recent launch of the Xilinx reVISION stack launch for embedded-vision applications. (See “Xilinx reVISION stack pushes machine learning for vision-guided applications all the way to the edge.”)


And now, here’s tesla500’s Vision Research Phantom v5 high-speed camera teardown video:









Xcell Daily discussed DeePhi Tech’s Zynq-based CNN acceleration processor last year in connection with the Hot Chips 2016 conference. (See “DeePhi’s Zynq-based CNN processor is faster, more energy efficient than CPUs or GPUs.”) DeePhi’s founder Song Yao appears in a new Powered by Xilinx video this week giving many more details including some fascinating information about an early customer, ZeroTech—China’s second largest drone maker.


DeePhi provides the entire stack needed to develop machine-learning applications based on neural networks including the development software, algorithms, and a neural-network processor that runs efficiently on the Xilinx Zynq SoC. This technology is particularly good for deep-learning, vision-based embedded apps such as drones, robotics, surveillance cameras, and for cloud-computing applications as well.


The video also provides more details on ZeroTech’s use of DeePhi’s machine-learning technology for object detection, pedestrian detection, and gesture recognition—all in a drone that nestles in your hand.


Song Yao explains that DeePhi’s tools provide a GPU-like development environment while taking advantage of the superior efficiency of neural networks implemented with programmable logic. In addition, DeePhi can change the neural network’s architecture to further optimize the design for specific applications.


Finally, he explains that you can use these Zynq-based implementations in applications where GPUs will simply not work due to power-consumption restrictions. In fact, last year at Hot Chips 2016 he reportedly said, “The FPGA based DPU platform achieves an order of magnitude higher energy efficiency over GPU on image recognition and speech detection.”


Here’s the new, 3-minute Powered by Xilinx video:





How to use machine learning for embedded vision—and many other embedded applications

by Xilinx Employee ‎03-30-2017 10:02 AM - edited ‎03-30-2017 12:00 PM (3,577 Views)


Image3.jpg Adam Taylor and Xilinx’s Sr. Product Manager for SDSoC and Embedded Vision Nick Ni have just published an article on the EE News Europe Web site titled “Machine learning in embedded vision applications.” That title’s pretty self-explanatory, but there are a few points I’d like to highlight. Then you can go read the full article yourself.


As the article states, “Machine learning spans several industry mega trends, playing a very prominent role within not only Embedded Vision (EV), but also Industrial Internet of Things (IIoT) and Cloud Computing.” In other words, if you’re designing products for any embedded market, you might well find yourself at a competitive disadvantage if you’re not adding machine-learning features to your road map.


This article closely ties machine learning with neural networks (including Feed-forward Neural Networks (FNNs), Recurrent Neural Networks (RNNs), and Deep Neural Networks (DNNs), and Convolutional Neural Networks (CNNs)). Neural networks are not programmed; they’re trained. Then, if they’re part of an embedded design, they’re deployed. Training is usually done using floating-point neural-network implementations but, for efficiency (power and cost), deployed neural networks can use fixed-point representations with very little or no loss of accuracy. (See “Counter-Intuitive: Fixed-Point Deep-Learning Inference Delivers 2x to 6x Better CNN Performance with Great Accuracy.”)


The programmable logic inside of Xilinx FPGAs, Zynq SoCs, and Zynq UltraScale+ MPSoCs is especially good at implementing fixed-point neural networks, as described in this article by Nick Ni and Adam Taylor. (Go read the article!)


Meanwhile, this is a good time to remind you of the recent Xilinx introduction of the reVISION stack for neural network development using Xilinx All Programmable devices. For more information about the Xilinx reVISION stack, see:
















About the Author
  • Be sure to join the Xilinx LinkedIn group to get an update for every new Xcell Daily post! ******************** Steve Leibson is the Director of Strategic Marketing and Business Planning at Xilinx. He started as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He's served as Editor in Chief of EDN Magazine, Embedded Developers Journal, and Microprocessor Report. He has extensive experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.