UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

 

In this 40-minute webinar, Xilinx will present a new approach that allows you to unleash the power of the FPGA fabric in Zynq SoCs and Zynq UltraScale+ MPSoCs using hardware-tuned OpenCV libraries, with a familiar C/C++ development environment and readily available hardware development platforms. OpenCV libraries are widely used for algorithm prototyping by many leading technology companies and computer vision researchers. FPGAs can achieve unparalleled compute efficiency on complex algorithms like dense optical flow and stereo vision in only a few watts of power.

 

This Webinar is being held on July 12. Register here.

 

Here’s a fairly new, 4-minute video showing a 1080p60 Dense Optical Flow demo, developed with the Xilinx SDSoC Development Environment in C/C++ using OpenCV libraries:

 

 

 

 

For related information, see Application Note XAPP1167, “Accelerating OpenCV Applications with Zynq-7000 All Programmable SoC using Vivado HLS Video Libraries.”

 

Plethora IIoT develops cutting‑edge solutions to Industry 4.0 challenges using machine learning, machine vision, and sensor fusion. In the video below, a Plethora IIoT Oberon system monitors power consumption, temperature, and the angular speed of three positioning servomotors in real time on a large ETXE-TAR Machining Center for predictive maintenance—to spot anomalies with the machine tool and to schedule maintenance before these anomalies become full-blown faults that shut down the production line. (It’s really expensive when that happens.) The ETXE-TAR Machining Center is center-boring engine crankshafts. This bore is the critical link between a car’s engine and the rest of the drive train including the transmission.

 

 

 

Plethora IIoT Oberon System.jpg 

 

 

 

Plethora uses Xilinx Zynq SoCs and Zynq UltraScale+ MPSoCs as the heart of its Oberon system because these devices’ unique combination of software-programmable processors, hardware-programmable FPGA fabric, and programmable I/O allow the company to develop real-time systems that implement sensor fusion, machine vision, and machine learning in one device.

 

Initially, Plethora IIoT’s engineers used the Xilinx Vivado Design Suite to develop their Zynq-based designs. Then they discovered Vivado HLS, which allows you to take algorithms in C, C++, or SystemC directly to the FPGA fabric using hardware compilation. The engineers’ first reaction to Vivado HLS: “Is this real or what?” They discovered that it was real. Then they tried the SDSoC Development Environment with its system-level profiling, automated software acceleration using programmable logic, automated system connectivity generation, and libraries to speed programming. As they say in the video, “You just have to program it and there you go.”

 

Here’s the video:

 

 

 

 

Plethora IIoT is showcasing its Oberon system in the Industrial Internet Consortium (IIC) Pavilion during the Hannover Messe Show being held this week. Several other demos in the IIC Pavilion are also based on Zynq All Programmable devices.

 

intoPIX announces IP core support for 8K TICO video compression with <1msec end-to-end latency

by Xilinx Employee ‎04-21-2017 02:01 PM - edited ‎04-21-2017 02:16 PM (523 Views)

 

Today, intoPIX announced that it’s lightweight TICO video-compression IP cores for Xilinx FPGAs can now support frame resolutions and rates to 8K60p as well as the previously supported HD and 4K resolutions. Currently, the compression cores support 10-bit, 4:2:2 workflows but intoPIX also disclosed in a published table (see below) that a future release of the IP core will support 4:4:4 color sampling. The TICO compression standard simplifies the management of live and broadcast video streams over existing video network infrastructures based on SDI and Ethernet by reducing the bandwidth requirements of high-definition and ultra-high-definition video at compression ratios as large as 5:1 (visually lossless at ratios to 4:1). TICO compression supports live video streams through its low latency—less than 1msec end-to-end.

 

Conveniently, intoPIX has published a comprehensive chart showing its various TICO compression IP cores and the Xilinx FPGAs that can support them. Here’s the intoPIX chart:

 

 

intoPIX TICO Compression Table for Xilinx FPGAs.jpg 

 

 

Note that the most cost-effective Xilinx FPGAs including the Spartan-6 and Artix-7 families support TICO compression at HD and even some UHD/4K video formats while the Kintex-7, Virtex-7, and UltraScale device families support all video formats through 8K.

 

Please contact intoPIX for more information about these IP cores.

 

 

 

Mega65 Logo.jpgThe MEGA65 is an open-source microcomputer modeled on the incredibly successful Commodore 64/65 circa 1982-1990. Ye olde Commodore 64 (C64)—introduced in 1982—was based on an 8-bit MOS Technology 6510 microprocessor, which was derived from the very popular 6502 processor that powered the Apple II, Atari 400/800, and many other 8-bit machines in the 1980s. The 6510 processor added an 8-bit parallel I/O port to the 6502, which no doubt dropped the microcomputer’s BOM cost a buck or two. According to Wikipedia, “The 6510 was only widely used in the Commodore 64 home computer and its variants.” Also according to Wikipedia, “For a substantial period (1983–1986), the C64 had between 30% and 40% share of the US market and two million units sold per year, outselling the IBM PC compatibles, Apple Inc. computers, and the Atari 8-bit family of computers.”

 

Now that is indeed a worthy computer to serve as a “Jurassic Park” candidate and therefore, the non-profit MEGA (Museum of Electronic Games & Art), “dedicated to the preservation of our digital heritage,” is supervising the physical recreation of the Commodore 64 microcomputer (mega65.org). It’s called the MEGA65 and it’s software-compatible with the original Commodore 64, only faster. (The 6510 processor emulation in the MEGA65 runs at 48MHz compared to the original MOS Technology 6510’s ~1MHz clock rate.) MEGA65 hardware designs and software are open-source (LGPL).

 

How do you go about recreating the hardware of a machine that’s been gone for 25 years? Fortunately, it’s a lot easier than extracting DNA from the stomach contents of ancient mosquitos trapped in amber. Considering that this blog is appearing in Xcell Daily on the Xilinx Web site, the answer’s pretty obvious: you use an FPGA. And that’s exactly what’s happening.

 

A few days ago, the MEGA65 team celebrated initial bringup of the MEGA65 pcb. You can read about the bringup efforts here and here is a photo of the pcb:

 

 

MEGA65 pcb.jpg 

 

The first MEGA65 PCB

 

 

 

The MEGA65 pcb is designed to fit into the existing Commodore 65 plastic case. (The Commodore 65 was prototyped but not put into production.)

 

Sort of gives a new meaning to “single-chip microcomputer,” does it not. That big chip in the middle of the board is an Xilinx Artix-7 A200T. It implements the Commodore 64’s entire motherboard in one programmable logic device. Yes, that includes the RAM. The Artix-7 A200T FPGA has 13.14Mbits of on-chip block RAM. That’s more than 1.5Mbytes of RAM, or 25x more RAM than the original Commodore 64 motherboard, which used eight 4164 64Kbit, 150nsec DRAMs for RAM storage. The video’s a bit improved too, from 160x200 pixels, with a maximum of four colors per 4x8 character block, or 320x200 pixels, with a maximum of two colors per 8x8 character block, to a more modern 1920x1200 pixels with 12-bit color (23-bit color is planned). Funny what 35 years of semiconductor evolution can produce.

 

What’s the project’s progress status? Here’s a snapshot from the MEGA65 site:

 

 

 

MEGA65 Progress.jpg

 

 

MEGA65 Project Status

 

 

 

And here’s a video of the MEGA65 in action:


 

 

 

 

 

Remember, what you see and hear is running on a Xilinx Artix-7 A200T, configured to emulate an entire Commodore 64 microcomputer. Most of the code in this video was written in the Jurassic period of microcomputer development. If you’re of a certain age, these old programs should bring a chuckle or perhaps just a smile to your lips.

 

 

Note: You’ll find a MEGA65 project log by Antti Lukats here.

 

 

 

 

 

 

Basic problem: When you’re fighting aliens to save the galaxy wearing your VR headset, having a wired tether to pump the video to your headset is really going to crimp your style. Spin around to blast that battle droid sneaking up on you from behind is just as likely to garrote you as save your neck. What to do? How will you successfully save the galaxy?

 

Well, NGCodec and Celeno Communications have a demo for you in the NGCodec booth (N2635SP-A) at NAB in the Las Vegas Convention Center next week. Combine NGCodec’s low-latency H.265/HEVC “RealityCodec” video coder/decoder IP with Celeno’s 5GHz 802.11ac WiFi connection and you have a high-definition (2160x1200), high-frame-rate (90 frames/sec) wireless video connection over a 15Mbps wireless channel. This demo uses a 250:1 video compression setting to fit the video into the 15Mbps channel.

 

In the demo, a RealityCodec hardware instance in a Xilinx Virtex UltraScale+ VU9P FPGA on a PCIe board plugged into a PC running Windows 10 compresses generated video in real time. The PC sends the compressed, 15Mbps video stream to a Celeno 802.11ac WiFi radio, which transmits the video over a standard 5GHz 802.11ac WiFi connection. Another Celeno WiFi radio receives the compressed video stream and sends it to a second RealityCodec for decoding. The decoder hardware is instantiated in a relatively small Xilinx Kintex-7 325T FPGA. The decoded video stream feeding the VR goggles requires 6Gbps of bandwidth, which is why you want to compress it for RF transmission.

 

Of course, if you’re going to polish off the aliens quickly, you really need that low compression latency. Otherwise, you’re dead meat and the galaxy’s lost. A bad day all around.

 

Here’s a block diagram of the NAB demo:

 

 

NGCodec Wireless VR Demo for NAB.jpg 

 

 

 

 

 

You are never going to get past a certain performance barrier by compiling C for a software-programmable processor. At some point, you need hardware acceleration.

 

As an analogy: You can soup up a car all you want; it’ll never be an airplane.

 

Sure, you can bump the processor clock rate. You can add processor cores and distribute the tasks. Both of these approaches increase power consumption, so you’ll need a bigger and more expensive power supply; they increase heat generation, which means you will need better cooling and probably a bigger heat sink or a fan (or another fan); and all of these things increase BOM costs.

 

Are you sure you want to take that path? Really?

 

OK, you say. This blog’s from an FPGA company (actually, Xilinx is an “All Programmable” company), so you’ll no doubt counsel me to use an FPGA to accelerate these tasks and I don’t want to code in Verilog or VHDL, thank you very much.

 

Not a problem. You don’t need to.

 

You can get the benefit of hardware acceleration while coding in C or C++ using the Xilinx SDSoC development environment. SDSoC produces compiled software automatically coupled to hardware accelerators and all generated directly from your high-level C or C++ code.

 

That’s the subject of a new Chalk Talk video just posted on the eejournal.com Web site. Here’s one image from the talk:

 

 

SDSoC Acceleration Results.jpg

 

 

This image shows three complex embedded tasks and the improvements achieved with hardware acceleration:

 

 

  • 2-camera, 3D disparity mapping – 292x speed improvement

 

  • Sobel filter video processing – 30x speed improvement

 

  • Binary neural network – 1000x speed improvement

 

 

A beefier software processor or multiple processor cores will not get you 1000x more performance—or even 30x—no matter how you tweak your HLL code, and software coders will sweat bullets just to get a few percentage points of improvement. For such big performance leaps, you need hardware.

 

Here’s the 14-minute Chalk Talk video:

 

 

 

 

 

By Adam Taylor

 

So far, we have examined the FPGA hardware build for the Aldec TySOM-2 FPGA Prototyping board example in Vivado, which is a straightforward example of a simple image-processing chain. This hardware design allows an image to be received, stored in DDR SDRAM attached to the Zynq SoC’s PS, and then output to an HDMI display. What the hardware design at the Vivado level does not do is perform any face-detection functions. And to be honest, why would it?

 

With the input and output paths of the image-processing pipeline defined, we can use the untapped resources of the Zynq SoC’s PL and PS/PL interconnects to create the application at a higher level. We need to use SDSoC to do this, which allows us to develop our design using a higher-level language like C or C++ and then move the defined functionality from the PS into the PL—to accelerate that function.

 

The Vivado design we examined last week forms an SDSoC Platform, which we can use with the Linux operating system to implement the final design. The use of Linux allows us to use OpenCV within the Zynq SoC’s PS cores to support creation of the example design. If we develop with the new Xilinx reVISION stack, we can go even further and accelerate some of the OpenCV functions.

 

The face-detection example supplied with the TySOM-2 board implements face detection using a Pixel Intensity Comparison-based Object (PICO) detection framework developed by N Markus et al. The PICO framework scans the image with a cascade of binary classifiers. This PICO-based approach permits more efficient implementations that do not require the computation of integral images, HOG Pyramids, etc.

 

In this example, we need to define a frame buffer within the device tree blob to allow the Linux application to access the images stored within the Zynq SoC’s PS DDR SDRAM. The Linux application then uses “Video for Linux 2” (V4L2) to access this frame buffer and to allow further processing.

 

 

Image1.jpg

 

 

 

Once we get an image frame from the frame buffer, the software application can process it. The application will do the following things:

 

  1. Receive the input frame from the DDR SDRAM frame buffer using the V4L2 Linux Driver.
  2. Convert the input frame YUV4:2:2 format as received by the Blue Eagle camera into grey scale. This conversion extracts the Lumina component as the greyscale value.
  3. Perform the PICO object detection on the greyscale frame.
  4. Perform Sobel edge detection on the faces detected within the PICO object detector output.
  5. Perform further YUV to RGB conversion on the original received image frame.
  6. Use the OpenCV Circle function to highlight detected faces.
  7. Output the image to the HDMI port in the RGBA 8:8:8:8 format using the libdrm library within the Linux OS.

 

Looking at the above functions, not all of them can be accelerated in to the hardware. In this example, the conversion from YUV to greyscale, Sobel Edge Detection, and YUV-to-RGB conversion can be accelerated using the PL to increase performance.

 

Moving these functions into the PL is as easy as selecting the two functions we wish to accelerate with hardware and then clicking on build to create the example.  

 

 

Image2.jpg

 

 

 

Once this was completed, the algorithm ran as expected using both the PS and PL in the Zynq SoC.

 

 

Image3.jpg

 

 

Using this approach allows us to exploit both the Zynq SoC’s PL and PS for image processing without the need to implement a fixed RTL design in Vivado. In short, this ability allows us to use a known good platform design to implement image capture and display across several different applications. Meanwhile, the use of SDSoC also allows us to exploit the Zynq SoC’s PL at a higher level without the need to develop the HDL from scratch, reducing development time.

 

 

My code is available on Github as always.

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

MicroZed Chronicles hardcopy.jpg 

  

 

  • Second Year E Book here
  • Second Year Hardback here

 

MicroZed Chronicles Second Year.jpg 

 

 

 

Later this month at the NAB Show in Las Vegas, you’ll be able to see several cutting-edge video demos based on the Xilinx Zynq SoC and Zynq UltraScale+ MPSoC in the Omnitek booth (C7915). First up is an HEVC video encoder demo using the embedded, hardened video codec built into the Zynq UltraScale+ ZU7EV MPSoC on a ZCU106 eval board. (For more information about the ZCU106 board, see “New Video: Zynq UltraScale+ EV MPSoCs encode and decode 4K60p H.264 and H.265 video in real time.”)

 

Next up is a demo of Omnitek’s HDMI 2.0 IP core, announced earlier this year. This core consists of separate transmit and receive subsystems. The HDMI 2.0 Rx subsystem can convert an HDMI video stream (up to 4KP60) into an RGB/YUV video AXI4-Stream and places AUX data in an auxiliary AXI4-Stream. The HDMI 2.0 Tx subsystem converts an RGB/YUV video AXI4-Stream plus AUX data into an HDMI video stream. This IP features a reduced resource count (small footprint in the programmable logic) and low latency.

 

Finally, Omnitek will be demonstrating a new addition to its OSVP Video Processor Suite: a real-time Image Signal Processing (ISP) Pipeline Subsystem, which can create an RGB video stream from raw image-sensor outputs. The ISP pipeline includes blocks that perform image cropping, defective-pixel correction, black-level compensation, vignette correction, automatic white balancing, and Bayer filter demosaicing.

 

 

 

Omnitek ISP Pipeline Subsystem.jpg

 

 

Omnitek’s Image Signal Processing (ISP) Pipeline Subsystem

 

 

 

 

Both the HDMI 2.0 and ISP Pipeline Subsystem IP are already proven on Xilinx All Programmable devices including all 7 series devices (Artix-7, Kintex-7, and Virtex-7), Kintex UltraScale and Virtex UltraScale devices, Kintex UltraScale+ and Virtex UltraScale+ devices, and Zynq-7000 SoCs and Zynq UltraScale+ MPSoCs.

 

 

 

NMI, a non-profit organization dedicated to improving electronic engineering and manufacturing in the UK, is organizing a one-day, machine-vision event for May 18 titled “Implementing Machine Vision with FPGA & SoC Platforms.” MBDA Missile Systems is hosting the event in its Stevenage location in the UK. (That’s roughly midway between London and Cambridge for those of us who are geographically challenged.)

 

Key themes for the event will include: OpenCV with FPGAs and SoCs, ADAS, Robotic Guided Vision/Drones, Industry 4.0, Defense, and Machine Learning.

 

Register here.

 

 

As a follow-on to last month’s announcement that RFEL had supplied the UK’s Defence Science and Technology Laboratory (DSTL) with two of its Zynq-based HALO Rapid Prototype Development Systems (RPDS), RFEL has now announced that DSTL has contracted with a three-company team to develop an adaptive, real-time, FPGA-based vision platform “to solve complex defence vision and surveillance problems, facilitating the rapid incorporation of best-in-class video processing algorithms while simultaneously bridging the gap between research prototypes and deployable equipment.” The three company team includes RFEL, 4Sight Imaging, and team leader Plextek.

 

The press release explains, “This innovative work draws together the best aspects of two approaches to video processing: high performance, bespoke FPGA processing supporting the computationally intensive tasks, and the flexibility (but lower performance) of CPU-based processing. This heterogeneous, hybrid approach is possible by using contemporary system-on-chip (SoC) devices, such as Xilinx’s Zynq devices, that provide embedded ARM CPUs with closely coupled FPGA fabric. The use of a modular FPGA design, with generic interfaces for each module, enables FPGA functions, which are traditionally inflexible, to be dynamically re-configured under software control.”

 

 

RFEL HALO RPDS.jpg
 

 

HALO Rapid Prototype Development Systems (RPDS)

 

 

 

 

  • For more information about the broad range of hardware, software, and development-tool technologies for vision-system development in the Xilinx reVISION stack, click here.

 

 

By Adam Taylor

 

 

Having introduced the Aldec TySOM-2 FPGA Prototyping Board, based on the Xilinx Zynq SoC, and the face detection application running on it, I thought it would be a good idea to take a more detailed examination of the face-detection application’s architecture.

 

The face detection example uses one Blue Eagle camera, which is connected to the Aldec FMC-ADAS card. The processed frames showing the detected face are output via the TySOM-2 board’s HDMI port. What is worth pointing out is that the application running on the TySOM-2 board, face detection in this case, is enabled by the software. The Zynq PL (programmable logic) hardware design provides the capability to interface with the camera, for sharing the video frames with the Zynq PS (processing system) through the DDR SDRAM, and for display output.

 

Any application could be implemented—not just face detection. It could be object tracking. I could be corner detection. It could be anything. This is one of the things that makes development of image-processing systems on the Zynq so powerful. We can use the same base platform on the TySOM-2 board and customize the application in software. Of course, we can also use the Xilinx SDSoC development environment to further accelerate the algorithm into the TySOM-2 platform’s remaining resources to increase performance.

 

The Blue Eagle camera transmits the video stream using a, FPD-Link III link. These links use a high-speed, bi-directional CML (Current Mode Logic) link to transfer the image data. An FPD-Link III receiving device (a TI DS90UB914Q-Q1 FPD-Link III SER/DES) is used on the ADAS FMC to implement this camera interface. This device is configured for the application in hand using the I2C peripheral in the Zynq SoC’s PS. This device provides video to the Zynq PL in a parallel format: the parallel data bits, HSync, VSync, and a pixel clock.

 

 

Image1.jpg 

 

 

We need to process the frames and store them within the Zynq PS’ DDR SDRAM using Video DMA (Direct Memory Access) to ensure that we can access the image frames within DDR memory using the Zynq SoC’s ARM Cortex-A9 processor. We need to use several IP blocks that come as standard IP within Vivado to implement this. These IP blocks transfer data using the AXI streaming protocol--AXIS.

 

Therefore, the first thing needed is to convert the received video in parallel format into an AXIS stream. Once the video is in the correct format, we can use the VDMA IP block to transfer video data to and from the Zynq PS’ DDR SDRAM, where the software running on the Zynq SoC’s ARM Cortex-A9 processors can access the frames and implement the application algorithms.

 

Unlike previous examples we have examined, which used a single AXI High Performance (AXI HP) port, this example uses two of the Zynq SOC’s AXI HP interface ports, one in each direction. This configuration requires a slightly more complicated DMA architecture because we’ll need two VDMA IP Blocks. Within the Zynq PL, the AXI standard used for most IP blocks is AXI 4.0 while the ports on the Zynq SoC implement AXI 3.0. Therefore, we need to use an AXI Interconnect or a protocol convertor to convert between the two standards.

 

 

Image2.jpg

 

 

 

This use of two interfaces will make no performance difference when compared to a single HP AXI interface because the S0 and S1 AXI HP Ports on the Zynq SoC which are used by this configuration are multiplexed down to the M0 port on the memory interconnect and finally connected to the S3 port on the DDR SDRAM controller. This is shown below in the interconnection diagram from UG585, the TRM for the Zynq SoC.

 

 

 

Image3.jpg 

 

 

Once the VDMA is implemented, the design then perform color-space conversion, chroma resampling, and finally passes to an on-screen display module. Once this has been completed, the video stream must be converted from AXIS to parallel video, which can then be output to the HDMI transmitter.

 

With this hardware platform completed, the next step is to write the software to create the application. For this we have the choice of using SDK or using SDSoC, which adds the ability to accelerate some of the application algorithm functions using programmable logic. As this example is implemented on the Zynq Z-7100 SoC, which has a significant amount of free, on-chip programmable resources following the implementation of the base platform, we’ll be using SDSoC for this example. We will look at the software architecture next time.

 

My code is available on Github as always.

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

MicroZed Chronicles hardcopy.jpg 

  

 

  • Second Year E Book here
  • Second Year Hardback here

 

MicroZed Chronicles Second Year.jpg 

 

Adam Taylor’s MicroZed Chronicles Part 183: Introducing the Aldec TySOM 2 and a Face-Detection app

by Xilinx Employee ‎04-05-2017 01:55 PM - edited ‎04-05-2017 03:48 PM (1,219 Views)

 

By Adam Taylor

 

So far on this journey, most of the boards we have looked at have been fitted with either the Zynq Z-7010 or Z-7020 SoCs. The new Aldec TySOM 2 board comes which with either a Zynq Z-7045 or Z-7100 device fitted, making it the most powerful Zynq-based board we have looked at to date. Especially with the Z-7100 SoC fitted as is the example Aldec has provided to me.

 

 

 

Aldec Tysom 2 - Adam Taylor.jpg

 

 

 

The TySOM 2 board is intended for development prototyping. As such, it provides you with a range of I/O pins, broken out on two FMC connectors that connect to 288 of the Zynq Z-7100 SoC’s 362 I/O pins and all 16 GTX lanes. It also provides some simple user peripherals including switches and LEDS along with an HDMI port connected to the Zynq SoC’s PL (programmable logic). Meanwhile, the Zynq PS (processing system) provides four USB 2.0 ports, Ethernet, and a USB/UART for connectivity and 1Gbyte of DDR memory. In short, the Aldec TySOM 2 board has everything we need to create a very power single board computer.

 

Here’s a block diagram of the TySOM 2 board:

 

 

 

Aldec TySOM 2 Block Diagram.jpg

 

Aldec TySOM 2 board block diagram

 

 

Of course, there’s a range of FPGA Mezzanine Cards (FMC) available from Aldec and other vendors to enable prototyping over a wide range of applications including vision, IIOT and ADAS. Aldec supplied my board with the ADAS daughter board, which enables the connection of up to five cameras using FPD-Link III connections. As FMC is an ANSI standard, there are a wide range of 3rd-party FMCs available, which further widen the prototyping options to support applications such as Software Defined Radio.

 

As I mentioned before, the Zynq Z-7100 SoC is the most powerful Zynq device we have examined to date. So what does the Z-7100 bring to the party that we have not seen before (not including the PL’s increased logic resources)? The most obvious addition is the provision of the 16 GTX transceivers that support data rates to 12.5Gbps. You can also use these high-speed serial links to implement Gen1 (2.5 Gbps) and Gen2 (5.0 Gbps) PCIe interfaces. Multi-lane solutions are also possible. The Z-7100 can support as many as 8 lanes if we so desire.

 

We also gain access to high performance I/O pins for the first-time, which introduce digitally controlled, on-chip termination for better signal integrity. Zynq Z-7020 devices and below only provide High Range (HR) I/O, which handle a wider range of I/O voltages (1.2V to 3.3V) although with reduced performance. When it comes to logic resources, the Zynq Z-7100 SoC is very impressive. It gives us 444K logic cells, 2020 DSP slices, 26.5Mbits of block RAM, and 554,800 flip flops.

 

We will look more in detail at how we can use this development board over the next few weeks. However, Aldec shipped this board pre-installed with a face-detection application, which connects to a single camera using the ADAS FMC and an HDMI display. When I connected it all up and ran the application, the example sprung to life and detected my face as I moved about in front of the supplied camera:

 

 

Aldec Face Detection using a Zynq Z-7100.jpg 

 

 

My code is available on Github as always.

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

MicroZed Chronicles hardcopy.jpg 

 

  

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

 

MicroZed Chronicles Second Year.jpg 

Mentor’s DRS360 autonomous driving platform is based on the Xilinx Zynq UltraScale+ MPSoC

by Xilinx Employee ‎04-04-2017 10:46 AM - edited ‎04-04-2017 10:46 AM (1,061 Views)

 

Mentor has just announced the DRS360 platform for developing autonomous driving systems based on the Xilinx Zynq UltraScale+ MPSoC. The automotive-grade DRS360 platform is already designed and tested for deployment in ISO 26262 ASIL D-compliant systems.

 

This platform offers comprehensive sensor-fusion capabilities for multiple cameras, radar, LIDAR, and other sensors while offering “dramatic improvements in latency reduction, sensing accuracy and overall system efficiency required for SAE Level 5 autonomous vehicles.” In particular, the DRS360 platform’s use of the Zynq UltraScale+ MPSoC permits the use of “raw data sensors,” thus avoiding the power, cost, and size penalties of microcontrollers and the added latency of local processing at the sensor nodes.

 

Eliminating pre-processing microcontrollers from all system sensor nodes brings many advantages to the autonomous-driving system design including improved real-time performance, significant reductions in system cost and complexity, and access to all of the captured sensor data for a maximum-resolution, unfiltered model of the vehicle’s environment and driving conditions.

 

Mentor DRS360 platform Block diagram.jpg

 

 

Rather than try to scale lower levels of ADAS up, Mentor’s DRS360 platform is optimized for Level 5 autonomous driving, and it’s engineered to easily scale down to Levels 4, 3 and even 2. This approach makes it far easier to develop systems at the appropriate level for the system you’re developing because the DRS360 platform is already designed to handle the most complex tasks from the beginning.

 

 

 

 

 

If you’re working with any sort of video, there’s a new 4-minute demo video you need to see. This video shows two new Zynq UltraScale+ EV MPSoC devices working in tandem to decode and display 4K60p streaming video in both H.264 and H.265 video formats in real time. Zynq UltraScale+ EV MPSoC devices incorporate hardened, low-latency H.264 and H.265 video codecs (encode and decode). The demo employs two Xilinx ZCU106 boards in the following configuration:

 

 

 

Zynq UltraScale Plus EV Video Codec Demo Diagram.jpg

 

 

 

The first ZCU106 extracts the 4K60p video stream from a USB stick at 60Mbps, decodes the video, and displays it on a local monitor using a DisplayPort interface. At the same time, the on-board Zynq UltraScale+ EV device re-encodes the video using the on-chip H.265 encoder, which reduces the video bit rate to 10Mbps thanks to the improved encoding efficiency of the H.265 standard. The board then transmits the resulting 10Mbps video stream over a wired Ethernet connection to a second ZCU106 board, which decodes the video and displays it on a second monitor. The entire process occurs with such low latency that it’s hard to see any delay between the two displayed video streams.

 

Here’s the video demo:

 

 

 

 

 

 

 

Here’s a 40-minute teardown video of a Vision Research Phantom v5 high-speed high-speed, 1024x1024-pixel, 1000frames/sec video camera (circa 2001) from tesla500’s YouTube video channel. His methodical teardown and excellent system-level explanation uncovers a couple of “huge” Xilinx XC4020 FPGAs (circa 2000) on the timing and interface boards and Xilinx XC9500 CPLDs implementing the timing and control on the four high-speed capture-memory boards. There’s also a Hitachi SH-2 32-bit RISC processor with a hardware MAC (for DSP) on the timing board.

 

The XC4020 FPGAs are 3rd-generation devices that each have 784 CLBs (1560 LUTs total). They were big in their day but they’re very small now. These days, I think you could implement all of the digital timing and control circuitry in this camera including the SH-2 processor’s capabilities using the smallest single-core Zynq Z-7007S SoC—with the ARM Cortex-A9 processor in the Zynq SoC running considerably more than 20x faster than the turn-of-the-millennium SH-2 processor’s roughly 28MHz maximum clock rate.

 

Of course, Vision Research has moved far beyond 1000 frames/sec over the past 17 years. Its latest cameras can go 1000x faster than that, hitting 1M frames/sec when configured with the company’s FAST option (fast indeed!), while the Phantom v5 is no longer listed even on the company’s “discontinued cameras” page. Nevertheless, I found tesla500’s teardown and explanations fascinating and valuable.

 

Of course, Xilinx All Programmable devices have long been used to design advanced video equipment like the Vision Research Phantom v5 high-speed camera. Which allows me to quickly remind you of the recent launch of the Xilinx reVISION stack launch for embedded-vision applications. (See “Xilinx reVISION stack pushes machine learning for vision-guided applications all the way to the edge.”)

 

And now, here’s tesla500’s Vision Research Phantom v5 high-speed camera teardown video:

 

 

 

 

 

 

 

 

Xcell Daily discussed DeePhi Tech’s Zynq-based CNN acceleration processor last year in connection with the Hot Chips 2016 conference. (See “DeePhi’s Zynq-based CNN processor is faster, more energy efficient than CPUs or GPUs.”) DeePhi’s founder Song Yao appears in a new Powered by Xilinx video this week giving many more details including some fascinating information about an early customer, ZeroTech—China’s second largest drone maker.

 

DeePhi provides the entire stack needed to develop machine-learning applications based on neural networks including the development software, algorithms, and a neural-network processor that runs efficiently on the Xilinx Zynq SoC. This technology is particularly good for deep-learning, vision-based embedded apps such as drones, robotics, surveillance cameras, and for cloud-computing applications as well.

 

The video also provides more details on ZeroTech’s use of DeePhi’s machine-learning technology for object detection, pedestrian detection, and gesture recognition—all in a drone that nestles in your hand.

 

Song Yao explains that DeePhi’s tools provide a GPU-like development environment while taking advantage of the superior efficiency of neural networks implemented with programmable logic. In addition, DeePhi can change the neural network’s architecture to further optimize the design for specific applications.

 

Finally, he explains that you can use these Zynq-based implementations in applications where GPUs will simply not work due to power-consumption restrictions. In fact, last year at Hot Chips 2016 he reportedly said, “The FPGA based DPU platform achieves an order of magnitude higher energy efficiency over GPU on image recognition and speech detection.”

 

Here’s the new, 3-minute Powered by Xilinx video:

 

 

 

 

How to use machine learning for embedded vision—and many other embedded applications

by Xilinx Employee ‎03-30-2017 10:02 AM - edited ‎03-30-2017 12:00 PM (1,475 Views)

 

Image3.jpg Adam Taylor and Xilinx’s Sr. Product Manager for SDSoC and Embedded Vision Nick Ni have just published an article on the EE News Europe Web site titled “Machine learning in embedded vision applications.” That title’s pretty self-explanatory, but there are a few points I’d like to highlight. Then you can go read the full article yourself.

 

As the article states, “Machine learning spans several industry mega trends, playing a very prominent role within not only Embedded Vision (EV), but also Industrial Internet of Things (IIoT) and Cloud Computing.” In other words, if you’re designing products for any embedded market, you might well find yourself at a competitive disadvantage if you’re not adding machine-learning features to your road map.

 

This article closely ties machine learning with neural networks (including Feed-forward Neural Networks (FNNs), Recurrent Neural Networks (RNNs), and Deep Neural Networks (DNNs), and Convolutional Neural Networks (CNNs)). Neural networks are not programmed; they’re trained. Then, if they’re part of an embedded design, they’re deployed. Training is usually done using floating-point neural-network implementations but, for efficiency (power and cost), deployed neural networks can use fixed-point representations with very little or no loss of accuracy. (See “Counter-Intuitive: Fixed-Point Deep-Learning Inference Delivers 2x to 6x Better CNN Performance with Great Accuracy.”)

 

The programmable logic inside of Xilinx FPGAs, Zynq SoCs, and Zynq UltraScale+ MPSoCs is especially good at implementing fixed-point neural networks, as described in this article by Nick Ni and Adam Taylor. (Go read the article!)

 

Meanwhile, this is a good time to remind you of the recent Xilinx introduction of the reVISION stack for neural network development using Xilinx All Programmable devices. For more information about the Xilinx reVISION stack, see:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Image Matters’ Origami B20 module, based on a Xilinx Kintex UltraScale KU060 FPGA, is a small 94x53mm module that you can use to perform all sorts of high-speed processing. (See “Image Matters launches Origami Ecosystem for developing advanced 4K/8K video apps using the FPGA-based Origami module.”) For example, you can use it for a variety of video-compression applications using various IP compression cores including MPEG, JPEG-2000, and TICO. You can also use it for cloud-computing and neural-network applications such as image detection. The key thing is that the small Origami B20 module puts everything you need to run the FPGA on the one small module including SDRAM, Flash memory, the power supply, a backup battery, and security features (including tamper protection).

 

Here’s a short, 2.5-minute, Powered by Xilinx video with more information about the Origami B20 module:

 

 

 

 

 

By Adam Taylor

 

A couple of weeks ago, I talked about the Xilinx reVision stack and the support it provides for OpenVX and OpenCV. One of the most exciting things I explained was about how we could accelerate several OpenCV functions (which include the OpenVX Core functions) using the Zynq SoC’s programmable logic. What I did not look at was the other significant part of the reVision stack and its support for machine learning.

 

Machine learning is increasing important for embedded-vision applications because it helps systems to evolve from being vision-enabled to being vision-guided autonomous systems. Machine learning is often used for embedded-vision applications to identify and classify information contained within an image. The embedded-vision system uses these identifications and classifications to make informed decisions in real time, enabling increased interaction with the environment.

 

For those unfamiliar with machine learning it is most often implemented by the creation and training of a neural network. Neural networks are modelled upon the human cerebral cortex in that each neuron receives an input, processes it, and communicates the processed signal it to another neuron. Neural networks typically consist of an input layer, internal layer(s), and an output layer.

 

 

Image1.jpg

 

 

 

Those familiar with machine learning may have come across the term “deep learning.” This is where there are several hidden layers in the neural network, allowing more complex machine-learning algorithms to be implemented.

 

When working with neural networks in embedded-vision applications, we need to use a 2D network. This is where Convolutional Neural Networks (CNNs) are used. CNNs are deep-learning networks that contain several convolutional and sub-sampling layers along with a separate, fully connected network to perform the final classification. Within the convolution layer, the input image will be broken down into several overlapping smaller tiles.

 

The results from this convolution layer are used to create an activation map, using an activation layer in the network placed before further sub-sampling and additional stages and preceding the final, fully connected network. The exact implementation of the CNN network varies depending upon the network architecture implemented (GoogLeNet, SSD, AlexNet). However, a CNN will typically contain at least the following elements:

 

 

  • Convolution – Identifies features within the image
  • Rectified Linear Unit (reLU) – Activation layer that creates an activation map following the convolution
  • Max Pooling – Performs sub-sampling between layers
  • Fully Connected layer – Performs the final classification

 

 

The weights used for each of these elements are determined via training, and one of the CNN’s advantages is the relative ease of training the network. Training requires large data sets and high-performance computers to correctly determine the weights for each stage.

 

To ease the development of machine-learning applications, many engineers use a framework like Caffe, which supports the implementation and training of machine learning. The use of frameworks allows us to work at a higher level and maximize reuse. Using a framework, we don’t need to start from scratch each time we develop an application.

 

The Xilinx reVision stack provides an integrated Caffe framework flow, which allows us to take the prototext definition of the network and trained weights to deploy the machine-learning application. (Note that network training is separate and distinct from deployment.) To enable this, the Xilinx reVision stack provides several hardware-accelerated functions that can be implemented within the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL (programmable logic) to create the machine-learning inference engine. The reVision stack also provides examples for a wide range of network structures, enabling us to get up and running with our machine-learning application without the need to initially compile the PL design. Once we are happy with the machine-learning application, we can then use the SDSoC flow to develop our own embedded-vision application containing the optimized machine-learning application.

 

 

Image2.jpg 

 

 

Using the Zynq PL provides for an optimal implementation that delivers faster response times when interacting with the embedded-vision system environment. This is especially true as machine learning applications are increasingly implemented using fixed-point integers like INT8, which are ideal for implementation in DSP elements.

 

Machine learning is going to be a hot area for several applications. So I will be coming back to this topic in detail as the MicroZed Chronicles progress—with some examples of course.

 

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

MicroZed Chronicles hardcopy.jpg 

  

 

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

 

MicroZed Chronicles Second Year.jpg 

 

 

The just-announced VICO-4 TICO SDI Converter from Village Island employs visually lossless 4:1 TICO compression to funnel a 4K60p video (on four 3G-SDI video streams or one 12G-SDI stream) into onto a single 3G-SDI output stream, which reduces infrastructure costs for transport, cabling, routing, and compression in broadcast networks.

 

 

 

Village Island VICO-4.jpg

 

 

VICO-4 4:1 SDI Converter from Village Island

 

 

 

Here’s a block diagram of what’s going on inside of Village Island’s VICO-4 TICO SDI Converter:

 

 

Village Island VICO-4 Block Diagram.jpg 

 

And here’s a diagram showing you what broadcasters can do with this sort of box:

 

 

Village Island VICO-4 Distribution Diagram.jpg

 

 

 

The reason this is even possible in a real-time broadcast environment is because the lightweight intoPIX TICO compression algorithm has very low latency (just very a few video lines) when implemented in hardware as IP. (Software-based, frame-by-frame video compression is therefore totally out of the question in an application such as this introduces too much delay.)

 

Looking at the VICO-4’s main (and only) circuit board shows one main chip implementing the 4:1 compression and signal multiplexing. And that chip is… a Xilinx Kintex UltraScale KU035 FPGA. It has plenty of on-chip programmable logic for the TICO compression IP and it has sixteen 16.3Gbps transceiver ports—more than plenty to handle the 3G- and 12G-SGI I/O required by this application.

 

 

Village Island VICO-4 pcb.jpg 

 

 

Note: Paltek in Japan is distributing Village Island’s VICO-4 board in Japan as an OEM component. The board needs 12Vdc at ~25VA.

 

 

 

For more information about TICO compression IP, see:

 

 

 

 

 

 

 

Laser-based, industrial 3D Camera from VRmagic resolves complex surfaces with 1/64 sub-pixel accuracy

by Xilinx Employee ‎03-23-2017 10:48 AM - edited ‎03-23-2017 11:08 AM (1,264 Views)


VRmagic LineCam3D.jpgA configurable, COG (center-of-gravity), laser-line extraction algorithm allows VRmagic’s LineCam3D to resolve complex surface contours with 1/64 sub-pixel accuracy. (The actual measurement precision, which can be as small as a micrometer, depends on the optics attached to the camera.) The camera must process the captured video internally because, at its maximum 1KHz scan rate, there would be far more raw contour data than can be pumped over the camera’s GigE Vision interface. The algorithm therefore runs in real time on the camera’s internal Xilinx series 7 FPGA, which is paired with a TI DaVinci SoC to handle other processing chores and 2Gbytes of DDR3 SDRAM. The camera’s imager is a 2048x1088-pixel CMOSIS CMV2000 CMOS image sensor with a pipelined global shutter. The VRmagic LineCam3D also has a 2D imaging mode that permits the extraction of additional object information such as surface printing that would not appear on the contour scans (as demonstrated in the photo below).

 

Here’s a composite photo of the camera’s line-scan contour output (upper left), the original object being scanned (lower left), and the image of the object constructed from the contour scans (right):

 

 

VRmagic LineCam3D Output.jpg 

 

In laser-triangulation measurement setups, the camera’s lens plane is not parallel to the scanned object’s image plane, which means that only a relatively small part of the laser-scanned image would normally be in focus due to limited depth of focus. To compensate for this, the LineCam3D integrates a 10° tilt-shift adapter into its rugged IP65/67 aluminum housing, to expand the maximum in-focus object height. Anyone familiar with photographic tilt-shift lenses—mainly used for architectural photography in the non-industrial world—immediately recognizes this as the Scheimpflug principle, which increases depth of focus by tilting the lens relative to both the imager plane and the subject plane. It’s fascinating that this industrial camera incorporates this ability into the camera body so that any C-mount lens can be used as a tilt-shift lens.

 

 

For more information about the LineCam3D camera, please contact VRmagic directly.

 

 

 

The Sundance DSP PXIe700 module is a 3U PXIe card with an on-board Xilinx Kintex-7 FPGA (a 325T or a 410T), so it can perform nearly any signal-processing or control task you can imagine.

 

 

Sundance PXIe-700 Kintex-7 Module Photo.jpg 

 

 

Sundance PXIe700 module based on a Xilinx Kintex-7 FPGA

 

 

Here’s a block diagram of the Sundance PXIe700 module:

 

 

Sundance PXIe-700 Kintex-7 Module.jpg

 

 

Sundance PXIe700 module based on a Xilinx Kintex-7 FPGA, Block Diagram

 

 

 

Sundance provides this board with the SCom IP Core, which communicates to the host through the PCIe interface and provides the user logic instantiated in the Kintex-7 FPGA with a multichannel streaming interface to the host CPU and sample applications. Other IP cores, a Windows driver, DLL, and user-interface software are also available. The PXIe700 data sheet also mentions a VideoGuru toolset that can turn this hardware into a video test center for  NTSC, VGA, DVI, SMPTE, GigE-Vision, and other video  standards. (Contact Sundance DSP for more details.)

 

The Sundance product page also shows the PXIe700 board with a couple of Sundance FMC modules attached as example applications:

 

 

Sundance PXIe-700 With attached FMC-DAQ2p5.jpg 

 

 

Sundance PXIe700 with attached FMC-DAQ2p5 multi-Gsample/sec ADC and DAC card

 

 

 

 

Sundance PXIe-700 With attached FMC-ADC500-5.jpg

 

 

Sundance PXIe700 with attached FMC-ADC500-5 5-channel, 500Msamples/sec, 16-bit ADC card

 

 

 

 

 

 

I did not go to Embedded World in Nuremberg this week but apparently SemiWiki’s Bernard Murphy was there and he’s published his observations about three Zynq-based reference designs that he saw running in Aldec’s booth on the company’s Zynq-based TySOM embedded dev and prototyping boards.

 

 

Aldec TySOM-2 Prototyping Board.jpg

 

Aldec TySOM-2 Embedded Prototyping Board

 

 

 

Murphy published this article titled “Aldec Swings for the Fences” on SemiWiki and wrote:

 

 

“At the show, Aldec provided insight into using the solution to model the ARM core running in QEMU, together with a MIPI CSI-2 solution running in the FPGA. But Aldec didn’t stop there. They also showed off three reference designs designed using this flow and built on their TySOM boards.

 

“The first reference design targets multi-camera surround view for ADAS (automotive – advanced driver assistance systems). Camera inputs come from four First Sensor Blue Eagle systems, which must be processed simultaneously in real-time. A lot of this is handled in software running on the Zynq ARM cores but the computationally-intensive work, including edge detection, colorspace conversion and frame-merging, is handled in the FPGA. ADAS is one of the hottest areas in the market and likely to get hotter since Intel just acquired Mobileye.

 

“The next reference design targets IoT gateways – also hot. Cloud interface, through protocols like MQTT, is handled by the processors. The gateway supports connection to edge devices using wireless and wired protocols including Bluetooth, ZigBee, Wi-Fi and USB.

 

“Face detection for building security, device access and identifying evil-doers is also growing fast. The third reference design is targeted at this application, using similar capabilities to those on the ADAS board, but here managing real-time streaming video as 1280x720 at 30 frames per second, from an HDR-CMOS image sensor.”

 

The article contains a photo of the Aldec TySOM-2 Embedded Prototyping Board, which is based on a Xilinx Zynq Z-7045 SoC. According to Murphy, Aldec developed the reference designs using its own and other design tools including the Aldec Riviera-PRO simulator and QEMU. (For more information about the Zynq-specific QEMU processor emulator, see “The Xilinx version of QEMU handles ARM Cortex-A53, Cortex-R5, Cortex-A9, and MicroBlaze.”)

 

Then Murphy wrote this:

 

“So yes, Aldec put together a solution combining their simulator with QEMU emulation and perhaps that wouldn’t justify a technical paper in DVCon. But business-wise they look like they are starting on a much bigger path. They’re enabling FPGA-based system prototype and build in some of the hottest areas in systems today and they make these solutions affordable for design teams with much more constrained budgets than are available to the leaders in these fields.”

 

 

 

Image3.jpgAEye is the latest iteration of the eye-tracking technology developed by EyeTech Digital Systems. The AEye chip is based on the Zynq Z-7020 SoC. It’s located immediately adjacent to the imaging sensor, which creates compact, stand-alone systems. This technology is finding its way into diverse vision-guided systems in the automotive, AR/VR, and medical diagnostic arenas. According to EyeTech, the Zynq SoC’s unique abilities allows the company to create products they could not do any other way.

 

With the advent of the reVISION stack, EyeTech is looking to expand its product offerings into machine learning, as discussed in this short, 3-minute video:

 

 

 

 

 

 

For more information about EyeTech, see:

 

 

 

 

EETimes’ Junko Yoshida with some expert help analyzes this week’s Xilinx reVISION announcement

by Xilinx Employee ‎03-15-2017 01:25 PM - edited ‎03-22-2017 07:20 AM (1,192 Views)

 

Image3.jpgThis week, EETimes’ Junko Yoshida published an article titled “Xilinx AI Engine Steers New Course” that gathers some comments from industry experts and from Xilinx with respect to Monday’s reVISION stack announcement. To recap, the Xilinx reVISION stack is a comprehensive suite of industry-standard resources for developing advanced embedded-vision systems based on machine learning and machine inference.

 

(See “Xilinx reVISION stack pushes machine learning for vision-guided applications all the way to the edge.”)

 

As Xilinx Senior Vice President of Corporate Strategy Steve Glaser tells Yoshida, “Xilinx designed the stack to ‘enable a much broader set of software and systems engineers, with little or no hardware design expertise to develop, intelligent vision guided systems easier and faster.’

 

Yoshida continues:

 

While talking to customers who have already begun developing machine-learning technologies, Xilinx identified ‘8 bit and below fixed point precision’ as the key to significantly improve efficiency in machine-learning inference systems.

 

 

Yoshida also interviewed Karl Freund, Senior Analyst for HPC and Deep Learning at Moor Insights & Strategy, who said:

 

Artificial Intelligence remains in its infancy, and rapid change is the only constant.” In this circumstance, Xilinx seeks “to ease the programming burden to enable designers to accelerate their applications as they experiment and deploy the best solutions as rapidly as possible in a highly competitive industry.

 

 

She also quotes Loring Wirbel, a Senior Analyst at The Linley group, who said:

 

What’s interesting in Xilinx's software offering, [is that] this builds upon the original stack for cloud-based unsupervised inference, Reconfigurable Acceleration Stack, and expands inference capabilities to the network edge and embedded applications. One might say they took a backward approach versus the rest of the industry. But I see machine-learning product developers going a variety of directions in trained and inference subsystems. At this point, there's no right way or wrong way.

 

 

There’s a lot more information in the EETimes article, so you might want to take a look for yourself.

 

 

 

Zynq + PYNQ + Python + BNNs: Machine inference does not get any easier… or faster

by Xilinx Employee ‎03-14-2017 03:10 PM - edited ‎03-15-2017 10:25 AM (5,048 Views)

 

Machine learning and machine inference based on CNNs (convolutional neural networks) are the latest way to classify images and, as I wrote in Monday’s blog post about the new Xilinx reVISION announcement, “The last two years have generated more machine-learning technology than all of the advancements over the previous 45 years and that pace isn't slowing down.” (See “Xilinx reVISION stack pushes machine learning for vision-guided applications all the way to the edge.”) The challenge now is to make the CNNs run faster while consuming less power. It would be nice to make them easier to use as well.

 

OK, that’s a setup. A paper published last month at the 25th International Symposium on Field Programmable Gate Arrays titled “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference” describes a method to speed up CNN-based inference while cutting power consumption by reducing CNN precision in the inference machines. As the paper states:

 

…a growing body of research demonstrates this approach [CNN] incorporates significant redundancy. Recently, it has been shown that neural networks can classify accurately using one- or two-bit quantization for weights and activations.  Such a combination of low-precision arithmetic and small memory footprint presents a unique opportunity for fast and energy-efficient image classification using Field Programmable Gate Arrays (FPGAs). FPGAs have much higher theoretical peak performance for binary operations compared to floating point, while the small memory footprint removes the off-chip memory bottleneck by keeping parameters on-chip, even for large networks. Binarized Neural Networks (BNNs), proposed by Courbariaux et al., are particularly appealing since they can be implemented almost entirely with binary operations, with the potential to attain performance in the teraoperations per second (TOPS) range on FPGAs.

 

The paper then describes the techniques developed by the authors to generate BNNs and instantiate them into FPGAs. The results, based on experiment using a Xilinx ZC706 eval kit based on a Zynq Z-7045 SoC, are impressive:

 

When it comes to pure image throughput, our designs outperform all others. For the MNIST dataset, we achieve an FPS which is over 48/6x over the nearest highest throughput design [1] for our SFC-max/LFC-max designs respectively. While our SFC-max design has lower accuracy than the networks implemented by Alemdar et al. for our LFC-max design outperforms their nearest accuracy design by over 6/1.9x for throughput and FPS/W respectively. For other datasets, our CNV-max design outperforms TrueNorth for FPS by over 17/8x for CIFAR-10 / SVHN datasets respectively, while achieving 9.44x higher throughput than the design by Ovtcharov et al., and 2:2x over the fastest results reported by Hegde et al. Our prototypes have classification accuracy within 3% of the other low-precision works, and could have been improved by using larger BNNs.

 

There’s something even more impressive, however. This design approach to creating BNNs is so scalable that it’s now on a low-end platform—the $229 Digilent PYNQ-Z1. (Digilent’s academic price for the PYNQ-Z1 is only $65!) Xilinx Research Labs in Ireland, NTNU (Norwegian U. of Science and Technology), and the U. of Sydney have released an open-source Binarized Neural Network (BNN) Overlay for the PYNQ-Z1 based on the work described in the above paper.

 

According to Giulio Gambardella of Xilinx Reseach Labs, “…running on the PYNQ-Z1 (a smaller Zynq 7020), [the PYNQ-Z1] can achieve 168,000 image classifications per second with 102µsec latency on the MNIST dataset with 98.40% accuracy, and 1700 images per seconds with 2.2msec latency on the CIFAR-10, SVHN, and GTSRB dataset, with 80.1%, 96.69%, and 97.66% accuracy respectively running at under 2.5W.”

 

 

PYNQ-Z1.jpg

 

Digilent PYNQ-Z1 board, based on a Xilinx Zynq Z-7020 SoC

 

 

 

Because the PYNQ-Z1 programming environment centers on Python and the Jupyter development environment, there are a number of Jupyter notebooks associated with this package that demonstrate what the overlay can do through live code that runs on the PYNQ-Z1 board, equations, visualizations and explanatory text and program results including images.

 

There are also examples of this BNN in practical application:

 

 

 

 

For more information about the Digilent PYNQ-Z1 board, see “Python + Zynq = PYNQ, which runs on Digilent’s new $229 pink PYNQ-Z1 Python Productivity Package.

 

 

 

EEJournal’s Kevin Morris weighs in on Monday’s Xilinx reVISION stack launch for embedded-vision apps

by Xilinx Employee ‎03-14-2017 01:35 PM - edited ‎03-22-2017 07:20 AM (1,144 Views)

 

Image3.jpgToday, EEJournal’s Kevin Morris has published a review article of the announcement titled “Teaching Machines to See: Xilinx Launches reVISION” following Monday’s announcement of the Xilinx reVISION stack for developing vision-guided applications. (See “Xilinx reVISION stack pushes machine learning for vision-guided applications all the way to the edge.”

 

Morris writes:

 

But vision is one of the most challenging computational problems of our era. High-resolution cameras generate massive amounts of data, and processing that information in real time requires enormous computing power. Even the fastest conventional processors are not up to the task, and some kind of hardware acceleration is mandatory at the edge. Hardware acceleration options are limited, however. GPUs require too much power for most edge applications, and custom ASICs or dedicated ASSPs are horrifically expensive to create and don’t have the flexibility to keep up with changing requirements and algorithms.

 

“That makes hardware acceleration via FPGA fabric just about the only viable option. And it makes SoC devices with embedded FPGA fabric - such as Xilinx Zynq and Altera SoC FPGAs - absolutely the solutions of choice. These devices bring the benefits of single-chip integration, ultra-low latency and high bandwidth between the conventional processors and the FPGA fabric, and low power consumption to the embedded vision space.

 

Later on, Morris gets to the fly in the ointment:

 

“Oh, yeah, There’s still that “almost impossible to program” issue.”

 

And then he gets to the solution:

 

reVISION, announced this week, is a stack - a set of tools, interfaces, and IP - designed to let embedded vision application developers start in their own familiar sandbox (OpenVX for vision acceleration and Caffe for machine learning), smoothly navigate down through algorithm development (OpenCV and NN frameworks such as AlexNet, GoogLeNet, SqueezeNet, SSD, and FCN), targeting Zynq devices without the need to bring in a team of FPGA experts. reVISION takes advantage of Xilinx’s previously-announced SDSoC stack to facilitate the algorithm development part. Xilinx claims enormous gains in productivity for embedded vision development - with customers predicting cuts of as much as 12 months from current schedules for new product and update development.

 

In many systems employing embedded vision, it’s not just the vision that counts. Increasingly, information from the vision system must be processed in concert with information from other types of sensors such as LiDAR, SONAR, RADAR, and others. FPGA-based SoCs are uniquely agile at handling this sensor fusion problem, with the flexibility to adapt to the particular configuration of sensor systems required by each application. This diversity in application requirements is a significant barrier for typical “cost optimization” strategies such as the creation of specialized ASIC and ASSP solutions.

 

The performance rewards for system developers who successfully harness the power of these devices are substantial. Xilinx is touting benchmarks showing their devices delivering an advantage of 6x images/sec/watt in machine learning inference with GoogLeNet @batch = 1, 42x frames/sec/watt in computer vision with OpenCV, and ⅕ the latency on real-time applications with GoogLeNet @batch = 1 versus “NVidia Tegra and typical SoCs.” These kinds of advantages in latency, performance, and particularly in energy-efficiency can easily be make-or-break for many embedded vision applications.

 

 

But don’t take my word for it, read Morris’ article yourself.

 

 

 

 

 

As part of today’s reVISION announcement of a new, comprehensive development stack for embedded-vision applications, Xilinx has produced a 3-minute video showing you just some of the things made possible by this announcement.

 

Here it is:

 

 

Adam Taylor’s MicroZed Chronicles, Part 177: Introducing the reVision stack

by Xilinx Employee ‎03-13-2017 10:39 AM - edited ‎03-22-2017 07:19 AM (2,017 Views)

 

By Adam Taylor

 

Several times in this series, we have looked at image processing using the Avnet EVK and the ZedBoard. Along with the basics, we have examined object tracking using OpenCV running on the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PS (processing system) and using HLS with its video library to generate image-processing algorithms for the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL (programmable logic, see blogs 140 to 148 here).

 

Xilinx’s reVision is an embedded-vision development stack that provides support for a wide range of frameworks and libraries often used for embedded-vision applications. Most exciting, from my point of view, is that the stack includes acceleration-ready OpenCV functions.

 

Image1.jpg 

 

 

The stack itself is split into three layers. Once we select or define our platform, we will be mostly working at the application and algorithm layers. Let’s take a quick look at the layers of the stack:

 

  1. Platform layer: This is the lowest level of the stack and is the one on which the remaining stack layers are built. This layer includes platform definitions of the hardware and the software environment. Should we choose not to use a predefined platform, we can generate a custom platform using Vivado.

 

  1. Algorithm layer: Here we create our application using SDSoC and the platform definition for the target hardware. It is within this layer that we can use the acceleration-ready OpenCV functions along with predefined and optimized implementations for Customized Neural Network (CNN) developments such as inference accelerators within the PL.

 

  1. Application Development Layer: The highest layer of the stack. Development here is where high-level frameworks such as Caffe and OpenVX are used to complete the application.

 

As I mentioned above one of the most exciting aspects of the reVISION stack is the ability to accelerate a wide range of OpenCV functions using the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL. We can group the OpenCV functions that can be hardware-accelerated using the PL into four categories:

 

  1. Computation – Includes functions such as absolute difference between two frames, pixel-wise operations (addition, subtraction and multiplication), gradient, and integral operations
  2. Input Processing – Supports bit-depth conversions, channel operations, histogram equalization, remapping, and resizing.
  3. Filtering – Supports a wide range of filters including Sobel, Custom Convolution, and Gaussian filters.
  4. Other – Provides a wide range of functions including Canny/Fast/Harris edge detection, thresholding, SVM, HoG, LK Optical Flow, Histogram Computation, etc.

 

What is very interesting with these function calls is that we can optimize them for resource usage or performance within the PL. The main optimization method is specifying the number of pixels to be processed during each clock cycle. For most accelerated functions, we can choose to process either one or eight pixels. Processing more pixels per clock cycle reduces latency but increases resource utilization. Processing one pixel per clock minimizes the resource requirements at the cost of increased latency. We control the number of pixels processed per clock in via the function call.

 

Over the next few blogs, we will look more at the reVision stack and how we can use it. However in the best Blue Peter tradition, the image below shows the result of running a reVision Harris OpenCV acceleration function within the PL when accelerated.

 

 

Image2.jpg

 

 

Accelerated Harris Corner Detection in the PL

 

 

 

 

Code is available on Github as always.

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

MicroZed Chronicles hardcopy.jpg 

 

 

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

MicroZed Chronicles Second Year.jpg

 

Xilinx reVISION stack pushes machine learning for vision-guided applications all the way to the edge

by Xilinx Employee ‎03-13-2017 07:37 AM - edited ‎03-22-2017 07:19 AM (3,776 Views)

 

Image3.jpgToday, Xilinx announced a comprehensive suite of industry-standard resources for developing advanced embedded-vision systems based on machine learning and machine inference. It’s called the reVISION stack and it allows design teams without deep hardware expertise to use a software-defined development flow to combine efficient machine-learning and computer-vision algorithms with Xilinx All Programmable devices to create highly responsive systems. (Details here.)

 

The Xilinx reVISION stack includes a broad range of development resources for platform, algorithm, and application development including support for the most popular neural networks: AlexNet, GoogLeNet, SqueezeNet, SSD, and FCN. Additionally, the stack provides library elements such as pre-defined and optimized implementations for CNN network layers, which are required to build custom neural networks (DNNs and CNNs). The machine-learning elements are complemented by a broad set of acceleration-ready OpenCV functions for computer-vision processing.

 

For application-level development, Xilinx supports industry-standard frameworks including Caffe for machine learning and OpenVX for computer vision. The reVISION stack also includes development platforms from Xilinx and third parties, which support various sensor types.

 

The reVISION development flow starts with a familiar, Eclipse-based development environment; the C, C++, and/or OpenCL programming languages; and associated compilers all incorporated into the Xilinx SDSoC development environment. You can now target reVISION hardware platforms within the SDSoC environment, drawing from a pool of acceleration-ready, computer-vision libraries to quickly build your application. Soon, you’ll also be able to use the Khronos Group’s OpenVX framework as well.

 

For machine learning, you can use popular frameworks including Caffe to train neural networks. Within one Xilinx Zynq SoC or Zynq UltraScale+ MPSoC, you can use Caffe-generated .prototxt files to configure a software scheduler running on one of the device’s ARM processors to drive CNN inference accelerators—pre-optimized for and instantiated in programmable logic. For computer vision and other algorithms, you can profile your code, identify bottlenecks, and then designate specific functions that need to be hardware-accelerated. The Xilinx system-optimizing compiler then creates an accelerated implementation of your code, automatically including the required processor/accelerator interfaces (data movers) and software drivers.

 

The Xilinx reVISION stack is the latest in an evolutionary line of development tools for creating embedded-vision systems. Xilinx All Programmable devices have long been used to develop such vision-based systems because these devices can interface to any image sensor and connect to any network—which Xilinx calls any-to-any connectivity—and they provide the large amounts of high-performance processing horsepower that vision systems require.

 

Initially, embedded-vision developers used the existing Xilinx Verilog and VHDL tools to develop these systems. Xilinx introduced the SDSoC development environment for HLL-based design two years ago and, since then, SDSoC has dramatically and successfully shorted development cycles for thousands of design teams. Xilinx’s new reVISION stack now enables an even broader set of software and systems engineers to develop intelligent, highly responsive embedded-vision systems faster and more easily using Xilinx All Programmable devices.

 

And what about the performance of the resulting embedded-vision systems? How do their performance metrics compare against against systems based on embedded GPUs or the typical SoCs used in these applications? Xilinx-based systems significantly outperform the best of this group, which employ Nvidia devices. Benchmarks of the reVISION flow using Zynq SoC targets against Nvidia Tegra X1 have shown as much as:

 

  • 6x better images/sec/watt in machine learning
  • 42x higher frames/sec/watt for computer-vision processing
  • 1/5th the latency, which is critical for real-time applications

 

Image1.jpg 

 

There is huge value to having a very rapid and deterministic system-response time and, for many systems, the faster response time of a design that's been accelerated using programmable logic can mean the difference between success and catastrophic failure. For example, the figure below shows the difference in response time between a car’s vision-guided braking system created with the Xilinx reVISION stack running on a Zynq UltraScale+ MPSoC relative to a similar system based on an Nvidia Tegra device. At 65mph, the Xilinx embedded-vision system’s response time stops the vehicle 5 to 33 feet faster depending on how the Nvidia-based system is implemented. Five to 33 feet could easily mean the difference between a safe stop and a collision.

 

 

Image2.jpg 

 

(Note: This example appears in the new Xilinx reVISION backgrounder.)

 

 

The last two years have generated more machine-learning technology than all of the advancements over the previous 45 years and that pace isn't slowing down. Many new types of neural networks for vision-guided systems have emerged along with new techniques that make deployment of these neural networks much more efficient. No matter what you develop today or implement tomorrow, the hardware and I/O reconfigurability and software programmability of Xilinx All Programmable devices can “future-proof” your designs whether it’s to permit the implementation of new algorithms in existing hardware; to interface to new, improved sensing technology; or to add an all-new sensor type (like LIDAR or Time-of-Flight sensors, for example) to improve a vision-based system’s safety and reliability through advanced sensor fusion.

 

Xilinx is pushing even further into vision-guided, machine-learning applications with the new Xilinx reVISION Stack and this announcement complements the recently announced Reconfigurable Acceleration Stack for cloud-based systems. (See “Xilinx Reconfigurable Acceleration Stack speeds programming of machine learning, data analytics, video-streaming apps.”) Together, these new development resources significantly broaden your ability to deploy machine-learning applications using Xilinx technology—from inside the cloud to the very edge.

 

 

You might also want to read “Xilinx AI Engines Steers New Course” by Junko Yoshida on the EETimes.com site.

 

 

Labels
About the Author
  • Be sure to join the Xilinx LinkedIn group to get an update for every new Xcell Daily post! ******************** Steve Leibson is the Director of Strategic Marketing and Business Planning at Xilinx. He started as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He's served as Editor in Chief of EDN Magazine, Embedded Developers Journal, and Microprocessor Report. He has extensive experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.