Last week at the NIPS 2017 conference in Long Beach, California, a Xilinx team demonstrated a live object-detection implementation of a YOLO—“you only look once”—network called Tincy YOLO (pronounced “teensy YOLO”) running on a Xilinx Zynq UltraScale+ MPSoC. Tincy YOLO combines reduced precision, pruning, and FPGA-based hardware acceleration to speed network performance by 160x, resulting in a YOLO network capable of operating on video frames at 16fps while dissipating a mere 6W.
Live demo of Tincy YOLO at NIPS 2017. Photo credit: Dan Isaacs
Here’s a description of that demo:
TincyYOLO: a real-time, low-latency, low-power object detection system running on a Zynq UltraScale+ MPSoC
By Michaela Blott, Principal Engineer, Xilinx
The Tincy YOLO demonstration shows real-time, low-latency, low-power object detection running on a Zynq UltraScale+ MPSoC device. In object detection, the challenge is to identify objects of interest within a scene and to draw bounding boxes around them, as shown in Figure 1. Object detection is useful in many areas, particularly in advanced driver assistance systems (ADAS) and autonomous vehicles where systems need to automatically detect hazards and to take the right course of action. Tincy YOLO leverages the “you only look once” (YOLO) algorithm, which delivers state-of-the-art object detection. Tincy YOLO is based on the Tiny YOLO convolutional network, which is based on the Darknet reference network. Tincy YOLO has been optimized through heavy quantization and modification to fit into the Zynq UltraScale+ MPSoC’s PL (programmable logic) and Arm Cortex-A53 processor cores to produce the final, real-time demo.
Figure 1: YOLO-recognized people with bounding boxes
To appreciate the computational challenge posed by Tiny YOLO, note that it takes 7 billion floating-point operations to process a single frame. Before you can conquer this computational challenge on an embedded platform, you need to pull many levers. Luckily, the all-programmable Zynq UltraScale+ MPSoC platform provides many levers to pull. Figure 2 summarizes the versatile and heterogeneous architectural options of the Zynq platform.
Figure 2: Tincy YOLO Platform Overview
The vanilla Darknet open-source neural network framework is optimized for CUDA acceleration but its generic, single-threaded processing option can target any C-programmable CPU. Compiling Darknet for the embedded Arm processors in the Zynq UltraScale+ MPSoC left us with a sobering performance of one recognized frame every 10 seconds. That’s about two orders of magnitude of performance away from a useful ADAS implementation. It also produces a very limited live-video experience.
To create Tincy YOLO, we leveraged several of the Zynq UltraScale+ MPSoC’s architectural features in steps, as shown in Figure 3. Our first major move was to quantize the computation of the network’s twelve inner (aka. hidden) layers by giving them binary weights and 3-bit activations. We then pruned this network to reduce the total operations to 4.5 GOPs/frame.
Figure 3: Steps used to achieve a 160x speedup of the Tiny YOLO network
We created a reduced-precision accelerator using a variant of the FINN BNN library (https://github.com/Xilinx/BNN-PYNQ) to offload the quantized layers into the Zynq UltraScale+ MPSoC’s PL. These layers account for more than 97% of all the computation within the network. Moving the computations for these layers into hardware bought us a 30x speedup of their specific execution, which translated into an 11x speedup within the overall application context, bringing the network’s performance up to 1.1fps.
We tackled the remaining outer layers by exploiting the NEON SIMD vector capabilities built into the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor cores, which gained another 2.2x speedup. Then we cracked down on the complexity of the initial convolution using maxpool elimination for another 2.2x speedup. This work raised the frame rate to 5.5fps. A final re-write of the network inference to parallelize the CPU computations across all four of the Zynq UltraScale+ MPSoC’s Arm Cortex-A53 processor delivered video performance at 16fps.
The result of these changes appears in Figure 4, which demonstrates better recognition accuracy than Tiny YOLO.