“Vision processing is so much fun,” said VanGogh Imaging’s CEO Ken Lee, speaking at last month’s Embedded Vision Summit in Boston. Lee continued: “It’s at the intersection of all the disciplines: hardware, software, algorithm, and math developments—to do all these amazing things.” Lee has extensive experience in 3D imaging and his company, VanGogh Imaging, is working on bringing several 3D-imaging products to the mass market. The medical industry shifted from 2D to 3D imaging in a period of three years, said Lee, but it’s harder to get into high-volume apps because of the cost of 3D sensors—which had been $2000 to $3000. Low-cost 3D sensors have arrived, he continued, and “prices are coming down dramatically.” As a result, VanGogh Imaging is developing a 3D-vision plugin for the Android Unity mobile gaming engine and an associated Xilinx-specific 3D-vision library. It’s not surprising that Van Gogh Imaging would want to enter the high-volume mass market with its 3D expertise, but the Xilinx vision library—that takes a bit of explanation.
VanGogh initially developed its in-house 3D-vision function library in C and C++ with some additional math functions developed with MATLAB. Here’s an example of one such function: the Iterative Closest Point (ICP) algorithm matches point clouds to register two 3D images. Generally, VanGogh designs require 50 iterations of this function, which is very processor intensive because there are a lot of points to match.
The bandwidth-eating XYZ distance formula used in the ICP algorithm is:
D=(x1-x2)2 + (y1-y2)2 + (z1-z2)2
Initially, VanGogh implemented its 3D-vision application using a PC for prototyping, using one processor core of an Intel i7 because VanGogh knew it would be porting its application to ARM and Android. After the application was up and running on the PC, VanGogh had to rewrite most of its 3D libraries for the ARM processor because they were “too heavy,” said Lee. Van Gogh had to clean up the original 3D algorithms and change data structures to suit the more limited embedded environment. Still, the retooled application did not quite run in real time. (30frame/sec was the target rate.)
Over the past year, said Lee, VanGogh Imaging discovered the Xilinx Zynq All Programmable SoC, which combines two 1GHz ARM Cortex-A9 MPCore processors with a closely connected block of programmable logic. Lee admitted he’d not used FPGAs in 15 years and his recollection was that they were a “real pain” to use back then. Looking at the Zynq SoC architecture however, it seemed to be a really good fit for this project. The programmable logic could possibly offload enough of the computational load to make VanGogh’s 3D-vision application run in real time at the desired frame rate.
VanGogh took ICP nearest neighbor function, representing 80% of the computational load, and moved just that one function to the Zynq SoC’s on-chip FPGA. Everything else stayed on the ARM processor. Performance improved immediately.
VanGogh did a bench test on 1000 points, which require 1 million distance calculations (1000 x 1000) per iteration. With 50 iterations, there are 50 million calculations to perform. The algorithm ran at 250msec per 50 iterations (4 frames/sec) on the PC using one Intel i7 processor core—an order of magnitude short of real time. Running on the ARM processor alone, performance dropped to 1 frame/sec, even further from the target performance.
The software port was done using the Xilinx Vivado HLS tool as an automated way to go from C to programmable logic. VanGogh Imaging doesn’t have FPGA engineers but its software engineers were able to use Vivado HLS to achieve their performance goals. Using the ARM processor and the programmable logic on the Xilinx Zynq 7020 All Programmable SoC on the Avnet ZedBoard (used for prototyping), the ICP 3D algorithm ran in 25msec, which is 40 frames/sec—target performance achieved and the code’s not fully optimized yet according to Lee.
The nice thing, said Lee, is that there’s more parallelization possible using the programmable logic in the Zynq SoC. “You can have 20 or 30 cores running in parallel,” he explained. The key is balancing resources with speed. First, just get going. Then evolve to the production-ready design. There’s no need to take everything into the FPGA. Be selective. Take the most time-consuming functions and work your way down until you achieve the target performance.
Here’s a 4-minute video excerpt of Lee’s presentation from the Embedded Vision Summit, courtesy of the Embedded Vision Alliance:
And here’s a link to a video of the full presentation on the Embedded Vision Alliance site (free registration required, but it’s a great idea anyway if you’re involved in vision applications):