This is a guest post from Quenton Hall, AI System Architect for Industrial, Vision, Healthcare and Sciences Markets.
In our previous post, we briefly presented the higher-level problems that have set the stage for a need for optimized accelerators. As a poignant reminder of the problem, let’s now consider the computational cost and power consumption associated with a very simple image classification algorithm.
Leveraging the data points provided by Mark Horowitz, we can consider the relative power consumption of our image classifier with differing spatial restrictions. While you will note that Mark’s energy consumption estimates were for the 45nm node, it has been suggested by industry experts that these data points continue to scale to current semiconductor process geometries. This is to say that the energy cost of an INT8 operation remains an order of magnitude less than the energy cost of an FP32 operation, without regard for whether the process geometry is 45 or 16nm.
Source: Bill Dally (Stanford), Cadence Embedded Neural Network Summit, February 1, 2017
Power consumption may be computed as follows:
Power = Energy(J) / Operations * Operations / s
We can see from this equation that there are only two ways to reduce power consumption. We can either decrease the energy required to perform a specific operation, decrease the number of operations, or both.
For our image classifier, we will choose ResNet50 as a target. ResNet offers near state-of-the-art image classification performance while simultaneously offering the advantage that it requires fewer parameters (weights) than many comparable networks with similar performance.
To deploy ResNet50 we must compute ~7.7 Billion ops/inference. This means that for every image that we would like to classify, we have a “computational cost” of 7.7 * 10E9.
Now, let’s consider a relatively high-volume inference application in which we might wish to classify 1000 images per second. Sticking with Mark’s 45nm energy estimates, we arrive at the following:
Power = 4pJ + 0.4pJ / Op * 7.7B Ops / Image * 1000 Images / s
As the first dimension for innovation, we can quantize the network from FP32 to 8-bit integer operations. This reduces power consumption by more than an order of magnitude. While FP32 precision is desirable during training to facilitate backpropagation, it adds little value at inference time for pixel data. Numerous studies and papers have shown that in many applications, it is possible to analyze the distribution of the weights at each layer and quantize across that distribution while maintaining the pre-quantized prediction accuracy within very reasonable margins.
Quantization research has also shown that 8-bit integer values are a good “general purpose” solution for pixel data and that for many inner layers of a typical network, it is possible to quantize down to 3-4 bits with only minimal loss in prediction accuracy. The Xilinx Research Labs team, lead by Michaela Blott, has been focused on Binary Neural Net (BNN) research and deployment for several years, with some incredible results. (See FINN and PYNQ for more details)
Today, our focus with DNNDK (soon, VITIS AI) is on quantizing network inference to INT8. It is no coincidence that a single DSP slice in a modern Xilinx FPGA can compute two 8-bit multiply operations in a single clock cycle. In the 16nm Ultrascale+ MPSoC device family, we have more than 15 different device variations, scaling from hundreds of DSP slices to thousands of DSP slices, while maintaining application and / OS compatibility. The maximum fCLK of the 16nm DSP slice tops out at 891MHz. A mid-sized MPSoC device is thus a very capable computational accelerator.
Now, let’s consider the implications of migrating to INT8 math from FP32:
Power = 0.2pJ + 0.03pJ / Op * 7.7B Ops / Image * 1000 Images / s
In his talk, Mark proposed that a solution for the computational efficiency problem is to use dedicated, purpose-built accelerators. His vision holds for ML inference.
What is not considered by the above analysis is that we would also see at least a four-fold decrease in external DDR traffic for FP32. As you might expect, it is also true is that the power cost associated with external memory access is considerably higher than it is for internal memory. If we simply leverage Mark’s data points, we see that the energy cost for DRAM access is around 1.3-2.6nJ, while the energy cost to access L1 memory might be 10-100pJ. It seems that the energy cost for external DRAM access is at least one order of magnitude higher than the energy cost of accessing internal memory (such as BlockRAM and UltraRAM found in Xilinx SoCs).
In addition to the benefits afforded by quantization, we can use network pruning techniques to reduce the computational workload required for inference. Using the Xilinx DNNDK AI Optimizer tool, it is possible to reduce the computational workload for an image classification model trained on ILSCVR2012 (ImageNet 1000 classes) by 30-40%, with less than 1% loss of accuracy. Furthermore, if we reduce the number of predicted classes, we can further increase these performance gains. The reality is that most real-world Image Classification networks are trained on a limited number of classes, making pruning beyond this watermark possible. For reference, one of our pruned VGG-SSD implementations, trained on four classes, requires 17 GOPs versus 117 GOPs for the original network, with no loss in accuracy! Who said that VGG wasn’t memory efficient?
However, if we simply assume that we are training our classifier on ILSCVR2012, we find that we can typically reduce the compute workload by ~30% with pruning. Taking this into account, we arrive at the following:
Power = 0.2pJ + 0.03pJ / Op * 7.7B Ops / Image * 0.7 * 1000 Images / s
Compare this to the original 33.88W estimate for FP32 inference.
While this analysis fails to consider many variables (confounders!?!), it seems obvious that there is a significant opportunity for optimization. And so, while we continue to search for the elusive “Panacea of Compute Saturation”, consider the context of Andrew Ng’s assertion that “AI is The New Electricity”. I don’t think that he was trying to suggest AI should require more electricity……only that AI is of extremely high value and is of tremendous impact. So, let’s keep a cool head about ML inference. There is simply no need to get hot around the collar, nor to provide liquid cooling for high-performance inference designs.
In Part 3 of this post, we will discuss the use of purpose-built “efficient” neural network models, and how they can be leveraged in Xilinx applications to afford even bigger efficiency gains. Until then, check out Chapter 7 in the DNNDK SDK User’s Guide so that you might better understand the level of inference performance that is possible in adaptable hardware, situated at The Edge and beyond.