This is a guest post from Quenton Hall, AI System Architect for Industrial, Scientific and Medical applications.
In 2014, Stanford Professor Mark Horowitz published a paper entitled “Computing’s Energy Problem (and what we can do about it)”. This seminal paper discussed the challenges that the semiconductor industry faces related to the breakdown of Dennard Scaling and Moore’s Law.
If I can be so bold, I would like to borrow and adapt the title of Mark’s paper so that I might provide some perspectives as to why you should consider specialized hardware for Machine Learning inference applications
First, let’s consider the problem. In approximately 2005, processor core clock frequencies stopped scaling. Shrinking process geometry and decreasing core voltages no longer offers the same advantages that it once did. The fundamental problem is that computing has hit the power density (W/mm2) wall.
If we put more cores on the same die, we can increase the number of ops within the same power budget, provided we also reduce the clock frequency somewhat to account for the energy used by the additional cores. It is not by coincidence that AMD and Intel released their first dual-core processors in 2005-2006. However, as we continue to try to increase the number of cores, we must consider the energy per op and the silicon area per op. Moreover, we also need to ensure that we can efficiently parallelize our algorithm by N, where N is the number of cores. The universal solution to this problem, or “Panacea of Compute Saturation”, for all algorithms remains an elusive problem and is today best solved through the application of adaptable hardware.
It turns out that whether your processor design is implemented using a multi-core CPU, GPU or SoC, the overall breakdown in power consumption at a processor level will be ~roughly~ the same. If we were to guesstimate a breakdown as follows, we might not be that far off:
Cores = 30%
Internal memory (L1, L2, L3) = 30%
External memory (DDR) = 40 %
What we fail to consider in the above analysis is there exists an additional plane of optimization available, which is to implement specialized hardware accelerators. Specialized hardware can be optimized to execute a specific function, at a very high level of efficiency. Such hardware is typically designed to reduce external memory accesses, reducing both latency and power consumption. Specialized hardware can be optimized such that the data motion portion of a given algorithm will use localized memory (BlockRAM, UltraRAM) for the storage of intermediate results.
Designing an efficient accelerator is a multi-dimensional design problem:
How do we implement hardware that is optimized to process our specific algorithm?
Mark expressed this most effectively as having to move the algorithm from the “space of all algorithms” to a “restricted space”
How do we keep the accelerator fed with data to ensure that our compute accelerator is saturated on every clock cycle?
How do we minimize communication overhead?
How can we optimize the dynamic range of the operators we are processing?
How do we minimize the use of external, or even local memory?
How do we eliminate instruction processing pipeline overhead?
How do we schedule operations to ensure data reuse, thereby minimizing memory traffic and maximizing the number of ops between memory accesses?
In Part 2 of this post, we will discuss and evaluate how Xilinx’s adaptable hardware and DNNDK address these challenges, specifically as it relates to machine learning inference. Until next time, I would suggest that you review Mark’s excellent talk on this subject, and then ponder how you might use adaptable hardware to your strategic advantage in your next design.