Power has become the #1 design consideration in most battery-operated and wall-powered systems and Xilinx decided to meet these power challenges head on with its 20nm UltraScale devices. Here are 19 ways that Xilinx 20nm UltraScale FPGAs can cut power consumption in your next systems design:
1. Process technology: TSMC manufactures Xilinx 20nm UltraScale devices using its 20SoC process, which employs TSMC’s second-generation gate-last HKMG (high-K metal-gate) and third-generation SiGe (silicon-germanium) strain technology to deliver improved performance at lower power. TSMC's 20SoC process technology can deliver 30% better speed at 1.9X the device density compared to the company’s 28nm process technology.
2. Voltage scaling: TSMC’s 20SoC process has both a high-performance mode (Vcc = 0.95V) and a low-power mode (Vcc = 0.9V). The 20SoC high-performance mode offers better performance than TSMC’s 28HP and 28HPL processes with lower static power. The low-power mode offers 65% lower static power than TSMC’s 28HP process and the Vcc headroom in devices manufactured with TSMC’s 20SoC process allows Xilinx to select portions of the power-distribution curve that perform well even at a reduced Vcc of 0.9V, which reduces dynamic power by approximately 10%.
20nm UltraScale Performance versus Power: A significant advantage
3. Selected lowest-power devices:Xilinx 20nm UltraScale FPGAs that can run at either 0.95V or 0.9V are designated -1L, based on their speed grade at 0.95V. The performance of -1L UltraScale devices is identical to that of a -1 speed grade at 0.95V and similar to that of a -1 device when run at 0.9V, but the -1L designation signifies that the device’s static-power consumption is exceptionally low. At 0.9V, the lower Vcc alone offers a static power reduction of approximately 30%. Xilinx screens -1L devices for tighter speed and leakage specifications than other UltraScale FPGAs. In other words, only the lowest leakage and highest performance UltraScale parts become -1L devices.
4. Managing process variation in 3D ICs: Larger 20nm UltraScale FPGAs are actually 3D ICs that employ Xilinx's second-generation stacked silicon interconnect (SSI) technology, which interconnects multiple FPGA die in one package. Xilinx actively controls the static power leakage of the entire 3D IC assembly by combining higher-leakage and lower-leakage die (all within spec) in each package. The result is a much lower maximum process leakage specification for the overall packaged device when compared to a single monolithic die with the same programmable-logic density.
5. Cutting I/O power through 3D IC integration: SSI-based 3D IC technology reduces I/O interconnect power by 100X in terms of bandwidth/W compared to the equivalent I/O bandwidth built with conventional multi-chip designs. This dramatic power reduction results from keeping all connections on-chip, which requires significantly less power than is needed to drive signals off-chip. This design concept delivers incredibly high speed at low power.
6. Low-power design does not stop at the process level: Xilinx focused on power efficiency from every angle at the 20 nm process node. Dozens of options were evaluated based on the percentage of dynamic power reduction that each could yield as well as the associated risks and the time required to implement. Each power-reduction technique was also judged for its impact on performance, cost, design-flow methodology, and overall schedule. The selected options were implemented in all Xilinx 20nm UltraScale devices.
7. ASIC-Like clocking delivers power savings: Clock routing and clock buffers in the UltraScale architecture have been entirely redesigned to provide vastly greater flexibility compared to all previous FPGA architectures. An abundance of clock routing and clock distribution tracks in both horizontal and vertical directions yield hundreds of global-capable clock buffers—more than 20X the number of global-capable clock buffers than previous architectures with thousands of placement options. In essence, the “center” of a clock network (the point from where clock skew starts to accumulate) can be placed in any clock region in an UltraScale FPGA. Clock networks run only where they are needed—the same as an ASIC. The UltraScale architecture provides clock networks with the lowest skew and fastest performance available from programmable-logic devices and these clock networks consume only the power needed to get clock signals from their source to their destinations.
UltraScale ASIC-Like Clocking
8. Fine-grained clock gating: Dynamic clock power can be further reduced by fine-grained clock gating. Clock drivers are dynamically gated off when associated logic in a design is not in use. This feature can be asserted statically or dynamically with a granularity of a single clock cycle. In the largest 20nm UltraScale devices, there are thousands of leaf-gateable clocks in addition to the familiar globally gateable clocks. Most of the clock tree power (CV2f) is actually at the horizontal-buffer and leaf-clock levels because that is the level at which thousands of loads are driven. Clock gating at this level of cuts dynamic power considerably. In addition, reducing the fanout drops clock buffer power because the buffer now drives only a few loads. This too reduces clock-tree power consumption. With the larger quantity of gateable clocks, some designs based on 20nm UltraScale devices can save 10–15% in clock-tree power, depending on the enable rate.
9. Use more of each CLB to reduce the number of CLBs used: The UltraScale architecture employs an enhanced Configurable Logic Block (CLB) that makes use of the available CLB resources more efficient. Numerous changes within the CLB structure provide added flexibility to the possible packing options. Every 6-input LUT is combined with two flip-flops. Each flip-flop has dedicated inputs and outputs, enabling all the components within a CLB to be used together or completely independently of one another. The flip-flops benefit from the increased quantity and flexibility of their control signals, with double the quantity of available clock-enable signals, optional “ignore” inputs on the clock-enable and reset ports, optional reset inversion allowing both active-high and active-low resets on flip-flops within the same CLB, and an additional clock signal for shift registers and distributed RAM. Collectively, these enhancements allow the Vivado Design Suite tools to pack many more design components (often functionally unrelated to each other) into a single CLB. Such designs consume the lowest possible power by achieving the best overall device utilization.
Use more of each UltraScale CLB to use fewer CLBs
10. Fewer CLBs mean less inter-CLB routing: Dramatic increases in CLB utilization enable tightly packed, high-performance designs. Denser packing ultimately results in less wire length and thus less wire capacitance—which contributes to lowering the total power consumption of a design.
11. Switch off unused Block RAMs: The UltraScale architecture supports power gating of unused block RAM. Static leakage from block RAMs substantially contributes to overall device leakage.
12. Cascading Block RAMS to lower dynamic power: UltraScale block RAM supports high-speed memory cascading where dedicated data-cascade routing and output multiplexing permits the construction of faster large block-RAM arrays with dramatically lower dynamic power requirements. Multiple block RAMs can be cascaded as required with no impact on block-RAM timing. This feature minimizes the number of active block RAMs at any given instant, which further reduces dynamic power consumption.
13. Use fewer DSP slices: Xilinx significantly enhanced the Virtex-7 FPGA's DSP slice, already the industry's performance leader, for the UltraScale architecture. These enhancements permit faster digital signal processing while consuming fewer routing and logic resources outside of the DSP block. For example, the UltraScale architecture DSP block’s wider 27 x 18-bit multipliers can implement IEEE Std 754 double-precision arithmetic using two-thirds fewer DSP blocks compared with the same function implemented with the DSP blocks in Xilinx 7 series devices.
14. Reduce I/O power: I/O power has become a significant contributor to total power consumption. As programmable devices have evolved, core power has been greatly reduced but until recently (with the advent of the Xilinx 7 series All Programmable families), I/O power had not. Especially in memory-intensive applications, massive I/O requirements can consume as much as 50% of a design’s total power budget. Xilinx aggressively reduced I/O power in the 7 series FPGAs through programmable slew rates and drive strengths. UltraScale devices employ the same power-saving features.
15. Use DDR4 memory: The UltraScale architecture takes memory interfacing to a new level by enabling multiple DDR3/4-capable SDRAM memory controllers and including integrated DDR physical-layer (PHY) blocks on chip. You will see a 20% reduction in power when moving from DDR3 to DDR4, because DDR4 operates at a lower voltage of 1.2V.
16. Reduce high-speed serial transceiver power: The SerDes transceivers in Xilinx 20nm UltraScale devices have been optimized for high performance and low jitter and offer several low-power operating features. The UltraScale architecture-based GTH transceiver has been redesigned to cut the total power requirement by 50% compared to the GTX and GTH transceivers in the 7series FPGAs.
17. Disable DFE when not needed: Many non-backplane applications do not need to use the decision feedback equalizer (DFE) circuitry in a SerDes transceiver. The DFE burns extra power so Xilinx UltraScale devices allow designers to switch off the DFE when using the SerDes ports for other applications. To save power, you can turn off the DFE circuitry and use the linear equalizer (LE) by itself. The LE uses much less power than the DFE because of its lower Rx gain and minimal circuitry.
18. Add hardened IP blocks: Replacing soft IP with an integrated block can reduce power consumption of that block by as much as a 10X. Xilinx implements an integrated Interlaken IP core for chip-to-chip connectivity that scales to 150Gbps. The Xilinx IP core is based on the industry's leading and most widely deployed implementation. It is a flexible, high-performance, low-power implementation of the Interlaken interface protocol specification protocol rev 1.2 that supports 12.5Gbps and 25Gbps transceivers. Combined with the transceiver technology in the UltraScale architecture and a flexible protocol layer, the integrated IP core minimizes the pin and power overhead of chip-to-chip interconnect. Integrated IP also exhibits lower latency than an equivalent soft IP solution, enabling performance that has not previously been possible.
Save power with hardened IP
19. Bake power reduction into the design tools: The Vivado Design Suite directly supports many of these UltraScale architectural power-reduction features.For example, the Vivado Design Suite power gates portions of a design by generating logic that drives leaf-clock buffer enables. The tool also automatically generates logic to support both static and dynamic power gating for block RAMs and can infer cascaded block RAMs.
Note: This blog post is based on the new White Paper “Power Reduction in Next-Generation UltraScale Architecture” by Srinivasa Kolluri. See the White Paper for additional technical details.