The news on DDR memory failures is not good according to FuturePlus Systems’ co-founder Barbara Aichinger speaking at this week’s Memcon 2016 event in Santa Clara, California. Memory errors continue to plague the industry and they are proving to be expensive. According to one paper co-authored by Facebook:
“…servers were flagged for memory repair if they had more than 100 correctable errors per week…” and “Under our more aggressive/proactive repair policy, we find that on the average around 46% of servers that have errors end up being repaired each month.”
Aichinger then did the math for the Memcon audience:
About 2% of Facebook’s servers have a memory failure every month.
46% of those servers with monthly memory failures get a memory DIMM swap.
Assuming Facebook has on the order of 100,000 servers, that’s 920 DIMM swaps per month.
There are 720 hours in a month.
Therefore, “Facebook is swapping out DIMMs [in its servers] every hour of every day of every month all year long!”
Aichinger said this week that DDR4 SDRAM also exhibits row-hammer failures. Want proof? Check out this White Paper from SGI titled “The Row Hammer Effect: Enhancing Memory RAS.” Worse, says Aichinger, error rates are vendor dependent.
Certainly, one way to induce memory errors in all kinds of DRAM including DDR4 SDRAMs is to violate memory protocols. All DDR specs including JEDEC’s DDR4 spec contain rules, many rules, about event ordering. For example:
Do not activate a bank that’s already open
Do not precharge a bank that’s closed
Do not read or write a page that’s not open
Timing violations are yet another way to cause memory errors.
Row hammer is a way to intentionally cause a memory error by repeatedly activating a row in the SDRAM. Repeated activation causes charge leakage in adjacent rows. Leak enough charge into a victim row and you can flip bits in that row even though you’re not explicitly accessing it. (And perhaps you cannot access that particular row because of memory management policies, so intentional row hammer is a form of hacking.)
This is your SDRAM on Row Hammer. Any questions?
The reason that Aichinger is standing atop this particular soap box at the moment is because she’s working with a JEDEC Task Group to produce a protocol-checks document for auditing SDRAM use. At Memcon, Aichinger proposed a compliance audit that can determine whether or not JEDEC specifications are being met in a specific application.
Achinger’s company, FuturePlus Systems, makes DIMM interposers that are pretty handy for this purpose. The interposers allow you to connect high-speed DSOs and logic analyzers directly to DIMM sockets so that you can monitor memory activity at speed to detect protocol violations caused by a number of issues including:
BIOS programming errors
Incorrect SPD programming (the SPD is an on-DIMM EEPROM with timing info about the specific SDRAMs on the DIMM)
Memory controller violations
Row hammer exploits
FuturePlus Systems also offers a self-contained piece of test equipment called the FS2800 DDR Detective that can perform this type of audit on DDR3, LPDDR3, DDR4, and LPDDR4 SDRAMs at speeds to DDR4-3200. (You’ll find a wealth of information about these memory-error problems on FuturePlus Systems’ www.ddrdetective.com and from the company’s blog.)
So where does that leave you? Where does it leave your design? Well, if you’re using an on-chip memory controller like the ones available in Xilinx Zynq Z-7000 SoCs and Zynq UltraScale+ MPSoCs, you’re going to want to be really careful about how you program these memory controllers and you might well want to perform an audit to ensure that you’re operating the DDR memory correctly. I’m sure FuturePlus systems’ would be happy to sell or rent a DDR Detective to you for an audit.
Beyond that, you may well want a special memory controller for your application’s specialized needs. If so, you’ll either need to design an ASIC or develop the controller using the programmable logic in a Xilinx All Programmable device. The programmable-logic route will get you to the finish line much faster and with more flexibility.