The term “row hammer” in relation to DDR3 SDRAM has appeared on my PC screen several times in the past few weeks—often enough to break into my conscious thoughts. Today, I received a White Paper about row hammer in DDR3 SDRAMs from Barbara Aichinger, founder of memory tool vendor FuturePlus. Barbara’s White Paper got my attention and if you’re designing with DDR3 SDRAMs, she should get your attention too.
“A recently discovered source of DRAM errors is the row hammer effect, which happens when a large number of activations to a particular aggressor row in the DRAM degrades the charge on the capacitor(s) in a nearby victim row, resulting in one or more bit errors in the victim row (Fig. 2). Manufacturers have released little public information on this error, and no currently published DDR DRAM specification forbids row hammering. However, it is a significant enough issue that one manufacturer makes test equipment that can detect row hammer conditions. Emerging solutions will prevent the memory usage patterns that could cause row hammering errors, or they will use potential future memory features like targeted row refresh to prevent data loss.”
Here’s Fig. 2 from Greenberg’s article:
Greenberg is currently the director of product marketing for DDR Controller IP at Synopsys but when I worked with him at Denali Software, he was always the go-to guy for all things memory. You can see why.
So just what is happening here? Aichinger’s FuturePlus White Paper says:
“In the quest to get memories smaller and faster memory vendors have had to make trade offs. One of these is very small physical geometries. These small geometries put memory cells very close together and as such one memory cell’s charge can leak into an adjacent one causing a bit flip. It has come to the attention of the industry that this is indeed happening under certain conditions. Very simply the problem occurs when the memory controller under command of the software causes an ACTIVATE command to a single row address repetitively. If the physically adjacent rows have not been ACTIVATED or Refreshed recently the charge from the over ACTIVATED row leaks into the dormant adjacent rows and causes a bit to flip. This failure mechanism has been coined ‘Row Hammer’ as a row of memory cells are being ‘hammered’ with ACTIVATE commands. Once this failure occurs a Refresh command from the Memory Controller solidifies the error into the memory cell. Current understanding is that the charge leakage does not damage the physical the memory cell which makes repeated memory tests to try to find the failing device useless.”
I wondered about the conditions that might trigger the row-hammer problem and the FuturePlus White Paper supplies a ready answer:
“If the software running does repeated accesses to a single location the memory controller will generate excessive ACTIVATE commands. Currently there is nothing in the memory controller design to prevent this from happening. Software often uses repetitive accesses to check to see if a task has been completed. This is a very common occurrence in software architecture and referred to as a Semaphore. Several tasks or threads will communicate with each other using a shared location in the memory. Thus they all need to repeatedly access these shared locations in order to communicate.”
FuturePlus’ FS2800 DDR Detective is a piece of test gear that can monitor your system’s accesses to DDR3 SDRAM and counts row accesses. Too many accesses to one row between refresh cycles could trigger the row-hammer problem. One suggested solution to the problem is to double the refresh cycle frequency; however that creates two problems: it reduces SDRAM throughput and it increases operating power.
Here’s a photo of the FuturePlus FS2800 DDR Detective with a DDR3 SDRAM interposer:
If you’re thinking that error correction might save you here (wouldn’t that be nice?), the FuturePlus White Paper sets you straight:
“Error correction techniques built into the DDR3 standard such as ECC are expensive to implement, add additional latency to every Read and Write transaction and will only correct a single bit error and only detect a double bit error. Anything beyond two bits of error in a 64 bit transaction will go undetected. Thus in many ways ECC is a false sense of security if users feel that this will save them.”
There’s another problem as well: accessing any SDRAM takes tens or hundreds of processor cycles. In other words, it’s slow. That makes SDRAM a poor choice for storage when it comes to frequently accessed data such as semaphores. Embedded system designers using the Xilinx Zynq SoC have other readily available storage alternatives for small data structures such as semaphores. In particular, the Zynq SoC has 256Kbytes of fast, on-chip SRAM that can’t exhibit row hammer because it’s SRAM, not DDR3 SDRAM. All you need to do is to make sure the semaphores and other small, shared, and frequently used data structures are located in this SRAM.
Is that possible? Certainly if you write all of your code in assembly language, it’s quite possible. However, that’s not realistic with today’s embedded systems, which are mostly programmed in C and C++.
To get a better answer, I called my go-to person for such matters: Jack Ganssle of the Ganssle Group. Jack’s history with embedded systems goes all the way back to the Intel 8080. He also owned a very successful in-circuit emulator company in the 1980s, so he’s been soaking up embedded systems design knowledge for decades. These days, Jack consults on embedded systems design projects and, as far as I’m concerned, he knows everything you need to know about writing embedded firmware.
Jack told me that you can’t isolate structures like semaphores in C but you can do it using the linker to restrict certain structures to a specific memory address range. These days, the most likely person to do this will be the system architect who sets up the hardware environment for the software engineers.
Jack also pointed out that the same row-hammer problem might occur with a stack stored in DDR3 SDRAM though I think that’s less likely. I think that stacks easily exceed the size of one SDRAM row but it certainly could happen inside of a really tight loop that’s executed a very large number of times.
Then Jack sent me this email:
“Thanks, Steve. This is very interesting. I suspect there are far more failure modes than just semaphores. For instance:
void delay(int delay_time)
for(i=0; i < delay_time; i++);
This is a delay routine which everyone will claim no one uses; wise people use a timer interrupt. Yet I see this in code all of the time. delay_time is in RAM, and if in DDR3 the loop will hammer the DRAM.
Since a complete refresh occurs slowly - on the order of msec - and the code runs fast, there are approximately a zillion ways a lot of reads to one address can happen before its row is refreshed.
Then there are IIR/FIR filters, signal processing, image processing and the like which often must read a location many times. People may complain that a small error in these cases isn't important - who cares if there's a one-pixel error on a screen for 1/30 of a second? But with ECC if there's a 2 bit error which can't be corrected, but can be detected, most likely the OS will crash the application as there's generally no good recovery.”
Note: After this blog posted, I received the following from Austin Lesea, a Principal Engineer at Xilinx and a real expert in these matters:
"A common misconception is that the standard single error correct, double error detect (secded) code cannot detect more than 2 errors.
Not true, it can detect any number of errors greater than 2, but not 100%. It can detect them to only one in 2 to the exponent of the number of bits in the code word accuracy (different ECC methods are better or worse but still not 100%). Some errors of more than 2 bits can not be caught, but in reality, almost all of them are caught.
The physical mechanism for this error may be such that almost all errors are one weak bit (different parts may have different behaviors).
For example, a soft error caused by a neutron strike will never create an upset in a bit, then skip three bits, and then cause another bit upset. Errors are always adjacent because the particle strike is localized (based on geometry and sizes). So, the error signature is a must to understand how to apply ECC properly.
Is there a better ECC for the hammer case? You bet. Has anyone thought of it? Not yet. Are they working on it? I am sure the smarter folks are."
If you’d prefer to watch a 5-minute video about this problem rather than reading the White Paper, here’s a FuturePlus video covering the same topics: