05-16-2018 09:13 AM
I have a Zynq Ultrascale+ custom board running Linux, and am experiencing a problem when accessing the PS DDR from linux.
The device has DDR4 ECC memory, which I can memtest successfully from u-boot, or the Xilinx baremetal application, but I when I read or write to the memory from Linux (e.g. using the memtester application) I get EDAC CE and UE errors, and the test fails with any appreciable amount of memory (>=500k).
What is odd is that Linux is able to use the DDR well enough to boot, and run basic applications etc.
Does anyone know why this may be happening?
05-17-2018 01:27 PM
05-21-2018 02:19 AM
Thank you for your reply.
My DDR is 8GB, which I suppose is fairly large, however I do get the same behavior when I only tell the U-boot/linux about a much smaller portion of the memory (say 2GB).
The FSBL does claim to be initializing the full memory on boot in its debug log.
I've tried disabling ECC, and interestingly the memory test does take much longer to fail in Linux, but it does still fail eventually.
I have also checked the read and write eyes using the standalone test as you suggested, and the read and write eyes look good. The read eye widths are all between 414 and 438ps wide (66-70%), and the write eyes are all between 414 and 455ps wide (66-72%).
Interestingly if I halve the memory speed to 600MHz, and disable the ECC, I can get the memory to test reliably (I left memtester running all afternoon friday on loop without a failure), but it would be good to understand why the memory test fails under all other circumstances (enabling ECC, running at full speed, or both).
It seems that you are correct about the memory being marginal for some reason, but I am struggling to understand why.
05-21-2018 11:25 AM
05-29-2018 09:18 AM
Thanks for that config parameter.
I've been running lots of further tests on this with different combinations of Bus widths, speeds, and ECC enabled/disabled.
One interesting scenario is that, with the 2T timing parameter enabled, I can get the memory to work at full speed, and full width (1200MHz, 64 bit), provided that ECC is disabled. Enabling ECC on the same configuration causes failure.
This to me feels like it could be something other than an ECC issue, because I can't see any reason why enabling ECC should make the memtest fail. Intuitively the signal integrity on the data lines should be very similar whether ECC is enabled or not?
05-31-2018 12:57 AM
So after further investigation, I believe that I am experiencing two separate issues.
One is caused by the layout. After revising UG583, I noted that the clock line isn't skewed with respect to address/command/control lines as per Table 2-26 (p.55). This means there occasionally isn't quite enough setup time for the address lines, causing one of the byte lanes to not be written correctly. This also explains why the 2T timing parameter solves this, since adding an extra clock cycle of latency, gives the chips more than enough setup time.
I am still confused as to why enabling ECC does not work. With the 2T timing parameter enabled, the memory interface never experiences errors, but as soon as ECC is enabled, errors are very common (far more common than with ECC disabled and 2T timing parameter disabled). I can't really explain why this is the case, since the ECC IC is identical to all of the other DDR ICs and sits in the middle of the bank. The ECC is being initialized by the FSBL, and besides that my bare metal memory test always writes before reading, removing any initialization issues.
I have two questions therefore:
- When a memory test fails due to an ECC error in bare metal, and you end up in the Serror handler, is there any way to read what the data returned by the memory was? Whenever I try to read from the debugger I get read issues.
- I notice from ug1087 that there are two AC bit delay line registers which one can set to delay the clock slightly (ACBDLR0 and ACBDLR16). Is there any documentation on these registers (i.e. how much delay is given by each bit), as it seems that I could use these registers to skew my clock and overcome the setup time issue without using the 2T timing parameter.
05-31-2018 12:09 PM
Good work! That all seems plausible.
It sounds like there is something wrong with your ECC byte somehow.
Starting in 2018.2, we will have fixed that ECC errors causing the memory test to stall in the Serror handler. From an architecture perspective, the correct value is returned- it is just that the Serror diverts your execution.
Yes, it is possible to phase shift the address interfaces. See this Answer Record:
Let me know if phase shifting helps, I've not had to use it yet.
06-01-2018 01:39 AM
Yes I agree that something seems odd with the ECC byte, I'll continue investigating. It just seems strange since it is identical to the other bytes.
Thanks for that link to the answer record, unfortunately I think using that register would do the opposite of what I need! In my case the clock line is the same length as the address lines at each IC, so I need to delay the clock with respect to the address lines, but the registers in AR70867 delay the address lines with respect to the clock.
I have tried using ACBDLR0 and ACBDLR16 to achieve this, empirical measurements showed that the delay lines were ~3ps per tap, but this is very much pushing the limits of what my oscilloscope can measure! I haven't had any success so far with this, it may be necessary to leave the 2T timing parameter in place for this revision of the PCB.
08-05-2019 01:33 AM
Hi @wastie ,
Here's how you apply the 2T timing parameter:
I hope this helps you find your problem!