cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
josh_tyler
Contributor
Contributor
2,206 Views
Registered: ‎04-10-2018

Zynq Ultrascale+ PS DDR EDAC errors when accessing memory.

I have a Zynq Ultrascale+ custom board running Linux, and am experiencing a problem when accessing the PS DDR from linux.

 

The device has DDR4 ECC memory, which I can memtest successfully from u-boot, or the Xilinx baremetal application, but I when I read or write to the memory from Linux (e.g. using the memtester application) I get EDAC CE and UE errors, and the test fails with any appreciable amount of memory (>=500k).

 

What is odd is that Linux is able to use the DDR well enough to boot, and run basic applications etc.

 

Does anyone know why this may be happening?

0 Kudos
Reply
9 Replies
dylan
Xilinx Employee
Xilinx Employee
2,177 Views
Registered: ‎07-30-2007

Is there anything else strange with your memory, such as it being very large?

The main thing is that the FSBL will initialize all of the DRAM ECC on boot. If this is not done, obviously you'll get errors on access.

Linux is a stronger memory test than a synthetic one. So you may have a marginal memory for some reason. I'd suggest disabling ECC and check the read and write eye with the standalone test and also try to boot linux.
0 Kudos
Reply
josh_tyler
Contributor
Contributor
2,149 Views
Registered: ‎04-10-2018

Hi Dylan,

 

Thank you for your reply.

 

My DDR is 8GB, which I suppose is fairly large, however I do get the same behavior when I only tell the U-boot/linux about a much smaller portion of the memory (say 2GB).

 

The FSBL does claim to be initializing the full memory on boot in its debug log.

 

I've tried disabling ECC, and interestingly the memory test does take much longer to fail in Linux, but it does still fail eventually.

 

I have also checked the read and write eyes using the standalone test as you suggested, and the read and write eyes look good. The read eye widths are all between 414 and 438ps wide (66-70%), and the write eyes are all between 414 and 455ps wide (66-72%).

 

Interestingly if I halve the memory speed to 600MHz, and disable the ECC, I can get the memory to test reliably (I left memtester running all afternoon friday on loop without a failure), but it would be good to understand why the memory test fails under all other circumstances (enabling ECC, running at full speed, or both).

 

It seems that you are correct about the memory being marginal for some reason, but I am struggling to understand why.

0 Kudos
Reply
dylan
Xilinx Employee
Xilinx Employee
2,134 Views
Registered: ‎07-30-2007

Hi- I agree, it does sound like a marginal interface.

That slowing down the interface improves the behavior, that makes it sound like more along signal integrity issues.

I agree that by using the eye test your read and write data eyes are likely good (you didn't specify your original speed). This checks the DQ pins. However, the eye test does not test the Command/address pins, or the DQS pins themselves- only their nominal use.

If it seems to be an address issue, consider trying the 2T timing parameter, which gives additional setup time for address pins, as a debug step on the PS IP parameters: CONFIG.PSU__DDRC__ENABLE_2T_TIMING = 1

Of course you'll want to step back and compare your schematic and layout against the UG583 PCB Guidelines.
0 Kudos
Reply
josh_tyler
Contributor
Contributor
2,092 Views
Registered: ‎04-10-2018

Hi Dylan,

 

Thanks for that config parameter.

 

I've been running lots of further tests on this with different combinations of Bus widths, speeds, and ECC enabled/disabled.

 

One interesting scenario is that, with the 2T timing parameter enabled, I can get the memory to work at full speed, and full width (1200MHz, 64 bit), provided that ECC is disabled. Enabling ECC on the same configuration causes failure.

 

This to me feels like it could be something other than an ECC issue, because I can't see any reason why enabling ECC should make the memtest fail. Intuitively the signal integrity on the data lines should be very similar whether ECC is enabled or not?

0 Kudos
Reply
josh_tyler
Contributor
Contributor
2,073 Views
Registered: ‎04-10-2018

So after further investigation, I believe that I am experiencing two separate issues.

 

One is caused by the layout. After revising UG583, I noted that the clock line isn't skewed with respect to address/command/control lines as per Table 2-26 (p.55). This means there occasionally isn't quite enough setup time for the address lines, causing one of the byte lanes to not be written correctly. This also explains why the 2T timing parameter solves this, since adding an extra clock cycle of latency, gives the chips more than enough setup time.

 

I am still confused as to why enabling ECC does not work. With the 2T timing parameter enabled, the memory interface never experiences errors, but as soon as ECC is enabled, errors are very common (far more common than with ECC disabled and 2T timing parameter disabled). I can't really explain why this is the case, since the ECC IC is identical to all of the other DDR ICs and sits in the middle of the bank. The ECC is being initialized by the FSBL, and besides that my bare metal memory test always writes before reading, removing any initialization issues.

 

I have two questions therefore:

- When a memory test fails due to an ECC error in bare metal, and you end up in the Serror handler, is there any way to read what the data returned by the memory was? Whenever I try to read from the debugger I get read issues.

- I notice from ug1087 that there are two AC bit delay line registers which one can set to delay the clock slightly (ACBDLR0 and ACBDLR16). Is there any documentation on these registers (i.e. how much delay is given by each bit), as it seems that I could use these registers to skew my clock and overcome the setup time issue without using the 2T timing parameter.

0 Kudos
Reply
dylan
Xilinx Employee
Xilinx Employee
2,057 Views
Registered: ‎07-30-2007

Good work! That all seems plausible.

 

It sounds like there is something wrong with your ECC byte somehow.

 

Starting in 2018.2, we will have fixed that ECC errors causing the memory test to stall in the Serror handler. From an architecture perspective, the correct value is returned- it is just that the Serror diverts your execution.

 

Yes, it is possible to phase shift the address interfaces. See this Answer Record:

https://www.xilinx.com/support/answers/70867.html

 

Let me know if phase shifting helps, I've not had to use it yet.

 

 

0 Kudos
Reply
josh_tyler
Contributor
Contributor
2,041 Views
Registered: ‎04-10-2018

Yes I agree that something seems odd with the ECC byte, I'll continue investigating. It just seems strange since it is identical to the other bytes.

 

Thanks for that link to the answer record, unfortunately I think using that register would do the opposite of what I need! In my case the clock line is the same length as the address lines at each IC, so I need to delay the clock with respect to the address lines, but the registers in AR70867 delay the address lines with respect to the clock.

 

I have tried using ACBDLR0 and ACBDLR16 to achieve this, empirical measurements showed that the delay lines were ~3ps per tap, but this is very much pushing the limits of what my oscilloscope can measure! I haven't had any success so far with this, it may be necessary to leave the 2T timing parameter in place for this revision of the PCB.

0 Kudos
Reply
wastie
Adventurer
Adventurer
1,032 Views
Registered: ‎02-12-2008

Can I set the 2T timing parameter in 2018.3? How would I do this?

0 Kudos
Reply
josh_tyler
Contributor
Contributor
1,024 Views
Registered: ‎04-10-2018

Hi @wastie ,

Here's how you apply the 2T timing parameter:

  1. Open your processor block design
  2. Click the Zynq UltraScale+ MPSoC block
  3. Under the 'Block Properties' subwindow on the left, change to the 'Properties' tab
  4. Expand the 'CONFIG' subtree
  5. Click the magnifying glass at the top of the 'Block Properties' subwindow, and type '2T' into the box
  6. The only result should be 'PSU__DDRC__ENABLE_2T_TIMING'
  7. Change the value of this property from '0' to '1'

I hope this helps you find your problem!