04-24-2018 04:22 PM - edited 04-25-2018 12:30 PM
We are currently bringing up a new ZU19EG (XCZU19EG-2FFVC1760I) based board using 2017.2 tools. There are 5x MT40A512M16HA-083E:A devices comprising the x72 DDR bank on the PS Side. In order to support debug ECC is disabled and the speed is set to 1600 MT/s. While running FSBL via JTAG it was discovered that the SDRAM write/read eye training was failing (and since it was checked with a while loop it never exited).
The psu_init.c code was updated to provide breaks from the training loops for debugging. Once stepping completely through psu_ddr_phybringup_data the resulting value of PGSR0 shows 0x84c844ff, which indicates a read eye training error, write level adjustment error, and dqs gate training error. The layout complies with UG583 for routing rules. The power distribution also looks good. Given the board fab I am unable to probe hardly anything other than termination resistors.
Before I go digging into every controller register I wanted to ask if anyone else has experienced anything similar with the MPSoC? I also wanted to comment on the psu_ddr_phybringup_data function.
There is a prog_reg call to address 0xFD080014U, I cannot find what this is. The TRM mentions a particular training sequence but it appears they are all issued at once in psu_ddr_phybringup_data, including the init bit which is recommended to be set on a separate register write. It appears this way across multiple ZU boards so I assume it's standard. There is a MR write command issued to 0xFD070014U, how does the data value map to the addr/bg/ba etc on the bus?
My plan of action is to issue individual training commands to isolate the sequences for debug. Any advisement is greatly appreciated. Thank you.
04-25-2018 04:19 AM
I also wanted to include the DPLL configuration and the DDR configuration setup information.
04-30-2018 09:09 AM
04-30-2018 12:41 PM
Our current implementation has the DM pins tied low with 39.2ohm at the devices given the intent for ECC on the PS side (https://forums.xilinx.com/t5/Memory-Interfaces/x72-DDR4-NO-DM-NO-DBI-Pins-still-connected/td-p/763918). The termination resistors are physical (not buried) components and could be removed. Is the recommendation to enable DM intended to drive that IO or to change the controller configuration? They are 0201s in a densely populated area so I want to verify before we re-work.
I am also installing the 17.4 tools on a separate machine to avoid DLC pod driver conflicts. Will respond back once that is tested.
04-30-2018 01:26 PM
If I re-enable ECC what should the DM/DBI settings be at that point? Back to NO DM and NO DBI?
04-30-2018 01:29 PM
04-30-2018 02:19 PM
Alright my plan is to run with "DM no DBI" in the 17.4 tools and see if that improves anything. It's looking like we will want to uninstall the DM pull-downs but I am still confused for the the proper connections. If ECC is enabled the DM pins shouldn't be used but I should still set the core configuration as "DM no DBI"? On the ZCU102 I see that "DM no DBI" is the default and that the DIMM is a x72 with ECC, so if it works there hopefully it work here.
04-30-2018 03:00 PM
05-01-2018 07:18 AM
Looking at the layout I see that the DMs were not routed to length match with the rest of the DQ bus since they were not expected to be used once ECC was enabled (per the discussion I posted earlier). How sensitive is the DM line to the length matching given the expected operation? Some are within tolerance but a few of the bytes DM signal are ~250mils different, outside of the 107mil derated value for 1600MT/s operation.
When attempting to run with "no DM no DBI" I see PSU_DDR_PHY_DTCR0_DTWBDDM is still set to 1. Is this intended? Also PSU_DDR_PHY_DX(x)2GCR1_DMEN is also still set. What should those be set to in a "no DM no DBI" config? Also I don't see an entry for DX1GCR1. Once the DM pull-down resistors are removed I am going to try the "no DM no DBI" setting again but with the above registers updated.
05-01-2018 10:56 AM
With the DM pull-downs removed and "DM no DBI" set in the HDF I am still seeing 0x84c844ff in the PGSR0 register. The errors indicate Read Eye Training Error, Write Leveling Adjustment Error, DQS Gate Training Error, and VREF Training Error. If I explore the DQS Gate error (read leveling?) registers DX0RSR1 I don't see any errors asserted. The DQS gating, latency, and delay registers all have similar values. Since PGSR0.QSGERR is set I would expect a byte to be set in DX(#)RSR1.
For the Read Eye training I do so DX0GSR2.REERR with ESTAT value of 0 (Initial read data miscompare before centering). The VREF ESTAT indicates a "Final check for DRAM VREF failed".
This may still indicate a write issue. What other actions can I take here?
05-01-2018 12:50 PM
By adding some DQS gating system latency (b'1) at 0xFD0807C0 [4:0] the QSGERR error bit is no longer asserted. The 3 remaining failures are VREF training, Write Level Adjustment, and Read Eye Training.
05-01-2018 02:00 PM
For the Write Level Adjustment I see that it is failing on byte 7. I've captured the write leveling registers but I cannot determine why it is happening or how to correct for this particular byte. It doesn't seem out of the ordinary compared to the others.
Layout (x16 parts):
WLPRD WLD WDQD WLSL WDQSL
Byte 0 = 2380mils 75 74 39 0 1 CAC1 = 3200mils
Byte 1 = 2083mils 73 72 37 0 1
Byte 2 = 2169mils 75 78 3D 0 1 CAC2 = 3621mils
Byte 3 = 2880mils 76 76 3B 0 1
Byte 4 = 2237mils 75 74 39 0 1 CAC3 = 4130mils
Byte 5 = 1753mils 75 74 38 0 1
Byte 6 = 3043mils 75 74 39 0 1 CAC4 = 4545mils
Byte 7 = 2298mils 73 DA 2A 3 2
Byte 8 = 3315mils ECC NOT USED CAC5 = 4980mils + Term
Any ideas on how I can adjust/correct byte 7?
05-01-2018 03:19 PM
You can try to change to use 32-bit width so that 7th byte is not used, and then run the read/write eye tests.
It looks like you are running fairly slow- which worries me. What does full speed look like?
Also, there were a few improvements to some registers in 2018.1, which may be worthwhile to test with.
The individual skew of the DM pins should not be too much of a factor, there is per-bit write deskewing done. Termination matters, however.
Generally, DDR4 issues end up being board issues at this point of the silicon/software. Power supply issues seem to be the most common.
05-04-2018 08:31 AM
Thanks for the feedback. Switching to 32-bit did not fix the issue. I am running slow at 1600 to minimize any SI effects that may be inherent in the design, given the initial issues. We are in the process of upgrading our licenses to support the 18.1 tools.
Unfortunately May is conference/WG meeting month so testing will be limited over the coming weeks. I did want to share the attached images in the hopes someone might have an "ah-ha" moment, I certainly haven't. I captured the attached signals on byte 1 on what is suspected to be the worst byte lane. "Worst" meaning it is the first chip for CAC, the DQ bus is on through vias (tiny stubs), the stitching vias (circled) are more sparse than other data bytes, and it is at the end of the VREF pour. The stitching vias for other data bytes are blind vias matched to the signal layer, byte 1 is the only byte going completely through the board. This byte fails training similar to others.
The captured signals are with psu_init default settings, ODT set to 40ohm on the DRAMs, Vref level set to 76% of VDDQ ~0.91V, and ZCAL completed on the PHY. The DQ11 read transaction Vil looks high, but this may also be a function of the measurement setup (which we are refining also). The same goes for the DQ11 write, the Vih looks low. The DQS looks accurate in both cases given the Vref settings. Perhaps I can just twist some Vref/ODT knobs to get this working (wishful thinking)?
I know this is not nearly the complete picture and I don't expect Xilinx to help me debug my board but any additional direction on things to check is much appreciated. I haven't found a smoking gun to correct for in layout yet and as seen in the picture updates will be very impactful for the entire board given the density.
We are in the process of updating our IBIS simulations as a recent tool update (AD18) broke our original simulation. We are now implementing the design in ADS. I am also working to measure the PDN and ensure everything remains in spec. Everything was over designed and our initial measurements showed VDDQ, VTT, VREF, and VPP all within spec during training but we are going to measure again.
Any additional advice is greatly appreciated!!! Thank you.
05-31-2018 12:55 PM - edited 05-31-2018 03:36 PM
We are back on this debug and hoping you can steer us in the right direction. In summary we get through PHY init and and through Write Leveling. We are failing on Read Leveling (1600MT/s) on all bytes but cannot determine why. We have probed all of byte 0 and a subset of images are attached. You can see bit 6 and clock all look good (rest of the bits look just as good). We have also checked our PDN with a very fast scope probe and there is nothing out of regulation during training.
What I can't make sense of is why the read leveling is failing. As far as we can tell the data and CAC bus look good. The delay from when the Controller ODT turns on and the DRAM drives the MPR data. For an MPR read command Micron has the Command->DQS turn time of PL(0) + AL(0) + CL(11). The image shows the issuing of a read command (CAS_n) and the 13.75ns of CL at 1.25ns periods. You will notice in the Long DQ6 capture the ODT is on for some time before it starts sending data. Not knowing exactly how the Read Leveling algorithm uses the MPR this may be expected.
All of our hardware measurements have looked good. Is there perhaps a configuration issue we are overlooking?
Thanks in advance for any additional comments.
06-05-2018 12:14 PM
Just wanted to provide an update. If we set a DGSL value (based on layout) prior to running the Read Leveling we see no errors in QSGERR and QSGDONE is asserted but the resulting DGSL value is reduced by 1. We've tested with different delay values and the resulting DGSL is always reduced by 1 once completed. Assuming that the QSGDONE assertion is valid we continue with training and pass Read Deskew but are failing on Write Deskew. We are investigating further but find it odd that the DQSGATE sequence cannot complete on it's own with setting DGSL first. The results are very repeatable so we are inclined to trust what is reported in the PGSR0 register. Will update on what we find regarding write deskew.
06-06-2018 02:04 PM
One more update. The P/N polarity from the controller to the DRAM is wrong. The _P is going to the _c and vice versa. I don't suppose there is any polarity flexibility on the PS side DRAM controller??
06-07-2018 06:38 AM