The analysis in this blog entry is based on a real customer issue where they were seeing DDR4 Calibration errors in hardware. The failures were inconsistent from one board to another and from build to build. This blog will show some of the debug techniques we used to narrow down the root cause and fix the issue.
In the end, the issue was caused by user XDC set_false_path constraints overriding MIG IP constraints. This highlights the danger of using set_false_paths incorrectly.
This is part three of the Using the Methodology Report series. For all entries in the series, see here.
The user had a design which used the Vivado and SDx flows. The design included two DDR4 64bit interfaces running at 2000 Mbps. The design closed timing, however calibration failures were observed on one and sometimes both of the DDR4 interfaces.
The hardware failures were build dependent:
The passing build worked on multiple boards
The failing build failed on multiple boards
Either or both interfaces failed most of the time
The failing bits varied from build to build
The failure signature indicates a timing constraints or CDC issue, so we use the following steps to debug.
1) Add an ILA and re-implement the design.
The failure now goes away or moves to different bits.
2) Use the incremental implementation flow to preserve the failure signature.
3) Add pipeline stages to the ILA to ease timing closure.
The goal of this test is to find an expected pattern during the failing stage to narrow down the failing bits.
3) Try a p-block to keep placement of the MIG IPs very close together.
In this case it does not change the failure signature.
Passing timing interface is failing in hardware
Failing timing interface is passing in hardware
Based on the above it looks like the issue might be that some MIG constraints are being overridden either by the user or the Vivado flow.
The next step is to review the user’s XDC constraints.
When we do, we notice that many false paths constraints between clocks were set by the user.
We now run our below suite of recommended reports. The key ones are report_methodology and report_cdc.
Report MIG set_max_delay (to find if theses constraints were being ignored)
Root cause analysis:
MIG set_max_delay paths were not being ignored.
Max delay is reported by report_timing
We find the below critical CDC warnings on some of the MIG paths (Fabric to PHY).
Now compare these paths to examples in the MIG example design (created using the IP Integrator flow) which are safely timed.
Based on these findings, we remove all false path constraints added by the user and report timing again without re-implementing the full design.
The report shows timing failures of more than 3ns in the worst case listed below for both DDR4_rx/tx.
We can leverage the scoping mechanism of the report timing summary to focus analysis on the MIG interface only.
We now find that false path constraints added by the user were causing fabric to PHY paths to be ignored.
Remove the above false path from the target XDC and re-implement the design.
The Design is again timing clean as before.
The CDC report now shows that the previously ignored paths are safely timed.
When we test the bit file on hardware, both DDR4 interfaces passed calibration consistently.
Be very careful with the set_false_path constraint!
This constraint can very easily lead to paths that are required to be timed being ignored. Be careful when using wildcards in the such constraints or setting false paths between entire clock domains unless you are very sure that no paths between them are required to be timed. The consequences can be hardware failures followed by a difficult and lengthy debug process.
When faced with a situation where you are getting hardware failures on a timing clean design, there are a number of checks that that can be run in Vivado. These should always be run, especially after Place and Route. Just being timing clean is not enough, you still need to complete these checks.
1) Report clock interaction:
This gives information about all clocks in the design.
2) Report Methodology
If there are unsafe or user ignored paths observed, then use Report Methodology and focus on critical warnings.
3) Report CDC
In this example Report CDC helped to identify some critical paths ignored due to user constraints.
Comparing these results with the MIG example design helps to find suspected paths out of the millions of paths present in a design.
Use the scoping mechanism to narrow down analysis to the selected module.
4) Report exception:
This gives information about paths ignored due to timing exceptions (if any) such as set_false_paths or set_clock_groups.
For very large designs, parsing through millions of paths would be very challenging and time consuming.
For faster turn-around, narrow down the scope of reporting using the below commands: