The analysis in this blog entry is based on a real customer issue where they were seeing rare bit flips in the field. This blog entry will show some of the debug techniques we used to narrow down the root cause and fix the issue.
In the end, the issue was caused by incorrect handling of clock domain crossing (CDC) which was highlighted by report_methodology and report_cdc reports.
This is part four of the Using the Methodology Report series. For all entries in the series, see here.
Our customerhad tens of thousands of Zynq-7000 series based products deployed in the field which were developed using Vivado 2013.4. Their end-customer reported corrupted data packets from a handful of cards. Investigation indicated that in every case, a bit flipped at exactly the same location in the design.
Root cause analysis:
In order to narrow down and focus we first asked for the location of these registers in the netlist.
We requested the DCPs so that we could review the design using various reports.
Although a power supply issue normally causes more random issues, in parallel we requested VCCINT/VCCAUX/VCCIO measurements during operation to measure levels and noise as outlined in the “Hardware debug best practices” in (Xilinx Answer 62181).
We also requested the schematic to review whether sufficient decoupling capacitors were used.
We were quickly able to rule out a power supply issue as the root cause.
Once we received the DCPs, our first action was to run report_timing_summary,report_methodology, report_drc and report_cdc using the latest version of Vivado.
Several issues were immediately identified.
The most important finding which related to the suspected FFs was flagged by report_methodologyLUTAR-1 check: LUT drives async reset alert
The FFs have an asynchronous reset and are driven by a 2-logic level deep path:
This is dangerous as the LUTs (red arrows) can glitch and trigger an unexpected reset.
The second most serious finding related to clock domain crossings and constraining.
Report_cdc found about 40000 paths with non-recommended CDC architecture:
Unsafe clock domain crossing can cause issues downstream or upstream of the flipping FFs and might be the real cause of the behavior observed.
In terms of constraints, there were several serious violations reported by the report_methodologyTIMING-24 check: Overridden Max delay datapath only.
When we removed the set_clock_groups -asynchronous constraint and replaced it with set_max_delay -datapath_only and the minimum clock period of the clock pairs, we saw very severe timing violations: -5.8ns due to 11-logic levels between asynchronous clocks.
A second round of reviews revealed false path constraints onvirtually all resets in the design which had been added to help close timing. From experience we know this is very dangerous: if bits of a state-machine come out of reset at different times, they can get into an illegal state, not recover and make the design operate incorrectly.
Even if resets are asynchronous, the reset deassertion needs to be timed, so you can never ignore timing on resets. In cases where you could, you would need to ask if you actually need a reset as not having one would have saved valuable routing resources and make the SR pin available for control set remapping, making the design smaller as logic functions could be partially mapped to these SR-pins.
After the reported issues were fixed (LUT driven asynchronous reset, CDC, CDC constraints) and new firmware was deployed to the field, these rare bit flips were no longer observed.
Advances in Vivado reporting (methodology, CDC) enabled us to successfully debug and resolve a rare bit flip issue.
When in doubt, never hesitate to review a design again with the latest version of Vivado. It will include CDC analysis and the most up-to-date set of methodology checks which might not have been available at the time of the original design.