05-02-2017 09:00 AM
Recently our 1 year old design started to misbehave.
We are working with a custom board populated with Artix 7 device. Design includes licensed TEMAC IP.
We have two Ethernet ports, we basically transfer, filter certain IP packets in between these ports.
Recently, the generated bitstreams started to produce corrupted results like packet corruption. The same bitstream can produce different results on different boards of the same hardware revision. Also, making a small difference in an unrelated part of the design causes ethernet functions to get corrupt from synthesis to synthesis.
I have search through and tried to implement correct CDCs, timing constraints, sync resets but to no avail.
One common behaviour is ethernet packet corruption but it is so specific that it may give a clue about my problem. At times, only the MSB of the ethernet packets gets corrupted, coming in the opposite bit value. When I make a small change in the unrelated part of the design this can get corrected. Also, most of the times trying to analyze the problem with Chipscope removes the problem from the design. So I have not been able to pin point the cause of the problem.
I guess I am getting lucky/unlucky placements at times, this would mean that the design is under/bad/over constrained.
I have used set_max_delay -datapathonly constrains for the async clocks. I suspected the problem was in the FIFO resets, so I have redesigned them synchronously.
I would be glad if someone with experience could point me in the right direction that I can further investigate and solve the issue we are having.
05-02-2017 10:30 AM
What you describe is often the result of timing not being met, true. But, I take it you have already gone down that path to no avail. Next largest cause of designs acting poorly is signal integrity. Signal integrity includes excessive jitter problems, which act to cause timing to be missed. For example, try setting the system_jitter value to 300 ps (default is 100 ps). If that 'fixes' the issues, I suggest actually measuring the system jitter, placing in the tools, and re-doing the design so that you know you are OK.
05-17-2017 10:44 PM
I have been looking into this for the last two weeks. I actually measured the system jitter (which is 400 ps) and put it into the design but that doesn't seem to solve my issues.
We have been working with the hardware engineers, inspecting power rails for levels and glitches.
I've redesigned the TEMAC part of the design, but issue seems to continue.
When random failure happens, with the same configuration bitstream in some boards ethernet packets output goes like this.
Board #1 : Correct Data Output : 01 02 03 04 05 06 ...
Board #2 : Corrupted Data Output : 01 12 03 14 05 16 ...
05-18-2017 07:08 AM
I would expect if jitter is causing timing failure that the 400ps set as system jitter would be improved from the default (error happens less frequently). If so, put in a larger value (600ps).
If not better, and temperature doesn't make it any better or worse, then it might be clock crossing domain issue. Look for data paths between clock domains in your design.
05-19-2017 12:07 PM
@mcetinsoy how are you dealing with the async clocks? Do you have properly designed async fifos? If you are transferring multiple bit busses between clock domains, set_max_delay -datapath_only is not going to be enough.