06-20-2012 03:53 AM
I have had a rather interesting week trying to track down a sudden (after a few slight changes) emergence of timing/cross clock domain issues in my logic.
Im using a SP605 Development Board with Xilinx ISE 14.1
My project consists of several external asynchronous signals, which I'm using the FPGA to convert into block data. The block data is then sent over a PCIe interface (based loosely on xapp1052).
There are two clock domains used in the project: 48MHz clock for my logic and the 62.5MHz PCIe clock. The two clock domains are derived from different sources and bare no relation to one another. I have been very cautious with cross clock domain signals and have used synchronisers and FIFOs where required. My clock domains are constrained properly, and I have use FROM/TO TIG constraints to remove the unconstrained paths (CCD) from the timing report. I have set INPUT_JITTER and SYSTEM_JITTER constraints to values far higher than the expected jitter.
The logic is failing in very pecular ways, which simulation and my mental checking rule "impossible". State machines don't seem to be following logical paths and signals are being set when it should be "impossible". Obviously, my HDL could be wrong, but the unpredictable nature gave me a gut feeling that it was a timing or cross clock based problem (which I have seen a few times before).
I spent 2 days adding in ChipScope ILA cores to observe the problem, where this sometimes caused the problem to evapourate completely, usually when I was observing the culprate signals. There have never been any timing related errors and I have tripple checked the CCD signals, as well as verified my methods with some reading.
It finally occured to me that the logic only failed when I had a ChipScope ILA inserted, using a clock in a different domain to the observed signals. An example of this would be using the 48MHz trigger clock to observe the state of the initiator reset signal of the PCIe block. I'm about 85% sure that this is the case, and my notes agree, but due to the irregularity of the errors I can't be 100% sure this is the cause.
Is there any reason that using different clock domain to the signals on the ILA clock port would cause spurious errors in the logic? It seems counter intuitive that an ILA core, there only to sense a signals state, could affect the signal.
I always check the timing report including unconstrained paths, none of which are failing. Really clutching at straws to understand how and why the logic is acting the way it is.
Before I get told off: I'm aware it's not overly useful to have a 48MHz clock observing a 62.5MHz signal, but it gives me an idea of what was happening on the PCIe side of things in relation to the 48MHz side.
I'm doing my best to learn and understand so if you have any advice, or ideas I would be very greatful to hear them.
06-20-2012 06:17 AM
My best guess is that somewhere you have missed a clock domain crossing issue in your design,
and it comes and goes with changes in the placement. Adding a ChipScope to the design definitely
changes placement, and perhaps when you are bringing signals from two different clock domains
into ChipScope it more severly affects the placement. ChipScope itself is usually placed as a
compact block of one or more BRAMs in the same column.
The most likely culprit is the use of an asynchronous signal in an "if" condition. In a one-hot
encoded FSM, this can cause both branches of the "if" to be taken (FSM goes two-hot) or
neither branch (FSM goes zero-hot). Using a "safe" FSM implementation can cause the FSM
to recover from the condition, but doesn't prevent it. Often having a particular set of relative
path delays can cause the problem to be masked - hence the symptoms change with each
placement, regardless of what logic has changed in the design.
Sometimes a visual aid can help in finding these problems (if you can't afford a CDC analyzer).
For example if all signals on the PCIe clock domain are named p_somthing, and all signals
on the 48-MHz clock are named l_something, it can become very obvious when you use a
signal inappropriately in the opposite clock domain.
If you find a placement where the design fails consistently, there are methods of using the
FPGA editor to route a particular signal out to an unused pin without changing the design
much. However I think that another pass through the source code would be a better starting
06-21-2012 04:12 AM
Thank you for the excellent explanation. I was under the (false) impression that CDC issues would always eventually reveal themselves regardless of paths, especially when the two clocks are unrelated.
Since I don't have CDC analyser, I took your advice and did a full check for CDC signals, which didn't turn any CDC problems up. It did show up a few questionable bits of code, written about 3 years ago, that made me cringe. It turns out the synthesiser wasn't understanding my garbage code and the resulting RTL was garbage. After re-writing my poorly written modules, my issues have evapourated.
I think there might still be a hidden CDC problem somewhere, which the newly written logic is dealing with better. I'm making sure that can't be the case with namings as you have suggested (as I have done for a year or so now).
Thanks again for your help Gabor, it is greatly apprecaited.