03-18-2021 11:08 PM
I have inherited an 8 byte lane NAND controller design that instances the same CDC edge detect for each byte lane when reading DQ(S) data
Each lane block samples its respective DQS signal with 300 MHz clock, samples four DQ bytes and presents to a shallow FIFO. A stretched "ready" flag passes through a typical edge detector circuit into the 100 MHz clock domain. The CDC edge detect output is the write enable to the FIFO.
When all byte lane FIFOs are not empty, data is read. The shallow FIFOs align all byte lanes.
The 300 MHz and 100 MHz clocks come from a PLL/MMCM. The 100 MHz clock also sources the NAND DQS (albeit phase delayed).
What happens with some builds is that no data is read from the FIFOs. Adding chipscope/ILA, I see all lanes have a stretched input to the CDC but some of the lanes do not detect the edge. So some of the edge detectors are not working and its build dependent
I see no reason for this circuit not to work. There are no ASYNC_REG placement attributes applied but I think CDC crossing will be handled by the tools as part of the 300-100 MHz related clock structure.
I have replaced the edge detectors with Xilinx xpm CDC_PULSE macros. Is this really needed? Its too early to tell if this is a definitive fix. I'm concerned because I can't see why some of the original CDCs should fail.
03-19-2021 05:36 AM
I have not understood all the details of your design. However, for a synchronizer, you must:
Keep in mind that the transit time of a signal through a synchronizer is uncertain. This is especially true if you have used set_false_path as the timing exception into the synchronizer. So, perhaps your design is suffering from the uncertainty?
Also, you show that the 100MHz and 300MHz clocks are related and that the stretched pulse originates in the 300MHz clock domain. You should not need a synchronizer to cross the stretched pulse into the 100MHz clock domain. The transit time of this direct crossing is still uncertain (ie. it could take either 3, 2, or 1 cycles of the 300MHz clock). Although, there are ways to make the transit time of this direct crossing constant.
03-19-2021 06:52 AM
Thanks for the reply. Indeed, I don't "think" I need a true synchronizer. The circuit in the green is a standard rising edge detector. Yes the transit time will vary, and I have 8 of these in parallel each sampling DQS lane bits. But in chipscope I can see all 8 inputs to the edge detectors, but I don't see all the outputs. I expect to see all 8 outputs "eventually".
03-19-2021 07:31 AM
As firstname.lastname@example.org said, crossing between two clocks that come from the same MMCM (assuming they use the same kind of clock buffer) does not require an asynchronous clock domain crossing circuit. These clocks are synchronous; they come out of the MMCM with guaranteed low skew, if they use the same type of clock buffer (either both BUFGs or derivatives of BUFGs, or BUFHs if you are in 7 series), and (for UltraScale/UltraScale+/Versal) they have been placed in the same CLOCK_DELAY_GROUP, then the clock tree delay is balanced - therefore they arrive at the destinations with controlled skew. Furthermore the tools understand all of this and take the skews into account during static timing analysis. So you can cross synchronously between these two domains.
The only issue (again, as Mark said) is making sure that you handle the fact that the 100MHz clock has only one edge for every 3 of the 300MHz clock, so you need some mechanism for managing that. What you are doing would probably work, but it would be sufficient to just stretch your pulse to exactly three clock periods at 300MHz and sample that at 100MHz - you wouldn't even need the edge detector.
There is a possibility that your clock domain crossing circuit (CDCC) is messing this up. Normally a CDCC contains a timing exception on the actual path between the last flip-flop on the source domain and the first flip-flop on the destination domain; this constraint is either a set_false_path or a set_max_delay -datapath_only. Both of these constraints will disable the normal timing checks that are done between the 300MHz and 100MHz related clocks - turning this from a synchronous clock crossing to a mesochronous clock crossing. If the clock crossing is mesochronous or asynchronous, then the pulse stretching must be strictly larger than the ratio of the clock periods - it must be at least 4 clock periods - three is insufficient. If you set it to 3, then you could well see behavior like you describe - on some runs (depending on the length of the routing of this path) it might work and on others it wouldn't.
03-19-2021 07:56 AM
Thanks for the reply. I'm using Kintex UltraScale. I don't understand the last two paragraphs.Since the clocks are phase related then I think you could just have a 3 @ 300 MHz clock wide pulse and not have an edge detector at all. The fact the pulse is 6 clocks @ 300 MHz wide means the 100 MHz domain will see 0, 0, 1, 1, 0, 0. At some point surely that is guaranteed.
The CDCC is just a few lines of VHDL, no attributes or constraints. So perhaps me using an xpm cdc_pulse is making things worse? Sorry, I don't see why the pulse has to be 4 clock cycle wide. It seems guaranteed to me that the 100 MHz clock is going to see a '1' at some point.
03-19-2021 08:01 AM
By the way, for event crossings (like this), I have moved away from the pulse stretching synchronizer to the toggle synchronizer - the characteristics of the toggle synchronizer are simply "better" - they can cross more events per second than the pulse stretching synchronizer. Furthermore, other than managing the number of events that cross the domain, the toggle synchronizer doesn't need to change based on the frequencies of the two clocks - it works on any ratio of frequencies (fast to slow or slow to fast) and doesn't need a parameter to tell it anything about the ratio (such as setting the number of clocks required for stretching).
Take a look at this post on toggle synchronizers.
03-19-2021 08:04 AM
Thank you yes, I was thinking exactly the same thing, to use a toggle on the source clock. I do this too because I lost track of slow-fast, fast slow transitions when designs would change frequencies and changed to using a q bit toggle.
03-19-2021 09:19 AM
why do you think you will always see 0, 0, 1, 1, 0, 0.
could it be that occasionally you are seeing 001000, or 001110 ?
would it be better to allow your detect system to allow this ?
03-19-2021 09:53 AM
and if the clocks are just a few ps different, what will happen ?
just a slight power glitch could cause that,
I would always tend to code conservatively,
its same as in programing C for a real time app,
you would not ever do if A = B, but if A >= B , just in case a glitch cause the counter to jump,
may be 1 in 10 billion, but it happens and its called defensive coding.
learn to do the same in FPGAs , especially when it comes to clock edges,
If you had just detected the rising edge, then this would always be fine , even if you captured just 1 or 3 highs.
03-19-2021 10:08 AM
The place and route tool knows the clocks are related so I'd expect clock skew to be taken care of as Avrum says. The pulse created on the 300 MHz domain will take one of 3 phases (see below). The 100 MHz rising edge detector only needs to see one of these logic level '1's in order to output a 100 MHz pulse. There are no attributes, constraints that will be overriding the tool's 100-300 and 300-100 MHz clock analysis. I will try the toggle method but I'm still baffled why this seemingly trivial circuit is not reliable.....
03-19-2021 10:38 AM
Its nothing to do with the clock phase,
its down to the reality of a silicon circuit,
All buffers will have a finite rise / fall time,
All buffers will have a rise / fall propagation time that's dependent upon the Process / Voltage / Temperature of the chip you have, and will be differ ent across the chip.
Thus the two clocks will have jitter on both edges.
The tools know this, and if you have data on both clocks, which are synchronised, from say a register, to another, with a gate between, then the tools ensure that no matter what the tolerance build up is, that a signal leaves one domain and will be picked up on the other.
But as others have said, it depends how the clocks have been generated, and how the logic has been implemented by the tools,
Its much safer to design conservative, and detect the rising edge, rather than the 001100 situation, as due to all sorts of reasons this may or may not happen,
Never make a design that you want to be reliable that depends on two clock edges to be totally aligned as at some point it will not happen and your design will fail,
Then you are in control.
03-19-2021 10:57 AM
Sorry, I don't see why the pulse has to be 4 clock cycle wide. It seems guaranteed to me that the 100 MHz clock is going to see a '1' at some point.
In this topic we are discussing both when the two clocks are synchronous and when they are asynchronous (or have an exception on the clock crossing path). For a synchronous clock crossing (the two clocks from the same MMCM with no exceptions) a pulse of 3 source clock periods wide with no metastability flip-flops and no edge detection is sufficient. There are exactly 3 synchronous rising edges of the source clock for every 1 synchronous rising edge of the destination clock. That means that one (exactly one) of three 300MHz clocks where the signal is high will line up with the rising edge of the 100MHz clock; no mstastability flip-flops and no edge detector are necessary.
If the clocks are asynchronous or even mesochronous. It would require 4 source clocks since your source clock is 3x your destination clock.
The real requirement for the pulse is that the length of the pulse (Tpulse)
Tpulse >= Tdst_per + Tsu +Th +Tjit
Where Tsu/Th are the setup and hold of the capture FF and Tjit is the cycle-cycle jitter.
But it is hard to know Tsu, Th and Tjit, so people, in general, just approximate this as one more clock period.
So here Tdst_clk is 3*Tsrc_clk (due to the 3:1 ratio of your clocks). It is also fairly safe to assume that Tsu+Th+Tjit < Tsrc_clk.
So that means that if Tpulse >= 4*Tsrc_clk then this should be sufficient at least for a mesochronous clock crossing.
For a truly asynchronous clock crossing you have to realize that Tdst_clk would not be exactly equal to 3*Tsrc_clk; if the source clock was 300MHz + 500ppm and the destination clock was 100MHz - 200ppm (just as an example), so you would need to account for these differences as well. But, again, adding this difference to Tsu+Th+Tjit will still be less than Tsrc_clk, so the above formula would still work.
But certainly if Tpulse >= 2*Tdst_clk, you are even safer (since 2*Tdst_clk is 6*Tsrc_clk).
03-19-2021 11:05 AM
OK, I got all of that, thank you for the detailed explanation! Going back to the original query. Since my clocks are related and I have no constraints or attributes, I don't see why my 2 (100 MHz) clock cycle pulse with edge detect would not work 100% of the time rather than build dependent. I could just have made it 1 (100 MHz) clock cycle wide too as you've suggested.
03-19-2021 12:39 PM - edited 03-19-2021 12:41 PM
Can I suggest,
yes I see your confusion,
what I'd take from this for now is to get into habit of safe design practises, it will never do you bad.
03-19-2021 01:02 PM
If the receive clock FFs were replaced with an SRL16, could that explain it? I know they're not the same as FFs when it comes to metastability. Synthesis schematic shows them as FFs but not sure if PAR replaces with SRLs. I could add a reset to ensure no SRLs are used. Is there a report I can read? I'm more familiar with Actel tool flow
03-19-2021 05:02 PM
So perhaps me using an xpm cdc_pulse is making things worse?
Internal to the xpm_cdc_pulse macro is the xpm_cdc_single macro. The xpm_cdc_single macro automatically writes a set_false_path constraint for the path ending on its input. So, Vivado is free to make this path any length during implementation with no guarantee that it will be the same length from implementation to implementation.
03-19-2021 09:03 PM
OK. So lets reset a bit.
If we make sure that these two clocks are from the same MMCM and that they use the same type of clock buffer and (if they are UltraScale/UltraScale+/Versal) the two clocks are in the same CLOCK_DELAY_GROUP, then this is not an asynchronous clock domain crossing circuit (CDCC), it is a synchronous CDCC (which not everyone even calls a CDCC).
Given that the two clocks are synchronous, what you have described should work and should not be implementation dependent. The two back to back flip-flops (that look like synchronizing flip-flops) are unnecessary but harmless, and the pulse stretching on the 300MHz domain and the rising edge detection on the 100MHz should work assuming
You need to be certain that there are no timing exceptions on these paths, and I would even have the tool report the paths between the 300MHz and 100MHz domain to see that they look "normal" (no exceptions, balanced clock skew, etc... - You can post the timing report here to be sure).
Next, I would ask - how did you look at this with an ILA? An ILA can only use one sample clock, and you are trying to look at signals on two clock domains - how did you do this? If your ILA clock is the 300MHz clock domain, then when you capture a signal from the 100MHz domain you are essentially doing a synchronous CDCC back from the 100MHz domain to the 300MHz domain. Again, this should work, but it is no more or no less suspect than your forward CDCC!
So, with what you are describing, we can't see anything. If all the information you gave us was correct, then this should work reliably and consistently.
You could try posting the code of the input capture up to the FIFO push, and also the capture from the ILA (as well as the timing reports I talked about earlier).
But without anything else, we can't find the problem.
03-20-2021 03:11 AM
Thank you for the reply. My PC has been blue screening recently and yesterday it has started corrupting Vivado projects, synthesis will not run etc. Its not safe to continue on that PC. For now I've trimmed down the VHDL into a Microsemi Libero project. I won't be able to build on Vivado and get the timing reports for a while..
Attached is a behavioral simulation and the source code.
03-20-2021 03:14 AM - edited 03-20-2021 09:23 AM
Regarding ILA captures. I had two cores, one on 300 MHz domain and one on the 100 MHz domain. As mentioned, I have 8 identical data lanes. I don't have the ILA screenshots but I was definitely seeing all 8 lane stretched 6 cycle enable flags on the 300 MHz ILA domain, but not all 8 FIFO_WEN_S signals on the 100 MHz domain.
Yellow trace is the stretched 6 cycle valid from 300 MHz domain
Red trace is the rising edge detected version on the 100 MHz domain. Data is subsequently written to the FIFO and the FIFO_EMPTY flag de-asserts
03-29-2021 01:45 AM
My build machine had a dodgy DDR4 RAM stick.. Got the PC into a bit of a state. I'm now using a temporary machine.
Rather than import the .xci files I decided to rebuild the IP cores from scratch. When creating the PLL for the 100 MHz and 300 MHz clocks, the matched routing option made me realize I did not have this ticked. I now have a clean project (new PC) and ticked that option and now when using the 6x stretched signal (without moving to the toggle FF method), its worked on the last 6 builds.
It always met timing in the old project even without the PLL matched routing option ticked. Not sure why its now working... Even if unrelated, the 100 MHz first FF is going to see the stretched 300 MHz signal "eventually".