cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
cbemlahe
Explorer
Explorer
1,314 Views
Registered: ‎09-18-2007

Intermittent CDC Using Phase Matched Clocks

I have inherited an 8 byte lane NAND controller design that instances the same CDC edge detect for each byte lane when reading DQ(S) data

Each lane block samples its respective DQS signal with 300 MHz clock, samples four DQ bytes and presents to a shallow FIFO. A stretched "ready" flag passes through a typical edge detector circuit into the 100 MHz clock domain. The CDC edge detect output is the write enable to the FIFO.

When all byte lane FIFOs are not empty, data is read. The shallow FIFOs align all byte lanes. 

The 300 MHz and 100 MHz clocks come from a PLL/MMCM. The 100 MHz clock also sources the NAND DQS (albeit phase delayed).

What happens with some builds is that no data is read from the FIFOs. Adding chipscope/ILA, I see all lanes have a stretched input to the CDC but some of the lanes do not detect the edge. So some of the edge detectors are not working and its build dependent

I see no reason for this circuit not to work. There are no ASYNC_REG placement attributes applied but I think CDC crossing will be handled by the tools as part of the 300-100 MHz related clock structure.

I have replaced the edge detectors with Xilinx xpm CDC_PULSE macros. Is this really needed? Its too early to tell if this is a definitive fix. I'm concerned because I can't see why some of the original CDCs should fail.

nand_cdc.JPG

 

 

 

 

 

 

nand_cdc.JPG
0 Kudos
20 Replies
1,274 Views
Registered: ‎01-22-2015

@cbemlahe 

I have not understood all the details of your design.  However, for a synchronizer, you must:

  • set ASYNC_REG=TRUE for each register used in the synchronizer
  • place a timing exception on the path coming into the first register of the synchronizer

Keep in mind that the transit time of a signal through a synchronizer is uncertain.  This is especially true if you have used set_false_path as the timing exception into the synchronizer.  So, perhaps your design is suffering from the uncertainty?

Also, you show that the 100MHz and 300MHz clocks are related and that the stretched pulse originates in the 300MHz clock domain.  You should not need a synchronizer to cross the stretched pulse into the 100MHz clock domain.  The transit time of this direct crossing is still uncertain (ie. it could take either 3, 2, or 1 cycles of the 300MHz clock).  Although, there are ways to make the transit time of this direct crossing constant.

Cheers,
Mark

cbemlahe
Explorer
Explorer
1,259 Views
Registered: ‎09-18-2007

Thanks for the reply. Indeed, I don't "think" I need a true synchronizer. The circuit in the green is a standard rising edge detector. Yes the transit time will vary, and I have 8 of these in parallel each sampling DQS lane bits. But in chipscope I can see all 8 inputs to the edge detectors, but I don't see all the outputs. I expect to see all 8 outputs "eventually".

0 Kudos
avrumw
Expert
Expert
1,246 Views
Registered: ‎01-23-2009

As markg@prosensing.com said, crossing between two clocks that come from the same MMCM (assuming they use the same kind of clock buffer) does not require an asynchronous clock domain crossing circuit. These clocks are synchronous; they come out of the MMCM with guaranteed low skew, if they use the same type of clock buffer (either both BUFGs or derivatives of BUFGs, or BUFHs if you are in 7 series), and (for UltraScale/UltraScale+/Versal) they have been placed in the same CLOCK_DELAY_GROUP, then the clock tree delay is balanced - therefore they arrive at the destinations with controlled skew. Furthermore the tools understand all of this and take the skews into account during static timing analysis. So you can cross synchronously between these two domains.

The only issue (again, as Mark said) is making sure that you handle the fact that the 100MHz clock has only one edge for every 3 of the 300MHz clock, so you need some mechanism for managing that. What you are doing would probably work, but it would be sufficient to just stretch your pulse to exactly three clock periods at 300MHz and sample that at 100MHz - you wouldn't even need the edge detector.

There is a possibility that your clock domain crossing circuit (CDCC) is messing this up. Normally a CDCC contains a timing exception on the actual path between the last flip-flop on the source domain and the first flip-flop on the destination domain; this constraint is either a set_false_path or a set_max_delay -datapath_only. Both of these constraints will disable the normal timing checks that are done between the 300MHz and 100MHz related clocks - turning this from a synchronous clock crossing to a mesochronous clock crossing. If the clock crossing is mesochronous or asynchronous, then the pulse stretching must be strictly larger than the ratio of the clock periods - it must be at least 4 clock periods - three is insufficient. If you set it to 3, then you could well see behavior like you describe - on some runs (depending on the length of the routing of this path) it might work and on others it wouldn't.

Avrum

0 Kudos
cbemlahe
Explorer
Explorer
1,239 Views
Registered: ‎09-18-2007

Thanks for the reply. I'm using Kintex UltraScale. I don't understand the last two paragraphs.Since the clocks are phase related then I think you could just have a 3 @ 300 MHz clock wide pulse and not have an edge detector at all. The fact the pulse is 6 clocks @ 300 MHz wide means the 100 MHz domain will see 0, 0, 1, 1, 0, 0. At some point surely that is guaranteed. 

The CDCC is just a few lines of VHDL, no attributes or constraints. So perhaps me using an xpm cdc_pulse is making things worse? Sorry, I don't see why the pulse has to be 4 clock cycle wide. It seems guaranteed to me that the 100 MHz clock is going to see a '1' at some point.

 

0 Kudos
avrumw
Expert
Expert
1,238 Views
Registered: ‎01-23-2009

By the way, for event crossings (like this), I have moved away from the pulse stretching synchronizer to the toggle synchronizer - the characteristics of the toggle synchronizer are simply "better" - they can cross more events per second than the pulse stretching synchronizer. Furthermore, other than managing the number of events that cross the domain, the toggle synchronizer doesn't need to change based on the frequencies of the two clocks - it works on any ratio of frequencies (fast to slow or slow to fast) and doesn't need a parameter to tell it anything about the ratio (such as setting the number of clocks required for stretching). 

Take a look at this post on toggle synchronizers.

Avrum

0 Kudos
cbemlahe
Explorer
Explorer
1,232 Views
Registered: ‎09-18-2007

Thank you yes, I was thinking exactly the same thing, to use a toggle on the source clock. I do this too because I lost track of slow-fast, fast slow transitions when designs would change frequencies and changed to using a q bit toggle.

0 Kudos
drjohnsmith
Teacher
Teacher
1,186 Views
Registered: ‎07-09-2009

why do you think you will always see  0, 0, 1, 1, 0, 0.

could it be that occasionally you are seeing 001000, or 001110 ?

    would it be better to allow your detect system to allow this ?

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
cbemlahe
Explorer
Explorer
1,179 Views
Registered: ‎09-18-2007

Because the 100 MHz clock will see these two '1's. Some of the FIFO lanes see no write enable at all from the rising edge detect.

cbemlahe_0-1616171535599.png

 

0 Kudos
drjohnsmith
Teacher
Teacher
1,172 Views
Registered: ‎07-09-2009

and if the clocks are just a few ps different, what will happen ?

    just a slight power glitch could cause that,

I would always tend to code conservatively, 

    its same as in programing C for a real time app,

        you would not ever do if A = B, but if A >= B , just in case a glitch cause the counter to jump,

              may be 1 in 10 billion, but it happens and its called defensive coding.

learn to do the same in FPGAs , especially when it comes to clock edges,

If you had just detected the rising edge, then this would always be fine , even if you captured just 1 or 3 highs.

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
cbemlahe
Explorer
Explorer
1,158 Views
Registered: ‎09-18-2007

 

 

The place and route tool knows the clocks are related so I'd expect clock skew to be taken care of as Avrum says. The pulse created on the 300 MHz domain will take one of 3 phases (see below). The 100 MHz rising edge detector only needs to see one of these logic level '1's in order to output a 100 MHz pulse. There are no attributes, constraints that will be overriding the tool's 100-300 and 300-100 MHz clock analysis. I will try the toggle method but I'm still baffled why this seemingly trivial circuit is not reliable.....

cbemlahe_0-1616173369045.png

 

0 Kudos
drjohnsmith
Teacher
Teacher
1,141 Views
Registered: ‎07-09-2009

Its nothing to do with the clock phase,

   its down to the reality of a silicon circuit,

All buffers will have a finite rise / fall time, 

   All buffers will have a rise / fall propagation time that's dependent upon the Process / Voltage / Temperature of the chip you have, and will be differ ent across the chip.

Thus the two clocks will have jitter on both edges. 

The tools know this, and if you have data on both clocks, which are synchronised, from say a register, to another, with a gate between, then the tools ensure that no matter what the tolerance build up is, that a signal leaves one domain and will be picked up on the other.

But as others have said, it depends how the clocks have been generated, and how the logic has been implemented by the tools,

Its much safer to design conservative, and detect the rising edge, rather than the 001100 situation, as due to all sorts of reasons this may or may not happen,

Never make a design that you want to be reliable that depends on two clock edges to be totally aligned as at some point it will not happen and your design will fail,

 

 

Then you are in control.

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
avrumw
Expert
Expert
1,134 Views
Registered: ‎01-23-2009

 Sorry, I don't see why the pulse has to be 4 clock cycle wide. It seems guaranteed to me that the 100 MHz clock is going to see a '1' at some point.

In this topic we are discussing both when the two clocks are synchronous and when they are asynchronous (or have an exception on the clock crossing path). For a synchronous clock crossing (the two clocks from the same MMCM with no exceptions) a pulse of 3 source clock periods wide with no metastability flip-flops and no edge detection is sufficient. There are exactly 3 synchronous rising edges of the source clock for every 1 synchronous rising edge of the destination clock. That means that one (exactly one) of three 300MHz clocks where the signal is high will line up with the rising edge of the 100MHz clock; no mstastability flip-flops and no edge detector are necessary.

If the clocks are asynchronous or even mesochronous. It would require 4 source clocks since your source clock is 3x your destination clock.

The real requirement for the pulse is that the length of the pulse (Tpulse)

Tpulse >= Tdst_per + Tsu +Th +Tjit

Where Tsu/Th are the setup and hold of the capture FF and Tjit is the cycle-cycle jitter. 

But it is hard to know Tsu, Th and Tjit, so people, in general, just approximate this as one more clock period.

So here Tdst_clk is 3*Tsrc_clk (due to the 3:1 ratio of your clocks). It is also fairly safe to assume that Tsu+Th+Tjit < Tsrc_clk. 

So that means that if Tpulse >= 4*Tsrc_clk then this should be sufficient at least for a mesochronous clock crossing.

For a truly asynchronous clock crossing you have to realize that Tdst_clk would not be exactly equal to 3*Tsrc_clk; if the source clock was 300MHz + 500ppm and the destination clock was 100MHz - 200ppm (just as an example), so you would need to account for these differences as well. But, again, adding this difference to Tsu+Th+Tjit will still be less than Tsrc_clk, so the above formula would still work.

But certainly if Tpulse >= 2*Tdst_clk, you are even safer (since 2*Tdst_clk is 6*Tsrc_clk).

Avrum

0 Kudos
cbemlahe
Explorer
Explorer
1,121 Views
Registered: ‎09-18-2007

OK, I got all of that, thank you for the detailed explanation! Going back to the original query. Since my clocks are related and I have no constraints or attributes, I don't see why my 2 (100 MHz) clock cycle pulse with edge detect would not work 100% of the time rather than build dependent. I could just have made it 1 (100 MHz) clock cycle wide too as you've suggested.

0 Kudos
drjohnsmith
Teacher
Teacher
1,070 Views
Registered: ‎07-09-2009

Can I suggest, 

  yes I see your confusion,

what I'd take from this for now is to get into habit of safe design practises, it will never do you bad.

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
cbemlahe
Explorer
Explorer
1,064 Views
Registered: ‎09-18-2007

If the receive clock FFs were replaced with an SRL16, could that explain it? I know they're not the same as FFs when it comes to metastability. Synthesis schematic shows them as FFs but not sure if PAR replaces with SRLs. I could add a reset to ensure no SRLs are used. Is there a report I can read? I'm more familiar with Actel tool flow

0 Kudos
980 Views
Registered: ‎01-22-2015

@cbemlahe 

So perhaps me using an xpm cdc_pulse is making things worse? 

Internal to the xpm_cdc_pulse macro is the xpm_cdc_single macro.  The xpm_cdc_single macro automatically writes a set_false_path constraint for the path ending on its input.  So, Vivado is free to make this path any length during implementation with no guarantee that it will be the same length from implementation to implementation.

0 Kudos
avrumw
Expert
Expert
899 Views
Registered: ‎01-23-2009

OK. So lets reset a bit.

If we make sure that these two clocks are from the same MMCM and that they use the same type of clock buffer and (if they are UltraScale/UltraScale+/Versal) the two clocks are in the same CLOCK_DELAY_GROUP, then this is not an asynchronous clock domain crossing circuit (CDCC), it is a synchronous CDCC (which not everyone even calls a CDCC).

Given that the two clocks are synchronous, what you have described should work and should not be implementation dependent. The two back to back flip-flops (that look like synchronizing flip-flops) are unnecessary but harmless, and the pulse stretching on the 300MHz domain and the rising edge detection on the 100MHz should work assuming

  • Your pulse stretcher is at least 3 x 300MHz clocks long (you say it is 6)
  • You always have at least 3 x 300MHz of 0 between pulses 
    • So your minimum profile is 3 cycles high followed by at least 3 cycles low
    • With your stretcher set to 6 cycles, this means 6 cycles high followed by at least 3 cycles low
      • One data transfer every 9 x 300MHz clocks minimum
  • Your data remains stable at least 12+ x 300MHz clocks after the rising edge of DQS on the 300MHz domain (maybe 1 for the pulse stretcher, up to 2 for waiting for the 100MHz edge, 3 for each of the two back flip-flops on the 100MHz domain, 3 more for the push into the FIFO)

You need to be certain that there are no timing exceptions on these paths, and I would even have the tool report the paths between the 300MHz and 100MHz domain to see that they look "normal" (no exceptions, balanced clock skew, etc... - You can post the timing report here to be sure).

Next, I would ask - how did you look at this with an ILA? An ILA can only use one sample clock, and you are trying to look at signals on two clock domains - how did you do this? If your ILA clock is the 300MHz clock domain, then when you capture a signal from the 100MHz domain you are essentially doing a synchronous CDCC back from the 100MHz domain to the 300MHz domain. Again, this should work, but it is no more or no less suspect than your forward CDCC!

So, with what you are describing, we can't see anything. If all the information you gave us was correct, then this should work reliably and consistently.

You could try posting the code of the input capture up to the FIFO push, and also the capture from the ILA (as well as the timing reports I talked about earlier).

But without anything else, we can't find the problem.

Avrum

cbemlahe
Explorer
Explorer
761 Views
Registered: ‎09-18-2007

Thank you for the reply. My PC has been blue screening recently and yesterday it has started corrupting Vivado projects, synthesis will not run etc. Its not safe to continue on that PC. For now I've trimmed down the VHDL into a Microsemi Libero project. I won't be able to build on Vivado and get the timing reports for a while..

Attached is a behavioral simulation and the source code.

 

 

NAND Capture.JPG
0 Kudos
cbemlahe
Explorer
Explorer
761 Views
Registered: ‎09-18-2007

Regarding ILA captures. I had two cores, one on 300 MHz domain and one on the 100 MHz domain. As mentioned, I have 8 identical data lanes. I don't have the ILA screenshots but I was definitely seeing all 8 lane stretched 6 cycle enable flags on the 300 MHz ILA domain, but not all 8 FIFO_WEN_S signals on the 100 MHz domain. 

Yellow trace is the stretched 6 cycle valid from 300 MHz domain

Red trace is the rising edge detected version on the 100 MHz domain. Data is subsequently written to the FIFO and the FIFO_EMPTY flag de-asserts

0 Kudos
cbemlahe
Explorer
Explorer
520 Views
Registered: ‎09-18-2007

My build machine had a dodgy DDR4 RAM stick.. Got the PC into a bit of a state. I'm now using a temporary machine.

Rather than import the .xci files I decided to rebuild the IP cores from scratch. When creating the PLL for the 100 MHz and 300 MHz clocks, the matched routing option made me realize I did not have this ticked. I now have a clean project (new PC) and ticked that option and now when using the 6x stretched signal (without moving to the toggle FF method), its worked on the last 6 builds.

It always met timing in the old project even without the PLL matched routing option ticked. Not sure why its now working... Even if unrelated, the 100 MHz first FF is going to see the stretched 300 MHz signal "eventually".

 

 

Xilinx PLL Matched Routing.JPG
Xilinx NAND Clock Interraction Report.png
0 Kudos